Re: DCT: D compiler as a collection of libraries
On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote: Am Fri, 11 May 2012 10:01:28 +0200 schrieb Roman D. Boiko r...@d-coding.com: There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct (I decided to comment on both my post and your reply.) I've got *a lot* of feedback from community, which caused a significant redesign (but caused a delay with yet unknown duration). I'll write more on that later. Thanks a lot to everybody! Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). Looks like I'm going to replace this implementation almost completely. The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). See below. My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). By the D specification I mean http://dlang.org/language-reference.html, or (when specification is not clear for me) one of the following: * TDPL, * current DMD implementation * community feedback. (Note that I may assume that I understand some aspect, but later revisit it if needed.) Please post any feed here. A dedicated project web-site will be created later. It is going to be http://d-coding.com, but it is not usable yet (I have not much web-design practise). It is hosted on https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull requests are welcome. A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design. Could you name a few specific concerns? The reason of this thread was to reflect on design early. OTOH, I didn't spend time documenting my design goals and tradeoffs, so discussion turned into a brainstorm and wasn't always productive. But now when I see how much value can I get even without doing my homework, I'm much more likely to provide better documentation and ask more specific questions. I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it. I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible. Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like. Well, *considering* such inclusion seriously would help improve the quality of project. But it is not what necessarily will happen. Currently I think that my goals are close to those of use case 2 from Jonathan's reply. But until the project is reasonably complete it is not the time to argue whether to include it (or its part). Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way. Not a goal currently, because that would make the project significantly less likely to be completed ever. That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools. Well, I hope that some more people will join the project once it stabilizes and delivers something useful. By the way, is anybody *potentially* interested to join? To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours: - separate array for line # lookup - TokenType/TokenKind enum - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?) Switch code generation is something temporary that works currently and helps me evaluate possible problems with future design, which is planned to be
Re: DCT: D compiler as a collection of libraries
Am Sun, 20 May 2012 10:09:34 +0200 schrieb Roman D. Boiko r...@d-coding.com: Could you name a few specific concerns? Mostly my own gut feeling, that things that sound great in my head turn out to bite me in the end. Things that one just doesn't think of because of the limited horizon everyone has and probably a lack of experience in the field of compiler/tools writing. On the other hand I have good experience with working by community feedback. So the rough edges in the design may already be ironed out by now. I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible. There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion). C bindings could be an option. (As in: the smallest common denominator.) They allow existing tools (written in Java, C#, Python, ...) to use your library. Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). I'm sure it will :) (I'm going to elaborate on this some time later). I'm curious. There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed. I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean. It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?) The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the surrounding parser state. When they reach the file level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else. ** Errors ** I generally keep the start and end column, in case someone wants that. This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently. [...] Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations. No more questions! -- Marco
Re: DCT: D compiler as a collection of libraries
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote: There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion). The opposite situation: I didn't think about parser per project :) So I guess no need to worry here. C bindings could be an option. (As in: the smallest common denominator.) They allow existing tools (written in Java, C#, Python, ...) to use your library. Yeah, but I'm far from there yet. Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). I'm sure it will :) (I'm going to elaborate on this some time later). I'm curious. Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.). I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean. Answered below. It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?) D has the following EoF cases: \0, \x1A, physical EoF and __EOF__ special token when not inside comment or some of string literals. ^D is not in this list. The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the surrounding parser state. When they reach the file level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else. If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.
Re: DCT: D compiler as a collection of libraries
Am Sun, 20 May 2012 20:37:07 +0200 schrieb Roman D. Boiko r...@d-coding.com: Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). I'm sure it will :) (I'm going to elaborate on this some time later). I'm curious. Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.). I just only thought about the single-use situation, not the situation when editing files. Now the idea has evolved a bit and you probably thought about storing the scope hierarchy in a tree, too. It is still difficult to write a fast parser when someone can add/remove a '{' somewhere at the top of datetime.d, which changes the interpretation of the rest of the file. Mono-D has some hickups worse than those (several seconds even) - maybe a bug. I'll keep my fingers crossed. If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF. Now I get it! -- Marco
Re: DCT: D compiler as a collection of libraries
Am Fri, 11 May 2012 10:01:28 +0200 schrieb Roman D. Boiko r...@d-coding.com: There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later. A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design. I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it. Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like. Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way. That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools. To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours: - separate array for line # lookup - TokenType/TokenKind enum - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?) Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). One notable difference is that I only check for isEoF in one location, since I append \0 to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.) ** Errors ** I generally keep the start and end column, in case someone wants that. A real world example: ubyte x = ...; if (x = 0 x = 8) ... Line 2, warning: Comparison is always true. At first glace you say No, a byte can become larger than 8., but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art: Line 2, warning: Comparison is always true: if (x = 0 x = 8) ... ^^ ** Code highlighting ** Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 2 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens. If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently. -- Marco
Re: DCT: D compiler as a collection of libraries
Le 15/05/2012 21:59, Roman D. Boiko a écrit : On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote: On 05/14/2012 05:00 PM, Roman D. Boiko wrote: Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose? Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing. I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that. The only information that is usefull over time is the position. This is the one that you want to calculate lazily. In other terms, what you want to keep in memory have no interest. A circular buffer as Timon propose is a better idea.
Re: DCT: D compiler as a collection of libraries
Le 14/05/2012 21:21, Roman D. Boiko a écrit : On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote: On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote: What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the compiler mode... and there actually is an error, then just lex it again, to produce a pretty error message ;) ... The problem with multiple lexer implementations is that it might become much more difficult to maintain them. Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this. If both have the same interface, metaprograming can do the magic for you.
Re: DCT: D compiler as a collection of libraries
On Tuesday, 15 May 2012 at 09:33:31 UTC, deadalnix wrote: Le 14/05/2012 21:21, Roman D. Boiko a écrit : Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this. If both have the same interface, metaprograming can do the magic for you. Well, I didn't state that there is no way to solve the issue. Also I added unless the difference is minor.
Re: DCT: D compiler as a collection of libraries
On 05/14/2012 05:00 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose? Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
Re: DCT: D compiler as a collection of libraries
On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote: On 05/14/2012 05:00 PM, Roman D. Boiko wrote: Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose? Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing. I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Currently I think about making token a class instead of struct. A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is: // Represents lexed token struct Token { size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc. } Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32. * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed) * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose?
Re: DCT: D compiler as a collection of libraries
On Monday, 14 May 2012 at 15:00:37 UTC, Roman D. Boiko wrote: Could anybody suggest other pros and cons? Which option would you choose? Further discussion on this topic (struct vs class) is at http://forum.dlang.org/thread/asdrqlaydzcdpqwsb...@forum.dlang.org
Re: DCT: D compiler as a collection of libraries
On Monday, 14 May 2012 at 16:30:21 UTC, deadalnix wrote: Le 14/05/2012 17:00, Roman D. Boiko a écrit : Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32. Why is this a benefice ? NNTP error: 400 load at 23.60, try later prevented me from answering :) Because it might be difficult to find a big chunk of available memory (3.5M vs 14M for this particular case). * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed) I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI. Agree. * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization. It looks so. Thanks. It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose? You are over engineering the whole stuff. I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
Re: DCT: D compiler as a collection of libraries
On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote: You are over engineering the whole stuff. I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals. What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the compiler mode... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. }
Re: DCT: D compiler as a collection of libraries
On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote: On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote: You are over engineering the whole stuff. I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals. What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the compiler mode... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. } So far it doesn't seem expensive to tolerate errors and proceed. The only thing I miss is some sort of specification when to stop including characters into token spelling and start a new token. I don't think I'll use backtracking for that in the nearest future. If I did, I would really separate part of lexer and provide two implementations for that part. Given this, accepting errors and moving on simply requires some finite set of rules about boundaries of invalid tokens. I also think structural code editing concepts will help here, but I didn't do any research on this topic yet. The problem with multiple lexer implementations is that it might become much more difficult to maintain them.
Re: DCT: D compiler as a collection of libraries
On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote: On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote: What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the compiler mode... and there actually is an error, then just lex it again, to produce a pretty error message ;) ... The problem with multiple lexer implementations is that it might become much more difficult to maintain them. Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote: It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be acceptable, because it's not either #1 or #2. That would have to be decided once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be able to compare implementations of #1 and #2 against what you're doing and seeing which performs better. - Jonathan M Davis Yes, my project has different goals from #1 or #2. The primary need I'm trying to address is making development of tools for D development much easier. Tools are needed for D to become more widely used in practise. Secondary (a nice-to-have) would be to build a compiler on top of that. I plan to do both, but I don't intend to be backwards-compatible with DMD backend. I hope that at least part of my code will either influence or be reused in the reference D compiler (whatever that will be in the future). It is important to have a reference frontend implementation which can be used in many tools, otherwise a lot of effort would be completely wasted and no tool would comply with the D specification and compilers.
Re: DCT: D compiler as a collection of libraries
On 5/12/12 12:17 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this? Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 13:50, Ary Manzana a écrit : On 5/11/12 4:22 PM, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote: Le 11/05/2012 13:50, Ary Manzana a écrit : Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes. 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. 2. If struct which is often allocated and deallocated, free list would allow to skip its initialization, which is especially important if size of struct is large. This is true for both heap and stack allocated data.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.
Re: DCT: D compiler as a collection of libraries
On 05/12/2012 11:08 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed. Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote: On 05/12/2012 11:08 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed. Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient? Do you mean something like what I summarized from Ary: http://forum.dlang.org/post/pggkobonzmgdwjigy...@forum.dlang.org ? If that is the case, could you please add your concerns to the dedicated issue: https://github.com/roman-d-boiko/dct/issues/15 ? Otherwise, if by overcomplicating you meant using a free list, I'd like to clarify that I don't plan to introduce it yet because I see no reason to use it in my context. Thanks
DCT: D compiler as a collection of libraries
There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 10:01, Roman D. Boiko wrote: There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later. (Re-posting here) A couple of questions: * What's the sate of the lexer * Does it convert numerical literals and similar to their actual values * Does it retain full source information * Is there an example we can look at to see how the API is used * Does it have a range based interface -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote: (Re-posting here) A couple of questions: * What's the sate of the lexer I consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two). Lexer will provide a random-access range of tokens (this is not done yet). Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real) * Does it convert numerical literals and similar to their actual values It is planned to add a post-processor for that as part of parser, please see README.md for some more details. * Does it retain full source information Yes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too. * Is there an example we can look at to see how the API is used TBD soon (see Roadmap in the readme file) * Does it have a range based interface Yes, this is what I consider one of its strengths.
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 10:01, Roman D. Boiko wrote: There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later. If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 10:58, Roman D. Boiko wrote: On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote: (Re-posting here) A couple of questions: * What's the sate of the lexer I consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two). I see. Lexer will provide a random-access range of tokens (this is not done yet). Ok. Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real) What about line and column information? * Does it convert numerical literals and similar to their actual values It is planned to add a post-processor for that as part of parser, please see README.md for some more details. Isn't that a job for the lexer? * Does it retain full source information Yes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too. That's sounds good. * Is there an example we can look at to see how the API is used TBD soon (see Roadmap in the readme file) * Does it have a range based interface Yes, this is what I consider one of its strengths. I see. Thanks. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
Am 11.05.2012 11:02, schrieb Jacob Carlborg: For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start -easy to extend and fix and think that is what walter wants from an D-ish frontend in the first place
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote: On 2012-05-11 10:58, Roman D. Boiko wrote: Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real) What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. * Does it convert numerical literals and similar to their actual values It is planned to add a post-processor for that as part of parser, please see README.md for some more details. Isn't that a job for the lexer? That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.
Re: DCT: D compiler as a collection of libraries
Am 11.05.2012 11:23, schrieb Roman D. Boiko: On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote: does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding? That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot). try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writing
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote: If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. Front end will not produce the same data as DMD front end does, so most likely it will not be suitable to replace existing C++ implementation. But since no information will be lost I will consider creating a separate project which would build output compatible with existing (not identical, I don't think that would be feasible).
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 09:21:29 UTC, dennis luehring wrote: Am 11.05.2012 11:02, schrieb Jacob Carlborg: For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start Will add that to May roadmap. -easy to extend and fix and think that is what walter wants from an D-ish frontend in the first place I plan to go this way.
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 11:22, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct? * Does it convert numerical literals and similar to their actual values It is planned to add a post-processor for that as part of parser, please see README.md for some more details. Isn't that a job for the lexer? That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. That might be the case. But I don't think it belongs in the parser. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later. Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming value to lexeme/spelling/representation, or something like that, and then name the new property value. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 11:31, Roman D. Boiko wrote: My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. That's good news. Front end will not produce the same data as DMD front end does, so most likely it will not be suitable to replace existing C++ implementation. That's too bad. But since no information will be lost I will consider creating a separate project which would build output compatible with existing (not identical, I don't think that would be feasible). Ok. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 11:23, Roman D. Boiko wrote: On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote: does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding? That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot). Example from TextMate: * void - keyword is colored * module main - nothing colored until I type a semicolon * module main; - keyword and main is colored (differently) * void foo - keyword is colored * void foo ( - keyword and foo is colored (differently) * struct F - keyword and F is colored (differently) * Literals are always colored. * User-defined constants are always colored. It's basically any token that is all upper case -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
Am 11.05.2012 11:33, schrieb Roman D. Boiko: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start Will add that to May roadmap. are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 09:28:36 UTC, dennis luehring wrote: Am 11.05.2012 11:23, schrieb Roman D. Boiko: On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote: does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding? That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot). try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writing I depends on IDE. For example, sublime-text (and most likely TextMate) uses regex for syntax highlighting. That makes it impossible to use with for D in some scenarios (like nested block comments). Any IDE that provides API for coloring will get correct information if code is valid. If it is not valid, it is only possible to handle specific kinds of errors, but in general there will always be cases when highlighting (or some parsing / semantic analysis information) is incorrect. I aim to handle common errors gracefully. In practice, I think, it is possible to handle 99% of problems, but this requires a lot of design / specification work.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote: On 2012-05-11 11:22, Roman D. Boiko wrote: Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct? There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added. * Does it convert numerical literals and similar to their actual values It is planned to add a post-processor for that as part of parser, please see README.md for some more details. Isn't that a job for the lexer? That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. That might be the case. But I don't think it belongs in the parser. I will provide example code and a dedicated post later to illustrate my point. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later. Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming value to lexeme/spelling/representation, or something like that, and then name the new property value. I was going to rename value, but couldn't find a nice term. Thanks for your suggestions! As for the property with strongly typed literal value, currently I plan to put it into AST.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 10:40:43 UTC, Roman D. Boiko wrote: On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote: Am 11.05.2012 11:33, schrieb Roman D. Boiko: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start Will add that to May roadmap. are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much I tried optimizing code when it didn't complicate design. And more optimizations will be added later. It would be interesting to have benchmarks for comparing performance, but I don't plan to branch and edit DMD front end to prepare it for benchmarking (any time soon). Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 11:31, Roman D. Boiko a écrit : On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote: If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now.
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 12:01, dennis luehring a écrit : Am 11.05.2012 11:33, schrieb Roman D. Boiko: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start Will add that to May roadmap. are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much The best optimization is the one that bring your code from non working state to working state.
Re: DCT: D compiler as a collection of libraries
On 5/11/12 4:22 PM, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote: Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D) Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:50:14 UTC, deadalnix wrote: Le 11/05/2012 11:31, Roman D. Boiko a écrit : My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now. I have started a similar stuff, and it is currently more advanced than DCT is. I kind of decapitated sdc to do it. Maybe we should join effort instead of doing 2 separate projects ? That makes sense. Is it possible to switch SDC to the Boost license? I'm trying to keep it for all DCT code.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote: On 2012-05-11 12:35, Roman D. Boiko wrote: On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote: Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct? There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added. Found it now, calculateFor. It not sure if it's the most intuitive name though. I get the feeling: calculate what?. calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions? That might be the case. But I don't think it belongs in the parser. I will provide example code and a dedicated post later to illustrate my point. I guess I'll have to wait for that then :) I'll try to do that ahead of roadmap, it is important.
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 14:04, Roman D. Boiko a écrit : That makes sense. Is it possible to switch SDC to the Boost license? I'm trying to keep it for all DCT code. Let me do a clean package of my code this week end. For now it is mixed with SDC source code, which was enough as I was working alone, but isn't quite right for people to join the effort.
Re: DCT: D compiler as a collection of libraries
Am 11.05.2012 13:50, schrieb Ary Manzana: On 5/11/12 4:22 PM, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed i think there are several parts that can work like that
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:55:47 UTC, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote: Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D) Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :) Mono-D is written in C#, VisualD uses D -- so it actually should be easier to integrate into the second one :)
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote: From the beginning, I'm think AST macro using CTFE. Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote: Mono-D is written in C#, VisualD uses D -- so it actually should be easier to integrate into the second one :) Sorry, I meant D-IDE. But there might exist the reason to consume D implementation from C# also. I would happily collaborate to make it usable for that. I have nothing against VisualD, just didn't think about it yet.
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 14:14, Roman D. Boiko a écrit : On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote: From the beginning, I'm think AST macro using CTFE. Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences. More explicitly, the goal isn't to implement a different language than D. Simply doing the parsing/AST building in a way that would allow AST macro to be introduced later. Your 3 points seem reasonable. Mine were : * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis). * Discard all deprecated features. Not even try to implement them even if dmd support them currently. * Do the parsing in several steps to allow different tools to work with it. I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote: On 5/11/12 4:22 PM, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? I borrowed the trick from Brian Schott: tokens will be stored as a sorted array. Sorting is done based on their indices in source code (they are naturally sorted immediately, no need to run any algorithm for that). The same is true for information about line start indices. Now given an index we can do a binary search in such sorted arrays, and get corresponding line number or token (whatever we need). Walking along utf code points starting from the index which corresponds to beginning of line and taking into account tab characters it is possible to calculate column reasonably fast. My assumption is that column numbers are needed for use cases like error reporting or other messages for a user, and thus there is no need to pre-calculate them for each token (or character). For such use cases efficiency should be really enough. In general, I precalculate and store only information which is needed frequently. Other information is evaluated lazily. Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). But the information from tokens is not discarded (at least, this is the requirement for DCT). So my choice is to keep it in tokens instead of converting to some other form. That also implies that Token is free from any auxiliary information which is not necessary for common use cases. So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. It is inefficient to calculate it during lexing, scanning algorithm become less robust and more complicated, and in most cases we won't need it anyway. I will add a hook to plug-in such functionality (when needed) if I will know why it is necessary.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 12:30:01 UTC, deadalnix wrote: Your 3 points seem reasonable. Mine were : * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis). All differences should be documented. * Discard all deprecated features. Not even try to implement them even if dmd support them currently. Yes, I forgot this one. Actually, I didn't discard imaginary floats, because I don't know what exactly will be done instead and it is easy to keep them. * Do the parsing in several steps to allow different tools to work with it. I was thinking about a pool of analysers each of which would add some information. This could be more than needed for semantic analysis. An analyser would be created each time when information (e.g., some indexing) is needed for a particular use case. I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end. I saw your ast_as_lib branch of SDC, but didn't dig deeper.
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 14:07, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote: Found it now, calculateFor. It not sure if it's the most intuitive name though. I get the feeling: calculate what?. calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions? My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought. I guess I'll have to wait for that then :) I'll try to do that ahead of roadmap, it is important. Cool. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote: On 2012-05-11 14:07, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote: Found it now, calculateFor. It not sure if it's the most intuitive name though. I get the feeling: calculate what?. calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions? My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought. What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole.
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 14:14, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote: From the beginning, I'm think AST macro using CTFE. Could you please elaborate? I plan to strictly follow published D specification. That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org? Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 15:01, Roman D. Boiko wrote: What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that index is used to get the location. Another option would be to turn it around: sturct/class Location { Location find/locate (Token token); } The problem with placing it in Token is that Token should not know anything about source as a whole. I see. For convenience there could be properties defined that just forwards the call to some other function. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 15:01, Roman D. Boiko a écrit : On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote: On 2012-05-11 14:07, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote: Found it now, calculateFor. It not sure if it's the most intuitive name though. I get the feeling: calculate what?. calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions? My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought. What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole. I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. It is likely to be slower on compiling erroneous code or used for something else than compiling, and likely to be insignificant to compile correct code (I suspect the performance win is negligible compared to the time required to build a fully operational executable). You are overcomplicating things for no to little benefice.
Re: DCT: D compiler as a collection of libraries
? my guess is that the spec is TDPL + TDPL errata. dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. On Fri, May 11, 2012 at 3:17 PM, Jacob Carlborg d...@me.com wrote: On 2012-05-11 14:14, Roman D. Boiko wrote: On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote: From the beginning, I'm think AST macro using CTFE. Could you please elaborate? I plan to strictly follow published D specification. That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org? Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote: Le 11/05/2012 15:01, Roman D. Boiko a écrit : The problem with placing it in Token is that Token should not know anything about source as a whole. I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. It would be interesting to see benchmarks. But anyway log(1M) is just 20, and I didn't see any source code with 1M lines :). The main point is design, not performance. I'm trying to build minimal design that handles common use cases very well, and others well enough. I don't see why Location would be needed for every Token. It is likely to be slower on compiling erroneous code or used for something else than compiling, and likely to be insignificant to compile correct code (I suspect the performance win is negligible compared to the time required to build a fully operational executable). IMO, it is unrelated. You are overcomplicating things for no to little benefice. Well, it depends... However, I will provide design rationale document this month.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 13:25:53 UTC, Jacob Carlborg wrote: On 2012-05-11 15:01, Roman D. Boiko wrote: What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that index is used to get the location. It is also what I was planning to do after I posted my last suggestion. Unless I will discover some conceptual problem with that, which is unlikely. Another option would be to turn it around: sturct/class Location { Location find/locate (Token token); } The problem with placing it in Token is that Token should not know anything about source as a whole. I see. For convenience there could be properties defined that just forwards the call to some other function. Both these cases would introduce circular dependency between Lexer and its output data (Location or Token). It is possible to break such dependency via complicating things even more, or live with it. But I don't like circular dependencies when there is no compelling reason to introduce them.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 13:30:49 UTC, Rory McGuire wrote: ? my guess is that the spec is TDPL + TDPL errata. dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. Whenever I'm in doubt, I try to analyze both TDPL and dlang.org, and also consult DMD sources. But in general when I write specification I mean dlang.org. I agree that it should be maintained, but I understand that it is difficult to do. I'll try to summarize problems which I discover (I already found many), but it is quite time-consuming activity.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote: Le 11/05/2012 15:01, Roman D. Boiko a écrit : The problem with placing it in Token is that Token should not know anything about source as a whole. I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M N (M is much smaller than N).
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 12:20:27 UTC, Roman D. Boiko wrote: On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote: Mono-D is written in C#, VisualD uses D -- so it actually should be easier to integrate into the second one :) Sorry, I meant D-IDE. But there might exist the reason to consume D implementation from C# also. I would happily collaborate to make it usable for that. Oops D-IDE is also C#...
Re: DCT: D compiler as a collection of libraries
On 2012-05-11 15:30, Rory McGuire wrote: ? my guess is that the spec is TDPL + TDPL errata. dlang.org http://dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org http://dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. None of these are in sync. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 16:02, Roman D. Boiko a écrit : On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote: Le 11/05/2012 15:01, Roman D. Boiko a écrit : The problem with placing it in Token is that Token should not know anything about source as a whole. I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M N (M is much smaller than N). N can easily be number of tokens.
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 14:45:45 UTC, Jacob Carlborg wrote: On 2012-05-11 15:30, Rory McGuire wrote: ? my guess is that the spec is TDPL + TDPL errata. dlang.org http://dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org http://dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. None of these are in sync. I suggest to choose a dedicated place where it is possible to discuss differences and other problems with specification. This forum?
Re: DCT: D compiler as a collection of libraries
On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote: Le 11/05/2012 16:02, Roman D. Boiko a écrit : Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M N (M is much smaller than N). N can easily be number of tokens. Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side.
Re: DCT: D compiler as a collection of libraries
yip, but TDPL still has to take precedence because that is the one that walter + andrei + community put the most focused effort into. On Fri, May 11, 2012 at 4:45 PM, Jacob Carlborg d...@me.com wrote: On 2012-05-11 15:30, Rory McGuire wrote: ? my guess is that the spec is TDPL + TDPL errata. dlang.org http://dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org http://dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. None of these are in sync. -- /Jacob Carlborg
Re: DCT: D compiler as a collection of libraries
On Friday, May 11, 2012 23:00:15 Rory McGuire wrote: yip, but TDPL still has to take precedence because that is the one that walter + andrei + community put the most focused effort into. It doesn't necessarily. Each place that they differ is examined individually and decided on its own merits. There is no automatic TDPL says it's so, so it's so. There _are_ cases where things have even been explicitly changed after TDPL was released, making TDPL out-of-date for that particular item (e.g strong vs weak purity is not discussed in TDPL at all, beacuse it didn't exist at the time). In general, however, TDPL _is_ correct, and when there's a difference, it's simply a case of the documentation or compiler not having been updated accordingly. We try to avoid making any changes that would conflict with TDPL. - Jonathan M Davis
Re: DCT: D compiler as a collection of libraries
On 5/11/12 10:14 PM, Roman D. Boiko wrote: On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote: Le 11/05/2012 16:02, Roman D. Boiko a écrit : Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M N (M is much smaller than N). N can easily be number of tokens. Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side. As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this?
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Summary of your points (I deliberately emphasize some of them more than you did; please correct me if I misinterpreted anything): * storing location for each token is simple and cheap * SourceRange is needed; an instance per token; it should be a class, not struct * Syntax highlighting doesn't need to keep all tokens in memory * but it must know column and line for each token * for this use case DCT has awful performance * to build AST lines and columns are required * information from tokens must be copied and possibly transformed in order to put it into AST; after such transformation tokens are not needed any more Those all are valid points given that we don't have any benchmarks and usage examples. The good news is that use cases like highlighting, parsing, autocompletion, etc. are the core functionality for which DCT should be designed. So if it fails any of them, design needs to be revisited. But I will need some time to thoroughly address each of this issues (either prove that it is not relevant, or fix the problem). I will definitely complete this work during May. I will regularly report my progress here. In general, I tend to disagree with most of them, because I already thought about each and designed accordingly, but it is important to give them enough attention. What I fail miserably at this moment is providing actual usage code and benchmarks, so I'm going to focus on that. Thanks for your feedback!
Re: DCT: D compiler as a collection of libraries
On Friday, May 11, 2012 10:01:28 Roman D. Boiko wrote: There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later. It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be acceptable, because it's not either #1 or #2. That would have to be decided once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be able to compare implementations of #1 and #2 against what you're doing and seeing which performs better. - Jonathan M Davis