> Date: Thu, 14 May 2015 08:57:30 -0400 > From: David Mertens <[email protected]> > To: [email protected] > Subject: Re: [Tinycc-devel] TinyCC REPL > Message-ID: > <ca+4ieyxajmiut2gc8eehm-dva26fof41d6ppfet2qworcqp...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" >
>> I pushed to the mob your changes to the win32/inlcude/winapi >> > > Those changes were the minimum necessary so that tcc could compile perl.h > on Windows. Is that a sensible criterion for including them in tcc's source > code? > > David > I agree with Sergey, compiling Perl is a good reason for header changes. > Date: Thu, 14 May 2015 09:35:12 -0400 > From: David Mertens <[email protected]> > To: [email protected] > Subject: Re: [Tinycc-devel] TinyCC REPL > Message-ID: > <ca+4ieyuvehefrcjyhmnaeq2n6lbbvpexlafpzryawu7581k...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > Out of curiosity, have you added an assert() to confirm that the character set is ASCII instead of something else? I think ASCII is the norm, but main-frames aren't the only place I've heard of non-ASCII encodings (not saying you should handle it, just test it; handling can be implemented by someone with a weird machine - I think it was embedded). > Hey Sergey, > > Thanks for this! I am impressed at your efforts! Unfortunately, I think > this is a bit premature. I have more work to do on my exsymtab fork before > I was going to bring it up for inclusion in tcc itself. (Indeed, I am not > entirely sure that it is appropriate for inclusion in tcc.) Here are a > couple of reasons we should wait for a little while longer: > > 1. The extended symbol table API is not documented at all. > 2. The symbol table copy takes O(N_sym^2) to run. It might be possible > to speed this up, but for the moment I plan to work around it by > implementing symbol table caching. I have not yet completed that work, so I > consider the exsymtab project to be incomplete at the moment. You refer to what I understand to be the symbgol table as an Array Mapped Trie. I'm not familiar with the structure, but it looks like an array that someone shoved a bunch of trie data structures into. How exactly does ::filled_bits work? Is it a bit field indicating which children (I assume placed in relation to their parent) have data, and which don't? > 4. As implemented, this cuts the maximum number of token symbols in > half. (I needed to use one of the bits to indicate "extended symbol".) > 5. The current token lookup is based on a compressed trie that > explicitly only supports A-Z, a-z, 0-9, and _. It does not support $ in > identifiers and would need to be revised in order to do so. I chose to > implement Phil Bagwell's Array Mapped Trie > <http://lampwww.epfl.ch/papers/triesearches.pdf.gz> in the belief that > it would perform better than a hash for lookup. Once I add symbol table > caching, I hope to add (or switch to) Array Compressed Tries for even > better cache utilization. But currently, I rely on have 63 allowed > characters in identifiers. Assuming that my understanding above is right (or at least the bitfield part is), how about reducing the allowed number of children for a single trie node, and use a custom encoding to indicate that some particular trie child node (let's say the first) is a "continuation" node, containing the OTHER half (or third, or whatever) of the children? The source already mentions the assumption that the upper-case letters are used more rarely. Depending on choosen method (something derived from UTF-8, perhaps? Nice and simple, after all, implementable purely via comparisons: instead of indicating extension bytes, the highest set bits would indicate "subsidiary nodes"), extension to full Unicode might be possible (if improbable, and ill-advised). As an example, assuming that only one continuation node was allowed per parent, the parent would have 62 - 2 bits, for 60 symbols if the highest bit was set, and 61 otherwise. Meanwhile, the continuation node would always have 62. The result is up to 122 - "extended symbol" == 121 available symbols: even if you add $, an ASCII C++ operator declaration doesn't allow even as many as 100 characters, much less C, so for relatively little cost (if the assumptions about character occurance ratios are "close enough") you can easily cover more than is needed, thereby providing space for e.g. other languages at a later point in time. > 7. A separate idea that I plan to pursue on my fork is to extend how tcc > pulls data in from file handles. I would like to make it hookable so that I > could write hooks in my Perl module and have it interact directly with > Perl's parser, rather than pulling all of the C code into a temporary > buffer. This may go beyond the wishes of the community and merits further > discussion. > It's been talked about before. I believe that Grischka vetoed it at the time. Still, for e.g. shipping a "boot environment" with TCC (for crazy no-filesystem experimenting), it would be invaluable. _______________________________________________ Tinycc-devel mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/tinycc-devel
