Re: [Libreoffice] [Crazy Ideas] Discuss
On 11/29/2010 07:48 PM, Mattias Johnsson wrote: On 30 November 2010 11:34, Joe Smithj...@martnet.com wrote: I was also having a lot of trouble learning anything from running OOo under gdb. Gdb was acting weird and I couldn't step through the code and poke around. I ended up trying to do it by adding a printf, rebuild, run, rinse, repeat. No fun; less progress. Did you turn off compiler optimisations? I had the same problem (gdb hopping around in a non-intuitive way, and the values of some useful variables were optimised out) until I turned them off. If that's the problem, you can do it by configuring with --enable-debug, or if you're just building a single module you can do build -- debug=t dbglevel=2 I think dbglevel=1 also turns them off, and will give less noise, but I haven't checked. Cheers, Mattias Whee! I was resisting another full build; I didn't realize this was part of the problem. Thanks! Joe ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [Crazy Ideas] Discuss Replace regexp parser with std library
Joe Smith wrote: I've looked at the code a bit, and it seems like there is indeed only one point of contact with the rest of the suite, textsearch.cxx, which handles all types of text searches (normal, regexp fuzzy), and calls Regexpr::re_search(), which calls re_match2() to run the actual regexp match. So the structure makes it easy to replace the regexp code in one place. Unfortunately, the way the functions work does not match well with the Boost RE classes, although I'm sure it would be possible with an interface layer. For example, the Boost engine handles locale-specific issues internally, whereas OOo's engine knows almost nothing about character case or multi-character sequences. Instead, it preps the text to be searched by running it through a filter. I don't understand the i18n character encoding issues well enough to guess what that filter is actually doing or how it should be handled. Hi Joe, hm - then I think a combination of those two approaches might be a winning strategy - LibO uses icu for all those nifty transliteration stuff what not. I notice that newer boost versions also optionally support icu, maybe that already gives us good enough coverage - I'd be tempted to just give it a whirl, and add it as an optional, experimental feature to have people play with it. Cheers, -- Thorsten pgp8DTxCj9okj.pgp Description: PGP signature ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [Crazy Ideas] Discuss
I like the idea of replacing the internal regexp in favor of a more fully developed regexp evaluator. The goal should be to get rid of the weaker regexp module. A question I don't know how to answer is which is the best replacement. As Thorsten points out, ideally the replacement should be enabled for localization and transliteration - again more stuff I don't understand well at all. However, looking at textsearch.cxx in Open Grok -- http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165 -- can see this comment before the various types of calls to a search routine: // use transliteration here, but only if not RegEx, which does it different One can also see other exclusion of the regexp search algorithm from the transliteration search prep and search result code in textsearch.cxx around the calls to the search routines, but I'm not absolutely sure that exclusion is complete. If the regexp search truly *never* uses transliteration then the swap out will be simpler and the change-over may actually enable transliteration. I haven't looked at the internal code of the regexp - perhaps it 'does it's own thing' internally for transliteration... -- View this message in context: http://nabble.documentfoundation.org/Crazy-Ideas-Discuss-Replace-regexp-parser-with-std-library-tp1974632p1989646.html Sent from the Dev mailing list archive at Nabble.com. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [Crazy Ideas] Discuss
On 11/29/2010 06:39 PM, John LeMoyne Castle wrote: ... However, looking at textsearch.cxx in Open Grok -- http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165 -- can see this comment before the various types of calls to a search routine: // use transliteration here, but only if not RegEx, which does it different One can also see other exclusion of the regexp search algorithm from the transliteration search prep and search result code in textsearch.cxx around the calls to the search routines, but I'm not absolutely sure that exclusion is complete. If the regexp search truly *never* uses transliteration then the swap out will be simpler and the change-over may actually enable transliteration. I haven't looked at the internal code of the regexp - perhaps it 'does it's own thing' internally for transliteration... Right. I have only a vague idea what transliteration means here. From a web search I can see that it must be an attempt to deal with things like accented characters (Is a the same as ä, or not? Is ss the same as ß?), but I couldn't find any clear description of exactly what the transliteration was doing. There is a letter-case filter applied to the text before a regex search, changing all characters to one single case, lower case for English text. If the user indicates that case is significant, the filter is not applied. The actual searches get a text buffer and a pair of indices (first, last) indicating the region to search. The results are returned as a list of matches, also with indices into the text buffer. The code does a lot of adjusting of the indices, I suppose to account for character-level changes due to the transliteration, but again, I can't really tell what the adjustment code is supposed to do. I was also having a lot of trouble learning anything from running OOo under gdb. Gdb was acting weird and I couldn't step through the code and poke around. I ended up trying to do it by adding a printf, rebuild, run, rinse, repeat. No fun; less progress. My thought was maybe to just avoid all that and start out with an extension testbed that uses the Boost regexp. I'm sure I can get access to paragraphs of text without any transliteration or filtering, and see how well the Boost functions work. If that goes well, then move on to replacing code. I think Boost looks like the way to go, since it has a lot of functionality, supports Unicode (16- or 32-bit chars), and OOo already uses it. Performance could be a problem. I saw a comment in the code somewhere saying that performance is critical for some spreadsheets--I assume because Calc's lookups default to using regular expression matching. As far as I can see, that's a faulty design, the lookups should not use regexp matching unless it is specifically requested, but it may be too late to change that now. I've seen benchmarks indicating that the Boost regexp is fairly fast compared to other regexp engines, but I'm guessing that it's still slower than the current primitive engine. Joe ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [Crazy Ideas] Discuss
On 30 November 2010 11:34, Joe Smith j...@martnet.com wrote: I was also having a lot of trouble learning anything from running OOo under gdb. Gdb was acting weird and I couldn't step through the code and poke around. I ended up trying to do it by adding a printf, rebuild, run, rinse, repeat. No fun; less progress. Did you turn off compiler optimisations? I had the same problem (gdb hopping around in a non-intuitive way, and the values of some useful variables were optimised out) until I turned them off. If that's the problem, you can do it by configuring with --enable-debug, or if you're just building a single module you can do build -- debug=t dbglevel=2 I think dbglevel=1 also turns them off, and will give less noise, but I haven't checked. Cheers, Mattias ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] [Crazy Ideas] Discuss
On Tue, 2010-11-30 at 11:48 +1100, Mattias Johnsson wrote: On 30 November 2010 11:34, Joe Smith j...@martnet.com wrote: I was also having a lot of trouble learning anything from running OOo under gdb. Gdb was acting weird and I couldn't step through the code and poke around. I ended up trying to do it by adding a printf, rebuild, run, rinse, repeat. No fun; less progress. Did you turn off compiler optimisations? I had the same problem (gdb hopping around in a non-intuitive way, and the values of some useful variables were optimised out) until I turned them off. Yup, this is a good advice. Also, gdb at one point had issues with setting break points in class constructors. I don't know if this has been resolved yet in the more recent releases, but something to keep in mind in case you still use a version of gdb with this issue unresolvedd. gdb also tends to quit when you try to step through parts of code where no debug symbols are available. This may happen when you've rebuilt module only partially with debug symbols, and/or step into code of other modules that have not been re-built with debug symbols. If that's the problem, you can do it by configuring with --enable-debug, or if you're just building a single module you can do build -- debug=t dbglevel=2 build debug=t alone should turn off compiler optimization, before the '--' not after. I wouldn't recommend dbglevel=2 unless you know what you are getting with dbglevel=2. I think dbglevel=1 also turns them off, and will give less noise, but I haven't checked. It's the debug=t part that turns off compiler optimization. dbglevel=# controls the amount of debug messages that other devs have put in (if I understand David's mail correctly, that is). Kohei -- Kohei Yoshida, LibreOffice hacker, Calc kyosh...@novell.com ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
[Libreoffice] [Crazy Ideas] Discuss Replace regexp parser with std library
Anyone interested in discussing this crazy idea? Replace home-grown regexp parser with some std library http://wiki.documentfoundation.org/Development/Crazy_Ideas#Replace_home-grown_regexp_parser_with_some_std_library I've been thinking about this since I found my first bug in OOo's oddball regex engine two years ago. Even though it's a feature for die-hard geeks, I would love to see OOo's quirky, complicated and non-standard regex engine replaced with something solid, standard and externally supported. I've looked at the code a bit, and it seems like there is indeed only one point of contact with the rest of the suite, textsearch.cxx, which handles all types of text searches (normal, regexp fuzzy), and calls Regexpr::re_search(), which calls re_match2() to run the actual regexp match. So the structure makes it easy to replace the regexp code in one place. Unfortunately, the way the functions work does not match well with the Boost RE classes, although I'm sure it would be possible with an interface layer. For example, the Boost engine handles locale-specific issues internally, whereas OOo's engine knows almost nothing about character case or multi-character sequences. Instead, it preps the text to be searched by running it through a filter. I don't understand the i18n character encoding issues well enough to guess what that filter is actually doing or how it should be handled. That's as far as I've gotten, although I have some ideas for some prototype code. I'd love to get some input from someone more experienced with OOo's code, or even to discuss how the regexp support fits at the application level. Joe ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice