Re: [Libreoffice] [Crazy Ideas] Discuss

2010-11-30 Thread Joe Smith

On 11/29/2010 07:48 PM, Mattias Johnsson wrote:

On 30 November 2010 11:34, Joe Smithj...@martnet.com  wrote:

I was also having a lot of trouble learning anything from running OOo under
gdb. Gdb was acting weird and I couldn't step through the code and poke
around. I ended up trying to do it by adding a printf, rebuild, run, rinse,
repeat. No fun; less progress.


Did you turn off compiler optimisations? I had the same problem (gdb
hopping around in a non-intuitive way, and the values of some useful
variables were optimised out) until I turned them off.

If that's the problem, you can do it by configuring with
--enable-debug, or if you're just building a single module you can do
build -- debug=t dbglevel=2

I think dbglevel=1 also turns them off, and will give less noise,
but I haven't checked.

Cheers,
Mattias


Whee! I was resisting another full build; I didn't realize this was part 
of the problem.


Thanks!

Joe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [Crazy Ideas] Discuss Replace regexp parser with std library

2010-11-29 Thread Thorsten Behrens
Joe Smith wrote:
 I've looked at the code a bit, and it seems like there is indeed only one 
 point
 of contact with the rest of the suite, textsearch.cxx, which handles all types
 of text searches (normal, regexp  fuzzy), and calls Regexpr::re_search(), 
 which
 calls re_match2() to run the actual regexp match.
 
 So the structure makes it easy to replace the regexp code in one place.
 
 Unfortunately, the way the functions work does not match well with the Boost 
 RE
 classes, although I'm sure it would be possible with an interface layer.
 
 For example, the Boost engine handles locale-specific issues internally, 
 whereas
 OOo's engine knows almost nothing about character case or multi-character
 sequences. Instead, it preps the text to be searched by running it through a
 filter. I don't understand the i18n  character encoding issues well enough to
 guess what that filter is actually doing or how it should be handled.
 
Hi Joe,

hm - then I think a combination of those two approaches might be a
winning strategy - LibO uses icu for all those nifty transliteration
stuff  what not.

I notice that newer boost versions also optionally support icu,
maybe that already gives us good enough coverage - I'd be tempted to
just give it a whirl, and add it as an optional, experimental
feature to have people play with it.

Cheers,

-- Thorsten


pgp8DTxCj9okj.pgp
Description: PGP signature
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [Crazy Ideas] Discuss

2010-11-29 Thread John LeMoyne Castle

I like the idea of replacing the internal regexp in favor of a more fully
developed regexp evaluator.  The goal should be to get rid of the weaker
regexp module.  A question I don't know how to answer is which is the best
replacement.  As Thorsten points out, ideally the replacement should be
enabled for localization and transliteration - again more stuff I don't
understand well at all.  

However, looking at textsearch.cxx in Open Grok --
http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
 
--  can see this comment before the various types of calls to a search
routine: 
// use transliteration here, but only if not RegEx, which does it different

One can also see other exclusion of the regexp search algorithm from the
transliteration search prep and search result code in textsearch.cxx around
the calls to the search routines, but I'm not absolutely sure that exclusion
is complete.  If the regexp search truly *never* uses transliteration then
the swap out will be simpler and the change-over may actually enable
transliteration.  I haven't looked at the internal code of the regexp -
perhaps it 'does it's own thing' internally for transliteration...
-- 
View this message in context: 
http://nabble.documentfoundation.org/Crazy-Ideas-Discuss-Replace-regexp-parser-with-std-library-tp1974632p1989646.html
Sent from the Dev mailing list archive at Nabble.com.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [Crazy Ideas] Discuss

2010-11-29 Thread Joe Smith

On 11/29/2010 06:39 PM, John LeMoyne Castle wrote:


...
However, looking at textsearch.cxx in Open Grok --
http://opengrok.go-oo.org/xref/libs-gui/i18npool/source/search/textsearch.cxx#165
--  can see this comment before the various types of calls to a search
routine:
// use transliteration here, but only if not RegEx, which does it different

One can also see other exclusion of the regexp search algorithm from the
transliteration search prep and search result code in textsearch.cxx around
the calls to the search routines, but I'm not absolutely sure that exclusion
is complete.  If the regexp search truly *never* uses transliteration then
the swap out will be simpler and the change-over may actually enable
transliteration.  I haven't looked at the internal code of the regexp -
perhaps it 'does it's own thing' internally for transliteration...


Right. I have only a vague idea what transliteration means here. From 
a web search I can see that it must be an attempt to deal with things 
like accented characters (Is a the same as ä, or not? Is ss the 
same as ß?), but I couldn't find any clear description of exactly what 
the transliteration was doing.


There is a letter-case filter applied to the text before a regex search, 
changing all characters to one single case, lower case for English text. 
If the user indicates that case is significant, the filter is not applied.


The actual searches get a text buffer and a pair of indices (first, 
last) indicating the region to search. The results are returned as a 
list of matches, also with indices into the text buffer. The code does a 
lot of adjusting of the indices, I suppose to account for 
character-level changes due to the transliteration, but again, I can't 
really tell what the adjustment code is supposed to do.


I was also having a lot of trouble learning anything from running OOo 
under gdb. Gdb was acting weird and I couldn't step through the code and 
poke around. I ended up trying to do it by adding a printf, rebuild, 
run, rinse, repeat. No fun; less progress.


My thought was maybe to just avoid all that and start out with an 
extension testbed that uses the Boost regexp. I'm sure I can get access 
to paragraphs of text without any transliteration or filtering, and see 
how well the Boost functions work. If that goes well, then move on to 
replacing code.


I think Boost looks like the way to go, since it has a lot of 
functionality, supports Unicode (16- or 32-bit chars), and OOo already 
uses it.


Performance could be a problem. I saw a comment in the code somewhere 
saying that performance is critical for some spreadsheets--I assume 
because Calc's lookups default to using regular expression matching.


As far as I can see, that's a faulty design, the lookups should not use 
regexp matching unless it is specifically requested, but it may be too 
late to change that now.


I've seen benchmarks indicating that the Boost regexp is fairly fast 
compared to other regexp engines, but I'm guessing that it's still 
slower than the current primitive engine.


Joe

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [Crazy Ideas] Discuss

2010-11-29 Thread Mattias Johnsson
On 30 November 2010 11:34, Joe Smith j...@martnet.com wrote:
 I was also having a lot of trouble learning anything from running OOo under
 gdb. Gdb was acting weird and I couldn't step through the code and poke
 around. I ended up trying to do it by adding a printf, rebuild, run, rinse,
 repeat. No fun; less progress.

Did you turn off compiler optimisations? I had the same problem (gdb
hopping around in a non-intuitive way, and the values of some useful
variables were optimised out) until I turned them off.

If that's the problem, you can do it by configuring with
--enable-debug, or if you're just building a single module you can do
build -- debug=t dbglevel=2

I think dbglevel=1 also turns them off, and will give less noise,
but I haven't checked.

Cheers,
Mattias
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] [Crazy Ideas] Discuss

2010-11-29 Thread Kohei Yoshida
On Tue, 2010-11-30 at 11:48 +1100, Mattias Johnsson wrote:
 On 30 November 2010 11:34, Joe Smith j...@martnet.com wrote:
  I was also having a lot of trouble learning anything from running OOo under
  gdb. Gdb was acting weird and I couldn't step through the code and poke
  around. I ended up trying to do it by adding a printf, rebuild, run, rinse,
  repeat. No fun; less progress.
 
 Did you turn off compiler optimisations? I had the same problem (gdb
 hopping around in a non-intuitive way, and the values of some useful
 variables were optimised out) until I turned them off.

Yup, this is a good advice.  Also, gdb at one point had issues with
setting break points in class constructors.  I don't know if this has
been resolved yet in the more recent releases, but something to keep in
mind in case you still use a version of gdb with this issue unresolvedd.
gdb also tends to quit when you try to step through parts of code where
no debug symbols are available.  This may happen when you've rebuilt
module only partially with debug symbols, and/or step into code of other
modules that have not been re-built with debug symbols.

 If that's the problem, you can do it by configuring with
 --enable-debug, or if you're just building a single module you can do
 build -- debug=t dbglevel=2

build debug=t alone should turn off compiler optimization, before the
'--' not after.  I wouldn't recommend dbglevel=2 unless you know what
you are getting with dbglevel=2.

 I think dbglevel=1 also turns them off, and will give less noise,
 but I haven't checked.

It's the debug=t part that turns off compiler optimization.  dbglevel=#
controls the amount of debug messages that other devs have put in (if I
understand David's mail correctly, that is).

Kohei

-- 
Kohei Yoshida, LibreOffice hacker, Calc
kyosh...@novell.com

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


[Libreoffice] [Crazy Ideas] Discuss Replace regexp parser with std library

2010-11-26 Thread Joe Smith
Anyone interested in discussing this crazy idea?

Replace home-grown regexp parser with some std library
http://wiki.documentfoundation.org/Development/Crazy_Ideas#Replace_home-grown_regexp_parser_with_some_std_library

I've been thinking about this since I found my first bug in OOo's oddball regex
engine two years ago. Even though it's a feature for die-hard geeks, I would
love to see OOo's quirky, complicated and non-standard regex engine replaced
with something solid, standard and externally supported.

I've looked at the code a bit, and it seems like there is indeed only one point
of contact with the rest of the suite, textsearch.cxx, which handles all types
of text searches (normal, regexp  fuzzy), and calls Regexpr::re_search(), which
calls re_match2() to run the actual regexp match.

So the structure makes it easy to replace the regexp code in one place.

Unfortunately, the way the functions work does not match well with the Boost RE
classes, although I'm sure it would be possible with an interface layer.

For example, the Boost engine handles locale-specific issues internally, whereas
OOo's engine knows almost nothing about character case or multi-character
sequences. Instead, it preps the text to be searched by running it through a
filter. I don't understand the i18n  character encoding issues well enough to
guess what that filter is actually doing or how it should be handled.

That's as far as I've gotten, although I have some ideas for some prototype
code. I'd love to get some input from someone more experienced with OOo's code,
or even to discuss how the regexp support fits at the application level.

Joe


___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice