Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
in moz\zipped I found some info on the openoffice.org wiki. The next hurdle seems to be getting the MS SDK. I'm d/l the win7 version as its smaller and hopefully okay but it is looking like it will take all night. I might have this working in a couple of weeks at this rate. Also, I noticed some oddities. checking size of long... 0!! config kept picking up /usr/bin/csc.exe which seems to be some kind of scheme interpreter. I just pulled in all the cygwin dev tools so I guess I ended up with it. I renamed the file something else and it is now picking up the DotNet version. It also appears that when using VCExpress that ATL and COM is out so some features will go missing, presumably. Regards Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On 2011-02-02 at 13:56, tlillqv...@novell.com wrote: checking size of long... 0!! Equally fun is the one that follows immediately: checking whether byte ordering is bigendian... yes The result of that test must not be really used anywhere either. I will remove it, too. --tml ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
No chance it's used to support writing some of the binary file formats out uniformly across different endians? Nope. OOo/LibreOffice has its own stuff for all such things, since very long times. The configure script is a relatively late addition to OOo/LibreOffice. Anyway, I noticed that SIZEOF_LONG and WORDS_BIGENDIAN are checked in set_soenv.in to for some rare platforms (MIPS, PowerPC, S390x), so I will not remove them from configure.in. I will just bypass the failing tests on Windows, and hardcode the correct values (even if not needed as such on Windows). --tml ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hurray, I have finally got past bootstrap phase and I'm leaving it build in the background today while I'm at work. I had a number of small issues that I had to resolve, including being unable to execute some of the installers that were downloaded. A chmod 755 src/*.exe src/*.EXE seemed to resolve that, but a couple of other issues too. I will send an update tonight if all goes well. Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Sorry Tor - forgot to reply-all and sent only to you previously... resending to the list. On 3 February 2011 10:35, Steven Butler sebut...@gmail.com wrote: I will send an update tonight if all goes well. It seems to have failed building VCL - there is an error stating f268: Error: The image(s) check ... could not be found. (my elision as different PC) Could this have something to do with removing icons that was done recently? I may have to wait till tonight and try to update the git checkout(s) to see if that helps. I also had errors building in a number of other subprojects that I'll need to look into tonight. Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On 1 February 2011 07:53, Tor Lillqvist tlillqv...@novell.com wrote: With the clarification that it is the *Cygwin* command line, yes. seems I already have gnu make in my path on windows from the mingw Nah, that is not usable for this. It must be the Cygwin make that is used for this (and other Cygwin tools, as described in the wiki). (The Cygwin make is as such not used for much in the LO build process, just at the very top level. For the rest LO's own dmake is used.) And to avoid any possibility of confusion, make sure your non-related development environment(s) don't show up in any environment variables (PATH, LIBS, etc) in the LO build environment. --tml Ok, I've not done any more work on developing this as I have been working on getting a win32 (actually 64 bit win 7) build environment working tonight. I haven't got very far but I will try to note the steps I've taken as I go. I'm currently going to have to give up for the night as it is complaining about the MozillaBuildSetup tools and it's 79 MB and coming down from ftp.mozilla.com at dialup speed :( I'm very new to git so I gave up on --reference and just did a straight clone from my SMB share, which was relatively quick, but of course I only got the bootstrap. Once I get past bootstrap stage, is the git part going to grab relative to bootstrap or go straight to libreoffice.org? I was thinking about manually cloning each of the repositories in the clone directory if necessary to short circuit this. Here's my steps so far: 1. Install Cygwin - pick all development tools and install (much later) 2. Clone the bootstrap git project from SMB share and copied the src files. 3. In Cygwin shell, the autogen failed with an odd error related to Native programs and symlinks. I got past this by doing the following: cd /bin rm /usr/bin/awk cp /usr/bin/gawk.exe awk.exe cp /usr/bin/gzip.exe gunzip.exe 4. After this, it seemed to pick my MSVC2008 Express install as the compiler (I also had several cygwin gcc versions installed but it seems to have ignored them), then I needed to add the jdk 6 home to the config option ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/ 5. I now find I need the mozilla build tools and to add another config option Download http://ftp.mozilla.org/pub/mozilla.org/mozilla/libraries/win32/MozillaBuildSetup-Latest.exe (very slow 2 hour download :( ) will install it in the morning if it's finished downloading... I'll keep adding to this list in case it helps someone else out. -- Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hi, On 1 February 2011 22:26, Michael Meeks michael.me...@novell.com wrote: Hi Steve, 3. In Cygwin shell, the autogen failed with an odd error related to Native programs and symlinks. I got past this by doing the following: cd /bin rm /usr/bin/awk cp /usr/bin/gawk.exe awk.exe cp /usr/bin/gzip.exe gunzip.exe Urk; I guess we should try to patch/fix our autogen.sh to work more nicely - or is this unavoidable ? It says its because non-cygwin programs (native Windows) can't execute them - I have no idea where they are used (or if they are used) so I followed some hints off the net to make it stop complaining :) 4. After this, it seemed to pick my MSVC2008 Express install as the compiler (I also had several cygwin gcc versions installed but it seems to have ignored them), then I needed to add the jdk 6 home to the config option ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/ After finally installing mozilla-build with the following steps: 7. Rerun autogen: ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/ --with-mozilla-build=/cygdrive/c/mozilla-build/ configure: error: Building SeaMonkey is supported with Microsoft Visual Studio 2005 only. 8. I downloaded prebuilt seamonkey from here: http://tools.openoffice.org/moz_prebuild/OOo3.2/ grabbed 3 files started with WNT but I'm not sure what to do with them... Where should I put these to make it all go? Regards Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
2011/2/1 Steven Butler sebut...@gmail.com: configure: error: Building SeaMonkey is supported with Microsoft Visual Studio 2005 only. 8. I downloaded prebuilt seamonkey from here: http://tools.openoffice.org/moz_prebuild/OOo3.2/ grabbed 3 files started with WNT but I'm not sure what to do with them... Where should I put these to make it all go? clone\libs-extern-sys\moz\zipped\ Cheers, Andras ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On Tue, Feb 1, 2011 at 11:44 PM, Steven Butler sebut...@gmail.com wrote: Hi, On 1 February 2011 22:26, Michael Meeks michael.me...@novell.com wrote: Hi Steve, 3. In Cygwin shell, the autogen failed with an odd error related to Native programs and symlinks. I got past this by doing the following: cd /bin rm /usr/bin/awk cp /usr/bin/gawk.exe awk.exe cp /usr/bin/gzip.exe gunzip.exe Urk; I guess we should try to patch/fix our autogen.sh to work more nicely - or is this unavoidable ? It says its because non-cygwin programs (native Windows) can't execute them - I have no idea where they are used (or if they are used) so I followed some hints off the net to make it stop complaining :) 4. After this, it seemed to pick my MSVC2008 Express install as the compiler (I also had several cygwin gcc versions installed but it seems to have ignored them), then I needed to add the jdk 6 home to the config option ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/ After finally installing mozilla-build with the following steps: 7. Rerun autogen: ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/ --with-mozilla-build=/cygdrive/c/mozilla-build/ configure: error: Building SeaMonkey is supported with Microsoft Visual Studio 2005 only. 8. I downloaded prebuilt seamonkey from here: http://tools.openoffice.org/moz_prebuild/OOo3.2/ grabbed 3 files started with WNT but I'm not sure what to do with them... Where should I put these to make it all go? In the directory moz/zipped. -- Jesús Corrius je...@softcatala.org Document Foundation founding member Skype: jcorrius | Twitter: @jcorrius ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On Mon, 2011-01-31 at 15:17 +, Michael Meeks wrote: Hi Steve, On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote: If the thesaurus is only loaded when the user pops it up, then couldn't mythes be taught to generate its own in-memory index from the dictionary and not bother with an index file at all? Right. I think we could easily serialize a small skip-list to disk too - if we simply store ~8 or ~32 or so indexes into the data - we can parse only a fraction of it, and pop that in our home directory. We could also drop the MyThes code too as a depedency to manage. The code using it is in: lingucomponent/source/thesaurus/libnth/nthesimp.cxx BTW, if I did that I'd probably do some major surgery on mythes and just use STL because it basically is doing C style memory management and processing and I think I would screw it up if I started messing with it. The only problem with simplifying it with STL constructs is that I would want to change the interface (string vs char *), maybe use STL vectors for the list of synonyms, etc. Heh; sure. By this stage it's not looking much like mythes anymore ... FWIW, I'm sure Nemeth would be interested if you e.g. wanted to create a reimpl of mythes that was faster than the original and perhaps simply designate the optimized version the new mythes version with an API/ABI change :-) C. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hi Michael On 1 February 2011 01:17, Michael Meeks michael.me...@novell.com wrote: Hi Steve, Sure - so; in response to user input I suspect we can take a second to parse the thesaurus; we have around 20Mb of text to load for en_US; perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so quickly. Where it will hurt is if it is not in cache and the user has some background task running that hits the disk. An example might be on Windows with virus scanning (or viruses :) ). Right. I think we could easily serialize a small skip-list to disk too - if we simply store ~8 or ~32 or so indexes into the data - we can parse only a fraction of it, and pop that in our home directory. We could also drop the MyThes code too as a depedency to manage. I'm not sure what you mean by a skip list unless you simply mean a similar file to the existing .idx, or just a list of offsets for where the words are to skip loading the whole file. The trouble with that approach is the readahead will likely pull in the whole file anyway as the words aren't generally _that_ far apart in it, so you'll still do all the IO and just skip a bit of the CPU time. The code using it is in: lingucomponent/source/thesaurus/libnth/nthesimp.cxx BTW, if I did that I'd probably do some major surgery on mythes and just use STL because it basically is doing C style memory management and processing and I think I would screw it up if I started messing with it. The only problem with simplifying it with STL constructs is that I would want to change the interface (string vs char *), maybe use STL vectors for the list of synonyms, etc. Heh; sure. I've cooled off on this a bit as performance is slower when using lots of strings etc. I was able to change the approach to loading the idx to treat it as a big buffer and sped it up considerably too. This did mean resorting to lots of pointer tomfoolery but it is easy to cleanup as there are only 3 allocations instead of 100k+ worth. I guess we could re-write it inside lingucomponent then (?) but we should prolly get a better understanding of how frequently this code is called first - is it hooked into from the spell checking code ? or is it really just the Tools-Language-Thesaurus ? It's actually hooked into the right click menu (probably amongst other things). The first time you right click on a word, the dictionary for the current locale is loaded before the right click menu shows up. After that, it uses the cached thesaurus dictionary for subsequent lookups. If you look in your right-click menu, you'll notice a thesaurus list of synonyms shows up (assuming the word is found) :). Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On 1 February 2011 06:30, Caolán McNamara caol...@redhat.com wrote: FWIW, I'm sure Nemeth would be interested if you e.g. wanted to create a reimpl of mythes that was faster than the original and perhaps simply designate the optimized version the new mythes version with an API/ABI change :-) I don't think there is any need for an API or ABI change as I'm shying away from an STL reimplementation. If optimisation is desired (probably not needed), reducing the string allocations by reading in the whole index file certainly helps (I cut down from 0.046 seconds with hot-cache to 0.019 seconds with hot cache to load the US dictionary. The speedup is similar on cold cache but I can't recall the numbers exactly - something like 0.1 seconds down to 0.05 seconds. I thought it would be possible to use the STL algorithms to do the binary search and/or use the map, but using all those strings and a map take considerably longer than all the strdups in the original (I recall about 0.08 seconds to load the index using STL map. I didn't measure lookup time but it would be very similar. Using STL vectors made it comparable, but then it turns out binary_search only tells you if an item exists, not its index which is kind of annoying. :) So at this point I think an STL rewrite would not result in a performance improvement, so would be an academic exercise. Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On 31 January 2011 23:14, Tor Lillqvist tlillqv...@novell.com wrote: (Hmm, was this message intentionally not to the list?) An accident, list ccd. So after looking at the wiki I wasn't able to find any instructions on how one would go about building the Windows installer. Are the instructions the same as for other platforms? Yes and no; if you mean some make install or make dev-install, those don't make sense. (I don't even know what they might do in a Windows LibreOffice build environment.) Just doing a normal build successfully on Windows, you end up with an (MSI-based) installer. And if you happen to have NSIS on the machine (optional), also a NSIS wrapper of that, a single executable. What toolchain needs to be installed? Is it cygwin, mingw, or MSVC? Cygwin and MSVC2008 or MSVC2010. The Express editions are supposed to work, I think. I could use MSVC2008 express which I already have installed. Would the build work over an SMB share? I don't really want to redownload the whole lot (bandwidth is limited on Australian broadband plans) - so failing doing it over SMB, would copying my existing git repos over to the Windows machine allow an attempt at a build without too much breakage? There's obviously a lot of linux product already in that build tree. And a bunch of other dependencies, but I think their download should now be nicely automated, at least in master. Not 100% automated in the 3-3 branch. (Note that in the 3-3 branch I think one should not attempt a Windows build in the new way (directly in the directory from the bootstrap repo), but just do it the old way, in the directory from the build repo.) Once the toolchain is there, is there a special target for the windows installer? It gets built in the instsetoo_native module. The actual dmake target name used in its util/makefile.mk is something like openoffice_en-US I think (yes, we should change those openoffice strings there to libreoffice). The MSI installer (.msi and .cab files, setup.exe, and various small other bits) ends up in wntmsci12.pro/LibreOffice/msi/install/native/en-US or somesuch place. So do I simply type make at the command line under windows? Hmmm. seems I already have gnu make in my path on windows from the mingw Ruby build dev build framework as well. I wonder if that would work okay or if I'd need to remove that tool chain from my path to stop things getting confusing? Time for work... will hopefully look into this tonight ... --tml -- Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hi Michael, On 29 January 2011 21:45, Steve Butler sebut...@gmail.com wrote: I thought I would discuss your idea about not using the index at all to see what reception it gets, but I think you may also have been suggesting a similar thing: are the index files even useful on modern gear? I can populate the en_US index in memory from the .dat file with the C++ code in 0.287 s after dropping all cache, and 0.188s when the cache is hot. I do admit that my desktop is pretty quick though, with 4 cores, SATA II drives etc. I have plugged the idxdict.cpp code (modified) into the mythes index loader and made it load from the .dat file directly. The index file is no longer touched. Here's some comparison timings on the above system (measured with gettimeofday either side of the call in swriter). Using an INDEX FILE: US Thesaurus - cold OS cache 2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds. US Thesaurus - hot OS cache 2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds. USING NO INDEX FILE: US Thesaurus - cold OS cache 2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds. US Thesaurus - hot OS cache 2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds. As can be seen from these numbers, it is around 3x slower for the US thesaurus regardless of hot/cold cache. BTW, if I did that I'd probably do some major surgery on mythes and just use STL because it basically is doing C style memory management and processing and I think I would screw it up if I started messing with it. The only problem with simplifying it with STL constructs is that I would want to change the interface (string vs char *), maybe use STL vectors for the list of synonyms, etc. I've kept the public interface of mythes the same with my changes (but the index file name in the constructor is ignored), apart from this one: const char* get_th_encoding(); I didn't change the mentry struct or code dealing with reading an entry from the dat file at all. The offset is loaded straight from the std::map by word lookup but then falls back to the mythes C style code. It might be possible to make the index creation run quicker by avoiding use of so many std::strings but I probably wouldn't do this as it will make it harder to understand. I did remove some private member functions that were no longer needed, and some private data is now using std::string and std::map (as per idxdict). Now, assuming anyone thinks this is a good idea and the tradeoff of initial lookup speed vs installation size is appropriate, I would appreciate pointers as to how we would go about packaging up such a change when it is completely isolated to messing about with 3rd party source. Naturally if this approach was selected then building the .idx files and adding them to the language pack zips would need to be removed. A further option could be to have it use idx files if they exist, but fallback to using only the .dat files. Changes are LGPLv3+,MPL licensed. I've attached the two altered files here in case anyone wants to have a look and provide feedback on the approach. As this is simply proof of concept for the timing, I haven't tested against memory leaks or corruption of data yet. I'm also not sure how to format it as the original code is not well formatted. Regards, Steven Butler #ifndef _MYTHES_HXX_ #define _MYTHES_HXX_ // some maximum sizes for buffers #define MAX_WD_LEN 200 #define MAX_LN_LEN 16384 // a meaning with definition, count of synonyms and synonym list struct mentry { char* defn; int count; char** psyns; }; #include iostream #include fstream #include string #include map typedef std::mapstd::string, long WordLocationMap; class MyThes { std::string encoding; /* stores text encoding; */ WordLocationMap wordList; FILE *pdfile; // disallow copy-constructor and assignment-operator for now MyThes(); MyThes(const MyThes ); MyThes operator = (const MyThes ); public: MyThes(const char* idxpath, const char* datpath); ~MyThes(); // lookup text in index and return number of meanings // each meaning entry has a defintion, synonym count and pointer // when complete return the *original* meaning entry and count via // CleanUpAfterLookup to properly handle memory deallocation int Lookup(const char * pText, int len, mentry** pme); void CleanUpAfterLookup(mentry** pme, int nmean); const char* get_th_encoding(); private: // Open index and dat files and load list array int thInitialize (const char* indxpath, const char* datpath); // internal close and cleanup dat and idx files int thCleanup (); /* // read a text line (\n terminated) stripping off line terminator int readLine(FILE * pf, char * buf, int nc); // binary search on null terminated character strings int binsearch(char * wrd, char* list[], int nlst); */ }; #endif #include COPYING #include stdio.h
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
On Sun, Jan 30, 2011 at 4:32 AM, Steve Butler sebut...@gmail.com wrote: Hi Michael, On 29 January 2011 21:45, Steve Butler sebut...@gmail.com wrote: Here's some comparison timings on the above system (measured with gettimeofday either side of the call in swriter). Using an INDEX FILE: US Thesaurus - cold OS cache 2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds. US Thesaurus - hot OS cache 2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds. USING NO INDEX FILE: US Thesaurus - cold OS cache 2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds. US Thesaurus - hot OS cache 2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds. As can be seen from these numbers, it is around 3x slower for the US thesaurus regardless of hot/cold cache. [...] Now, assuming anyone thinks this is a good idea and the tradeoff of initial lookup speed vs installation size is appropriate, I would appreciate pointers as to how we would go about packaging up such a change when it is completely isolated to messing about with 3rd party source. Naturally if this approach was selected then building the .idx files and adding them to the language pack zips would need to be removed. A further option could be to have it use idx files if they exist, but fallback to using only the .dat files. I have only skimmed this thread, so forgive me if i missed the mark but: why not generate the index at install time ? that will still achieve the goal of reducing the size of the installer, without the performance hit at runtime no? Norbert Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hi Norbert, I have only skimmed this thread, so forgive me if i missed the mark but: why not generate the index at install time ? that will still achieve the goal of reducing the size of the installer, without the performance hit at runtime no? The option to build the index at install time was also discussed and was the original goal, and has definitely not been ruled out. My understanding Michael was not keen to do this on Linux, but keen to try it in the Windows Installer. From my perspective it was a lot easier to patch some code into something I could test (mythes) than something I could not (the windows installer), so I thought I'd start with an easier option. It is important to keep in mind that the performance hit is once off per instance of swriter, and happens the first time you right click on a word in a specific language. With this implementation, once this is done, the whole index is cached in an STL map so performance should be around the same as before (lookup a word in an STL Rbtree vs a binary search on a char ** structure). It would also be possible to generate a dictionary index on first use, but of course that would mean having to generate an index per user so I'm not entertaining that idea seriously. Regards, Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)
Hi Michael, Then 'mythes' seems to be used in lingucomponent/ somewhere - I suppose that is where to be digging for the user code. I suspect if we can read and index this file in two seconds - and it is used in response to user input - there may not really be a lot of value in indexing it ahead of time, but ... ;-) worth playing with that. I haven't had a look at this yet as I thought getting a script to analyze the existing thesaurus files would be helpful to get those errors looked at. I thought I would discuss your idea about not using the index at all to see what reception it gets, but I think you may also have been suggesting a similar thing: are the index files even useful on modern gear? I can populate the en_US index in memory from the .dat file with the C++ code in 0.287 s after dropping all cache, and 0.188s when the cache is hot. I do admit that my desktop is pretty quick though, with 4 cores, SATA II drives etc. If the thesaurus is only loaded when the user pops it up, then couldn't mythes be taught to generate its own in-memory index from the dictionary and not bother with an index file at all? BTW, if I did that I'd probably do some major surgery on mythes and just use STL because it basically is doing C style memory management and processing and I think I would screw it up if I started messing with it. The only problem with simplifying it with STL constructs is that I would want to change the interface (string vs char *), maybe use STL vectors for the list of synonyms, etc. By this stage it's not looking much like mythes anymore ... Regards Steven Butler ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice