Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-02 Thread Steven Butler
in moz\zipped

I found some info on the openoffice.org wiki. The next hurdle seems to
be getting the MS SDK.  I'm d/l the win7 version as its smaller and
hopefully okay but it is looking like it will take all night.

I might have this working in a couple of weeks at this rate.

Also, I noticed some oddities.

checking size of long... 0!!

config kept picking up /usr/bin/csc.exe which seems to be some kind of
scheme interpreter.  I just pulled in all the cygwin dev tools so I
guess I ended up with it.  I renamed the file something else and it is
now picking up the DotNet version.

It also appears that when using VCExpress that ATL and COM is out so
some features will go missing, presumably.

Regards
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-02 Thread Tor Lillqvist
 On 2011-02-02 at 13:56, tlillqv...@novell.com wrote:
  checking size of long... 0!!

Equally fun is the one that follows immediately:

checking whether byte ordering is bigendian... yes

The result of that test must not be really used anywhere either. I will remove 
it, too.

--tml


___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-02 Thread Tor Lillqvist
 No chance it's used to support writing some of the binary file formats
 out uniformly across different endians?

Nope. OOo/LibreOffice has its own stuff for all such things, since very long 
times. The configure script is a relatively late addition to OOo/LibreOffice.

Anyway, I noticed that SIZEOF_LONG and WORDS_BIGENDIAN are checked in 
set_soenv.in to for some rare platforms (MIPS, PowerPC, S390x), so I will not 
remove them from configure.in. I will just bypass the failing tests on Windows, 
and hardcode the correct values (even if not needed as such on Windows).

--tml


___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-02 Thread Steven Butler
Hurray,

I have finally got past bootstrap phase and I'm leaving it build in
the background today while I'm at work.

I had a number of small issues that I had to resolve, including being
unable to execute some of the installers that were downloaded.  A
chmod 755 src/*.exe src/*.EXE seemed to resolve that, but a couple of
other issues too.

I will send an update tonight if all goes well.

Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-02 Thread Steven Butler
Sorry Tor - forgot to reply-all and sent only to you previously...
resending to the list.


On 3 February 2011 10:35, Steven Butler sebut...@gmail.com wrote:

 I will send an update tonight if all goes well.

It seems to have failed building VCL - there is an error stating
f268: Error: The image(s) check ... could not be found. (my elision
as different PC)

Could this have something to do with removing icons that was done
recently?  I may have to wait till tonight and try to update the git
checkout(s) to see if that helps.

I also had errors building in a number of other subprojects that I'll
need to look into tonight.

Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-01 Thread Steven Butler
On 1 February 2011 07:53, Tor Lillqvist tlillqv...@novell.com wrote:

 With the clarification that it is the *Cygwin* command line, yes.

 seems I already have gnu make in my path on windows from the mingw

 Nah, that is not usable for this. It must be the Cygwin make that is used for 
 this (and other Cygwin tools, as described in the wiki). (The Cygwin make is 
 as such not used for much in the LO build process, just at the very top 
 level. For the rest LO's own dmake is used.) And to avoid any possibility 
 of confusion, make sure your non-related development environment(s) don't 
 show up in any environment variables (PATH, LIBS, etc) in the LO build 
 environment.

 --tml

Ok, I've not done any more work on developing this as I have been
working on getting a win32 (actually 64 bit win 7) build environment
working tonight.

I haven't got very far but I will try to note the steps I've taken as
I go.  I'm currently going to have to give up for the night as it is
complaining about the MozillaBuildSetup tools and it's 79 MB and
coming down from ftp.mozilla.com at dialup speed :(

I'm very new to git so I gave up on --reference and just did a
straight clone from my SMB share, which was relatively quick, but of
course I only got the bootstrap.  Once I get past bootstrap stage, is
the git part going to grab relative to bootstrap or go straight to
libreoffice.org?  I was thinking about manually cloning each of the
repositories in the clone directory if necessary to short circuit
this.

Here's my steps so far:

1. Install Cygwin - pick all development tools and install (much later)
2. Clone the bootstrap git project from SMB share and copied the src files.
3. In Cygwin shell, the autogen failed with an odd error related to
Native programs and symlinks.  I got past this by doing the following:
cd /bin
rm /usr/bin/awk
cp /usr/bin/gawk.exe awk.exe
cp /usr/bin/gzip.exe gunzip.exe
4. After this, it seemed to pick my MSVC2008 Express install as the
compiler (I also had several cygwin gcc versions installed but it
seems to have ignored them), then I needed to add the jdk 6 home to
the config option
./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/
5. I now find I need the mozilla build tools and to add another config option
Download 
http://ftp.mozilla.org/pub/mozilla.org/mozilla/libraries/win32/MozillaBuildSetup-Latest.exe
(very slow 2 hour download :( ) will install it in the morning if
it's finished downloading...

I'll keep adding to this list in case it helps someone else out.

-- 
Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-01 Thread Steven Butler
Hi,

On 1 February 2011 22:26, Michael Meeks michael.me...@novell.com wrote:
 Hi Steve,

 3. In Cygwin shell, the autogen failed with an odd error related to
 Native programs and symlinks.  I got past this by doing the following:
     cd /bin
     rm /usr/bin/awk
       cp /usr/bin/gawk.exe awk.exe
       cp /usr/bin/gzip.exe gunzip.exe

        Urk; I guess we should try to patch/fix our autogen.sh to work more
 nicely - or is this unavoidable ?

It says its because non-cygwin programs (native Windows) can't execute
them - I have no idea where they are used (or if they are used) so I
followed some hints off the net to make it stop complaining :)

 4. After this, it seemed to pick my MSVC2008 Express install as the
 compiler (I also had several cygwin gcc versions installed but it
 seems to have ignored them), then I needed to add the jdk 6 home to
 the config option
     ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ Files/Java/jdk1.6.0_18/

After finally installing mozilla-build with the following steps:

7. Rerun autogen:
./autogen.sh --with-jdk-home=/cygdrive/c/Program\
Files/Java/jdk1.6.0_18/
--with-mozilla-build=/cygdrive/c/mozilla-build/

configure: error: Building SeaMonkey is supported with Microsoft
Visual Studio 2005 only.
8. I downloaded prebuilt seamonkey from here:
http://tools.openoffice.org/moz_prebuild/OOo3.2/
grabbed 3 files started with WNT but I'm not sure what to do with 
them...

Where should I put these to make it all go?

Regards
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-01 Thread Andras Timar
2011/2/1 Steven Butler sebut...@gmail.com:

        configure: error: Building SeaMonkey is supported with Microsoft
 Visual Studio 2005 only.
 8. I downloaded prebuilt seamonkey from here:
 http://tools.openoffice.org/moz_prebuild/OOo3.2/
        grabbed 3 files started with WNT but I'm not sure what to do with 
 them...

 Where should I put these to make it all go?

clone\libs-extern-sys\moz\zipped\

Cheers,
Andras
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-02-01 Thread Jesús Corrius
On Tue, Feb 1, 2011 at 11:44 PM, Steven Butler sebut...@gmail.com wrote:
 Hi,

 On 1 February 2011 22:26, Michael Meeks michael.me...@novell.com wrote:
 Hi Steve,

 3. In Cygwin shell, the autogen failed with an odd error related to
 Native programs and symlinks.  I got past this by doing the following:
     cd /bin
     rm /usr/bin/awk
       cp /usr/bin/gawk.exe awk.exe
       cp /usr/bin/gzip.exe gunzip.exe

        Urk; I guess we should try to patch/fix our autogen.sh to work more
 nicely - or is this unavoidable ?

 It says its because non-cygwin programs (native Windows) can't execute
 them - I have no idea where they are used (or if they are used) so I
 followed some hints off the net to make it stop complaining :)

 4. After this, it seemed to pick my MSVC2008 Express install as the
 compiler (I also had several cygwin gcc versions installed but it
 seems to have ignored them), then I needed to add the jdk 6 home to
 the config option
     ./autogen.sh --with-jdk-home=/cygdrive/c/Program\ 
 Files/Java/jdk1.6.0_18/

 After finally installing mozilla-build with the following steps:

 7. Rerun autogen:
        ./autogen.sh --with-jdk-home=/cygdrive/c/Program\
 Files/Java/jdk1.6.0_18/
 --with-mozilla-build=/cygdrive/c/mozilla-build/

        configure: error: Building SeaMonkey is supported with Microsoft
 Visual Studio 2005 only.
 8. I downloaded prebuilt seamonkey from here:
 http://tools.openoffice.org/moz_prebuild/OOo3.2/
        grabbed 3 files started with WNT but I'm not sure what to do with 
 them...

 Where should I put these to make it all go?

In the directory moz/zipped.

-- 
Jesús Corrius je...@softcatala.org
Document Foundation founding member
Skype: jcorrius | Twitter: @jcorrius
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-31 Thread Caolán McNamara
On Mon, 2011-01-31 at 15:17 +, Michael Meeks wrote:
 Hi Steve,
 On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote:
  If the thesaurus is only loaded when the user pops it up, then
  couldn't mythes be taught to generate its own in-memory index
  from the dictionary and not bother with an index file at all?
 
   Right. I think we could easily serialize a small skip-list to disk too
 - if we simply store ~8 or ~32 or so indexes into the data - we can
 parse only a fraction of it, and pop that in our home directory. We
 could also drop the MyThes code too as a depedency to manage.
 
   The code using it is in:
 
   lingucomponent/source/thesaurus/libnth/nthesimp.cxx
 
  BTW, if I did that I'd probably do some major surgery on mythes and
  just use STL because it basically is doing C style memory management
  and processing and I think I would screw it up if I started messing
  with it.  The only problem with simplifying it with STL constructs is
  that I would want to change the interface (string vs char *), maybe
  use STL vectors for the list of synonyms, etc.
 
   Heh; sure.
 
  By this stage it's not looking much like mythes anymore ...

FWIW, I'm sure Nemeth would be interested if you e.g. wanted to create a
reimpl of mythes that was faster than the original and perhaps simply
designate the optimized version the new mythes version with an API/ABI
change :-)

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-31 Thread Steven Butler
Hi Michael

On 1 February 2011 01:17, Michael Meeks michael.me...@novell.com wrote:
 Hi Steve,

        Sure - so; in response to user input I suspect we can take a second to
 parse the thesaurus; we have around 20Mb of text to load for en_US;
 perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
 quickly.

Where it will hurt is if it is not in cache and the user has some
background task running that hits the disk.

An example might be on Windows with virus scanning (or viruses :) ).

        Right. I think we could easily serialize a small skip-list to disk too
 - if we simply store ~8 or ~32 or so indexes into the data - we can
 parse only a fraction of it, and pop that in our home directory. We
 could also drop the MyThes code too as a depedency to manage.

I'm not sure what you mean by a skip list unless you simply mean a
similar file to the existing .idx, or just a list of offsets for where
the words are to skip loading the whole file.  The trouble with that
approach is the readahead will likely pull in the whole file anyway as
the words aren't generally _that_ far apart in it, so you'll still do
all the IO and just skip a bit of the CPU time.


        The code using it is in:

        lingucomponent/source/thesaurus/libnth/nthesimp.cxx

 BTW, if I did that I'd probably do some major surgery on mythes and
 just use STL because it basically is doing C style memory management
 and processing and I think I would screw it up if I started messing
 with it.  The only problem with simplifying it with STL constructs is
 that I would want to change the interface (string vs char *), maybe
 use STL vectors for the list of synonyms, etc.

        Heh; sure.

I've cooled off on this a bit as performance is slower when using lots
of strings etc.  I was able to change the approach to loading the idx
to treat it as a big buffer and sped it up considerably too.  This did
mean resorting to lots of pointer tomfoolery but it is easy to cleanup
as there are only 3 allocations instead of 100k+ worth.

        I guess we could re-write it inside lingucomponent then (?) but we
 should prolly get a better understanding of how frequently this code is
 called first - is it hooked into from the spell checking code ? or is it
 really just the Tools-Language-Thesaurus ?

It's actually hooked into the right click menu (probably amongst other
things).  The first time you right click on a word, the dictionary for
the current locale is loaded before the right click menu shows up.
After that, it uses the cached thesaurus dictionary for subsequent
lookups.

If you look in your right-click menu, you'll notice a thesaurus list
of synonyms shows up (assuming the word is found) :).

Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-31 Thread Steven Butler
On 1 February 2011 06:30, Caolán McNamara caol...@redhat.com wrote:
 FWIW, I'm sure Nemeth would be interested if you e.g. wanted to create a
 reimpl of mythes that was faster than the original and perhaps simply
 designate the optimized version the new mythes version with an API/ABI
 change :-)

I don't think there is any need for an API or ABI change as I'm shying
away from an STL reimplementation.  If optimisation is desired
(probably not needed), reducing the string allocations by reading in
the whole index file certainly helps (I cut down from 0.046 seconds
with hot-cache to 0.019 seconds with hot cache to load the US
dictionary.  The speedup is similar on cold cache but I can't recall
the numbers exactly - something like 0.1 seconds down to 0.05 seconds.

I thought it would be possible to use the STL algorithms to do the
binary search and/or use the map, but using all those strings and a
map take considerably longer than all the strdups in the original (I
recall about 0.08 seconds to load the index using STL map.  I didn't
measure lookup time but it would be very similar.

Using STL vectors made it comparable, but then it turns out
binary_search only tells you if an item exists, not its index which is
kind of annoying. :)

So at this point I think an STL rewrite would not result in a
performance improvement, so would be an academic exercise.

Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-31 Thread Steven Butler
On 31 January 2011 23:14, Tor Lillqvist tlillqv...@novell.com wrote:
 (Hmm, was this message intentionally not to the list?)

An accident, list ccd.

 So after looking at the wiki I wasn't able to find any instructions on
 how one would go about building the Windows installer.  Are the
 instructions the same as for other platforms?

 Yes and no; if you mean some make install or make dev-install, those 
 don't make sense. (I don't even know what they might do in a Windows 
 LibreOffice build environment.)

 Just doing a normal build successfully on Windows, you end up with an 
 (MSI-based) installer. And if you happen to have NSIS on the machine 
 (optional), also a NSIS wrapper of that, a single executable.

 What toolchain needs to be installed?  Is it cygwin, mingw, or MSVC?

 Cygwin and MSVC2008 or MSVC2010. The Express editions are supposed to work, I 
 think.

I could use MSVC2008 express which I already have installed.  Would
the build work over an SMB share?  I don't really want to redownload
the whole lot (bandwidth is limited on Australian broadband plans) -
so failing doing it over SMB, would copying my existing git repos over
to the Windows machine allow an attempt at a build without too much
breakage?  There's obviously a lot of linux product already in that
build tree.

 And a bunch of other dependencies, but I think their download should now be 
 nicely automated, at least in master. Not 100% automated in the 3-3 branch. 
 (Note that in the 3-3 branch I  think one should not attempt a Windows build 
 in the new way (directly in the directory from the bootstrap repo), but 
 just do it the old way, in the directory from the build repo.)

 Once the toolchain is there, is there a special target for the windows
 installer?

 It gets built in the instsetoo_native module. The actual dmake target name 
 used in its util/makefile.mk is something like openoffice_en-US I think (yes, 
 we should change those openoffice strings there to libreoffice). The MSI 
 installer (.msi and .cab files, setup.exe, and various small other bits) ends 
 up in wntmsci12.pro/LibreOffice/msi/install/native/en-US or somesuch place.

So do I simply type make at the command line under windows?  Hmmm.
seems I already have gnu make in my path on windows from the mingw
Ruby build dev build framework as well.  I wonder if that would work
okay or if I'd need to remove that tool chain from my path to stop
things getting confusing?

Time for work... will hopefully look into this tonight ...


 --tml
-- 
Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-30 Thread Steve Butler
Hi Michael,

On 29 January 2011 21:45, Steve Butler sebut...@gmail.com wrote:

 I thought I would discuss your idea about not using the index at all
 to see what reception it gets, but I think you may also have been
 suggesting a similar thing:
 are the index files even useful on modern gear?

 I can populate the en_US index in memory from the .dat file with the
 C++ code in 0.287 s after dropping all cache, and 0.188s when the
 cache is hot.

 I do admit that my desktop is pretty quick though, with 4 cores, SATA
 II drives etc.

I have plugged the idxdict.cpp code (modified) into the mythes index
loader and made it load from the .dat file directly.  The index file
is no longer touched.

Here's some comparison timings on the above system (measured with
gettimeofday either side of the call in swriter).

Using an INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds.
US Thesaurus - hot OS cache
2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds.

USING NO INDEX FILE:
US Thesaurus - cold OS cache
2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds.
US Thesaurus - hot OS cache
2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds.

As can be seen from these numbers, it is around 3x slower for the US
thesaurus regardless of hot/cold cache.

 BTW, if I did that I'd probably do some major surgery on mythes and
 just use STL because it basically is doing C style memory management
 and processing and I think I would screw it up if I started messing
 with it.  The only problem with simplifying it with STL constructs is
 that I would want to change the interface (string vs char *), maybe
 use STL vectors for the list of synonyms, etc.

I've kept the public interface of mythes the same with my changes (but
the index file name in the constructor is ignored), apart from this
one:
const char* get_th_encoding();

I didn't change the mentry struct or code dealing with reading an
entry from the dat file at all.  The offset is loaded straight from
the std::map by word lookup but then falls back to the mythes C style
code.

It might be possible to make the index creation run quicker by
avoiding use of so many std::strings but I probably wouldn't do this
as it will make it harder to understand.

I did remove some private member functions that were no longer needed,
and some private data is now using std::string and std::map (as
per idxdict).

Now, assuming anyone thinks this is a good idea and the tradeoff of
initial lookup speed vs installation size is appropriate, I would
appreciate pointers as to how we would go about packaging up such a
change when it is completely isolated to messing about with 3rd party
source.  Naturally if this approach was selected then building the
.idx files and adding them to the language pack zips would need to be
removed.  A further option could be to have it use idx files if they
exist, but fallback to using only the .dat files.

Changes are LGPLv3+,MPL licensed.  I've attached the two altered files
here in case anyone wants to have a look and provide feedback on the
approach.

As this is simply proof of concept for the timing, I haven't tested
against memory leaks or corruption of data yet.

I'm also not sure how to format it as the original code is not well formatted.

Regards,
Steven Butler
#ifndef _MYTHES_HXX_
#define _MYTHES_HXX_

// some maximum sizes for buffers
#define MAX_WD_LEN 200
#define MAX_LN_LEN 16384


// a meaning with definition, count of synonyms and synonym list
struct mentry {
  char*  defn;
  int  count;
  char** psyns;
};
#include iostream
#include fstream
#include string
#include map

typedef std::mapstd::string, long WordLocationMap;

class MyThes
{

	std::string  encoding;   /* stores text encoding; */
	WordLocationMap wordList;
 
FILE  *pdfile;

	// disallow copy-constructor and assignment-operator for now
	MyThes();
	MyThes(const MyThes );
	MyThes  operator = (const MyThes );

public:
	MyThes(const char* idxpath, const char* datpath);
	~MyThes();

// lookup text in index and return number of meanings
	// each meaning entry has a defintion, synonym count and pointer 
// when complete return the *original* meaning entry and count via 
// CleanUpAfterLookup to properly handle memory deallocation

int Lookup(const char * pText, int len, mentry** pme); 

void CleanUpAfterLookup(mentry** pme, int nmean);

const char* get_th_encoding(); 

private:
// Open index and dat files and load list array
int thInitialize (const char* indxpath, const char* datpath);

// internal close and cleanup dat and idx files
int thCleanup ();
/*
// read a text line (\n terminated) stripping off line terminator
int readLine(FILE * pf, char * buf, int nc);

// binary search on null terminated character strings
int binsearch(char * wrd, char* list[], int nlst);
*/
};

#endif





#include COPYING
#include stdio.h

Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-30 Thread Norbert Thiebaud
On Sun, Jan 30, 2011 at 4:32 AM, Steve Butler sebut...@gmail.com wrote:
 Hi Michael,

 On 29 January 2011 21:45, Steve Butler sebut...@gmail.com wrote:

 Here's some comparison timings on the above system (measured with
 gettimeofday either side of the call in swriter).

 Using an INDEX FILE:
 US Thesaurus - cold OS cache
 2011/01/30 04:21:37.887449: Loaded in 0.097378 seconds.
 US Thesaurus - hot OS cache
 2011/01/30 04:22:37.338682: Loaded in 0.044813 seconds.

 USING NO INDEX FILE:
 US Thesaurus - cold OS cache
 2011/01/30 10:07:42.186452: Loaded in 0.253337 seconds.
 US Thesaurus - hot OS cache
 2011/01/30 10:08:01.737888: Loaded in 0.130883 seconds.

 As can be seen from these numbers, it is around 3x slower for the US
 thesaurus regardless of hot/cold cache.

[...]
 Now, assuming anyone thinks this is a good idea and the tradeoff of
 initial lookup speed vs installation size is appropriate, I would
 appreciate pointers as to how we would go about packaging up such a
 change when it is completely isolated to messing about with 3rd party
 source.  Naturally if this approach was selected then building the
 .idx files and adding them to the language pack zips would need to be
 removed.  A further option could be to have it use idx files if they
 exist, but fallback to using only the .dat files.

I have only skimmed this thread, so forgive me if i missed the mark but:

why not generate the index at install time ?
that will still achieve the goal of reducing the size of the
installer, without the performance hit at runtime no?

Norbert

 Regards,
 Steven Butler

 ___
 LibreOffice mailing list
 LibreOffice@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/libreoffice


___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-30 Thread Steve Butler
Hi Norbert,


 I have only skimmed this thread, so forgive me if i missed the mark but:

 why not generate the index at install time ?
 that will still achieve the goal of reducing the size of the
 installer, without the performance hit at runtime no?


The option to build the index at install time was also discussed and
was the original goal, and has definitely not been ruled out.  My
understanding Michael was not keen to do this on Linux, but keen to
try it in the Windows Installer.

From my perspective it was a lot easier to patch some code into
something I could test (mythes) than something I could not (the
windows installer), so I thought I'd start with an easier option.

It is important to keep in mind that the performance hit is once off
per instance of swriter, and happens the first time you right click on
a word in a specific language.  With this implementation, once this is
done, the whole index is cached in an STL map so performance should be
around the same as before (lookup a word in an STL Rbtree vs a binary
search on a char ** structure).

It would also be possible to generate a dictionary index on first use,
but of course that would mean having to generate an index per user so
I'm not entertaining that idea seriously.

Regards,
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice


[Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

2011-01-29 Thread Steve Butler
Hi Michael,


        Then 'mythes' seems to be used in lingucomponent/ somewhere - I suppose
 that is where to be digging for the user code. I suspect if we can read
 and index this file in two seconds - and it is used in response to user
 input - there may not really be a lot of value in indexing it ahead of
 time, but ... ;-) worth playing with that.

I haven't had a look at this yet as I thought getting a script to
analyze the existing thesaurus files would be helpful to get those
errors looked at.

I thought I would discuss your idea about not using the index at all
to see what reception it gets, but I think you may also have been
suggesting a similar thing:
are the index files even useful on modern gear?

I can populate the en_US index in memory from the .dat file with the
C++ code in 0.287 s after dropping all cache, and 0.188s when the
cache is hot.

I do admit that my desktop is pretty quick though, with 4 cores, SATA
II drives etc.

If the thesaurus is only loaded when the user pops it up, then
couldn't mythes be taught to generate its own in-memory index
from the dictionary and not bother with an index file at all?

BTW, if I did that I'd probably do some major surgery on mythes and
just use STL because it basically is doing C style memory management
and processing and I think I would screw it up if I started messing
with it.  The only problem with simplifying it with STL constructs is
that I would want to change the interface (string vs char *), maybe
use STL vectors for the list of synonyms, etc.

By this stage it's not looking much like mythes anymore ...

Regards
Steven Butler
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice