Mathieu,
Thanks! I am reviewing this patch.
cheers,
Michal Pryc
Mathieu Dimanche wrote:
> Hi everyone
>
> Using a home-compiled SVN version (rev. 1090) on Ubuntu Gutsy (7.10),
> I wanted to index my Thunderbird emails properly but encountered some
> problems and strange behavior I felt compelled to fix. So here's a
> patch against rev. 1090 with theses improvements (Changelog order) :
>
> 1) Thunderbird email non ASCII characters :
>
> Current behaviour of the TB extension is to create temporary TMS files
> in ~/.xesam/ThunderbirdEmails/ToIndex/ which are being indexed
> asynchronously by trackerd. These files are XML-like containing
> indexable informations in CDATA sections.
>
> One problem I encountered is about strings' encoding in these CDATA
> sections. The TB extension fetches Author, Recipients and Subject from
> a nsIMsgDBHdr component, as read in the mail header, i.e. encoded in
> mime format. This means that special characters (like french accented
> letters, copyright symbol, and so on.) where weirdly encoded. Per
> example, a subject with a "é" in it, like in "Notification d'état de
> la distribution" was given to trackerd through the TMS file as
> "=\?ISO-8859-1\?Q\?Notification_d'=E9tat_de_la_distribution\?=", which
> was awfully ineffective to index the different words. Worse, some
> characters made trackerd fail to index the TMS file at all.
>
> Same behavior with recipients lists when, say, someone's surname got a
> non-ASCII character in it. Idem for the "From:" header info.
>
> So, what needed to be done was to force the TB extension to decode
> theses problematic strings. By chance, the nsIMsgDBHdr component has a
> simple way to do it using mime2DecodedXXX members. Quite easy.
>
> So TMS files where now containing ISO-8859-1 encoded data. But
> trackerd refused to read these files as the gnome functions used to
> read and parse the TMS files expected UTF-8 encoded content. So, OK,
> let's force the extension to encode the whole TMS file in unicode.
> This was done through a nsIConverterOutputStream component plugged
> into the nsIFileOutputStream previously used to write the file [1].
>
> What does the patch change then ?
> * Author, Recipients and Subject are always readable and indexable,
> even when composed with non-ASCII characters
> * TMS files are encoded in UTF-8
>
> For info, I indexed my 36000+ emails (lot of spam archiving for
> training antispamware), mainly in french and english, and not a single
> one failed to be indexed AND show up nicely in t-s-t search results.
>
>
> 2) Email Recipients and CCs string format
>
> Recipients without a name attached where indexed as "[EMAIL PROTECTED]
> [EMAIL PROTECTED]".
> Recipients with a name attached where indexed as "[EMAIL PROTECTED] Name".
>
> I was expecting "correct" email contact format like "Name
> <[EMAIL PROTECTED]>" or "[EMAIL PROTECTED]"
>
> The patch does restore this expected behaviour.
>
>
> 3) tracker-search-tool emails not showing recipient(s)
>
> t-s-t only showed Subject, Sender and Date.
>
> The patch have Recipient shown too. (french label translation provided)
>
> TODO : multiple "To :" headers seem to be indexed when appropriate,
> but only the LAST one shows up here.
>
>
> 4) tracker-preferences "Choose a folder" and "Enter a file glob"
> dialogs are not translatable
>
> Well, with the patch, they are. (french translations provided)
>
>
> 5) tracker-preferences "Use additional memory for faster indexing"
> translations
>
> An initial typo was in the additional word ("additonal"), translators
> translated well, and then the typo was corrected, but not in the po
> files. So I corrected the typo in all the po files, and now, this
> option is well translated.
>
>
> 6) hits/items transition
>
> As seen on bug #464516 [2], using item(s) instead of hit(s) is a good
> idea. Modified the french translations to reflect this (élément(s)
> instead of résultat(s)).
>
>
> 7) trackerd --help uses the system's locale
>
> On my system, LC_ALL was empty, so trackerd help usage was always
> written in default english, instead of matching my
> LC_MESSAGES="fr_FR.UTF-8".
> So, it's fixed.
>
>
> 8) bug #467151 : "Language Typo: It's Portuguese not Portugese"
>
> Fixed.
>
>
> 9) bug #504003: "empty line when adding 'Ignored File Patterns'"
>
> Fixed.
>
> In fact, this was a strange behaviour. Having "NoIndexFileTypes=;" in
> ~/.config/tracker/tracker.cfg made tracker-preferences have a blank
> item in the Ignore FileTypes list, whereas having "NoIndexFileTypes="
> didn't. This behaviour comes from the g_key_file_get_string_list
> function call in _get_string_list.c/_get_string_list() function.
> Pretty sure the glib people should be alerted about this because it's
> very counter-intuitive, and nothing let's us expect this kind of
> behaviour from the documentation [3].
>
> Of course, everytime an empty list (ending with semi-colon) was
> fetched from ~/.config/tracker/tracker.cfg, this behaviours appeared.
> But no more.
>
>
> 10) bug #498041: "Thunderbird indexing option grayed out on Debian
> unstable"
>
> Fixed. Made TB indexing usable.
>
>
> 11) bug #464323: "critical warning : tracker_indexer_get_hits"
>
> Fixed. Something to do with the stopwords.
>
>
>
> I hope I respected the coding style (please review) and that someone
> will commit the patch soon.
> If committed, please assign the fixed bugs to me, I'll close them.
> BTW, I'll comment them with an explanation and link to this mail.
>
>
> Mathieu
>
>
> -------------------------------
> [1] http://developer.mozilla.org/en/docs/Writing_textual_data
> [2] http://bugzilla.gnome.org/show_bug.cgi?id=464516
> [3]
> http://library.gnome.org/devel/glib/2.14/glib-Key-value-file-parser.html#g-key-file-get-string-list
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> tracker-list mailing list
> [email protected]
> http://mail.gnome.org/mailman/listinfo/tracker-list
>
_______________________________________________
tracker-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/tracker-list