[notmuch] Notmuch's search view sucks

2009-12-04 Thread Baruch Even
Karl Wiberg wrote:
> On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth  wrote:
>> And a step beyond that would support different languages for
>> different emails, but that sounds like something "hard" to identify.
> 
> But probably not as hard as identifying spam. It could probably be
> done with a simple Bayesian filter counting word frequencies---but
> it'd be much better if somebody else had already solved the problem,
> since this smells suspiciously like something that ought to be a
> separate project and put in a library ... does anyone know if such a
> project already exists? I know Google can do it ...
> 
> It'd be very cool to have notmuch automatically tag messages according
> to what language they're in.

What we should have is an interface to run an external program to 
classify a message when it's newly introduced and another that runs when 
tags are changed so that machine learning can be made to work when the 
user changes tags.

Baruch



[notmuch] Notmuch's search view sucks

2009-12-04 Thread Olly Betts
Karl Wiberg writes:
> On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote:
> > And a step beyond that would support different languages for
> > different emails, but that sounds like something "hard" to identify.
> 
> But probably not as hard as identifying spam. It could probably be
> done with a simple Bayesian filter counting word frequencies---but
> it'd be much better if somebody else had already solved the problem,
> since this smells suspiciously like something that ought to be a
> separate project and put in a library ... does anyone know if such a
> project already exists?

There's TextCat:

http://www.let.rug.nl/vannoord/TextCat/

It looks at n-gram frequencies, and can guess pretty reliably from
even a fairly small amount of text.

TextCat is in Perl.  I don't know if there's a C or C++ implementation
but it isn't a huge piece of code - finding a good technique was the
clever part of it.

Cheers,
Olly



[notmuch] Notmuch's search view sucks

2009-12-04 Thread Carl Worth
On Fri, 04 Dec 2009 06:52:38 -0500, Aaron Ecay  wrote:
> The same algorithm is implemented in C here:
> http://www.mnogosearch.org/guesser/
> 
> Licensed under the GPL and includes presets for ~50 languages.

That indeed does look very interesting, (at least what I can get from
google's cache of the website, as the server seems to be down just
now). Oh, but I can just "apt-get source mnogosearch" and find
src/mguesser.c and src/guesser.c at least.

> A potential drawback is that it doesn't handle raw HTML very well,
> according to the documentation.

Shouldn't really be an issue. Notmuch will already want to de-tagify
HTML before indexing anyway.

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] Notmuch's search view sucks

2009-12-04 Thread Karl Wiberg
On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth  wrote:
> And a step beyond that would support different languages for
> different emails, but that sounds like something "hard" to identify.

But probably not as hard as identifying spam. It could probably be
done with a simple Bayesian filter counting word frequencies---but
it'd be much better if somebody else had already solved the problem,
since this smells suspiciously like something that ought to be a
separate project and put in a library ... does anyone know if such a
project already exists? I know Google can do it ...

It'd be very cool to have notmuch automatically tag messages according
to what language they're in.

-- 
Karl Wiberg, kha at treskal.com
   subrabbit.wordpress.com
   www.treskal.com/kalle


[notmuch] Notmuch's search view sucks

2009-12-04 Thread Aaron Ecay
--- 2009ko Abenudak 4an, Olly Betts-ek idatzi zuen:

[...]

> TextCat is in Perl.  I don't know if there's a C or C++ implementation but
> it isn't a huge piece of code - finding a good technique was the clever part
> of it.

The same algorithm is implemented in C here:
http://www.mnogosearch.org/guesser/

Licensed under the GPL and includes presets for ~50 languages.  A potential
drawback is that it doesn't handle raw HTML very well, according to the
documentation.

Aaron


Re: [notmuch] Notmuch's search view sucks

2009-12-04 Thread Olly Betts
Karl Wiberg writes:
 On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote:
  And a step beyond that would support different languages for
  different emails, but that sounds like something hard to identify.
 
 But probably not as hard as identifying spam. It could probably be
 done with a simple Bayesian filter counting word frequencies---but
 it'd be much better if somebody else had already solved the problem,
 since this smells suspiciously like something that ought to be a
 separate project and put in a library ... does anyone know if such a
 project already exists?

There's TextCat:

http://www.let.rug.nl/vannoord/TextCat/

It looks at n-gram frequencies, and can guess pretty reliably from
even a fairly small amount of text.

TextCat is in Perl.  I don't know if there's a C or C++ implementation
but it isn't a huge piece of code - finding a good technique was the
clever part of it.

Cheers,
Olly

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [notmuch] Notmuch's search view sucks

2009-12-04 Thread Aaron Ecay
--- 2009ko Abenudak 4an, Olly Betts-ek idatzi zuen:

[...]

 TextCat is in Perl.  I don't know if there's a C or C++ implementation but
 it isn't a huge piece of code - finding a good technique was the clever part
 of it.

The same algorithm is implemented in C here:
http://www.mnogosearch.org/guesser/

Licensed under the GPL and includes presets for ~50 languages.  A potential
drawback is that it doesn't handle raw HTML very well, according to the
documentation.

Aaron
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] Notmuch's search view sucks

2009-12-03 Thread Carl Worth
On Thu, 03 Dec 2009 14:33:51 +0100, Gregor Hoffleit  
wrote:
> first a short introduction: I was a mutt user for ages. When I read
> about Sup, I was intrigued. After a short evaluation period, I switched
> to Sup, which I'm now using since six months. 

Hi Gregor, welcome to notmuch!

> But. Compared to Sup, the current notmuch clients suck :-)

Hey, we like our rough edges *really* rough, dontcha know?

> I'm experimenting with a notmuch web client (currently 'evenless'),
> trying to replicate much of the feeling of Sup, in a web client.

Hey, that sounds really interesting! I'll definitely look forward to
what you come up with.

> Also, any l10n (e.g. of time representation) would have to be hardcoded
> as well (btw, anybody knows a library for human readable time
> representations which supports l10n and i18n?).

I'd love to see one. The quick scan I did for human-readable time
formatting found stuff in languages like perl, python, and ruby, but I
didn't notice much in C. I also didn't look close enough to see if any
of these have multi-language suport.

> So perhaps it's better to move the polishing into the client (Yeah!
> Python to the rescue! ;-). But then, 'notmuch search' would need to
> return some raw representation of the date field as well.

Good point. There's actually a weird mix of raw and cooked output from
the notmuch command line right now. As you noticed, "notmuch search"
cooks the date too much, (and in a way useful only to English speakers).

Meanwhile, the "notmuch show" output is far too raw to be read without a
client prettying it up. (The message{ header{ body{ body} header}
message} stuff is almost as bad as XML.)

> Any comment? Any other thoughts about this?

I think I'd like to see notmuch output get both more cooked and more raw
at the same time. I'd like things to be more cooked by default,
("notmuch show" shouldn't print the ugly delimiters, should indent
messages, and should start up a pager). And then we just need options
that frontends can pass to get the raw output, (but quoted
safely---which the current "notmuch show" output is *not*).

-Carl

PS. If you're worried about multi-lingualization issues for notmuch,
you'll want to know that notmuch is (for now) unconditionally
instructing Xapian to use an English-language stemmer when indexing
mail. Obviously we'll want to support a configuration option for
specifying a default stemmer, (Xapian has stemmers for many languages I
believe). And a step beyond that would support different languages for
different emails, but that sounds like something "hard" to identify.
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] Notmuch's search view sucks

2009-12-03 Thread Gregor Hoffleit
Hi there,

first a short introduction: I was a mutt user for ages. When I read
about Sup, I was intrigued. After a short evaluation period, I switched
to Sup, which I'm now using since six months. 

Sup has many rough edges on its own, and it's not that easy to fix some
of them from the current codebase. notmuch looks like a clean restart of
the same idea, but with a different architecture. I like the concept of
a command line tool with a minimal set of functionality as a common
core, upon which different clients can build on.


But. Compared to Sup, the current notmuch clients suck :-)


Today: Sup's search-results-mode. It has a lot of polish that's plainly
missing from notmuch.el (or notmuch.vim):

- Sup's display is much terse than notmuch, still
- Sup manages to display the first few words of the first unread message
  in the thread.
- If a thread contains many authors, Sup shows only the firstnames.
  If that's still too long to fit, it cuts off at some point.
- User's name is rewritten as 'me'.
- The message date format needs only 8 characters (notmuch: 12).
- Message count is only displayed when necessary (>=1).
- Threads with unread messages are bold (resp. hilighted).
- Threads with attachments are marked with an "@".
- Threads with mails to user are marked with an ">".
- Different colors of tags, message content.

All in all, 'notmuch search' is a raw representation of field values,
while Sup's search-results-mode shows a polished and terse
interpretation of the same values, for human beings, even optimized for
the current display width.

Now notmuch.el and notmuch.vim just display the output of 'notmuch
search', verbatim (perhaps enhanced with coloring based on regexes).


I'm experimenting with a notmuch web client (currently 'evenless'),
trying to replicate much of the feeling of Sup, in a web client.

First, I took the output of 'notmuch search', parsed it and tried to
reformat it like Sup. That worked well for all fields but the date
field: In contrast to the other fields, notmuch's date representation
is intended for direct consumption by humans (english-speaking, that is
;-).


I noticed this entry in TODO:

Add a "--format" option to "notmuch search", (something printf-like
for selecting what gets printed).

Since I'm not eager to write a format parser, I started to implement
--format as an enumerating option notmuch_format_t. By now, I have
NOTMUCH_FORMAT_DEFAULT and NOTMUCH_FORMAT_SUP. do_search_threads() does
the real work. In notmuch-time.c, I have implemented an alternative nice
and terse time representation, notmuch_time_relative8_date().

I realized, though, that at this point I would have to hardcode things
like ANSI coloring into NOTMUCH_FORMAT_SUP.

Also, any l10n (e.g. of time representation) would have to be hardcoded
as well (btw, anybody knows a library for human readable time
representations which supports l10n and i18n?).


So perhaps it's better to move the polishing into the client (Yeah!
Python to the rescue! ;-). But then, 'notmuch search' would need to
return some raw representation of the date field as well.


Any comment? Any other thoughts about this?



Regards,
Gregor Hoffleit


Re: [notmuch] Notmuch's search view sucks

2009-12-03 Thread Carl Worth
On Thu, 03 Dec 2009 14:33:51 +0100, Gregor Hoffleit gre...@hoffleit.de wrote:
 first a short introduction: I was a mutt user for ages. When I read
 about Sup, I was intrigued. After a short evaluation period, I switched
 to Sup, which I'm now using since six months. 

Hi Gregor, welcome to notmuch!

 But. Compared to Sup, the current notmuch clients suck :-)

Hey, we like our rough edges *really* rough, dontcha know?

 I'm experimenting with a notmuch web client (currently 'evenless'),
 trying to replicate much of the feeling of Sup, in a web client.

Hey, that sounds really interesting! I'll definitely look forward to
what you come up with.

 Also, any l10n (e.g. of time representation) would have to be hardcoded
 as well (btw, anybody knows a library for human readable time
 representations which supports l10n and i18n?).

I'd love to see one. The quick scan I did for human-readable time
formatting found stuff in languages like perl, python, and ruby, but I
didn't notice much in C. I also didn't look close enough to see if any
of these have multi-language suport.

 So perhaps it's better to move the polishing into the client (Yeah!
 Python to the rescue! ;-). But then, 'notmuch search' would need to
 return some raw representation of the date field as well.

Good point. There's actually a weird mix of raw and cooked output from
the notmuch command line right now. As you noticed, notmuch search
cooks the date too much, (and in a way useful only to English speakers).

Meanwhile, the notmuch show output is far too raw to be read without a
client prettying it up. (The message{ header{ body{ body} header}
message} stuff is almost as bad as XML.)

 Any comment? Any other thoughts about this?

I think I'd like to see notmuch output get both more cooked and more raw
at the same time. I'd like things to be more cooked by default,
(notmuch show shouldn't print the ugly delimiters, should indent
messages, and should start up a pager). And then we just need options
that frontends can pass to get the raw output, (but quoted
safely---which the current notmuch show output is *not*).

-Carl

PS. If you're worried about multi-lingualization issues for notmuch,
you'll want to know that notmuch is (for now) unconditionally
instructing Xapian to use an English-language stemmer when indexing
mail. Obviously we'll want to support a configuration option for
specifying a default stemmer, (Xapian has stemmers for many languages I
believe). And a step beyond that would support different languages for
different emails, but that sounds like something hard to identify.


pgp51kI3oj69Z.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch