Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-08-12 Thread Sebastian Spaeth
On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote:
 So what would be a good format?  One possibility would be to
 NULL-delimit the query part; as distasteful as I find that, this part
 of the search output isn't meant for user consumption.  Though I fear
 this is endemic to the dual role the search output currently plays as
 both user and computer readable.

Perhaps we take a queue from xargs and have a command line switch for \n
vs \0 output?

--null
-- 0
  Input items are terminated by a null character instead of by
  whitespace, and the quotes and backslash are not special (every
  character is taken lit- erally).


pgpNgqfc7XudK.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-08-12 Thread Austin Clements
Quoth Sebastian Spaeth on Aug 12 at 10:07 am:
 On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote:
  So what would be a good format?  One possibility would be to
  NULL-delimit the query part; as distasteful as I find that, this part
  of the search output isn't meant for user consumption.  Though I fear
  this is endemic to the dual role the search output currently plays as
  both user and computer readable.
 
 Perhaps we take a queue from xargs and have a command line switch for \n
 vs \0 output?
 
 --null
 -- 0
   Input items are terminated by a null character instead of by
   whitespace, and the quotes and backslash are not special (every
   character is taken lit- erally).

This was one of the approaches I considered, but given that JSON
parsing (with my optimized json.el) is nearly as fast as regexp-based
parsing (which would also be the fastest way to parse \0-separated
output) and is flexible, robust, and structured (unlike any simple
delimited text format), I concluded we should just go with JSON.  If
we care about speed enough to introduce another format, I'd propose
S-expressions over a new text format, since they double parsing
performance compared to text, have the same structural benefits as
JSON, and JSON and S-expressions could even share the same formatting
code (so there's no chance of representation divergence).

BTW, reviving the JSON search parser is on my shortlist.  I should
also bundle up my optimized json.el, since it should seriously boost
the performance of notmuch-show, too.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


JSON parsing performance (was Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter')

2011-07-20 Thread Austin Clements
Quoth myself on Jul 13 at  2:57 pm:
 Quoth Pieter Praet on Jul 13 at  4:16 pm:
  Jamie Zawinski once said/wrote [1]:
'Some people, when confronted with a problem, think I know,
I'll use regular expressions. Now they have two problems.'
  
  With this in mind, I set out to get rid of this whole regex mess altogether,
  by populating the search buffer using Notmuch's JSON output instead of doing
  brittle text matching tricks.
  
  Looking for some documentation, I stumbled upon a long-forgotten gem [2].
  
  David's already done pretty much all of the work for us!
 
 Yes, similar thoughts were running through my head as I futzed with
 the formatting for this.  My concern with moving to JSON for search
 buffers is that parsing it is about *30 times slower* than the current
 regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413
 result search buffer).  I think JSON makes a lot of sense for show
 buffers because there's generally less data and it has a lot of
 complicated structure.  Search results, on the other hand, have a very
 simple, regular, and constrained structure, so JSON doesn't buy us
 nearly as much.
 
 JSON is hard to parse because, like the text search output, it's
 designed for human consumption (of course, unlike the text search
 output, it's also designed for computer consumption).  There's
 something to be said for the debuggability and generality of this and
 JSON is very good for exchanging small objects, but it's a remarkably
 inefficient way to exchange large amounts of data between two
 programs.
 
 I guess what I'm getting at, though it pains me to say it, is perhaps
 search needs a fast, computer-readable interchange format.  The
 structure of the data is so simple and constrained that this could be
 altogether trivial.
 
 Or maybe I need a faster computer.

Or maybe I need to un-lame my benchmark.

TL;DR: We should use JSON for search results, but possibly not the
json.el shipped with Emacs.

I realized that my text benchmark didn't capture the cost of
extracting the match strings.  re-search-forward records matches as
buffer positions, which don't get realized into strings until you call
match-string.  Hence, match-string is quite expensive.

Also, Emacs' json.el is slow, so I perked it up.  My modified json.el
is ~3X faster, particularly for string-heavy output like notmuch's.
Though now I'm well into the realm of eq is faster than = and M-x
disassemble, so unless I missed something big, this is as fast as it
gets.

While I was still thinking about new IPC formats, I realized that the
text format and the Emacs UI are already tightly coupled, so why not
go all the way and use S-expressions for IPC?  I now think JSON is
fast enough to use, but S-expressions still have a certain appeal.
They share most of the benefits of JSON; structure and extensibility
in particular.  Further, while the content of some ad-hoc format could
easily diverge from both the text and JSON formats, S-expressions
could exactly parallel the JSON content (with a little more
abstraction, they could even share the same format code).  For kicks,
I included an S-expression benchmark.  It beats out the text parser by
a factor of two and the optimized JSON parser by a factor of three.

Here are the results for my 1,413 result search buffer and timeworn
computer

 Time   Normalized
--format=text   0.148s 1.00x
--format=json   0.598s 4.04x
custom json.el  0.209s 1.41x
 + string keys  0.195s 1.32x
S-expressions   0.066s 0.45x

I don't have time right now, but next week I might be able to look
through and update dme's JSON-based search code.


The benchmark and modified json.el are attached.

The benchmark is written so you can open it and eval-buffer, then C-x
C-e the various calls in the comments.  You can either
make-text/make-json, or run notmuch manually, pipe the results into
files text and json, and open them in Emacs.

Please excuse the modified json.el code; it's gone through zero
cleanup.
(defmacro time-it (repeat rest body)
  (declare (indent 1))
  (when (not (numberp repeat))
(push repeat body)
(setq repeat 1))
  (let ((start-time (gensym)) (i (gensym)))
`(let ((,start-time (get-internal-run-time)))
   (dotimes (,i ,repeat)
 ,@body)
   (/ (float-time (time-subtract (get-internal-run-time) ,start-time))
  ,repeat

;; Text

(defun make-text ()
  (with-current-buffer (get-buffer-create text)
(erase-buffer)
(call-process notmuch nil t nil search --format=text -- 
tag:x/notmuch)))

(defun time-text ()
  (with-current-buffer text
(time-it 10
  (goto-char (point-min))
  (while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) 
\\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t)
(let* ((thread-id (match-string 1))
   (date (match-string 2))
   (count (match-string 3))
   (authors (match-string 4))
   (subject 

Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-19 Thread servilio
What about encoding in notmuch the elements composing the line, print
the elements with a separator that would be encoded if it appears in
an element, then do the reverse in emacs. One such encoding might be
URL-encoding.

Servilio
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-16 Thread Pieter Praet
On Wed, 13 Jul 2011 14:57:21 -0400, Austin Clements amdra...@mit.edu wrote:
 Quoth Pieter Praet on Jul 13 at  4:16 pm:
  On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu 
  wrote:
   Quoth Pieter Praet on Jul 11 at 10:43 pm:
TL;DR: I can haz regex pl0x?
   
   Oof, what a pain.  I'm happy to change the output format of search; I
   hadn't realized how difficult it would be to parse.  In fact, I'm not
   sure it's even parsable by regexp, because the message ID's themselves
   could contain parens.
   
   So what would be a good format?  One possibility would be to
   NULL-delimit the query part; as distasteful as I find that, this part
   of the search output isn't meant for user consumption.  Though I fear
   this is endemic to the dual role the search output currently plays as
   both user and computer readable.
   
   I've also got the code to do everything using document ID's instead of
   message ID's.  As a side-effect, it makes the search output clean and
   readily parsable since document ID's are just numbers.  Hence, there
   are no quoting or escaping issues (plus the output is much more
   compact).  I haven't sent this to the list yet because I haven't had a
   chance to benchmark it and determine if the performance benefits make
   exposing document ID's worthwhile.
  
  Jamie Zawinski once said/wrote [1]:
'Some people, when confronted with a problem, think I know,
I'll use regular expressions. Now they have two problems.'
  
  With this in mind, I set out to get rid of this whole regex mess altogether,
  by populating the search buffer using Notmuch's JSON output instead of doing
  brittle text matching tricks.
  
  Looking for some documentation, I stumbled upon a long-forgotten gem [2].
  
  David's already done pretty much all of the work for us!
 
 Yes, similar thoughts were running through my head as I futzed with
 the formatting for this.  My concern with moving to JSON for search
 buffers is that parsing it is about *30 times slower* than the current
 regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413
 result search buffer).  I think JSON makes a lot of sense for show
 buffers because there's generally less data and it has a lot of
 complicated structure.  Search results, on the other hand, have a very
 simple, regular, and constrained structure, so JSON doesn't buy us
 nearly as much.

That seems about right. Using the entire Notmuch mailing list archive,
processing JSON ends up taking 23x longer (see test in att).

 JSON is hard to parse because, like the text search output, it's
 designed for human consumption (of course, unlike the text search
 output, it's also designed for computer consumption).  There's
 something to be said for the debuggability and generality of this and
 JSON is very good for exchanging small objects, but it's a remarkably
 inefficient way to exchange large amounts of data between two
 programs.
 
 I guess what I'm getting at, though it pains me to say it, is perhaps
 search needs a fast, computer-readable interchange format.  The
 structure of the data is so simple and constrained that this could be
 altogether trivial.

I guess that's our only option then. Could you implement it for me?
I'll make sure to rebase my patch series in an acceptable time frame.

An extra output format shouldn't be that much of a problem though, if we
further compartmentalize the code. What are your thoughts on (in the
long term) moving to a plugin-based architecture? Eg. enable something
like this:

  ./input/{Maildir, ...}
  ./output/{plain, JSON, ...}
  ./filters/{crypto, ...}
  ./backends/(Xapian, ...)
  ./uis/{Emacs, VIM, web, ...}

 Or maybe I need a faster computer.

That's what M$ Tech Support would want you to believe :)
What we need is slower computers, so devs are forced to count cycles again.
The rise of netbooks has thankfully done wonders in this respect.

 If anyone is curious, here's how I timed the parsing.
 
 (defmacro time-it (code)
   `(let ((start-time (get-internal-run-time)))
  ,code
  (float-time (time-subtract (get-internal-run-time) start-time
 
 (with-current-buffer json
   (goto-char (point-min))
   (time-it (json-read)))
 
 (with-current-buffer text
   (goto-char (point-min))
   (time-it
(while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) 
 \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t


Peace

-- 
Pieter


regexp-vs-json.org
Description: Binary data
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-13 Thread Pieter Praet
On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote:
 Quoth Pieter Praet on Jul 11 at 10:43 pm:
  TL;DR: I can haz regex pl0x?
 
 Oof, what a pain.  I'm happy to change the output format of search; I
 hadn't realized how difficult it would be to parse.  In fact, I'm not
 sure it's even parsable by regexp, because the message ID's themselves
 could contain parens.
 
 So what would be a good format?  One possibility would be to
 NULL-delimit the query part; as distasteful as I find that, this part
 of the search output isn't meant for user consumption.  Though I fear
 this is endemic to the dual role the search output currently plays as
 both user and computer readable.
 
 I've also got the code to do everything using document ID's instead of
 message ID's.  As a side-effect, it makes the search output clean and
 readily parsable since document ID's are just numbers.  Hence, there
 are no quoting or escaping issues (plus the output is much more
 compact).  I haven't sent this to the list yet because I haven't had a
 chance to benchmark it and determine if the performance benefits make
 exposing document ID's worthwhile.

Jamie Zawinski once said/wrote [1]:
  'Some people, when confronted with a problem, think I know,
  I'll use regular expressions. Now they have two problems.'

With this in mind, I set out to get rid of this whole regex mess altogether,
by populating the search buffer using Notmuch's JSON output instead of doing
brittle text matching tricks.

Looking for some documentation, I stumbled upon a long-forgotten gem [2].

David's already done pretty much all of the work for us!

Unfortunately, it doesn't apply cleanly to master anymore.

David, would you mind rebasing it?


Peace

-- 
Pieter

[1] http://www.jwz.org/hacks/
[2] id:1290777202-14040-1-git-send-email-...@dme.org
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-13 Thread David Edmondson
* pie...@praet.org [2011-07-13 Wed 15:16]
 David, would you mind rebasing it?

I'm sorry, I'm not likely to have time to do this.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-13 Thread Austin Clements
Quoth Pieter Praet on Jul 13 at  4:16 pm:
 On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote:
  Quoth Pieter Praet on Jul 11 at 10:43 pm:
   TL;DR: I can haz regex pl0x?
  
  Oof, what a pain.  I'm happy to change the output format of search; I
  hadn't realized how difficult it would be to parse.  In fact, I'm not
  sure it's even parsable by regexp, because the message ID's themselves
  could contain parens.
  
  So what would be a good format?  One possibility would be to
  NULL-delimit the query part; as distasteful as I find that, this part
  of the search output isn't meant for user consumption.  Though I fear
  this is endemic to the dual role the search output currently plays as
  both user and computer readable.
  
  I've also got the code to do everything using document ID's instead of
  message ID's.  As a side-effect, it makes the search output clean and
  readily parsable since document ID's are just numbers.  Hence, there
  are no quoting or escaping issues (plus the output is much more
  compact).  I haven't sent this to the list yet because I haven't had a
  chance to benchmark it and determine if the performance benefits make
  exposing document ID's worthwhile.
 
 Jamie Zawinski once said/wrote [1]:
   'Some people, when confronted with a problem, think I know,
   I'll use regular expressions. Now they have two problems.'
 
 With this in mind, I set out to get rid of this whole regex mess altogether,
 by populating the search buffer using Notmuch's JSON output instead of doing
 brittle text matching tricks.
 
 Looking for some documentation, I stumbled upon a long-forgotten gem [2].
 
 David's already done pretty much all of the work for us!

Yes, similar thoughts were running through my head as I futzed with
the formatting for this.  My concern with moving to JSON for search
buffers is that parsing it is about *30 times slower* than the current
regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413
result search buffer).  I think JSON makes a lot of sense for show
buffers because there's generally less data and it has a lot of
complicated structure.  Search results, on the other hand, have a very
simple, regular, and constrained structure, so JSON doesn't buy us
nearly as much.

JSON is hard to parse because, like the text search output, it's
designed for human consumption (of course, unlike the text search
output, it's also designed for computer consumption).  There's
something to be said for the debuggability and generality of this and
JSON is very good for exchanging small objects, but it's a remarkably
inefficient way to exchange large amounts of data between two
programs.

I guess what I'm getting at, though it pains me to say it, is perhaps
search needs a fast, computer-readable interchange format.  The
structure of the data is so simple and constrained that this could be
altogether trivial.

Or maybe I need a faster computer.


If anyone is curious, here's how I timed the parsing.

(defmacro time-it (code)
  `(let ((start-time (get-internal-run-time)))
 ,code
 (float-time (time-subtract (get-internal-run-time) start-time

(with-current-buffer json
  (goto-char (point-min))
  (time-it (json-read)))

(with-current-buffer text
  (goto-char (point-min))
  (time-it
   (while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) 
\\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'

2011-07-11 Thread Austin Clements
Quoth Pieter Praet on Jul 11 at 10:43 pm:
 TL;DR: I can haz regex pl0x?

Oof, what a pain.  I'm happy to change the output format of search; I
hadn't realized how difficult it would be to parse.  In fact, I'm not
sure it's even parsable by regexp, because the message ID's themselves
could contain parens.

So what would be a good format?  One possibility would be to
NULL-delimit the query part; as distasteful as I find that, this part
of the search output isn't meant for user consumption.  Though I fear
this is endemic to the dual role the search output currently plays as
both user and computer readable.

I've also got the code to do everything using document ID's instead of
message ID's.  As a side-effect, it makes the search output clean and
readily parsable since document ID's are just numbers.  Hence, there
are no quoting or escaping issues (plus the output is much more
compact).  I haven't sent this to the list yet because I haven't had a
chance to benchmark it and determine if the performance benefits make
exposing document ID's worthwhile.
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch