Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote: So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. Perhaps we take a queue from xargs and have a command line switch for \n vs \0 output? --null -- 0 Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken lit- erally). pgpNgqfc7XudK.pgp Description: PGP signature ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
Quoth Sebastian Spaeth on Aug 12 at 10:07 am: On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote: So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. Perhaps we take a queue from xargs and have a command line switch for \n vs \0 output? --null -- 0 Input items are terminated by a null character instead of by whitespace, and the quotes and backslash are not special (every character is taken lit- erally). This was one of the approaches I considered, but given that JSON parsing (with my optimized json.el) is nearly as fast as regexp-based parsing (which would also be the fastest way to parse \0-separated output) and is flexible, robust, and structured (unlike any simple delimited text format), I concluded we should just go with JSON. If we care about speed enough to introduce another format, I'd propose S-expressions over a new text format, since they double parsing performance compared to text, have the same structural benefits as JSON, and JSON and S-expressions could even share the same formatting code (so there's no chance of representation divergence). BTW, reviving the JSON search parser is on my shortlist. I should also bundle up my optimized json.el, since it should seriously boost the performance of notmuch-show, too. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
JSON parsing performance (was Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter')
Quoth myself on Jul 13 at 2:57 pm: Quoth Pieter Praet on Jul 13 at 4:16 pm: Jamie Zawinski once said/wrote [1]: 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' With this in mind, I set out to get rid of this whole regex mess altogether, by populating the search buffer using Notmuch's JSON output instead of doing brittle text matching tricks. Looking for some documentation, I stumbled upon a long-forgotten gem [2]. David's already done pretty much all of the work for us! Yes, similar thoughts were running through my head as I futzed with the formatting for this. My concern with moving to JSON for search buffers is that parsing it is about *30 times slower* than the current regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413 result search buffer). I think JSON makes a lot of sense for show buffers because there's generally less data and it has a lot of complicated structure. Search results, on the other hand, have a very simple, regular, and constrained structure, so JSON doesn't buy us nearly as much. JSON is hard to parse because, like the text search output, it's designed for human consumption (of course, unlike the text search output, it's also designed for computer consumption). There's something to be said for the debuggability and generality of this and JSON is very good for exchanging small objects, but it's a remarkably inefficient way to exchange large amounts of data between two programs. I guess what I'm getting at, though it pains me to say it, is perhaps search needs a fast, computer-readable interchange format. The structure of the data is so simple and constrained that this could be altogether trivial. Or maybe I need a faster computer. Or maybe I need to un-lame my benchmark. TL;DR: We should use JSON for search results, but possibly not the json.el shipped with Emacs. I realized that my text benchmark didn't capture the cost of extracting the match strings. re-search-forward records matches as buffer positions, which don't get realized into strings until you call match-string. Hence, match-string is quite expensive. Also, Emacs' json.el is slow, so I perked it up. My modified json.el is ~3X faster, particularly for string-heavy output like notmuch's. Though now I'm well into the realm of eq is faster than = and M-x disassemble, so unless I missed something big, this is as fast as it gets. While I was still thinking about new IPC formats, I realized that the text format and the Emacs UI are already tightly coupled, so why not go all the way and use S-expressions for IPC? I now think JSON is fast enough to use, but S-expressions still have a certain appeal. They share most of the benefits of JSON; structure and extensibility in particular. Further, while the content of some ad-hoc format could easily diverge from both the text and JSON formats, S-expressions could exactly parallel the JSON content (with a little more abstraction, they could even share the same format code). For kicks, I included an S-expression benchmark. It beats out the text parser by a factor of two and the optimized JSON parser by a factor of three. Here are the results for my 1,413 result search buffer and timeworn computer Time Normalized --format=text 0.148s 1.00x --format=json 0.598s 4.04x custom json.el 0.209s 1.41x + string keys 0.195s 1.32x S-expressions 0.066s 0.45x I don't have time right now, but next week I might be able to look through and update dme's JSON-based search code. The benchmark and modified json.el are attached. The benchmark is written so you can open it and eval-buffer, then C-x C-e the various calls in the comments. You can either make-text/make-json, or run notmuch manually, pipe the results into files text and json, and open them in Emacs. Please excuse the modified json.el code; it's gone through zero cleanup. (defmacro time-it (repeat rest body) (declare (indent 1)) (when (not (numberp repeat)) (push repeat body) (setq repeat 1)) (let ((start-time (gensym)) (i (gensym))) `(let ((,start-time (get-internal-run-time))) (dotimes (,i ,repeat) ,@body) (/ (float-time (time-subtract (get-internal-run-time) ,start-time)) ,repeat ;; Text (defun make-text () (with-current-buffer (get-buffer-create text) (erase-buffer) (call-process notmuch nil t nil search --format=text -- tag:x/notmuch))) (defun time-text () (with-current-buffer text (time-it 10 (goto-char (point-min)) (while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t) (let* ((thread-id (match-string 1)) (date (match-string 2)) (count (match-string 3)) (authors (match-string 4)) (subject
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
What about encoding in notmuch the elements composing the line, print the elements with a separator that would be encoded if it appears in an element, then do the reverse in emacs. One such encoding might be URL-encoding. Servilio ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
On Wed, 13 Jul 2011 14:57:21 -0400, Austin Clements amdra...@mit.edu wrote: Quoth Pieter Praet on Jul 13 at 4:16 pm: On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote: Quoth Pieter Praet on Jul 11 at 10:43 pm: TL;DR: I can haz regex pl0x? Oof, what a pain. I'm happy to change the output format of search; I hadn't realized how difficult it would be to parse. In fact, I'm not sure it's even parsable by regexp, because the message ID's themselves could contain parens. So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. I've also got the code to do everything using document ID's instead of message ID's. As a side-effect, it makes the search output clean and readily parsable since document ID's are just numbers. Hence, there are no quoting or escaping issues (plus the output is much more compact). I haven't sent this to the list yet because I haven't had a chance to benchmark it and determine if the performance benefits make exposing document ID's worthwhile. Jamie Zawinski once said/wrote [1]: 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' With this in mind, I set out to get rid of this whole regex mess altogether, by populating the search buffer using Notmuch's JSON output instead of doing brittle text matching tricks. Looking for some documentation, I stumbled upon a long-forgotten gem [2]. David's already done pretty much all of the work for us! Yes, similar thoughts were running through my head as I futzed with the formatting for this. My concern with moving to JSON for search buffers is that parsing it is about *30 times slower* than the current regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413 result search buffer). I think JSON makes a lot of sense for show buffers because there's generally less data and it has a lot of complicated structure. Search results, on the other hand, have a very simple, regular, and constrained structure, so JSON doesn't buy us nearly as much. That seems about right. Using the entire Notmuch mailing list archive, processing JSON ends up taking 23x longer (see test in att). JSON is hard to parse because, like the text search output, it's designed for human consumption (of course, unlike the text search output, it's also designed for computer consumption). There's something to be said for the debuggability and generality of this and JSON is very good for exchanging small objects, but it's a remarkably inefficient way to exchange large amounts of data between two programs. I guess what I'm getting at, though it pains me to say it, is perhaps search needs a fast, computer-readable interchange format. The structure of the data is so simple and constrained that this could be altogether trivial. I guess that's our only option then. Could you implement it for me? I'll make sure to rebase my patch series in an acceptable time frame. An extra output format shouldn't be that much of a problem though, if we further compartmentalize the code. What are your thoughts on (in the long term) moving to a plugin-based architecture? Eg. enable something like this: ./input/{Maildir, ...} ./output/{plain, JSON, ...} ./filters/{crypto, ...} ./backends/(Xapian, ...) ./uis/{Emacs, VIM, web, ...} Or maybe I need a faster computer. That's what M$ Tech Support would want you to believe :) What we need is slower computers, so devs are forced to count cycles again. The rise of netbooks has thankfully done wonders in this respect. If anyone is curious, here's how I timed the parsing. (defmacro time-it (code) `(let ((start-time (get-internal-run-time))) ,code (float-time (time-subtract (get-internal-run-time) start-time (with-current-buffer json (goto-char (point-min)) (time-it (json-read))) (with-current-buffer text (goto-char (point-min)) (time-it (while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t Peace -- Pieter regexp-vs-json.org Description: Binary data ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote: Quoth Pieter Praet on Jul 11 at 10:43 pm: TL;DR: I can haz regex pl0x? Oof, what a pain. I'm happy to change the output format of search; I hadn't realized how difficult it would be to parse. In fact, I'm not sure it's even parsable by regexp, because the message ID's themselves could contain parens. So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. I've also got the code to do everything using document ID's instead of message ID's. As a side-effect, it makes the search output clean and readily parsable since document ID's are just numbers. Hence, there are no quoting or escaping issues (plus the output is much more compact). I haven't sent this to the list yet because I haven't had a chance to benchmark it and determine if the performance benefits make exposing document ID's worthwhile. Jamie Zawinski once said/wrote [1]: 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' With this in mind, I set out to get rid of this whole regex mess altogether, by populating the search buffer using Notmuch's JSON output instead of doing brittle text matching tricks. Looking for some documentation, I stumbled upon a long-forgotten gem [2]. David's already done pretty much all of the work for us! Unfortunately, it doesn't apply cleanly to master anymore. David, would you mind rebasing it? Peace -- Pieter [1] http://www.jwz.org/hacks/ [2] id:1290777202-14040-1-git-send-email-...@dme.org ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
* pie...@praet.org [2011-07-13 Wed 15:16] David, would you mind rebasing it? I'm sorry, I'm not likely to have time to do this. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
Quoth Pieter Praet on Jul 13 at 4:16 pm: On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements amdra...@mit.edu wrote: Quoth Pieter Praet on Jul 11 at 10:43 pm: TL;DR: I can haz regex pl0x? Oof, what a pain. I'm happy to change the output format of search; I hadn't realized how difficult it would be to parse. In fact, I'm not sure it's even parsable by regexp, because the message ID's themselves could contain parens. So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. I've also got the code to do everything using document ID's instead of message ID's. As a side-effect, it makes the search output clean and readily parsable since document ID's are just numbers. Hence, there are no quoting or escaping issues (plus the output is much more compact). I haven't sent this to the list yet because I haven't had a chance to benchmark it and determine if the performance benefits make exposing document ID's worthwhile. Jamie Zawinski once said/wrote [1]: 'Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.' With this in mind, I set out to get rid of this whole regex mess altogether, by populating the search buffer using Notmuch's JSON output instead of doing brittle text matching tricks. Looking for some documentation, I stumbled upon a long-forgotten gem [2]. David's already done pretty much all of the work for us! Yes, similar thoughts were running through my head as I futzed with the formatting for this. My concern with moving to JSON for search buffers is that parsing it is about *30 times slower* than the current regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413 result search buffer). I think JSON makes a lot of sense for show buffers because there's generally less data and it has a lot of complicated structure. Search results, on the other hand, have a very simple, regular, and constrained structure, so JSON doesn't buy us nearly as much. JSON is hard to parse because, like the text search output, it's designed for human consumption (of course, unlike the text search output, it's also designed for computer consumption). There's something to be said for the debuggability and generality of this and JSON is very good for exchanging small objects, but it's a remarkably inefficient way to exchange large amounts of data between two programs. I guess what I'm getting at, though it pains me to say it, is perhaps search needs a fast, computer-readable interchange format. The structure of the data is so simple and constrained that this could be altogether trivial. Or maybe I need a faster computer. If anyone is curious, here's how I timed the parsing. (defmacro time-it (code) `(let ((start-time (get-internal-run-time))) ,code (float-time (time-subtract (get-internal-run-time) start-time (with-current-buffer json (goto-char (point-min)) (time-it (json-read))) (with-current-buffer text (goto-char (point-min)) (time-it (while (re-search-forward ^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$ nil t ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch
Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter'
Quoth Pieter Praet on Jul 11 at 10:43 pm: TL;DR: I can haz regex pl0x? Oof, what a pain. I'm happy to change the output format of search; I hadn't realized how difficult it would be to parse. In fact, I'm not sure it's even parsable by regexp, because the message ID's themselves could contain parens. So what would be a good format? One possibility would be to NULL-delimit the query part; as distasteful as I find that, this part of the search output isn't meant for user consumption. Though I fear this is endemic to the dual role the search output currently plays as both user and computer readable. I've also got the code to do everything using document ID's instead of message ID's. As a side-effect, it makes the search output clean and readily parsable since document ID's are just numbers. Hence, there are no quoting or escaping issues (plus the output is much more compact). I haven't sent this to the list yet because I haven't had a chance to benchmark it and determine if the performance benefits make exposing document ID's worthwhile. ___ notmuch mailing list notmuch@notmuchmail.org http://notmuchmail.org/mailman/listinfo/notmuch