date:20170318

Re: [PATCH] rename libutil.a to libnotmuch_util.a

2017-03-18 Thread David Bremner

David Bremner  writes:

> Apparently some systems (MacOS?) have a system library called libutil
> and the name conflict causes problems. Since this library is quite
> notmuch specific, rename it to something less generic.

pushed.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: memory leak cleanup for notmuch show

2017-03-18 Thread David Bremner

David Bremner  writes:

> Inspired by Jeff's patch, I updated the memory test suite to test
> notmuch-show, this series is the result. I also include Jeff's
> original patch in this series.
>

Series pushed,

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: [PATCH] lib/message.cc: fix Coverity finding (use after free)

2017-03-18 Thread David Bremner

Tomi Ollila  writes:

> - const char *data;
>  
> - data = message->doc.get_data ().c_str ();
> + std::string datastr = message->doc.get_data ();
> + const char *data = datastr.c_str ();
>  

Pushed,

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

2017-03-18 Thread David Bremner

Daniel Kahn Gillmor  writes:

> On Thu 2017-03-16 20:34:22 -0400, David Bremner wrote:
>> Daniel Kahn Gillmor  writes:
>>>  0) what happens when one of the files gets deleted from the message
>>> store? do the terms it contributes get removed from the index?
>>
>> That's a good guestion, and an issue I hadn't thought about.
>> Currently there's no way to do this short of deleting all the terms (for
>> all the files (excepting tags and properties, presumably) and
>> reindexing. This will require some more thought, I think.
>
> i didn't mean to raise the concern to drag this work down, i just want
> to make sure the problem is on the table.  dropping all terms on
> deletion and re-indexing remaining files with the same message ID isn't
> terribly efficient, but i don't think it's going to be terribly costly
> either.  we're not talking about hundreds of files per message-id in
> most normal cases; usually only two (sent-to-self,
> recvd-from-mailing-list), and maybe a half-dozen at most (messages sent
> to multiple mailboxes that all forward to me).

I can think of 3 general approaches at the moment. They each have (at
least) one gotcha; more precisely they each require some added
complexity somewhere else in the codebase.

One is this one, just add all the terms to one xapian document. The
gotcha is needing some reindexing facility (we want this for other
reasons, so that might not be so bad).

The second approach that occurs to me is to still add the terms to one
xapian document, but to prefix them with a number identifying the file
copy (1,2, etc). The complexity here is in the generation of queries,
each one needs to be OR_ed with eg. SUBJECT:foo or 1#SUBJECT:foo or
2#SUBJECT:foo. I'm not really sure offhand how to do that without field
processors. I'm also not sure about the performance impact.

The third approach is create extra xapian documents per file, which have
a different document type (from the notmuch point of view). Here the
complexity will be dealing with the returned documents from a xapian
query. We can probably use a wildcard search on the type (mail, mail1,
mail2, etc...) to make the queries reasonably easy. My gut feeling is
that this is the "right" approach, althought it will be a bit more
complicated to get started.  It will also require changing our idea of
threads in the "structured output" where a thread looks something like

(thread
   (message
  (instance/file)
  (instance/file))
   (message
  (instance/file))

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: memory leak cleanup for notmuch show

2017-03-18 Thread Tomi Ollila

On Sat, Mar 18 2017, David Bremner  wrote:

> Inspired by Jeff's patch, I updated the memory test suite to test
> notmuch-show, this series is the result. I also include Jeff's
> original patch in this series.

series LGTM. tests pass (fedora 25, gmime 2.6.20, xapian 1.2.24)

Tomi
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: [PATCH 1/6] perf-test: use 'eval' in memory_run

2017-03-18 Thread Tomi Ollila

On Sat, Mar 18 2017, David Bremner  wrote:

> This allows the use of redirection in the tests
> ---
>  performance-test/perf-test-lib.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/performance-test/perf-test-lib.sh 
> b/performance-test/perf-test-lib.sh
> index 00d2f1c6..c89d5aab 100644
> --- a/performance-test/perf-test-lib.sh
> +++ b/performance-test/perf-test-lib.sh
> @@ -149,7 +149,7 @@ memory_run ()
>  
>  printf "[ %d ]\t%s\n" $test_count "$1"
>  
> -NOTMUCH_TALLOC_REPORT="$talloc_log" valgrind --leak-check=full 
> --log-file="$log_file" $2
> +NOTMUCH_TALLOC_REPORT="$talloc_log" eval "valgrind --leak-check=full 
> --log-file='$log_file' $2"

For the record, this would have worked w/o the double quotes (which I
thought would have not), but it is somewhat safer for someone to copy
this to some use. If there were literal '>'s in the line, then those
redirections would have been done in 'eval' (with these quotes). Without
quotes, '>' redirection would have been done before 'eval'. But 
when '>' is given in variable $2, the redirection would have been
done in eval (after shell has expanded the line for it) in any case.

all that said, this particular change OK...

Tomi

>  
>  awk '/LEAK SUMMARY/,/suppressed/ { sub(/^==[0-9]*==/," "); print }' 
> "$log_file"
>  echo

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH] perf-test/mem: add simple memory tests for notmuch search

2017-03-18 Thread David Bremner

Just copy and replace from the show tests. Currently these show no
major leaks.
---
 performance-test/M03-search.sh | 13 +
 1 file changed, 13 insertions(+)
 create mode 100755 performance-test/M03-search.sh

diff --git a/performance-test/M03-search.sh b/performance-test/M03-search.sh
new file mode 100755
index ..8d026eee
--- /dev/null
+++ b/performance-test/M03-search.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+test_description='search'
+
+. ./perf-test-lib.sh || exit 1
+
+memory_start
+
+memory_run 'search *' "notmuch search '*' 1>/dev/null"
+memory_run 'search --format=json *' "notmuch search --format=json '*' 
1>/dev/null"
+memory_run 'search --format=sexp *' "notmuch search --format=sexp '*' 
1>/dev/null"
+
+memory_done
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

RE: [PATCH] test: add known broken test for indexing html

2017-03-18 Thread David Bremner

Jeffrey Stedfast  writes:

> Hey David,
>
> I actually have an HTML tokenizer for MimeKit for (among other things) this 
> type of purpose. Perhaps I need to port that to C and include that with GMime 
> 
>
> https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text
>
> Jeff

That's probably a good idea in your abundant spare time ;).  More
generally though we've thought about letting users provide filters to
convert attachements (e.g. .odt / .docx / pdf) to text. I'm not sure
about the performance hit, but I guess that would work for html as well.
I guess in principle it should be possible to write GMime filter that
manages the child process.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 3/6] fix memory leaks in notmuch-show.c:format_headers_sprinter()

2017-03-18 Thread David Bremner

From: Jeffrey Stedfast 

Internet_address_list_to_string() and
g_mime_message_get_date_as_string() return allocated string buffers
and not const, so from what I can tell from taking a look at the
sprinter-sexp.c’s sexp_string() function, the code leaks the
recipients_string as well as the date string.
---
 notmuch-show.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/notmuch-show.c b/notmuch-show.c
index aff93803..095595e2 100644
--- a/notmuch-show.c
+++ b/notmuch-show.c
@@ -202,8 +202,9 @@ format_headers_sprinter (sprinter_t *sp, GMimeMessage 
*message,
  * reflected in the file devel/schemata. */
 
 InternetAddressList *recipients;
-const char *recipients_string;
+char *recipients_string;
 const char *reply_to_string;
+char *date_string;
 
 sp->begin_map (sp);
 
@@ -218,6 +219,7 @@ format_headers_sprinter (sprinter_t *sp, GMimeMessage 
*message,
 if (recipients_string) {
sp->map_key (sp, "To");
sp->string (sp, recipients_string);
+   g_free (recipients_string);
 }
 
 recipients = g_mime_message_get_recipients (message, 
GMIME_RECIPIENT_TYPE_CC);
@@ -225,6 +227,7 @@ format_headers_sprinter (sprinter_t *sp, GMimeMessage 
*message,
 if (recipients_string) {
sp->map_key (sp, "Cc");
sp->string (sp, recipients_string);
+   g_free (recipients_string);
 }
 
 recipients = g_mime_message_get_recipients (message, 
GMIME_RECIPIENT_TYPE_BCC);
@@ -232,6 +235,7 @@ format_headers_sprinter (sprinter_t *sp, GMimeMessage 
*message,
 if (recipients_string) {
sp->map_key (sp, "Bcc");
sp->string (sp, recipients_string);
+   g_free (recipients_string);
 }
 
 reply_to_string = g_mime_message_get_reply_to (message);
@@ -248,7 +252,9 @@ format_headers_sprinter (sprinter_t *sp, GMimeMessage 
*message,
sp->string (sp, g_mime_object_get_header (GMIME_OBJECT (message), 
"References"));
 } else {
sp->map_key (sp, "Date");
-   sp->string (sp, g_mime_message_get_date_as_string (message));
+   date_string = g_mime_message_get_date_as_string (message);
+   sp->string (sp, date_string);
+   g_free (date_string);
 }
 
 sp->end (sp);
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 2/6] perf-test: add simple memory tests for notmuch-show

2017-03-18 Thread David Bremner

These are probably too slow to run with the full corpus
---
 performance-test/M02-show.sh | 13 +
 1 file changed, 13 insertions(+)
 create mode 100755 performance-test/M02-show.sh

diff --git a/performance-test/M02-show.sh b/performance-test/M02-show.sh
new file mode 100755
index ..d73035ea
--- /dev/null
+++ b/performance-test/M02-show.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+
+test_description='show'
+
+. ./perf-test-lib.sh || exit 1
+
+memory_start
+
+memory_run 'show *' "notmuch show '*' 1>/dev/null"
+memory_run 'show --format=json *' "notmuch show --format=json '*' 1>/dev/null"
+memory_run 'show --format=sexp *' "notmuch show --format=sexp '*' 1>/dev/null"
+
+memory_done
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 4/6] cli/show: fix some memory leaks in format_part_text

2017-03-18 Thread David Bremner

Mimic Jeff Stedfast's changes to format_headers_sprinter, clean up use
of internet_address_list_to_string and
g_mime_message_get_date_as_string.
---
 notmuch-show.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/notmuch-show.c b/notmuch-show.c
index 095595e2..b0afc29e 100644
--- a/notmuch-show.c
+++ b/notmuch-show.c
@@ -460,7 +460,8 @@ format_part_text (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
 if (GMIME_IS_MESSAGE (node->part)) {
GMimeMessage *message = GMIME_MESSAGE (node->part);
InternetAddressList *recipients;
-   const char *recipients_string;
+   char *recipients_string;
+   char *date_string;
 
printf ("\fheader{\n");
if (node->envelope_file)
@@ -471,11 +472,15 @@ format_part_text (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
recipients_string = internet_address_list_to_string (recipients, 0);
if (recipients_string)
printf ("To: %s\n", recipients_string);
+   g_free (recipients_string);
recipients = g_mime_message_get_recipients (message, 
GMIME_RECIPIENT_TYPE_CC);
recipients_string = internet_address_list_to_string (recipients, 0);
if (recipients_string)
printf ("Cc: %s\n", recipients_string);
-   printf ("Date: %s\n", g_mime_message_get_date_as_string (message));
+   g_free (recipients_string);
+   date_string = g_mime_message_get_date_as_string (message);
+   printf ("Date: %s\n", date_string);
+   g_free (date_string);
printf ("\fheader}\n");
 
printf ("\fbody{\n");
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 6/6] cli/show: unref crlf filter.

2017-03-18 Thread David Bremner

Mimic the handling of the other filter g_objects. This cleans up a
fair sized memory leak.
---
 notmuch-show.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/notmuch-show.c b/notmuch-show.c
index 43ee9021..7451d5ab 100644
--- a/notmuch-show.c
+++ b/notmuch-show.c
@@ -276,6 +276,7 @@ show_text_part_content (GMimeObject *part, GMimeStream 
*stream_out,
 {
 GMimeContentType *content_type = g_mime_object_get_content_type 
(GMIME_OBJECT (part));
 GMimeStream *stream_filter = NULL;
+GMimeFilter *crlf_filter = NULL;
 GMimeDataWrapper *wrapper;
 const char *charset;
 
@@ -287,8 +288,10 @@ show_text_part_content (GMimeObject *part, GMimeStream 
*stream_out,
return;
 
 stream_filter = g_mime_stream_filter_new (stream_out);
+crlf_filter = g_mime_filter_crlf_new (FALSE, FALSE);
 g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter),
-g_mime_filter_crlf_new (FALSE, FALSE));
+crlf_filter);
+g_object_unref (crlf_filter);
 
 charset = g_mime_object_get_content_type_parameter (part, "charset");
 if (charset) {
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

memory leak cleanup for notmuch show

2017-03-18 Thread David Bremner

Inspired by Jeff's patch, I updated the memory test suite to test
notmuch-show, this series is the result. I also include Jeff's
original patch in this series.

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 1/6] perf-test: use 'eval' in memory_run

2017-03-18 Thread David Bremner

This allows the use of redirection in the tests
---
 performance-test/perf-test-lib.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/performance-test/perf-test-lib.sh 
b/performance-test/perf-test-lib.sh
index 00d2f1c6..c89d5aab 100644
--- a/performance-test/perf-test-lib.sh
+++ b/performance-test/perf-test-lib.sh
@@ -149,7 +149,7 @@ memory_run ()
 
 printf "[ %d ]\t%s\n" $test_count "$1"
 
-NOTMUCH_TALLOC_REPORT="$talloc_log" valgrind --leak-check=full 
--log-file="$log_file" $2
+NOTMUCH_TALLOC_REPORT="$talloc_log" eval "valgrind --leak-check=full 
--log-file='$log_file' $2"
 
 awk '/LEAK SUMMARY/,/suppressed/ { sub(/^==[0-9]*==/," "); print }' 
"$log_file"
 echo
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH 5/6] cli/show: fix usage of g_mime_content_type_to_string

2017-03-18 Thread David Bremner

It returns an "allocated string", which needs to be freed.
---
 notmuch-show.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/notmuch-show.c b/notmuch-show.c
index b0afc29e..43ee9021 100644
--- a/notmuch-show.c
+++ b/notmuch-show.c
@@ -438,6 +438,7 @@ format_part_text (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
notmuch_message_get_flag (message, 
NOTMUCH_MESSAGE_FLAG_EXCLUDED) ? 1 : 0,
notmuch_message_get_filename (message));
 } else {
+   char *content_string;
const char *disposition = _get_disposition (meta);
const char *cid = g_mime_object_get_content_id (meta);
const char *filename = leaf ?
@@ -454,7 +455,10 @@ format_part_text (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
printf (", Filename: %s", filename);
if (cid)
printf (", Content-id: %s", cid);
-   printf (", Content-type: %s\n", g_mime_content_type_to_string 
(content_type));
+
+   content_string = g_mime_content_type_to_string (content_type);
+   printf (", Content-type: %s\n", content_string);
+   g_free (content_string);
 }
 
 if (GMIME_IS_MESSAGE (node->part)) {
@@ -495,8 +499,9 @@ format_part_text (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
show_text_part_content (node->part, stream_stdout, 0);
g_object_unref(stream_stdout);
} else {
-   printf ("Non-text part: %s\n",
-   g_mime_content_type_to_string (content_type));
+   char *content_string = g_mime_content_type_to_string (content_type);
+   printf ("Non-text part: %s\n", content_string);
+   g_free (content_string);
}
 }
 
@@ -564,6 +569,7 @@ format_part_sprinter (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
 GMimeObject *meta = node->envelope_part ?
GMIME_OBJECT (node->envelope_part) : node->part;
 GMimeContentType *content_type = g_mime_object_get_content_type (meta);
+char *content_string;
 const char *disposition = _get_disposition (meta);
 const char *cid = g_mime_object_get_content_id (meta);
 const char *filename = GMIME_IS_PART (node->part) ?
@@ -592,7 +598,9 @@ format_part_sprinter (const void *ctx, sprinter_t *sp, 
mime_node_t *node,
 }
 
 sp->map_key (sp, "content-type");
-sp->string (sp, g_mime_content_type_to_string (content_type));
+content_string = g_mime_content_type_to_string (content_type);
+sp->string (sp, content_string);
+g_free (content_string);
 
 if (disposition) {
sp->map_key (sp, "content-disposition");
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

RE: [PATCH] test: add known broken test for indexing html

2017-03-18 Thread David Bremner

Jeffrey Stedfast  writes:

> Base64 encoded inline image data is always within the src attribute
> value of an  tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> spot.
>
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of 
>tags?
>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>

I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

RE: [PATCH] test: add known broken test for indexing html

2017-03-18 Thread David Bremner

Jeffrey Stedfast  writes:

> Hi David,
>
> Base64 encoded inline image data is always within the src attribute value of 
> an  tag and will always begin with "data:" followed by the mime-type and 
> then followed by ";base64," so it's pretty easy to spot.
>
> While on this topic, why index HTML attribute values at all? Other than 
> perhaps some known ones like perhaps the 'alt' value of  tags?
>
> I would argue that the only portion of any HTML that you should be indexing 
> at all for searching is the character data between tags.

We're not currently parsing the HTML, so none of these distinctions are
really available to us. Maybe adding an HTML parser is the right
solution, but it's a bit non-trivial.

d
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

RE: [PATCH] test: add known broken test for indexing html

2017-03-18 Thread Jeffrey Stedfast

Hi David,

Base64 encoded inline image data is always within the src attribute value of an 
 tag and will always begin with "data:" followed by the mime-type and then 
followed by ";base64," so it's pretty easy to spot.

While on this topic, why index HTML attribute values at all? Other than perhaps 
some known ones like perhaps the 'alt' value of  tags?

I would argue that the only portion of any HTML that you should be indexing at 
all for searching is the character data between tags.

Hope my $0.02 helps,

Jeff

> -Original Message-
> From: notmuch [mailto:notmuch-boun...@notmuchmail.org] On Behalf Of
> David Bremner
> Sent: Saturday, March 18, 2017 9:25 AM
> To: notmuch@notmuchmail.org
> Subject: [PATCH] test: add known broken test for indexing html
> 
> 'quite' on IRC reported that notmuch new was grinding to a halt during initial
> indexing, and we eventually narrowed the problem down to some html parts
> with large embedded images. These cause the number of terms added to
> the Xapian database to explode (the first 400 messages generated 4.6M
> unique terms), and of course the resulting terms are not much use for
> searching.
> ---
> 
> I'm not sure the best approach to fix this. Workarounds include limiting the
> size of the part indexed, and skipping html parts. The latter is easy, but
> probably too drastic.  A nice solution might be a filter similar to the 
> existing
> one that strips out uuencoded text but for base64. Alas base64 crud seems
> to come with all kinds of syntactic wrappers, so it's probably harder to 
> filter.
> 
> 
>  test/T680-html-indexing.sh   | 12 +++
>  test/corpora/README  |  3 ++
>  test/corpora/html/embedded-image | 69
> 
>  3 files changed, 84 insertions(+)
>  create mode 100755 test/T680-html-indexing.sh  create mode 100644
> test/corpora/html/embedded-image
> 
> diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file
> mode 100755 index ..78768c4f
> --- /dev/null
> +++ b/test/T680-html-indexing.sh
> @@ -0,0 +1,12 @@
> +#!/usr/bin/env bash
> +test_description="indexing of html parts"
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus html
> +
> +test_begin_subtest 'embedded images should not be indexed'
> +test_subtest_known_broken
> +notmuch search
> kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 >
> +OUTPUT test_expect_equal_file /dev/null OUTPUT
> +
> +test_done
> diff --git a/test/corpora/README b/test/corpora/README index
> 77c48e6e..c9a35fed 100644
> --- a/test/corpora/README
> +++ b/test/corpora/README
> @@ -9,3 +9,6 @@ default
>  broken
>The broken corpus contains messages that are broken and/or RFC
>non-compliant, ensuring we deal with them in a sane way.
> +
> +html
> +  The html corpus contains html parts
> diff --git a/test/corpora/html/embedded-image
> b/test/corpora/html/embedded-image
> new file mode 100644
> index ..40851530
> --- /dev/null
> +++ b/test/corpora/html/embedded-image
> @@ -0,0 +1,69 @@
> +From: =?utf-8?b?bWFsbW9ib3Jn?= 
> +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= 
> +Date: Tue, 19 Jul 2016 11:54:24 +0200
> +X-Feed2Imap-Version: 1.2.5
> +Message-Id: 
> +Subject:
> +=?utf-
> 8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
> +Content-Type: multipart/alternative; boundary="=-1468922508-176605-
> 12427-9500-21-="
> +MIME-Version: 1.0
> +
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/plain; charset=utf-8; format=flowed
> +Content-Transfer-Encoding: 8bit
> +
> +
> +
> +Malmö 2016-07-09
> +
> +I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver [A]
> +
> +
> +
> +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
> +--
> +Feed: Förvaltnings AB Malmöborg
> +
> +Item: Tack alla trafikanter och fotgängare!
> +
> +Date: 2016-07-19 11:54:24 +0200
> +Author: malmoborg
> +Filed under: Nyheter
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/html; charset=utf-8
> +Content-Transfer-Encoding: 8bit
> +
> + +borderspacing="0">  +cellpadding="4" cellspacing="2">  +align="right">Feed:  +href="http://malmoborg.se;> Förvaltnings AB Malmöborg 
> +Item:  +href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/
> +">Tack alla trafikanter och fotgängare! 
> +
> +
> +Malmö 2016-07-09
> +I

Re: Github an licensing

2017-03-18 Thread David Bremner

Mark Walters  writes:

>> On irc rlb pointed me to
>> https://joeyh.name/blog/entry/removing_everything_from_github/
>>
>> IANAL so I don't know whether it is a real problem, a hypothetical
>> problem or not a problem.
>
> I got a couple of links sent to me privately which look relevant:
>
>  * https://news.ycombinator.com/item?id=13767373
>  * https://news.ycombinator.com/item?id=13766933
>
> (hacker threads, but from people claiming to be lawyers)

Another comment, from the FSF


https://www.fsf.org/blogs/licensing/do-githubs-updated-terms-of-service-conflict-with-copyleft

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

[PATCH] test: add known broken test for indexing html

2017-03-18 Thread David Bremner

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
---

I'm not sure the best approach to fix this. Workarounds include
limiting the size of the part indexed, and skipping html parts. The
latter is easy, but probably too drastic.  A nice solution might be a
filter similar to the existing one that strips out uuencoded text but
for base64. Alas base64 crud seems to come with all kinds of syntactic
wrappers, so it's probably harder to filter.


 test/T680-html-indexing.sh   | 12 +++
 test/corpora/README  |  3 ++
 test/corpora/html/embedded-image | 69 
 3 files changed, 84 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index ..78768c4f
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index ..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= 
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= 
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: 
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; 
boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på 
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
+dräneringsarbete som i sin tur har inneburit vissa 
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några 
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi 
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu 
+kommer den vackra fastigheten att klara sig torrskodd under många år 
+framöver [A]
+
+ 
+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+-- 
+Feed: Förvaltnings AB Malmöborg
+
+Item: Tack alla trafikanter och fotgängare!
+
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+
+
+Feed:
+http://malmoborg.se;>
+Förvaltnings AB Malmöborg
+
+Item:
+http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/;>Tack
 alla trafikanter och fotgängare!
+
+
+
+Malmö 2016-07-09
+I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 
3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större 
dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på 
Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren 
är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda 
vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att 
klara sig torrskodd under många år framöver  
+
+
+
+Date:2016-07-19 11:54:24 +0200
+Author:malmoborg
+Filed 
under:Nyheter
+
+
+--=-1468922508-176605-12427-9500-21-=--
-- 
2.11.0

___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Re: [PATCH] rename libutil.a to libnotmuch_util.a

Re: memory leak cleanup for notmuch show

Re: [PATCH] lib/message.cc: fix Coverity finding (use after free)

Re: [RFC patch 2/2] lib: index message files with duplicate message-ids

Re: memory leak cleanup for notmuch show

Re: [PATCH 1/6] perf-test: use 'eval' in memory_run

[PATCH] perf-test/mem: add simple memory tests for notmuch search

RE: [PATCH] test: add known broken test for indexing html

[PATCH 3/6] fix memory leaks in notmuch-show.c:format_headers_sprinter()

[PATCH 2/6] perf-test: add simple memory tests for notmuch-show

[PATCH 4/6] cli/show: fix some memory leaks in format_part_text

[PATCH 6/6] cli/show: unref crlf filter.

memory leak cleanup for notmuch show

[PATCH 1/6] perf-test: use 'eval' in memory_run

[PATCH 5/6] cli/show: fix usage of g_mime_content_type_to_string

RE: [PATCH] test: add known broken test for indexing html

RE: [PATCH] test: add known broken test for indexing html

RE: [PATCH] test: add known broken test for indexing html

Re: Github an licensing

[PATCH] test: add known broken test for indexing html

20 matches

Site Navigation

Mail list logo

Footer information