[PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-14 Thread David Bremner
Jani Nikula  writes:

> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.

Pushed this version.

d


Re: [PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-14 Thread David Bremner
Jani Nikula  writes:

> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.

Pushed this version.

d
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Tomi Ollila
On Wed, Sep 11 2013, Jani Nikula  wrote:

> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.
> ---

V2 LGTM.

Tomi

>  test/encoding | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/test/encoding b/test/encoding
> index 2e1326e..7372b6b 100755
> --- a/test/encoding
> +++ b/test/encoding
> @@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
> charset=iso-8859-2"' \
>  output=$(notmuch search tu?? 2>&1 | notmuch_show_sanitize)
>  test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
> Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"
>  
> +test_begin_subtest "RFC 2047 encoded word with spaces"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded word with spaces (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words back to back"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words without space before or after"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
> Notmuch Test Suite; encodedword withoutspace (inbox unread)"
> +
>  test_done
> -- 
> 1.8.4.rc3
>
> ___
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Jani Nikula
Some common broken RFC 2047 encodings that we currently let gmime
parse strictly. We could tell gmime to be forgiving in what it accepts
as RFC 2047 encoding, making these tests pass.
---
 test/encoding | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/test/encoding b/test/encoding
index 2e1326e..7372b6b 100755
--- a/test/encoding
+++ b/test/encoding
@@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
charset=iso-8859-2"' \
 output=$(notmuch search tu?? 2>&1 | notmuch_show_sanitize)
 test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"

+test_begin_subtest "RFC 2047 encoded word with spaces"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
Notmuch Test Suite; encoded word with spaces (inbox unread)"
+
+test_begin_subtest "RFC 2047 encoded words back to back"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
+
+test_begin_subtest "RFC 2047 encoded words without space before or after"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
Notmuch Test Suite; encodedword withoutspace (inbox unread)"
+
 test_done
-- 
1.8.4.rc3



[PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Austin Clements
v2 LGTM.

Quoth Jani Nikula on Sep 11 at  8:36 pm:
> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.
> ---
>  test/encoding | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/test/encoding b/test/encoding
> index 2e1326e..7372b6b 100755
> --- a/test/encoding
> +++ b/test/encoding
> @@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
> charset=iso-8859-2"' \
>  output=$(notmuch search tu?? 2>&1 | notmuch_show_sanitize)
>  test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
> Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"
>  
> +test_begin_subtest "RFC 2047 encoded word with spaces"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded word with spaces (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words back to back"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words without space before or after"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
> Notmuch Test Suite; encodedword withoutspace (inbox unread)"
> +
>  test_done


Re: [PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Tomi Ollila
On Wed, Sep 11 2013, Jani Nikula  wrote:

> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.
> ---

V2 LGTM.

Tomi

>  test/encoding | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/test/encoding b/test/encoding
> index 2e1326e..7372b6b 100755
> --- a/test/encoding
> +++ b/test/encoding
> @@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
> charset=iso-8859-2"' \
>  output=$(notmuch search tučňáččí 2>&1 | notmuch_show_sanitize)
>  test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
> Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"
>  
> +test_begin_subtest "RFC 2047 encoded word with spaces"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded word with spaces (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words back to back"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words without space before or after"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
> Notmuch Test Suite; encodedword withoutspace (inbox unread)"
> +
>  test_done
> -- 
> 1.8.4.rc3
>
> ___
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: [PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Austin Clements
v2 LGTM.

Quoth Jani Nikula on Sep 11 at  8:36 pm:
> Some common broken RFC 2047 encodings that we currently let gmime
> parse strictly. We could tell gmime to be forgiving in what it accepts
> as RFC 2047 encoding, making these tests pass.
> ---
>  test/encoding | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/test/encoding b/test/encoding
> index 2e1326e..7372b6b 100755
> --- a/test/encoding
> +++ b/test/encoding
> @@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
> charset=iso-8859-2"' \
>  output=$(notmuch search tučňáččí 2>&1 | notmuch_show_sanitize)
>  test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
> Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"
>  
> +test_begin_subtest "RFC 2047 encoded word with spaces"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded word with spaces (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words back to back"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
> Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
> +
> +test_begin_subtest "RFC 2047 encoded words without space before or after"
> +test_subtest_known_broken
> +add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
> +output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
> +test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
> Notmuch Test Suite; encodedword withoutspace (inbox unread)"
> +
>  test_done
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[PATCH v2 1/2] test: add known broken tests for known broken RFC 2047 encodings

2013-09-11 Thread Jani Nikula
Some common broken RFC 2047 encodings that we currently let gmime
parse strictly. We could tell gmime to be forgiving in what it accepts
as RFC 2047 encoding, making these tests pass.
---
 test/encoding | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/test/encoding b/test/encoding
index 2e1326e..7372b6b 100755
--- a/test/encoding
+++ b/test/encoding
@@ -29,4 +29,22 @@ add_message '[content-type]="text/plain; 
charset=iso-8859-2"' \
 output=$(notmuch search tučňáččí 2>&1 | notmuch_show_sanitize)
 test_expect_equal "$output" "thread:0002   2001-01-05 [1/1] 
Notmuch Test Suite; ISO-8859-2 encoded message (inbox unread)"
 
+test_begin_subtest "RFC 2047 encoded word with spaces"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded word with spaces?="'
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0003   2001-01-05 [1/1] 
Notmuch Test Suite; encoded word with spaces (inbox unread)"
+
+test_begin_subtest "RFC 2047 encoded words back to back"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded-words-back?==?utf-8?q?to-back?="'
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0004   2001-01-05 [1/1] 
Notmuch Test Suite; encoded-words-backto-back (inbox unread)"
+
+test_begin_subtest "RFC 2047 encoded words without space before or after"
+test_subtest_known_broken
+add_message '[subject]="=?utf-8?q?encoded?=word without=?utf-8?q?space?=" '
+output=$(notmuch search id:${gen_msg_id} 2>&1 | notmuch_show_sanitize)
+test_expect_equal "$output" "thread:0005   2001-01-05 [1/1] 
Notmuch Test Suite; encodedword withoutspace (inbox unread)"
+
 test_done
-- 
1.8.4.rc3

___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Encodings

2011-07-13 Thread Patrick Totzke
Hi Uwe,

On Wed, Jul 13, 2011 at 09:04:47AM +0200, Uwe Kleine-K?nig wrote:
> > But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
> I think it would be right to enforce that tags are utf-8 encoded.
> Otherwise the users get strange results if they change their locale.

I agree that it would be very nice indeed if it was safe to assume
all tags are utf-8. But i also see that it's a bit of an effort
to ensure this as all UI's would have to explicitly recode
stuff that isn't utf-8.
It seems to be a conciously made design decision to allow
other encodings for tags, which is up for discussion f course.
All I'm saying is that the bindings should conform. And if it's 
not safe to assume utf-8 here, we shouldn't decode as such.

I'm unsure what happens in all the new get_part() parts of the api.
If there, all mimepart-text is also returned as utf-8, it would only
be consistant to bend tag encodings to utf-8 also. But I doubt thats the case.
Can anyone clarify this?
/Patrick
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20110713/2c748d5a/attachment.pgp>


Encodings

2011-07-13 Thread Uwe Kleine-König
Hi Patrick,

On Tue, Jul 12, 2011 at 10:29:58PM +0100, Patrick Totzke wrote:
> I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
> breaks things with Database.get_all_tags:
> 
> -->%-
> AttributeErrorTraceback (most recent call last)
> 
> /home/pazz/projects/alot/ in ()
> 
> /usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
>  86 # No need to call nmlib.notmuch_tags_valid(self._tags);
> 
>  87 # Tags._get safely returns None, if there is no more valid 
> tag.
> 
> ---> 88 tag = Tags._get(self._tags).decode('utf-8')
>  89 if tag is None:
>  90 self._tags = None
> 
> AttributeError: 'NoneType' object has no attribute 'decode'
> %<---
> 
> The reason is that the Tags.next() tries to decode before it tests if tag is 
> None.
> Now, we _could_ apply a patch like this one here:
> 
> -->%-
> diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
> index 65a9118..2ae670d 100644
> --- a/bindings/python/notmuch/tag.py
> +++ b/bindings/python/notmuch/tag.py
> @@ -85,12 +85,12 @@ class Tags(object):
>  raise NotmuchError(STATUS.NOT_INITIALIZED)
>  # No need to call nmlib.notmuch_tags_valid(self._tags);
>  # Tags._get safely returns None, if there is no more valid tag.
> -tag = Tags._get(self._tags).decode('utf-8')
> +tag = Tags._get(self._tags)
>  if tag is None:
>  self._tags = None
>  raise StopIteration
>  nmlib.notmuch_tags_move_to_next(self._tags)
> -return tag
> +return tag.decode('utf-8')
>  
>  def __nonzero__(self):
>  """Implement bool(Tags) check that can be repeatedly used
> ---%<-
> 
> But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
I think it would be right to enforce that tags are utf-8 encoded.
Otherwise the users get strange results if they change their locale.

Best regards
Uwe


Re: Encodings

2011-07-13 Thread Patrick Totzke
Hi Uwe,

On Wed, Jul 13, 2011 at 09:04:47AM +0200, Uwe Kleine-König wrote:
> > But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
> I think it would be right to enforce that tags are utf-8 encoded.
> Otherwise the users get strange results if they change their locale.

I agree that it would be very nice indeed if it was safe to assume
all tags are utf-8. But i also see that it's a bit of an effort
to ensure this as all UI's would have to explicitly recode
stuff that isn't utf-8.
It seems to be a conciously made design decision to allow
other encodings for tags, which is up for discussion f course.
All I'm saying is that the bindings should conform. And if it's 
not safe to assume utf-8 here, we shouldn't decode as such.

I'm unsure what happens in all the new get_part() parts of the api.
If there, all mimepart-text is also returned as utf-8, it would only
be consistant to bend tag encodings to utf-8 also. But I doubt thats the case.
Can anyone clarify this?
/Patrick


signature.asc
Description: Digital signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: Encodings

2011-07-13 Thread Uwe Kleine-König
Hi Patrick,

On Tue, Jul 12, 2011 at 10:29:58PM +0100, Patrick Totzke wrote:
> I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
> breaks things with Database.get_all_tags:
> 
> -->%-
> AttributeErrorTraceback (most recent call last)
> 
> /home/pazz/projects/alot/ in ()
> 
> /usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
>  86 # No need to call nmlib.notmuch_tags_valid(self._tags);
> 
>  87 # Tags._get safely returns None, if there is no more valid 
> tag.
> 
> ---> 88 tag = Tags._get(self._tags).decode('utf-8')
>  89 if tag is None:
>  90 self._tags = None
> 
> AttributeError: 'NoneType' object has no attribute 'decode'
> %<---
> 
> The reason is that the Tags.next() tries to decode before it tests if tag is 
> None.
> Now, we _could_ apply a patch like this one here:
> 
> -->%-
> diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
> index 65a9118..2ae670d 100644
> --- a/bindings/python/notmuch/tag.py
> +++ b/bindings/python/notmuch/tag.py
> @@ -85,12 +85,12 @@ class Tags(object):
>  raise NotmuchError(STATUS.NOT_INITIALIZED)
>  # No need to call nmlib.notmuch_tags_valid(self._tags);
>  # Tags._get safely returns None, if there is no more valid tag.
> -tag = Tags._get(self._tags).decode('utf-8')
> +tag = Tags._get(self._tags)
>  if tag is None:
>  self._tags = None
>  raise StopIteration
>  nmlib.notmuch_tags_move_to_next(self._tags)
> -return tag
> +return tag.decode('utf-8')
>  
>  def __nonzero__(self):
>  """Implement bool(Tags) check that can be repeatedly used
> ---%<-
> 
> But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
I think it would be right to enforce that tags are utf-8 encoded.
Otherwise the users get strange results if they change their locale.

Best regards
Uwe
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Encodings

2011-07-12 Thread Patrick Totzke
Hiya,

I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
breaks things with Database.get_all_tags:

-->%-
AttributeErrorTraceback (most recent call last)

/home/pazz/projects/alot/ in ()

/usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
 86 # No need to call nmlib.notmuch_tags_valid(self._tags);

 87 # Tags._get safely returns None, if there is no more valid tag.

---> 88 tag = Tags._get(self._tags).decode('utf-8')
 89 if tag is None:
 90 self._tags = None

AttributeError: 'NoneType' object has no attribute 'decode'
%<---

The reason is that the Tags.next() tries to decode before it tests if tag is 
None.
Now, we _could_ apply a patch like this one here:

-->%-
diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
index 65a9118..2ae670d 100644
--- a/bindings/python/notmuch/tag.py
+++ b/bindings/python/notmuch/tag.py
@@ -85,12 +85,12 @@ class Tags(object):
 raise NotmuchError(STATUS.NOT_INITIALIZED)
 # No need to call nmlib.notmuch_tags_valid(self._tags);
 # Tags._get safely returns None, if there is no more valid tag.
-tag = Tags._get(self._tags).decode('utf-8')
+tag = Tags._get(self._tags)
 if tag is None:
 self._tags = None
 raise StopIteration
 nmlib.notmuch_tags_move_to_next(self._tags)
-return tag
+return tag.decode('utf-8')

 def __nonzero__(self):
 """Implement bool(Tags) check that can be repeatedly used
---%<-

But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
So i'd suggest we just revore the commit in question.
best,
/p
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: 



Encodings

2011-07-12 Thread Patrick Totzke
Hi!

As discussed on irc, if notmuch stores header values in utf8,
its safe to decode them to unicode instances here.
best,
/p


On Mon, Jul 11, 2011 at 08:03:38AM -0700, Carl Worth wrote:
> On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth  SSpaeth.de> wrote:
> > The answer is that things are very implicit. notmuch.h speaks of
> > strings but never mentions encodings
> 
> Much of this was intentional on my part.
> 
> For example, I intentionally avoided restrictions on what could be
> stored as a tag in the database, (other than the terminating character
> implied by "string" of course).
> 
> > So, can be document what encoding we are expected to pass in the various
> > APIs
> 
> Yes, let's clarify documentation wherever we need to.
> 
> > For some of the stuff we read directly from the files, eg
> > arbitrary headers, we can probably be least sure
> 
> The headers should be decoded to utf-8, (via
> g_mime_utils_header_decode_text), before being stored in the database.
> 
> > but are e.g. the returned tags always utf-8?
> 
> No. The tag data is returned exactly as the user presented it.
> 
> > I would love to make the python bindings use unicode() instances in
> > cases where we can be sure to actually receive utf-8 encoded strings.
> > 
> > Encodings make my brain hurt. Unfortunately one cannot simply ignore
> > them.
> 
> I think a lot of the pain here is due to some bad design decisions in
> python itself. Of course, my saying that doesn't make things any easier
> for you.
> 
> But do tell me what more we can do to clarify behavior or documentation.
> 
> -Carl
> 
> -- 
> carl.d.worth at intel.com



> ___
> notmuch mailing list
> notmuch at notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

-- next part --
A non-text attachment was scrubbed...
Name: 0001-unicode-return-value-for-Message.get_header.patch
Type: text/x-diff
Size: 1728 bytes
Desc: not available
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20110712/62f17e9c/attachment.patch>
-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20110712/62f17e9c/attachment.pgp>


Re: Encodings

2011-07-12 Thread Patrick Totzke
Hiya,

I noticed that commit 687366b920caa5de6ea0b66b70cf2a11e5399f7b
breaks things with Database.get_all_tags:

-->%-
AttributeErrorTraceback (most recent call last)

/home/pazz/projects/alot/ in ()

/usr/local/lib/python2.7/dist-packages/notmuch/tag.pyc in next(self)
 86 # No need to call nmlib.notmuch_tags_valid(self._tags);

 87 # Tags._get safely returns None, if there is no more valid tag.

---> 88 tag = Tags._get(self._tags).decode('utf-8')
 89 if tag is None:
 90 self._tags = None

AttributeError: 'NoneType' object has no attribute 'decode'
%<---

The reason is that the Tags.next() tries to decode before it tests if tag is 
None.
Now, we _could_ apply a patch like this one here:

-->%-
diff --git a/bindings/python/notmuch/tag.py b/bindings/python/notmuch/tag.py
index 65a9118..2ae670d 100644
--- a/bindings/python/notmuch/tag.py
+++ b/bindings/python/notmuch/tag.py
@@ -85,12 +85,12 @@ class Tags(object):
 raise NotmuchError(STATUS.NOT_INITIALIZED)
 # No need to call nmlib.notmuch_tags_valid(self._tags);
 # Tags._get safely returns None, if there is no more valid tag.
-tag = Tags._get(self._tags).decode('utf-8')
+tag = Tags._get(self._tags)
 if tag is None:
 self._tags = None
 raise StopIteration
 nmlib.notmuch_tags_move_to_next(self._tags)
-return tag
+return tag.decode('utf-8')
 
 def __nonzero__(self):
 """Implement bool(Tags) check that can be repeatedly used
---%<-

But as Carl sais, we cannot guarantee that a tag is utf8 encoded anyway.
So i'd suggest we just revore the commit in question.
best,
/p


signature.asc
Description: Digital signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Re: Encodings

2011-07-12 Thread Patrick Totzke
Hi!

As discussed on irc, if notmuch stores header values in utf8,
its safe to decode them to unicode instances here.
best,
/p


On Mon, Jul 11, 2011 at 08:03:38AM -0700, Carl Worth wrote:
> On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth  
> wrote:
> > The answer is that things are very implicit. notmuch.h speaks of
> > strings but never mentions encodings
> 
> Much of this was intentional on my part.
> 
> For example, I intentionally avoided restrictions on what could be
> stored as a tag in the database, (other than the terminating character
> implied by "string" of course).
> 
> > So, can be document what encoding we are expected to pass in the various
> > APIs
> 
> Yes, let's clarify documentation wherever we need to.
> 
> > For some of the stuff we read directly from the files, eg
> > arbitrary headers, we can probably be least sure
> 
> The headers should be decoded to utf-8, (via
> g_mime_utils_header_decode_text), before being stored in the database.
> 
> > but are e.g. the returned tags always utf-8?
> 
> No. The tag data is returned exactly as the user presented it.
> 
> > I would love to make the python bindings use unicode() instances in
> > cases where we can be sure to actually receive utf-8 encoded strings.
> > 
> > Encodings make my brain hurt. Unfortunately one cannot simply ignore
> > them.
> 
> I think a lot of the pain here is due to some bad design decisions in
> python itself. Of course, my saying that doesn't make things any easier
> for you.
> 
> But do tell me what more we can do to clarify behavior or documentation.
> 
> -Carl
> 
> -- 
> carl.d.wo...@intel.com



> ___
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

From 988a9832d714dfa0f91b2b1185a50acb4a6ca4b5 Mon Sep 17 00:00:00 2001
From: pazz 
Date: Tue, 12 Jul 2011 19:47:39 +0100
Subject: [PATCH 1/8] unicode return value for Message.get_header()

As discussed in IRC, notmuch recodes mailheaders to
utf-8, so we can safely decode them into unicode instances.
---
 bindings/python/notmuch/message.py |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bindings/python/notmuch/message.py b/bindings/python/notmuch/message.py
index 763d2c6..4a43a88 100644
--- a/bindings/python/notmuch/message.py
+++ b/bindings/python/notmuch/message.py
@@ -379,14 +379,16 @@ class Message(object):
 
 :param header: The name of the header to be retrieved.
It is not case-sensitive (TODO: confirm).
-:type header: str
-:returns: The header value as string
+:type header: str or unicode instance
+:returns: The header value as a unicode string
 :exception: :exc:`NotmuchError`
 
 * STATUS.NOT_INITIALIZED if the message 
   is not initialized.
 * STATUS.NULL_POINTER, if no header was found
 """
+if isinstance(header, unicode):
+header = header.encode('utf-8')
 if self._msg is None:
 raise NotmuchError(STATUS.NOT_INITIALIZED)
 
@@ -394,7 +396,7 @@ class Message(object):
 header = Message._get_header (self._msg, header)
 if header == None:
 raise NotmuchError(STATUS.NULL_POINTER)
-return header
+return header.decode('utf-8')
 
 def get_filename(self):
 """Returns the file path of the message file
-- 
1.7.4.1



signature.asc
Description: Digital signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Encodings

2011-07-11 Thread Sebastian Spaeth
Hi all,
after I was notified about how notmuch's python bindings perform
differently depending on whether we hand it (byte-based) ASCII strings
or unicode, I tried to disentangle what encodings to expect and send it
to. The answer is that things are very implicit. notmuch.h speaks of
strings but never mentions encodings, xapian docs don't mention
encodings but ojwb confirmed that it expects utf-8.

So, can be document what encoding we are expected to pass in the various
APIs and where we can guarantee to actually return UTF-8 encoded
strings? For some of the stuff we read directly from the files, eg
arbitrary headers, we can probably be least sure, but are e.g. the
returned tags always utf-8?

I would love to make the python bindings use unicode() instances in
cases where we can be sure to actually receive utf-8 encoded strings.

Encodings make my brain hurt. Unfortunately one cannot simply ignore
them.

Sebastian
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20110711/0c5a127e/attachment.pgp>


Re: Encodings

2011-07-11 Thread Carl Worth
On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth  
wrote:
> The answer is that things are very implicit. notmuch.h speaks of
> strings but never mentions encodings

Much of this was intentional on my part.

For example, I intentionally avoided restrictions on what could be
stored as a tag in the database, (other than the terminating character
implied by "string" of course).

> So, can be document what encoding we are expected to pass in the various
> APIs

Yes, let's clarify documentation wherever we need to.

> For some of the stuff we read directly from the files, eg
> arbitrary headers, we can probably be least sure

The headers should be decoded to utf-8, (via
g_mime_utils_header_decode_text), before being stored in the database.

> but are e.g. the returned tags always utf-8?

No. The tag data is returned exactly as the user presented it.

> I would love to make the python bindings use unicode() instances in
> cases where we can be sure to actually receive utf-8 encoded strings.
> 
> Encodings make my brain hurt. Unfortunately one cannot simply ignore
> them.

I think a lot of the pain here is due to some bad design decisions in
python itself. Of course, my saying that doesn't make things any easier
for you.

But do tell me what more we can do to clarify behavior or documentation.

-Carl

-- 
carl.d.wo...@intel.com


pgpa9xhIIcXO4.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


Encodings

2011-07-11 Thread Carl Worth
On Mon, 11 Jul 2011 16:04:17 +0200, Sebastian Spaeth  
wrote:
> The answer is that things are very implicit. notmuch.h speaks of
> strings but never mentions encodings

Much of this was intentional on my part.

For example, I intentionally avoided restrictions on what could be
stored as a tag in the database, (other than the terminating character
implied by "string" of course).

> So, can be document what encoding we are expected to pass in the various
> APIs

Yes, let's clarify documentation wherever we need to.

> For some of the stuff we read directly from the files, eg
> arbitrary headers, we can probably be least sure

The headers should be decoded to utf-8, (via
g_mime_utils_header_decode_text), before being stored in the database.

> but are e.g. the returned tags always utf-8?

No. The tag data is returned exactly as the user presented it.

> I would love to make the python bindings use unicode() instances in
> cases where we can be sure to actually receive utf-8 encoded strings.
> 
> Encodings make my brain hurt. Unfortunately one cannot simply ignore
> them.

I think a lot of the pain here is due to some bad design decisions in
python itself. Of course, my saying that doesn't make things any easier
for you.

But do tell me what more we can do to clarify behavior or documentation.

-Carl

-- 
carl.d.worth at intel.com
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: 
<http://notmuchmail.org/pipermail/notmuch/attachments/20110711/c0bb71c7/attachment.pgp>


Encodings

2011-07-11 Thread Sebastian Spaeth
Hi all,
after I was notified about how notmuch's python bindings perform
differently depending on whether we hand it (byte-based) ASCII strings
or unicode, I tried to disentangle what encodings to expect and send it
to. The answer is that things are very implicit. notmuch.h speaks of
strings but never mentions encodings, xapian docs don't mention
encodings but ojwb confirmed that it expects utf-8.

So, can be document what encoding we are expected to pass in the various
APIs and where we can guarantee to actually return UTF-8 encoded
strings? For some of the stuff we read directly from the files, eg
arbitrary headers, we can probably be least sure, but are e.g. the
returned tags always utf-8?

I would love to make the python bindings use unicode() instances in
cases where we can be sure to actually receive utf-8 encoded strings.

Encodings make my brain hurt. Unfortunately one cannot simply ignore
them.

Sebastian


pgpyWgSjcTolX.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch