Re: [notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-21 Thread Carl Worth
Hi James,

I just got to a point in my outstanding rework where I thought it would
make sense to pull this patch series in, (I'm adding support for storing
multiple filenames in a single mail document).

I took a closer look at this series, and I think it's still independent,
so I'll finish up what I'm doing and then add this series on top later.

But I can at least answer some of the questions you asked for now:

>   Does the re-indexing replace the old terms?

Before this patch, there's there's not yet any "re-indexing" in
notmuch. So we'll basically need to think about what we want to do here.

As this patch is written, (just calling into the existing _index_file
function), the re-indexing only adds new terms, (and doesn't delete
any). That's probably correct. We're using file size as an heuristic
that the larger file is a superset of the smaller file, but it doesn't
guarantee that the smaller file doesn't contain any unique terms. So I'd
be extremely hesitant to drop any terms here.

>   In the case
>   where you had a collision with different text this could
>   make a search return mails that don't contain that text.
>   I don't think it's a big issue though, even if that is the
>   case.

That's correct. As mentioned in a previous thread, this is likely only a
big issue in the face of deliberate message-ID spoofing or so. In that
thread we talked about some ideas for mitigating that. But I don't think
we need to solve that problem before applying this patch series.

-Carl


pgpIsgTCssm3P.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-21 Thread Carl Worth
Hi James,

I just got to a point in my outstanding rework where I thought it would
make sense to pull this patch series in, (I'm adding support for storing
multiple filenames in a single mail document).

I took a closer look at this series, and I think it's still independent,
so I'll finish up what I'm doing and then add this series on top later.

But I can at least answer some of the questions you asked for now:

>   Does the re-indexing replace the old terms?

Before this patch, there's there's not yet any "re-indexing" in
notmuch. So we'll basically need to think about what we want to do here.

As this patch is written, (just calling into the existing _index_file
function), the re-indexing only adds new terms, (and doesn't delete
any). That's probably correct. We're using file size as an heuristic
that the larger file is a superset of the smaller file, but it doesn't
guarantee that the smaller file doesn't contain any unique terms. So I'd
be extremely hesitant to drop any terms here.

>   In the case
>   where you had a collision with different text this could
>   make a search return mails that don't contain that text.
>   I don't think it's a big issue though, even if that is the
>   case.

That's correct. As mentioned in a previous thread, this is likely only a
big issue in the face of deliberate message-ID spoofing or so. In that
thread we talked about some ideas for mitigating that. But I don't think
we need to solve that problem before applying this patch series.

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-19 Thread James Westby
When we see a message where we already have the file
id stored, check if the size is larger. If it is then
re-index and set the file size and name to be the
new message.
---

  Here's the (quite simple) patch to implement indexing the
  largest copy of each mail that we have.

  Does the re-indexing replace the old terms? In the case
  where you had a collision with different text this could
  make a search return mails that don't contain that text.
  I don't think it's a big issue though, even if that is the
  case.

  Thanks,

  James

 lib/database.cc   |4 +++-
 lib/index.cc  |   27 +++
 lib/message.cc|   31 ++-
 lib/notmuch-private.h |   13 +
 lib/notmuch.h |5 +++--
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index d834d94..64f29b9 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1000,7 +1000,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
-   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+   ret = _notmuch_message_possibly_reindex (message, filename, size);
+   if (!ret)
+   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
goto DONE;
}

diff --git a/lib/index.cc b/lib/index.cc
index 125fa6c..14c3268 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -312,3 +312,30 @@ _notmuch_message_index_file (notmuch_message_t *message,

 return ret;
 }
+
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char *filename,
+const off_t size)
+{
+off_t realsize = size;
+off_t stored_size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+if (ret)
+goto DONE;
+stored_size = _notmuch_message_get_filesize (message);
+if (realsize > stored_size) {
+   ret = _notmuch_message_index_file (message, filename);
+   if (ret)
+   goto DONE;
+   ret = _notmuch_message_set_filesize (message, filename, realsize);
+   _notmuch_message_set_filename (message, filename);
+   _notmuch_message_sync (message);
+}
+
+  DONE:
+return ret;
+
+}
diff --git a/lib/message.cc b/lib/message.cc
index 2bfc5ed..cc32741 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -427,23 +427,38 @@ _notmuch_message_set_filename (notmuch_message_t *message,
 }

 notmuch_status_t
-_notmuch_message_set_filesize (notmuch_message_t *message,
+_notmuch_message_size_on_disk (notmuch_message_t *message,
   const char *filename,
-  const off_t size)
+  off_t *size)
 {
 struct stat st;
-off_t realsize = size;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;

-if (realsize < 0) {
+if (*size < 0) {
if (stat (filename, &st)) {
ret = NOTMUCH_STATUS_FILE_ERROR;
goto DONE;
} else {
-   realsize = st.st_size;
+   *size = st.st_size;
}
 }

+  DONE:
+return ret;
+}
+
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+  const char *filename,
+  const off_t size)
+{
+off_t realsize = size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+if (ret)
+goto DONE;
+
 message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
 Xapian::sortable_serialise (realsize));

@@ -451,6 +466,12 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
 return ret;
 }

+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message)
+{
+return Xapian::sortable_unserialise (message->doc.get_value 
(NOTMUCH_VALUE_FILESIZE));
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 1ba3055..cf65fd9 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -199,6 +199,14 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
   const char *filename,
   const off_t size);

+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message);
+
+notmuch_status_t
+_notmuch_message_size_on_disk (notmuch_message_t *message,
+  const char *filename,
+  off_t *size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);

@@ -218,6 +226,11 @@ notmuch_status_t
 _notmuch_message_index_file (notmuch_message_t *message,
 const char *filename);

+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char 

[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-18 Thread James Westby
When we see a message where we already have the file
id stored, check if the size is larger. If it is then
re-index and set the file size and name to be the
new message.
---

  Here's the (quite simple) patch to implement indexing the
  largest copy of each mail that we have.

  Does the re-indexing replace the old terms? In the case
  where you had a collision with different text this could
  make a search return mails that don't contain that text.
  I don't think it's a big issue though, even if that is the
  case.

  Thanks,

  James

 lib/database.cc   |4 +++-
 lib/index.cc  |   27 +++
 lib/message.cc|   31 ++-
 lib/notmuch-private.h |   13 +
 lib/notmuch.h |5 +++--
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index d834d94..64f29b9 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1000,7 +1000,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
-   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+   ret = _notmuch_message_possibly_reindex (message, filename, size);
+   if (!ret)
+   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
goto DONE;
}
 
diff --git a/lib/index.cc b/lib/index.cc
index 125fa6c..14c3268 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -312,3 +312,30 @@ _notmuch_message_index_file (notmuch_message_t *message,
 
 return ret;
 }
+
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char *filename,
+const off_t size)
+{
+off_t realsize = size;
+off_t stored_size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+if (ret)
+goto DONE;
+stored_size = _notmuch_message_get_filesize (message);
+if (realsize > stored_size) {
+   ret = _notmuch_message_index_file (message, filename);
+   if (ret)
+   goto DONE;
+   ret = _notmuch_message_set_filesize (message, filename, realsize);
+   _notmuch_message_set_filename (message, filename);
+   _notmuch_message_sync (message);
+}
+
+  DONE:
+return ret;
+
+}
diff --git a/lib/message.cc b/lib/message.cc
index 2bfc5ed..cc32741 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -427,23 +427,38 @@ _notmuch_message_set_filename (notmuch_message_t *message,
 }
 
 notmuch_status_t
-_notmuch_message_set_filesize (notmuch_message_t *message,
+_notmuch_message_size_on_disk (notmuch_message_t *message,
   const char *filename,
-  const off_t size)
+  off_t *size)
 {
 struct stat st;
-off_t realsize = size;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
 
-if (realsize < 0) {
+if (*size < 0) {
if (stat (filename, &st)) {
ret = NOTMUCH_STATUS_FILE_ERROR;
goto DONE;
} else {
-   realsize = st.st_size;
+   *size = st.st_size;
}
 }
 
+  DONE:
+return ret;
+}
+
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+  const char *filename,
+  const off_t size)
+{
+off_t realsize = size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, &realsize);
+if (ret)
+goto DONE;
+
 message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
 Xapian::sortable_serialise (realsize));
 
@@ -451,6 +466,12 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
 return ret;
 }
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message)
+{
+return Xapian::sortable_unserialise (message->doc.get_value 
(NOTMUCH_VALUE_FILESIZE));
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 1ba3055..cf65fd9 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -199,6 +199,14 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
   const char *filename,
   const off_t size);
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message);
+
+notmuch_status_t
+_notmuch_message_size_on_disk (notmuch_message_t *message,
+  const char *filename,
+  off_t *size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
 
@@ -218,6 +226,11 @@ notmuch_status_t
 _notmuch_message_index_file (notmuch_message_t *message,
 const char *filename);
 
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+c