[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-21 Thread Carl Worth
Hi James,

I just got to a point in my outstanding rework where I thought it would
make sense to pull this patch series in, (I'm adding support for storing
multiple filenames in a single mail document).

I took a closer look at this series, and I think it's still independent,
so I'll finish up what I'm doing and then add this series on top later.

But I can at least answer some of the questions you asked for now:

>   Does the re-indexing replace the old terms?

Before this patch, there's there's not yet any "re-indexing" in
notmuch. So we'll basically need to think about what we want to do here.

As this patch is written, (just calling into the existing _index_file
function), the re-indexing only adds new terms, (and doesn't delete
any). That's probably correct. We're using file size as an heuristic
that the larger file is a superset of the smaller file, but it doesn't
guarantee that the smaller file doesn't contain any unique terms. So I'd
be extremely hesitant to drop any terms here.

>   In the case
>   where you had a collision with different text this could
>   make a search return mails that don't contain that text.
>   I don't think it's a big issue though, even if that is the
>   case.

That's correct. As mentioned in a previous thread, this is likely only a
big issue in the face of deliberate message-ID spoofing or so. In that
thread we talked about some ideas for mitigating that. But I don't think
we need to solve that problem before applying this patch series.

-Carl
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 



Re: [notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-21 Thread Carl Worth
Hi James,

I just got to a point in my outstanding rework where I thought it would
make sense to pull this patch series in, (I'm adding support for storing
multiple filenames in a single mail document).

I took a closer look at this series, and I think it's still independent,
so I'll finish up what I'm doing and then add this series on top later.

But I can at least answer some of the questions you asked for now:

   Does the re-indexing replace the old terms?

Before this patch, there's there's not yet any re-indexing in
notmuch. So we'll basically need to think about what we want to do here.

As this patch is written, (just calling into the existing _index_file
function), the re-indexing only adds new terms, (and doesn't delete
any). That's probably correct. We're using file size as an heuristic
that the larger file is a superset of the smaller file, but it doesn't
guarantee that the smaller file doesn't contain any unique terms. So I'd
be extremely hesitant to drop any terms here.

   In the case
   where you had a collision with different text this could
   make a search return mails that don't contain that text.
   I don't think it's a big issue though, even if that is the
   case.

That's correct. As mentioned in a previous thread, this is likely only a
big issue in the face of deliberate message-ID spoofing or so. In that
thread we talked about some ideas for mitigating that. But I don't think
we need to solve that problem before applying this patch series.

-Carl


pgpIsgTCssm3P.pgp
Description: PGP signature
___
notmuch mailing list
notmuch@notmuchmail.org
http://notmuchmail.org/mailman/listinfo/notmuch


[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-19 Thread James Westby
When we see a message where we already have the file
id stored, check if the size is larger. If it is then
re-index and set the file size and name to be the
new message.
---

  Here's the (quite simple) patch to implement indexing the
  largest copy of each mail that we have.

  Does the re-indexing replace the old terms? In the case
  where you had a collision with different text this could
  make a search return mails that don't contain that text.
  I don't think it's a big issue though, even if that is the
  case.

  Thanks,

  James

 lib/database.cc   |4 +++-
 lib/index.cc  |   27 +++
 lib/message.cc|   31 ++-
 lib/notmuch-private.h |   13 +
 lib/notmuch.h |5 +++--
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index d834d94..64f29b9 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1000,7 +1000,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
-   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+   ret = _notmuch_message_possibly_reindex (message, filename, size);
+   if (!ret)
+   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
goto DONE;
}

diff --git a/lib/index.cc b/lib/index.cc
index 125fa6c..14c3268 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -312,3 +312,30 @@ _notmuch_message_index_file (notmuch_message_t *message,

 return ret;
 }
+
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char *filename,
+const off_t size)
+{
+off_t realsize = size;
+off_t stored_size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, );
+if (ret)
+goto DONE;
+stored_size = _notmuch_message_get_filesize (message);
+if (realsize > stored_size) {
+   ret = _notmuch_message_index_file (message, filename);
+   if (ret)
+   goto DONE;
+   ret = _notmuch_message_set_filesize (message, filename, realsize);
+   _notmuch_message_set_filename (message, filename);
+   _notmuch_message_sync (message);
+}
+
+  DONE:
+return ret;
+
+}
diff --git a/lib/message.cc b/lib/message.cc
index 2bfc5ed..cc32741 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -427,23 +427,38 @@ _notmuch_message_set_filename (notmuch_message_t *message,
 }

 notmuch_status_t
-_notmuch_message_set_filesize (notmuch_message_t *message,
+_notmuch_message_size_on_disk (notmuch_message_t *message,
   const char *filename,
-  const off_t size)
+  off_t *size)
 {
 struct stat st;
-off_t realsize = size;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;

-if (realsize < 0) {
+if (*size < 0) {
if (stat (filename, )) {
ret = NOTMUCH_STATUS_FILE_ERROR;
goto DONE;
} else {
-   realsize = st.st_size;
+   *size = st.st_size;
}
 }

+  DONE:
+return ret;
+}
+
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+  const char *filename,
+  const off_t size)
+{
+off_t realsize = size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, );
+if (ret)
+goto DONE;
+
 message->doc.add_value (NOTMUCH_VALUE_FILESIZE,
 Xapian::sortable_serialise (realsize));

@@ -451,6 +466,12 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
 return ret;
 }

+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message)
+{
+return Xapian::sortable_unserialise (message->doc.get_value 
(NOTMUCH_VALUE_FILESIZE));
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 1ba3055..cf65fd9 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -199,6 +199,14 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
   const char *filename,
   const off_t size);

+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message);
+
+notmuch_status_t
+_notmuch_message_size_on_disk (notmuch_message_t *message,
+  const char *filename,
+  off_t *size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);

@@ -218,6 +226,11 @@ notmuch_status_t
 _notmuch_message_index_file (notmuch_message_t *message,
 const char *filename);

+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char *filename,
+ 

[notmuch] [PATCH] Reindex larger files that duplicate ids we have

2009-12-18 Thread James Westby
When we see a message where we already have the file
id stored, check if the size is larger. If it is then
re-index and set the file size and name to be the
new message.
---

  Here's the (quite simple) patch to implement indexing the
  largest copy of each mail that we have.

  Does the re-indexing replace the old terms? In the case
  where you had a collision with different text this could
  make a search return mails that don't contain that text.
  I don't think it's a big issue though, even if that is the
  case.

  Thanks,

  James

 lib/database.cc   |4 +++-
 lib/index.cc  |   27 +++
 lib/message.cc|   31 ++-
 lib/notmuch-private.h |   13 +
 lib/notmuch.h |5 +++--
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/lib/database.cc b/lib/database.cc
index d834d94..64f29b9 100644
--- a/lib/database.cc
+++ b/lib/database.cc
@@ -1000,7 +1000,9 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
if (ret)
goto DONE;
} else {
-   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+   ret = _notmuch_message_possibly_reindex (message, filename, size);
+   if (!ret)
+   ret = NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
goto DONE;
}
 
diff --git a/lib/index.cc b/lib/index.cc
index 125fa6c..14c3268 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -312,3 +312,30 @@ _notmuch_message_index_file (notmuch_message_t *message,
 
 return ret;
 }
+
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const char *filename,
+const off_t size)
+{
+off_t realsize = size;
+off_t stored_size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, realsize);
+if (ret)
+goto DONE;
+stored_size = _notmuch_message_get_filesize (message);
+if (realsize  stored_size) {
+   ret = _notmuch_message_index_file (message, filename);
+   if (ret)
+   goto DONE;
+   ret = _notmuch_message_set_filesize (message, filename, realsize);
+   _notmuch_message_set_filename (message, filename);
+   _notmuch_message_sync (message);
+}
+
+  DONE:
+return ret;
+
+}
diff --git a/lib/message.cc b/lib/message.cc
index 2bfc5ed..cc32741 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -427,23 +427,38 @@ _notmuch_message_set_filename (notmuch_message_t *message,
 }
 
 notmuch_status_t
-_notmuch_message_set_filesize (notmuch_message_t *message,
+_notmuch_message_size_on_disk (notmuch_message_t *message,
   const char *filename,
-  const off_t size)
+  off_t *size)
 {
 struct stat st;
-off_t realsize = size;
 notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
 
-if (realsize  0) {
+if (*size  0) {
if (stat (filename, st)) {
ret = NOTMUCH_STATUS_FILE_ERROR;
goto DONE;
} else {
-   realsize = st.st_size;
+   *size = st.st_size;
}
 }
 
+  DONE:
+return ret;
+}
+
+notmuch_status_t
+_notmuch_message_set_filesize (notmuch_message_t *message,
+  const char *filename,
+  const off_t size)
+{
+off_t realsize = size;
+notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS;
+
+ret = _notmuch_message_size_on_disk (message, filename, realsize);
+if (ret)
+goto DONE;
+
 message-doc.add_value (NOTMUCH_VALUE_FILESIZE,
 Xapian::sortable_serialise (realsize));
 
@@ -451,6 +466,12 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
 return ret;
 }
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message)
+{
+return Xapian::sortable_unserialise (message-doc.get_value 
(NOTMUCH_VALUE_FILESIZE));
+}
+
 const char *
 notmuch_message_get_filename (notmuch_message_t *message)
 {
diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
index 1ba3055..cf65fd9 100644
--- a/lib/notmuch-private.h
+++ b/lib/notmuch-private.h
@@ -199,6 +199,14 @@ _notmuch_message_set_filesize (notmuch_message_t *message,
   const char *filename,
   const off_t size);
 
+off_t
+_notmuch_message_get_filesize (notmuch_message_t *message);
+
+notmuch_status_t
+_notmuch_message_size_on_disk (notmuch_message_t *message,
+  const char *filename,
+  off_t *size);
+
 void
 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
 
@@ -218,6 +226,11 @@ notmuch_status_t
 _notmuch_message_index_file (notmuch_message_t *message,
 const char *filename);
 
+notmuch_status_t
+_notmuch_message_possibly_reindex (notmuch_message_t *message,
+const