Re: De-duping attachments

2010-09-16 Thread Marc Patermann
Hi, Shuvam Misra schrieb am 15.09.2010 03:40 Uhr: How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the

Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 08:40:59AM +0530, Shuvam Misra wrote: Dear Rob, I had reservations about some of these things too. :( In particular, I was wondering about having to remember and recreate the exact transfer-encoding. If both of us forward the same attachment in two emails, and one

Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote: Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. Yeah, so? It's going down. That's a large

Re: De-duping attachments

2010-09-15 Thread Simon Matter
On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote: Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. Yeah, so? It's going down. That's a large

Re: De-duping attachments

2010-09-15 Thread Eric Luyten
On Wed, September 15, 2010 10:01 am, Simon Matter wrote: I guess much more efficient than a compressing filesystem would be a compressing and de-duping filesystem or disk storage in this case. Has anyone tried this with a Cyrus message store with lots of corporate message data stored on it?

Re: De-duping attachments

2010-09-15 Thread Shuvam Misra
Dear Bron, So you save, what, 50%. Does that sound about right? Do you have statistics on how much space you'd save with this theoretical patch? No, and this is the first thing I want to do. I'm getting some simple utilities developed which will run all week (niced suitably) and extract and

Re: De-duping attachments

2010-09-15 Thread Shuvam Misra
The sparse file idea is brilliant! Never occurred to me. :) We'd have to store the reference-pointer in the message file, so we would omit the actual attachment but eat up perhaps 50 bytes to keep the reference to the file. Shuvam 1. Completely rewrite the message file removing the

Re: De-duping attachments

2010-09-15 Thread Shuvam Misra
Makes sense. There might be some size based logic here too - only bother applying this on messages over 20k, and where the attachment is at least 20k in size. Anything smaller than that is pretty pointless. Yes, absolutely. Left to myself, I'd not have bothered with any attachment less than

Re: De-duping attachments

2010-09-15 Thread Nik Conwell
Great thread. Here as some real world numbers based on our spools here at BU. One of our masters has 4,800 users, 22,000 mailboxes, and is using about 374G of disk. Based on the md5 files for these users there are 6,046,363 messages. If I look at the first md5 value (md5 on the msg if I

Re: De-duping attachments

2010-09-15 Thread Simon Matter
On Wed, September 15, 2010 10:01 am, Simon Matter wrote: I guess much more efficient than a compressing filesystem would be a compressing and de-duping filesystem or disk storage in this case. Has anyone tried this with a Cyrus message store with lots of corporate message data stored on

Re: De-duping attachments

2010-09-15 Thread Eric Luyten
On Wed, September 15, 2010 2:12 pm, Simon Matter wrote: You said ZFS, did you consider testing its built in deduping? (If its even there in Solaris 10?) Simon, OpenSolaris does have it (block level dedup) since about one year but it is too recent an addition to the commercial Solaris 10 to

Re: De-duping attachments

2010-09-15 Thread Joseph Brennan
Outside the cyrus box: The Mimedefang milter has a built-in function (optional of course) to remove an attachment, write it to a file, and replace the attachment part with a text part giving a web link to the file. The files could be on a slower type of disk drive than you need for email

Re: De-duping attachments

2010-09-15 Thread Gavin McCullagh
Hi, On Wed, 15 Sep 2010, Nik Conwell wrote: Isn't the easy hack for dedup just looking at the above md5 files and then doing appropriate hard links? This could be done by a nightly trawl of the spool space. A bigger win would be to separate the headers from the messages but that's a lot

Re: De-duping attachments

2010-09-15 Thread Patrick Goetz
On 09/14/2010 11:55 PM, Rob Mueller wrote: Eg. An architectural firm might end up sending big blueprint documents back and forth between each other a lot, so they'd gain a lot from deduplication. Not to throw a damp towel on this discussion, but isn't this really an administrative problem

Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 05:24:11PM +0100, Gavin McCullagh wrote: Hi, On Wed, 15 Sep 2010, Nik Conwell wrote: Isn't the easy hack for dedup just looking at the above md5 files and then doing appropriate hard links? This could be done by a nightly trawl of the spool space. A bigger

De-duping attachments

2010-09-14 Thread Shuvam Misra
How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the message is fetched for any reason, the original

Re: De-duping attachments

2010-09-14 Thread Rob Mueller
How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the message is fetched for any reason, the original

Re: De-duping attachments

2010-09-14 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 12:13:03PM +1000, Rob Mueller wrote: How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file.

Re: De-duping attachments

2010-09-14 Thread Shuvam Misra
Dear Rob, I had reservations about some of these things too. :( In particular, I was wondering about having to remember and recreate the exact transfer-encoding. If both of us forward the same attachment in two emails, and one encodes in quoted-printable, the other in base64, Cyrus had better be

Re: De-duping attachments

2010-09-14 Thread Shuvam Misra
Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. To set up a mail server with a few TB of disk space, we usually land up deploying a separate chassis with RAID

Re: De-duping attachments

2010-09-14 Thread Rob Mueller
A 500-user company can easily acquire an email archive of 2-5TB. I don't care how much the IO load of that archive server increases, but I'd like to reduce disk space utilisation. If the customer can stick to 2TB of It would be interesting to measure the amount of duplication that is going