[ldap] Re: best practice to attach binary documents to entries?
Ignoring the issue of DB efficiency for a moment (it's a moot point I think; any directory server today is using a DB engine with exactly the same storage/data management technologies as any other DBMS), it's important to note that the LDAP protocol (and current implementations of LDAP) are not suited to filesystem-style operations. In particular, remote filesystem protocols generally will break up large data access requests into blocks (page sized or some other VM-efficient size). LDAP doesn't work that way, it moves data around as a single monolithic piece. This has efficiency implications for the LDAP server, the network, as well as the LDAP client. Accessing a large document that resides inside an LDAP server will temporarily consume a large amount of memory (slightly in excess of the actual document size) while the entry is accessed and transmitted to the client. In contrast, a server designed for file operations will be able to transmit a series of small blocks, thus using very small amounts of memory on the server side, as well as imposing a lower sustained load on the network. Even with an optimized server that streams data efficiently from disk to ASN.1 to the network, the client will have to keep everything in memory at once while it parses the received data. So yes, while I generally like using LDAP as much as possible, I don't believe this is a wise use of the protocol. Use a pointer to a document server, something that's purpose-built for serving files. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
[ldap] Re: best practice to attach binary documents to entries?
You won't need a consistency-checking tool if you use a revision software such as CVS, Baazar, Darcs, etc. on your documents on the file server. Regards, David Damon Senior Systems Integration Analyst "Mark H. Wood" Sent by: [email protected] 01/14/2009 10:07 AM To [email protected] cc Subject [ldap] Re: best practice to attach binary documents to entries? Option 4: store the documents on some file server -- it doesn't have to be the directory's host -- under any convenient naming scheme. Define an attribute to hold the list of file paths associated with a given object. Mapping files by path allows you to change your mind about how you want to arrange them in the filesystem without having to rearrange what has already been stored, and decouples the file store from dependency on the structure of your directory. You still need to build the consistency-checking tool mentioned by another poster. -- Mark H. Wood, Lead System Programmer [email protected] Friends don't let friends publish revisable-form documents. [attachment "atti3jpx.dat" deleted by David Damon/US/Schenker]
[ldap] Re: best practice to attach binary documents to entries?
Option 4: store the documents on some file server -- it doesn't have to be the directory's host -- under any convenient naming scheme. Define an attribute to hold the list of file paths associated with a given object. Mapping files by path allows you to change your mind about how you want to arrange them in the filesystem without having to rearrange what has already been stored, and decouples the file store from dependency on the structure of your directory. You still need to build the consistency-checking tool mentioned by another poster. -- Mark H. Wood, Lead System Programmer [email protected] Friends don't let friends publish revisable-form documents. pgp6aKs67vhLz.pgp Description: PGP signature
[ldap] Re: best practice to attach binary documents to entries?
We have agreement, then. I prefer to not store large binary stuff in Directory Server, but if the binaries never change, it may not be such a big deal. Of course, I would choose MySQL for the database :) On Jan 14, 2009, at 8:53 AM, Adam Tauno Williams wrote: I think most people who have looked into this would agree with Terry. I think that if you choose option 1, you will find that your directory software is designed to return relatively small amounts of data and is just not efficient at moving large blobs of data like the documents that you are thinking of storing. You will want to do proof-of-concept performance testing before committing to this approach to make sure the delivered system would have adequate response time under load. We store some BLOBs in LDAP (such as a user's desktop wallpaper). If they are of "reasonable" size it works very well. When I tested (which was some time and versions ago) it was loading/updating the BLOBs that hurt performance and ballooned the logs. I think it works well for items that are read-mostly, I wouldn't but BLOBs in the Dit that are frequently changed. In option 2 it is true that you will have to maintain two repositories, and it will be difficult for you keep them consistent. Many kinds of system bugs and failures will cause an update to be completed on one repository and not the other. If you choose this approach, be sure to develop a utility which will check consistency between the two repositories. Agree. I wonder why you'd want to build a document repository on LDAP at all? I'm a fan of LDAP but it seems, IMO, ill suited for that purpose. Option 3 attracted a lot of interest in the 90's when database companies like Informix and Oracle were positioning their DBMS products as the place to store all of your data, in whatever form. I believe that there were a number of success stories in that area. There seems to be less interest now. I gather it is just very difficult to create one DBMS product that can efficiently support many concurrent updates (as a DBMS must), many concurrent queries (as a DBMS must) and also serve big blobs of read-only data (like documents). As an Informix shop I think the loss-of-interest is just because it is now common place and barely worth mentioning. Again, if the BLOBs are read-mostly performance is very good and a modern RDMBS can feed them to a client very efficiently. However you do have to take BLOBs into account in your configuration; Informix (and other) RDBMs allow [and recommend] you create separate partitions (or whatever specific term the RDBMS in question uses) where the BLOBs are stored apart from transactional data. The first two capabilities add a lot of system overhead that works against the third capability. On the plus side, a DBMS will help you a lot in keeping its repository consistent with the directory repository. It may be expensive though. I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that I'd recommend DB2, which has a connection unlimited free version, for doing this kind of work if you need a free (as in beer) RDBMS. -- Adam Tauno Williams, Network & Systems Administrator Consultant - http://www.whitemiceconsulting.com Developer - http://www.opengroupware.org
[ldap] Re: best practice to attach binary documents to entries?
> I think most people who have looked into this would agree with Terry. I > think that if you choose option 1, you will find that your directory > software is designed to return relatively small amounts of data and is > just not efficient at moving large blobs of data like the documents that > you are thinking of storing. You will want to do proof-of-concept > performance testing before committing to this approach to make sure the > delivered system would have adequate response time under load. We store some BLOBs in LDAP (such as a user's desktop wallpaper). If they are of "reasonable" size it works very well. When I tested (which was some time and versions ago) it was loading/updating the BLOBs that hurt performance and ballooned the logs. I think it works well for items that are read-mostly, I wouldn't but BLOBs in the Dit that are frequently changed. > In option 2 it is true that you will have to maintain two repositories, > and it will be difficult for you keep them consistent. Many kinds of > system bugs and failures will cause an update to be completed on one > repository and not the other. If you choose this approach, be sure to > develop a utility which will check consistency between the two > repositories. Agree. I wonder why you'd want to build a document repository on LDAP at all? I'm a fan of LDAP but it seems, IMO, ill suited for that purpose. > Option 3 attracted a lot of interest in the 90's when database companies > like Informix and Oracle were positioning their DBMS products as the > place to store all of your data, in whatever form. I believe that there > were a number of success stories in that area. There seems to be less > interest now. I gather it is just very difficult to create one DBMS > product that can efficiently support many concurrent updates (as a DBMS > must), many concurrent queries (as a DBMS must) and also serve big blobs > of read-only data (like documents). As an Informix shop I think the loss-of-interest is just because it is now common place and barely worth mentioning. Again, if the BLOBs are read-mostly performance is very good and a modern RDMBS can feed them to a client very efficiently. However you do have to take BLOBs into account in your configuration; Informix (and other) RDBMs allow [and recommend] you create separate partitions (or whatever specific term the RDBMS in question uses) where the BLOBs are stored apart from transactional data. > The first two capabilities add a > lot of system overhead that works against the third capability. On the > plus side, a DBMS will help you a lot in keeping its repository > consistent with the directory repository. It may be expensive though. > I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that I'd recommend DB2, which has a connection unlimited free version, for doing this kind of work if you need a free (as in beer) RDBMS. -- Adam Tauno Williams, Network & Systems Administrator Consultant - http://www.whitemiceconsulting.com Developer - http://www.opengroupware.org
[ldap] Re: best practice to attach binary documents to entries?
If your server correctly handles options for attribute, you can store both informations (file type and the content) into a single attribute : cv;pdf: cv;tiff: It's up to the client to manage the option though, as it won't make any sense to the server (don't use the ;binary option, it as a specific semantic on the server) Otherwise, you can define subclassed attributes to handle the different kind of files. -- -- cordialement, regards, Emmanuel Lécharny www.iktek.com directory.apache.org
[ldap] Re: best practice to attach binary documents to entries?
Zhang Weiwu, I think most people who have looked into this would agree with Terry. I think that if you choose option 1, you will find that your directory software is designed to return relatively small amounts of data and is just not efficient at moving large blobs of data like the documents that you are thinking of storing. You will want to do proof-of-concept performance testing before committing to this approach to make sure the delivered system would have adequate response time under load. In option 2 it is true that you will have to maintain two repositories, and it will be difficult for you keep them consistent. Many kinds of system bugs and failures will cause an update to be completed on one repository and not the other. If you choose this approach, be sure to develop a utility which will check consistency between the two repositories. Such a utility will tend to run slowly, just because it has to search both repositories exhaustively. Make sure the utility completes fast enough that you can run it frequently, or after a known system failure, and so detect your consistency problems quickly when they are small, manageable, and not yet noticed by your customers. Keep in mind that if the utility runs while your repositories are in use and being updated, false inconsistencies may appear due to updates that have completed in one repository and not the other; the longer the utility takes to complete, the more "false positives" you will have. If "false positives" are not sorted out quickly, people will lose confidence in the consistency checker output and the false positives will mask real problems. On the plus side, it is probably easiest to design a high-performance product using this option because your documents will be served by software that is designed specifically for moving big chunks of data (like files containing documents) around, and your directory information will be served by software specifically designed for efficient searches. Option 3 attracted a lot of interest in the 90's when database companies like Informix and Oracle were positioning their DBMS products as the place to store all of your data, in whatever form. I believe that there were a number of success stories in that area. There seems to be less interest now. I gather it is just very difficult to create one DBMS product that can efficiently support many concurrent updates (as a DBMS must), many concurrent queries (as a DBMS must) and also serve big blobs of read-only data (like documents). The first two capabilities add a lot of system overhead that works against the third capability. On the plus side, a DBMS will help you a lot in keeping its repository consistent with the directory repository. It may be expensive though. I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that developed in an update-intensive transaction processing environment. There are other DBMS's like MySQL that grew out of a read-mostly industry environment. I don't know much about them and what I wrote above may not be true of them. Good luck, Mark Terry Gardner wrote: Best to point to a document server, not store in directory server. On Jan 13, 2009, at 7:49 PM, Zhang Weiwu wrote: Hello. In one if the directory we are managing it is desirable to attach documents to the entries. e.g. attach multiple CVs to an employee entry. What would be the best practice for such requirement? 1. Directly attach it to the entry using a binary attribute. The downside: file name is lost (because a binary attributes holds file content as value but not including the filename). If file type is limited to types that contain proper metadata (e.g. TIFF, PDF) then we can use the document title inside the document as filename. 2. Maintain a directory on the server file system with the same name as the DN of the entry. LDAP client (which is an web application) should try to get the files from there through ftp or http. Downside: maintain two repository of data; 3. Set up SQL database holding these data. Downside: same as above. Currently I am thinking about solution 1 partly because I think limiting document types to TIFF/PDF is helpful for management reason as well, so this limitation wouldn't hurt me too much. However this is my first time trying to offer binary document to users. How do you recommend? Thanks & best regards
[ldap] Re: best practice to attach binary documents to entries?
Best to point to a document server, not store in directory server. On Jan 13, 2009, at 7:49 PM, Zhang Weiwu wrote: Hello. In one if the directory we are managing it is desirable to attach documents to the entries. e.g. attach multiple CVs to an employee entry. What would be the best practice for such requirement? 1. Directly attach it to the entry using a binary attribute. The downside: file name is lost (because a binary attributes holds file content as value but not including the filename). If file type is limited to types that contain proper metadata (e.g. TIFF, PDF) then we can use the document title inside the document as filename. 2. Maintain a directory on the server file system with the same name as the DN of the entry. LDAP client (which is an web application) should try to get the files from there through ftp or http. Downside: maintain two repository of data; 3. Set up SQL database holding these data. Downside: same as above. Currently I am thinking about solution 1 partly because I think limiting document types to TIFF/PDF is helpful for management reason as well, so this limitation wouldn't hurt me too much. However this is my first time trying to offer binary document to users. How do you recommend? Thanks & best regards
