[ldap] Re: best practice to attach binary documents to entries?

2009-01-14 Thread Howard Chu
Ignoring the issue of DB efficiency for a moment (it's a moot point I think; 
any directory server today is using a DB engine with exactly the same 
storage/data management technologies as any other DBMS), it's important to 
note that the LDAP protocol (and current implementations of LDAP) are not 
suited to filesystem-style operations. In particular, remote filesystem 
protocols generally will break up large data access requests into blocks (page 
sized or some other VM-efficient size). LDAP doesn't work that way, it moves 
data around as a single monolithic piece. This has efficiency implications for 
the LDAP server, the network, as well as the LDAP client. Accessing a large 
document that resides inside an LDAP server will temporarily consume a large 
amount of memory (slightly in excess of the actual document size) while the 
entry is accessed and transmitted to the client. In contrast, a server 
designed for file operations will be able to transmit a series of small 
blocks, thus using very small amounts of memory on the server side, as well as 
imposing a lower sustained load on the network. Even with an optimized server 
that streams data efficiently from disk to ASN.1 to the network, the client 
will have to keep everything in memory at once while it parses the received data.


So yes, while I generally like using LDAP as much as possible, I don't believe 
this is a wise use of the protocol. Use a pointer to a document server, 
something that's purpose-built for serving files.

--
   -- Howard Chu
   CTO, Symas Corp.   http://www.symas.com
   Director, Highland Sun http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/




[ldap] Re: best practice to attach binary documents to entries?

2009-01-14 Thread ddamon
You won't need a consistency-checking tool if you use a revision software 
such as CVS, Baazar, Darcs, etc. on your documents on the file server.

Regards,
David Damon
Senior Systems Integration Analyst




"Mark H. Wood"  

Sent by: [email protected]
01/14/2009 10:07 AM

To
[email protected]
cc

Subject
[ldap] Re: best practice to attach binary documents to entries?






Option 4:  store the documents on some file server -- it doesn't have
to be the directory's host -- under any convenient naming scheme.
Define an attribute to hold the list of file paths associated with a
given object.

Mapping files by path allows you to change your mind about how you
want to arrange them in the filesystem without having to rearrange
what has already been stored, and decouples the file store from
dependency on the structure of your directory.

You still need to build the consistency-checking tool mentioned by
another poster.

-- 
Mark H. Wood, Lead System Programmer   [email protected]
Friends don't let friends publish revisable-form documents.
[attachment "atti3jpx.dat" deleted by David Damon/US/Schenker] 


[ldap] Re: best practice to attach binary documents to entries?

2009-01-14 Thread Mark H. Wood
Option 4:  store the documents on some file server -- it doesn't have
to be the directory's host -- under any convenient naming scheme.
Define an attribute to hold the list of file paths associated with a
given object.

Mapping files by path allows you to change your mind about how you
want to arrange them in the filesystem without having to rearrange
what has already been stored, and decouples the file store from
dependency on the structure of your directory.

You still need to build the consistency-checking tool mentioned by
another poster.

-- 
Mark H. Wood, Lead System Programmer   [email protected]
Friends don't let friends publish revisable-form documents.


pgp6aKs67vhLz.pgp
Description: PGP signature


[ldap] Re: best practice to attach binary documents to entries?

2009-01-14 Thread Terry Gardner
We have agreement, then. I prefer to not store large binary stuff in  
Directory Server, but if the binaries never change, it may not be such  
a big deal. Of course, I would choose MySQL for the database :)


On Jan 14, 2009, at 8:53 AM, Adam Tauno Williams wrote:

I think most people who have looked into this would agree with  
Terry.  I

think that if you choose option 1, you will find that your directory
software is designed to return relatively small amounts of data and  
is
just not efficient at moving large blobs of data like the documents  
that

you are thinking of storing.  You will want to do proof-of-concept
performance testing before committing to this approach to make sure  
the

delivered system would have adequate response time under load.


We store some BLOBs in LDAP (such as a user's desktop wallpaper).  If
they are of "reasonable" size it works very well.  When I tested  
(which

was some time and versions ago) it was loading/updating the BLOBs that
hurt performance and ballooned the logs.   I think it works well for
items that are read-mostly,  I wouldn't but BLOBs in the Dit that are
frequently changed.

In option 2 it is true that you will have to maintain two  
repositories,

and it will be difficult for you keep them consistent.  Many kinds of
system bugs and failures will cause an update to be completed on one
repository and not the other.  If you choose this approach, be sure  
to

develop a utility which will check consistency between the two
repositories.


Agree.  I wonder why you'd want to build a document repository on LDAP
at all?  I'm a fan of LDAP but it seems, IMO, ill suited for that
purpose.

Option 3 attracted a lot of interest in the 90's when database  
companies

like Informix and Oracle were positioning their DBMS products as the
place to store all of your data, in whatever form.  I believe that  
there
were a number of success stories in that area.  There seems to be  
less

interest now.  I gather it is just very difficult to create one DBMS
product that can efficiently support many concurrent updates (as a  
DBMS
must), many concurrent queries (as a DBMS must) and also serve big  
blobs

of read-only data (like documents).


As an Informix shop I think the loss-of-interest is just because it is
now common place and barely worth mentioning.  Again, if the BLOBs are
read-mostly performance is very good and a modern RDMBS can feed  
them to

a client very efficiently.  However you do have to take BLOBs into
account in your configuration;  Informix (and other) RDBMs allow [and
recommend] you create separate partitions (or whatever specific term  
the

RDBMS in question uses) where the BLOBs are stored apart from
transactional data.


 The first two capabilities add a
lot of system overhead that works against the third capability.  On  
the

plus side, a DBMS will help you a lot in keeping its repository
consistent with the directory repository.  It may be expensive  
though.

I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that


I'd recommend DB2, which has a connection unlimited free version, for
doing this kind of work if you need a free (as in beer) RDBMS.
--
Adam Tauno Williams, Network & Systems Administrator
Consultant - http://www.whitemiceconsulting.com
Developer - http://www.opengroupware.org







[ldap] Re: best practice to attach binary documents to entries?

2009-01-14 Thread Adam Tauno Williams
> I think most people who have looked into this would agree with Terry.  I 
> think that if you choose option 1, you will find that your directory 
> software is designed to return relatively small amounts of data and is 
> just not efficient at moving large blobs of data like the documents that 
> you are thinking of storing.  You will want to do proof-of-concept 
> performance testing before committing to this approach to make sure the 
> delivered system would have adequate response time under load.

We store some BLOBs in LDAP (such as a user's desktop wallpaper).  If
they are of "reasonable" size it works very well.  When I tested (which
was some time and versions ago) it was loading/updating the BLOBs that
hurt performance and ballooned the logs.   I think it works well for
items that are read-mostly,  I wouldn't but BLOBs in the Dit that are
frequently changed.

> In option 2 it is true that you will have to maintain two repositories, 
> and it will be difficult for you keep them consistent.  Many kinds of 
> system bugs and failures will cause an update to be completed on one 
> repository and not the other.  If you choose this approach, be sure to 
> develop a utility which will check consistency between the two 
> repositories. 

Agree.  I wonder why you'd want to build a document repository on LDAP
at all?  I'm a fan of LDAP but it seems, IMO, ill suited for that
purpose.

> Option 3 attracted a lot of interest in the 90's when database companies 
> like Informix and Oracle were positioning their DBMS products as the 
> place to store all of your data, in whatever form.  I believe that there 
> were a number of success stories in that area.  There seems to be less 
> interest now.  I gather it is just very difficult to create one DBMS 
> product that can efficiently support many concurrent updates (as a DBMS 
> must), many concurrent queries (as a DBMS must) and also serve big blobs 
> of read-only data (like documents).

As an Informix shop I think the loss-of-interest is just because it is
now common place and barely worth mentioning.  Again, if the BLOBs are
read-mostly performance is very good and a modern RDMBS can feed them to
a client very efficiently.  However you do have to take BLOBs into
account in your configuration;  Informix (and other) RDBMs allow [and
recommend] you create separate partitions (or whatever specific term the
RDBMS in question uses) where the BLOBs are stored apart from
transactional data.

>   The first two capabilities add a 
> lot of system overhead that works against the third capability.  On the 
> plus side, a DBMS will help you a lot in keeping its repository 
> consistent with the directory repository.  It may be expensive though.  
> I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that 

I'd recommend DB2, which has a connection unlimited free version, for
doing this kind of work if you need a free (as in beer) RDBMS.
-- 
Adam Tauno Williams, Network & Systems Administrator
Consultant - http://www.whitemiceconsulting.com
Developer - http://www.opengroupware.org




[ldap] Re: best practice to attach binary documents to entries?

2009-01-13 Thread Emmanuel Lecharny
If your server correctly handles options for attribute, you can store 
both informations (file type and the content) into a single attribute :


cv;pdf: 
cv;tiff: 

It's up to the client to manage the option though, as it won't make any 
sense to the server (don't use the ;binary option, it as a specific 
semantic on the server)


Otherwise, you can define subclassed attributes to handle the different 
kind of files.


--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org





[ldap] Re: best practice to attach binary documents to entries?

2009-01-13 Thread Mark P. Anderson

Zhang Weiwu,

I think most people who have looked into this would agree with Terry.  I 
think that if you choose option 1, you will find that your directory 
software is designed to return relatively small amounts of data and is 
just not efficient at moving large blobs of data like the documents that 
you are thinking of storing.  You will want to do proof-of-concept 
performance testing before committing to this approach to make sure the 
delivered system would have adequate response time under load.


In option 2 it is true that you will have to maintain two repositories, 
and it will be difficult for you keep them consistent.  Many kinds of 
system bugs and failures will cause an update to be completed on one 
repository and not the other.  If you choose this approach, be sure to 
develop a utility which will check consistency between the two 
repositories.  Such a utility will tend to run slowly, just because it 
has to search both repositories exhaustively.  Make sure the utility 
completes fast enough that you can run it frequently, or after a known 
system failure, and so detect your consistency problems quickly when 
they are small, manageable, and not yet noticed by your customers.  Keep 
in mind that if the utility runs while your repositories are in use and 
being updated, false inconsistencies may appear due to updates that have 
completed in one repository and not the other; the longer the utility 
takes to complete, the more "false positives" you will have.  If "false 
positives" are not sorted out quickly, people will lose confidence in 
the consistency checker output and the false positives will mask real 
problems.  On the plus side, it is probably easiest to design a 
high-performance product using this option because your documents will 
be served by software that is designed specifically for moving big 
chunks of data (like files containing documents) around, and your 
directory information will be served by software specifically designed 
for efficient searches.


Option 3 attracted a lot of interest in the 90's when database companies 
like Informix and Oracle were positioning their DBMS products as the 
place to store all of your data, in whatever form.  I believe that there 
were a number of success stories in that area.  There seems to be less 
interest now.  I gather it is just very difficult to create one DBMS 
product that can efficiently support many concurrent updates (as a DBMS 
must), many concurrent queries (as a DBMS must) and also serve big blobs 
of read-only data (like documents).  The first two capabilities add a 
lot of system overhead that works against the third capability.  On the 
plus side, a DBMS will help you a lot in keeping its repository 
consistent with the directory repository.  It may be expensive though.  
I am writing of enterprise-level DBMSs like Oracle, DB2, etc. that 
developed in an update-intensive transaction processing environment.
There are other DBMS's like MySQL that grew  out of a read-mostly 
industry environment.  I don't know much about them and what I wrote 
above may not be true of them.


Good luck,

Mark

Terry Gardner wrote:

Best to point to a document server, not store in directory server.

On Jan 13, 2009, at 7:49 PM, Zhang Weiwu wrote:


Hello. In one if the directory we are managing it is desirable to attach
documents to the entries. e.g. attach multiple CVs to an employee entry.

What would be the best practice for such requirement?

  1. Directly attach it to the entry using a binary attribute. The
 downside: file name is lost (because a binary attributes holds
 file content as value but not including the filename). If file
 type is limited to types that contain proper metadata (e.g. TIFF,
 PDF) then we can use the document title inside the document as
 filename.
  2. Maintain a directory on the server file system with the same name
 as the DN of the entry. LDAP client (which is an web application)
 should try to get the files from there through ftp or http.
 Downside: maintain two repository of data;
  3. Set up SQL database holding these data. Downside: same as above.

Currently I am thinking about solution 1 partly because I think limiting
document types to TIFF/PDF is helpful for management reason as well, so
this limitation wouldn't hurt me too much.

However this is my first time trying to offer binary document to users.
How do you recommend?

Thanks  & best regards









[ldap] Re: best practice to attach binary documents to entries?

2009-01-13 Thread Terry Gardner

Best to point to a document server, not store in directory server.

On Jan 13, 2009, at 7:49 PM, Zhang Weiwu wrote:

Hello. In one if the directory we are managing it is desirable to  
attach
documents to the entries. e.g. attach multiple CVs to an employee  
entry.


What would be the best practice for such requirement?

  1. Directly attach it to the entry using a binary attribute. The
 downside: file name is lost (because a binary attributes holds
 file content as value but not including the filename). If file
 type is limited to types that contain proper metadata (e.g. TIFF,
 PDF) then we can use the document title inside the document as
 filename.
  2. Maintain a directory on the server file system with the same name
 as the DN of the entry. LDAP client (which is an web application)
 should try to get the files from there through ftp or http.
 Downside: maintain two repository of data;
  3. Set up SQL database holding these data. Downside: same as above.

Currently I am thinking about solution 1 partly because I think  
limiting
document types to TIFF/PDF is helpful for management reason as well,  
so

this limitation wouldn't hurt me too much.

However this is my first time trying to offer binary document to  
users.

How do you recommend?

Thanks  & best regards