Sergio wrote:
1) As our database will be holding most of the data, I thought about the
following schema: storing the documents inside BLOBs in the database (in
case we need to access them using some other criteria) AND in Jackrabbit's
repository. While storing those documents using Jackrabbit, I plan to keep
the RDBMS' pointers (probably the document's record primary key) using
properties. The question is: does this make sense? Is it a common practice?
And if not, what is the standard approach?
well, the recommended approach is to replace your RDBMS with Jackrabbit.
2) Do I need to define node types for representing my documents? If not, is
there some standard type I can use?
for files and folders there's nt:file and nt:folder. See:
http://wiki.apache.org/jackrabbit/NodeTypeRegistry and of course the JSR 170
specification.
3) I have read that Jackrabbit is able to read inside some document types,
how do you accomplish that? Using TextExtractors?
correct. see: http://jackrabbit.apache.org/jackrabbit-text-extractors.html
How? Could you point me
to some examples? I failed to find any. Does it depend on the way I store
those documents? If so, how do you do it?
the text extractors only work with nt:resource nodes. this means your content
structure would look like this:
+ my.pdf (nt:file)
- jcr:created=20080101 (DATE)
+ jcr:content (nt:resource)
- jcr:mimeType=application/pdf (STRING)
- jcr:lastModified=20080101 (DATE)
- jcr:date=<pdf-binary> (BINARY>
regards
marcel