On Wed, Feb 17, 2010 at 18:44, Boomah <nickda...@gmail.com> wrote:
> Logically I have a "Contract" that has a bit of meta data (e.g. id) and
> maybe a pdf file associated with it.
>
> I set up a TransientRepository and added a node for each "Contract" with a
> unique path. I then set a property on this node called id with the
> associated string value of the id:
>
> node.setProperty("id", "123")
>
> I have a lot of these (about 200000) and when I do a search on the id using:
>
> "SELECT * FROM [nt:unstructured] WHERE id = '123'
>
> it seems to take longer than I would expect.

How long does the query take? How long does iterating over the result
nodes and working with it take? The latter could be slow if you have a
flat hierarchy, for which Jackrabbit isn't optimized. A typical
approach are things like date folders, eg. 2010/02/03. See also
https://issues.apache.org/jira/browse/JCR-642

> 1) So my first question is, is there an index on my id property by default?
> If not, how do I add an index to it?

All properties except binaries are indexed by default.

> If the "Contract" has a pdf file associated with it, at the moment I'm just
> adding a BinaryValue to the same node:
>
> node.setProperty("pdfFile", new BinaryValue(pdfInputStream))
>
> 2) My next question is, does the pdf file get indexed such that I can search
> for text inside it? If not how can I add it in such a way that it does? Once
> it has been, what is the SQL2 to query for the string "test"?

Binary properties are indexed if they are part of an nt:file, ie. the
jcr:content/jcr:data property (when storing files in the repository,
you should always use the standard nt:file nodetype for that anyway,
it pays off for integrations). A range of text extractors built-in
(using Apache Tika) will try to extract the text first that will be
full-text indexed. PDF is supported, using PDFbox.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetsc...@day.com

Reply via email to