On Wed, Feb 17, 2010 at 18:44, Boomah <nickda...@gmail.com> wrote: > Logically I have a "Contract" that has a bit of meta data (e.g. id) and > maybe a pdf file associated with it. > > I set up a TransientRepository and added a node for each "Contract" with a > unique path. I then set a property on this node called id with the > associated string value of the id: > > node.setProperty("id", "123") > > I have a lot of these (about 200000) and when I do a search on the id using: > > "SELECT * FROM [nt:unstructured] WHERE id = '123' > > it seems to take longer than I would expect.
How long does the query take? How long does iterating over the result nodes and working with it take? The latter could be slow if you have a flat hierarchy, for which Jackrabbit isn't optimized. A typical approach are things like date folders, eg. 2010/02/03. See also https://issues.apache.org/jira/browse/JCR-642 > 1) So my first question is, is there an index on my id property by default? > If not, how do I add an index to it? All properties except binaries are indexed by default. > If the "Contract" has a pdf file associated with it, at the moment I'm just > adding a BinaryValue to the same node: > > node.setProperty("pdfFile", new BinaryValue(pdfInputStream)) > > 2) My next question is, does the pdf file get indexed such that I can search > for text inside it? If not how can I add it in such a way that it does? Once > it has been, what is the SQL2 to query for the string "test"? Binary properties are indexed if they are part of an nt:file, ie. the jcr:content/jcr:data property (when storing files in the repository, you should always use the standard nt:file nodetype for that anyway, it pays off for integrations). A range of text extractors built-in (using Apache Tika) will try to extract the text first that will be full-text indexed. PDF is supported, using PDFbox. Regards, Alex -- Alexander Klimetschek alexander.klimetsc...@day.com