Re: jackrabbit 2.0 binary search indexing
On Fri, Feb 19, 2010 at 1:13 AM, ChadDavis chadmichaelda...@gmail.com wrote: On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote: On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. Just to clarify, you are saying that the binary indexing, as long as I'm using the JCR built-in node types for my binary file storage, e.g. nt:file -- jcr:content nt:resource --jcr:data ( binary property with my file ), occurs automatically? If so, then something's not working for me. Can you recommend some troubleshooting tips? How can I determine whether the binaries are being indexed? Note, I'm doing a full text search and it DOES hit other node properties, etc. Make sure you have all the extractors configured you need. This is done in the SearchIndex configuration. You can also look this up at the wiki Ard
Re: jackrabbit 2.0 binary search indexing
Make sure you have all the extractors configured you need. This is done in the SearchIndex configuration. You can also look this up at the wiki Is this necessary for jackrabbit 2.0?
Re: jackrabbit 2.0 binary search indexing
On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. For the search configuration in general, see [1]. I don't know if the TextExtractor config described there is still valid wrt to the use of Tika in Jackrabbit 2.0. [1] http://wiki.apache.org/jackrabbit/Search#Search_Configuration Regards, Alex -- Alexander Klimetschek alexander.klimetsc...@day.com
Re: jackrabbit 2.0 binary search indexing
On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote: On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. Just to clarify, you are saying that the binary indexing, as long as I'm using the JCR built-in node types for my binary file storage, e.g. nt:file -- jcr:content nt:resource --jcr:data ( binary property with my file ), occurs automatically? If so, then something's not working for me. Can you recommend some troubleshooting tips? How can I determine whether the binaries are being indexed? Note, I'm doing a full text search and it DOES hit other node properties, etc.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
My binary files are all PDFs, so the text is extracted with PdfBox toolkit and the full text becomes keyword searchable. All done using the default configuration, except I extended nt:resource to add a few attributes. The mimeType attribute will be application/octet-stream. Perhaps there is no plug-in that knows how to extract text from your binary files? From: ChadDavis chadmichaelda...@gmail.com To: users@jackrabbit.apache.org Date: 19/02/2010 11:13 AM Subject:Re: jackrabbit 2.0 binary search indexing On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote: On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. Just to clarify, you are saying that the binary indexing, as long as I'm using the JCR built-in node types for my binary file storage, e.g. nt:file -- jcr:content nt:resource --jcr:data ( binary property with my file ), occurs automatically? If so, then something's not working for me. Can you recommend some troubleshooting tips? How can I determine whether the binaries are being indexed? Note, I'm doing a full text search and it DOES hit other node properties, etc. -- This message contains privileged and confidential information only for use by the intended recipient. If you are not the intended recipient of this message, you must not disseminate, copy or use it in any manner. If you have received this message in error, please advise the sender by reply e-mail. Please ensure all e-mail attachments are scanned for viruses prior to opening or using.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
On Thu, Feb 18, 2010 at 5:30 PM, ross.dy...@ipaustralia.gov.au wrote: My binary files are all PDFs, so the text is extracted with PdfBox toolkit and the full text becomes keyword searchable. All done using the default configuration, except I extended nt:resource to add a few attributes. The mimeType attribute will be application/octet-stream. Perhaps there is no plug-in that knows how to extract text from your binary files? I tried pdf, word, and a plain text file . . . how long does it take for a doc to be indexed? From: ChadDavis chadmichaelda...@gmail.com To: us...@jackrabbit.apache.org Date: 19/02/2010 11:13 AM Subject: Re: jackrabbit 2.0 binary search indexing On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote: On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. Just to clarify, you are saying that the binary indexing, as long as I'm using the JCR built-in node types for my binary file storage, e.g. nt:file -- jcr:content nt:resource --jcr:data ( binary property with my file ), occurs automatically? If so, then something's not working for me. Can you recommend some troubleshooting tips? How can I determine whether the binaries are being indexed? Note, I'm doing a full text search and it DOES hit other node properties, etc. -- This message contains privileged and confidential information only for use by the intended recipient. If you are not the intended recipient of this message, you must not disseminate, copy or use it in any manner. If you have received this message in error, please advise the sender by reply e-mail. Please ensure all e-mail attachments are scanned for viruses prior to opening or using.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
I only have a small dataset in my test application (100 docs), it certainly only takes a few seconds to be available for the keyword search. ChadDavis chadmichaelda...@gmail.com wrote on 19/02/2010 11:33:27 AM: From: ChadDavis chadmichaelda...@gmail.com To: users@jackrabbit.apache.org Date: 19/02/2010 11:34 AM Subject: Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED] On Thu, Feb 18, 2010 at 5:30 PM, ross.dy...@ipaustralia.gov.au wrote: My binary files are all PDFs, so the text is extracted with PdfBox toolkit and the full text becomes keyword searchable. All done using the default configuration, except I extended nt:resource to add a few attributes. The mimeType attribute will be application/octet-stream. Perhaps there is no plug-in that knows how to extract text from your binary files? I tried pdf, word, and a plain text file . . . how long does it take for a doc to be indexed? From:ChadDavis chadmichaelda...@gmail.com To:users@jackrabbit.apache.org Date:19/02/2010 11:13 AM Subject:Re: jackrabbit 2.0 binary search indexing On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote: On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote: I'm looking for information on how to enable binary search indexing. I found documentation for pre-2.0 jackrabbit, and reference to the fact that Tika is now used internally for the binary indexing. However, I can't find any documentation of how to enable the binary indexing . . .. It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data property. The mimetype for text extraction is taken from the jcr:content/jcr:mimeType property. I don't know if you can enable it for other binary properties. Just to clarify, you are saying that the binary indexing, as long as I'm using the JCR built-in node types for my binary file storage, e.g. nt:file -- jcr:content nt:resource --jcr:data ( binary property with my file ), occurs automatically? If so, then something's not working for me. Can you recommend some troubleshooting tips? How can I determine whether the binaries are being indexed? Note, I'm doing a full text search and it DOES hit other node properties, etc. -- This message contains privileged and confidential information only for use by the intended recipient. If you are not the intended recipient of this message, you must not disseminate, copy or use it in any manner. If you have received this message in error, please advise the sender by reply e-mail. Please ensure all e-mail attachments are scanned for viruses prior to opening or using. -- This message contains privileged and confidential information only for use by the intended recipient. If you are not the intended recipient of this message, you must not disseminate, copy or use it in any manner. If you have received this message in error, please advise the sender by reply e-mail. Please ensure all e-mail attachments are scanned for viruses prior to opening or using.
Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
On Thu, Feb 18, 2010 at 5:41 PM, ross.dy...@ipaustralia.gov.au wrote: I only have a small dataset in my test application (100 docs), it certainly only takes a few seconds to be available for the keyword search. I figured it out. User error.