Re: jackrabbit 2.0 binary search indexing

2010-02-19 Thread Ard Schrijvers
On Fri, Feb 19, 2010 at 1:13 AM, ChadDavis chadmichaelda...@gmail.com wrote:
 On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com 
 wrote:
 On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote:
 I'm looking for information on how to enable binary search indexing.
 I found documentation for pre-2.0 jackrabbit, and reference to the
 fact that Tika is now used internally for the binary indexing.
 However, I can't find any documentation of how to enable the binary
 indexing . . ..

 It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
 property. The mimetype for text extraction is taken from the
 jcr:content/jcr:mimeType property. I don't know if you can enable it
 for other binary properties.


 Just to clarify, you are saying that the binary indexing, as long as
 I'm using the JCR built-in node types for my binary file storage, e.g.
 nt:file -- jcr:content nt:resource --jcr:data ( binary property
 with my file ), occurs automatically?

 If so, then something's not working for me.  Can you recommend some
 troubleshooting tips?  How can I determine whether the binaries are
 being indexed?  Note, I'm doing a full text search and it DOES hit
 other node properties, etc.

Make sure you have all the extractors configured you need. This is
done in the SearchIndex configuration. You can also look this up at
the wiki

Ard




Re: jackrabbit 2.0 binary search indexing

2010-02-19 Thread ChadDavis
 Make sure you have all the extractors configured you need. This is
 done in the SearchIndex configuration. You can also look this up at
 the wiki


Is this necessary for jackrabbit 2.0?


Re: jackrabbit 2.0 binary search indexing

2010-02-18 Thread Alexander Klimetschek
On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote:
 I'm looking for information on how to enable binary search indexing.
 I found documentation for pre-2.0 jackrabbit, and reference to the
 fact that Tika is now used internally for the binary indexing.
 However, I can't find any documentation of how to enable the binary
 indexing . . ..

It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
property. The mimetype for text extraction is taken from the
jcr:content/jcr:mimeType property. I don't know if you can enable it
for other binary properties.

For the search configuration in general, see [1]. I don't know if the
TextExtractor config described there is still valid wrt to the use of
Tika in Jackrabbit 2.0.

[1] http://wiki.apache.org/jackrabbit/Search#Search_Configuration

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetsc...@day.com


Re: jackrabbit 2.0 binary search indexing

2010-02-18 Thread ChadDavis
On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com wrote:
 On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com wrote:
 I'm looking for information on how to enable binary search indexing.
 I found documentation for pre-2.0 jackrabbit, and reference to the
 fact that Tika is now used internally for the binary indexing.
 However, I can't find any documentation of how to enable the binary
 indexing . . ..

 It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
 property. The mimetype for text extraction is taken from the
 jcr:content/jcr:mimeType property. I don't know if you can enable it
 for other binary properties.


Just to clarify, you are saying that the binary indexing, as long as
I'm using the JCR built-in node types for my binary file storage, e.g.
nt:file -- jcr:content nt:resource --jcr:data ( binary property
with my file ), occurs automatically?

If so, then something's not working for me.  Can you recommend some
troubleshooting tips?  How can I determine whether the binaries are
being indexed?  Note, I'm doing a full text search and it DOES hit
other node properties, etc.


Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

2010-02-18 Thread Ross . Dyson
My binary files are all PDFs, so the text is extracted with PdfBox toolkit 
and the full text becomes keyword searchable.
All done using the default configuration, except I extended nt:resource to 
add a few attributes.

The mimeType attribute will be application/octet-stream. 
Perhaps there is no plug-in that knows how to extract text from your 
binary files?




From:   ChadDavis chadmichaelda...@gmail.com
To: users@jackrabbit.apache.org
Date:   19/02/2010 11:13 AM
Subject:Re: jackrabbit 2.0 binary search indexing



On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com 
wrote:
 On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com 
wrote:
 I'm looking for information on how to enable binary search indexing.
 I found documentation for pre-2.0 jackrabbit, and reference to the
 fact that Tika is now used internally for the binary indexing.
 However, I can't find any documentation of how to enable the binary
 indexing . . ..

 It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
 property. The mimetype for text extraction is taken from the
 jcr:content/jcr:mimeType property. I don't know if you can enable it
 for other binary properties.


Just to clarify, you are saying that the binary indexing, as long as
I'm using the JCR built-in node types for my binary file storage, e.g.
nt:file -- jcr:content nt:resource --jcr:data ( binary property
with my file ), occurs automatically?

If so, then something's not working for me.  Can you recommend some
troubleshooting tips?  How can I determine whether the binaries are
being indexed?  Note, I'm doing a full text search and it DOES hit
other node properties, etc.


--
This message contains privileged and confidential information only 
for use by the intended recipient.  If you are not the intended 
recipient of this message, you must not disseminate, copy or use 
it in any manner.  If you have received this message in error, 
please advise the sender by reply e-mail.  Please ensure all 
e-mail attachments are scanned for viruses prior to opening or 
using.


Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

2010-02-18 Thread ChadDavis
On Thu, Feb 18, 2010 at 5:30 PM,  ross.dy...@ipaustralia.gov.au wrote:
 My binary files are all PDFs, so the text is extracted with PdfBox toolkit
 and the full text becomes keyword searchable.
 All done using the default configuration, except I extended nt:resource to
 add a few attributes.

 The mimeType attribute will be application/octet-stream.
 Perhaps there is no plug-in that knows how to extract text from your binary
 files?

I tried pdf, word, and a plain text file . . . how long does it take
for a doc to be indexed?





 From:        ChadDavis chadmichaelda...@gmail.com
 To:        us...@jackrabbit.apache.org
 Date:        19/02/2010 11:13 AM
 Subject:        Re: jackrabbit 2.0 binary search indexing
 


 On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek aklim...@day.com
 wrote:
 On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com
 wrote:
 I'm looking for information on how to enable binary search indexing.
 I found documentation for pre-2.0 jackrabbit, and reference to the
 fact that Tika is now used internally for the binary indexing.
 However, I can't find any documentation of how to enable the binary
 indexing . . ..

 It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
 property. The mimetype for text extraction is taken from the
 jcr:content/jcr:mimeType property. I don't know if you can enable it
 for other binary properties.


 Just to clarify, you are saying that the binary indexing, as long as
 I'm using the JCR built-in node types for my binary file storage, e.g.
 nt:file -- jcr:content nt:resource --jcr:data ( binary property
 with my file ), occurs automatically?

 If so, then something's not working for me.  Can you recommend some
 troubleshooting tips?  How can I determine whether the binaries are
 being indexed?  Note, I'm doing a full text search and it DOES hit
 other node properties, etc.



 --
 This message contains privileged and confidential information only
 for use by the intended recipient.  If you are not the intended
 recipient of this message, you must not disseminate, copy or use
 it in any manner.  If you have received this message in error,
 please advise the sender by reply e-mail.  Please ensure all
 e-mail attachments are scanned for viruses prior to opening or
 using.




Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

2010-02-18 Thread Ross . Dyson
I only have a small dataset in my test application (100 docs), it 
certainly only takes a few seconds to be available for the keyword search.

ChadDavis chadmichaelda...@gmail.com wrote on 19/02/2010 11:33:27 AM:

 From: ChadDavis chadmichaelda...@gmail.com
 To: users@jackrabbit.apache.org
 Date: 19/02/2010 11:34 AM
 Subject: Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]
 
 On Thu, Feb 18, 2010 at 5:30 PM,  ross.dy...@ipaustralia.gov.au wrote:
  My binary files are all PDFs, so the text is extracted with PdfBox 
toolkit
  and the full text becomes keyword searchable.
  All done using the default configuration, except I extended 
nt:resource to
  add a few attributes.
 
  The mimeType attribute will be application/octet-stream.
  Perhaps there is no plug-in that knows how to extract text from your 
binary
  files?
 
 I tried pdf, word, and a plain text file . . . how long does it take
 for a doc to be indexed?
 
 
 
 
 
  From:ChadDavis chadmichaelda...@gmail.com
  To:users@jackrabbit.apache.org
  Date:19/02/2010 11:13 AM
  Subject:Re: jackrabbit 2.0 binary search indexing
  
 
 
  On Thu, Feb 18, 2010 at 2:39 PM, Alexander Klimetschek 
aklim...@day.com
  wrote:
  On Thu, Feb 18, 2010 at 18:35, ChadDavis chadmichaelda...@gmail.com
  wrote:
  I'm looking for information on how to enable binary search indexing.
  I found documentation for pre-2.0 jackrabbit, and reference to the
  fact that Tika is now used internally for the binary indexing.
  However, I can't find any documentation of how to enable the binary
  indexing . . ..
 
  It is enabled for all nt:file binaries, ie. the jcr:content/jcr:data
  property. The mimetype for text extraction is taken from the
  jcr:content/jcr:mimeType property. I don't know if you can enable it
  for other binary properties.
 
 
  Just to clarify, you are saying that the binary indexing, as long as
  I'm using the JCR built-in node types for my binary file storage, e.g.
  nt:file -- jcr:content nt:resource --jcr:data ( binary property
  with my file ), occurs automatically?
 
  If so, then something's not working for me.  Can you recommend some
  troubleshooting tips?  How can I determine whether the binaries are
  being indexed?  Note, I'm doing a full text search and it DOES hit
  other node properties, etc.
 
 
 
  --
  This message contains privileged and confidential information only
  for use by the intended recipient.  If you are not the intended
  recipient of this message, you must not disseminate, copy or use
  it in any manner.  If you have received this message in error,
  please advise the sender by reply e-mail.  Please ensure all
  e-mail attachments are scanned for viruses prior to opening or
  using.
 
 

--
This message contains privileged and confidential information only 
for use by the intended recipient.  If you are not the intended 
recipient of this message, you must not disseminate, copy or use 
it in any manner.  If you have received this message in error, 
please advise the sender by reply e-mail.  Please ensure all 
e-mail attachments are scanned for viruses prior to opening or 
using.


Re: jackrabbit 2.0 binary search indexing [SEC=UNCLASSIFIED]

2010-02-18 Thread ChadDavis
On Thu, Feb 18, 2010 at 5:41 PM,  ross.dy...@ipaustralia.gov.au wrote:
 I only have a small dataset in my test application (100 docs), it certainly
 only takes a few seconds to be available for the keyword search.

I figured it out.  User error.