Hello Patrick,
didn't file 3 replace file 2 and file 1 perhaps? You did a session.save() after
each different file?
Do I understand correctly that you now at least get a hit for
/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
where you did not have this one before?
Ard
>
> Hi Ard,
>
> Thanx for your answer.... Especially the part concerning the
> logs... So I could realize that they were disabled... Shame
> on me !;-) Anyway... the logs showed me that some jars were
> missing in the classpath.
> After correction, I re-created my repository again with one
> Node where I attached 3 files (the means, the creation of a
> nt:file node with a nt:resource node for each attached file).
> My files are:
> 1. I set up the jcr:data property with a String, as you asked
> me to do... I put text/plain as mimetype (since the field is
> mandatory) 2. jcr:data is set up with a stream on a simple
> text file (mime type: text/plain) 3. jcr:data is set up with
> a stream on a Word Document file (mimetype: application/msword)
>
> I created this nodes and here are extracts of the logs the I
> got related to indexing. (note that there is no error log in
> the whole log file, only debug) file 1:
> DEBUG - persisting change log {#addedStates=15,
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took
> 172ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent:
> indexing finished in 31 ms.
>
> file 2:
> DEBUG - persisting change log {#addedStates=11,
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took
> 79ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent:
> indexing finished in 0 ms.
> DEBUG - got EventStateCollection
>
> file 3:
> DEBUG - persisting change log {#addedStates=11,
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took
> 125ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, application/msword, ) DEBUG -
> onEvent: indexing finished in 78 ms.
> DEBUG - got EventStateCollection
>
>
> And checking the state of the index with Luke, I could figure
> out that file 3 (Word) was tokenized... but the content of
> file 1 and 2 don't appear anywhere, even though the
> respective properties and nodes do appear!!!
> Consquently, when I run the following XPath query:
> /jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
>
> The only result is the Word Document...
>
> What happened with the 2 other files?
> Maybe the mimetype is wrong (text/plain) ?
> Or did I forget to define something ?
> Maybe I did something wrong in my filter definition, which is:
> <param name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.PlainTextExtractor,
> org.apache.jackrabbit.extractor.MsWordTextExtractor,
> org.apache.jackrabbit.extractor.MsExcelTextExtractor,
> org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
> org.apache.jackrabbit.extractor.PdfTextExtractor,
> org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
> org.apache.jackrabbit.extractor.RTFTextExtractor,
> org.apache.jackrabbit.extractor.HTMLTextExtractor,
> org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>
>
> I thought that
> org.apache.jackrabbit.extractor.PlainTextExtractor could
> handle simple text files...
> As you can see, it is getting better, but I still need a
> little help ;-) so if you haven any idea, don't hesitate
>
> Thank you in advance,
> BR
> Patrick
>
>
>
> ----- Message d'origine ----
> De : Ard Schrijvers <[EMAIL PROTECTED]> À :
> [email protected]; Patrick Wider
> <[EMAIL PROTECTED]> Envoyé le : Lundi, 22 Octobre 2007,
> 14h59mn 53s Objet : RE: Binary Content Search Problem...
>
> Hello Patrick,
>
>
> > Patrick Wider wrote:
> >
> > Of course the files contain somehow 'myKeyWord'... the text file
> > contains it for sure, but in the Document, 'myKeyWord'
> > is wrapped by bold and italic styles. But I don't think the styles
> > cause any problems... on the other hand, I have no idea how the
> > extractors works ;-) it's just a guess....
>
> Just for pinpointing the problem, what happens if:
>
> 1) you search for a word that is not with bold or italic styles?
> 2) if you replace inputstr with "a string to test myKeyWord",
> and then do the search again
>
> You might want to turn on the logging for the indexing and
> extractors, perhaps they reveal some problems. Furthermore
> you might want to take a look at the latest created index
> folder after adding a binary doc with luke [1] and see if the
> binary data is present as tokens in the index
>
> Regards Ard
>
> [1] http://www.getopt.org/luke/
>
> >
>
>
>
> ______________________________________________________________
> _______________
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails
> vers Yahoo! Mail
>