Re : Re : Binary Content Search Problem...

Patrick Wider Tue, 23 Oct 2007 03:20:31 -0700

Hi,
I really don't think file 3 replaces the previous ones. I really create on 
"top" node (called "Homepage"), where I attached 3 different Nodes using 
Homepage.addNode(...) (typed as: wider:file > 'nt:file', 'mix:referenceable' - 
maybe there is something missing in my NodeType definition???)...  I also 
attached 3 different nt:resource nodes. It goes like this:


   File fileTXT = new File("C:/JackRabbit/testresources/JackRabbittest.txt");
   File fileDOC = new File("C:/JackRabbit/testresources/JackRabbittest.doc");

   Node file1 = homepage.addNode("MyStringName", "wider:file");
   Node res1 = file1.addNode("jcr:content", "nt:resource");
   res1.setProperty("jcr:mimeType", mimetype);
   res1.setProperty("jcr:encoding", "");
   res1.setProperty("jcr:lastModified", cal);   
   res1.setProperty("jcr:data", "My String with MyKeyWord Content toto");
   session.save();

   Node file2 = homepage.addNode(fileTXT.getName(), "wider:file");
   Node res2 = file2.addNode("jcr:content", "nt:resource");
   res2.setProperty("jcr:mimeType", mimetype);
   res2.setProperty("jcr:encoding", "");
   res2.setProperty("jcr:lastModified", cal);
   InputStream inputTXT = new FileInputStream(fileTXT);
   res2.setProperty("jcr:data", inputTXT);
   session.save();

   Node file3 = homepage.addNode(fileDOC.getName(), "wider:file");
   Node res3 = file3.addNode("jcr:content", "nt:resource");
   res3.setProperty("jcr:mimeType", mimetype);
   res3.setProperty("jcr:encoding", "");
   res3.setProperty("jcr:lastModified", cal);
   InputStream inputDOC = new FileInputStream(fileDOC);
   res3.setProperty("jcr:data", inputDOC);
   session.save();


Yes, my query returns one hit: the doc file... even though MyKeyWord appears in 
the 3 contents

I had no return because of the missing jars. Now this problem is resolved and 
the Word Document is indexed! 
But the simple text file is not... weird, isn't it?

BR, Patrick

----- Message d'origine ----
De : Ard Schrijvers <[EMAIL PROTECTED]>
À : [email protected]; Patrick Wider <[EMAIL PROTECTED]>
Envoyé le : Mardi, 23 Octobre 2007, 11h55mn 29s
Objet : RE: Re : Binary Content Search Problem...


Hello Patrick,

didn't file 3 replace file 2 and file 1 perhaps? You did a session.save() after 
each different file? 

Do I understand correctly that you now at least get a hit for  

/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]

where you did not have this one before?

Ard

> 
> Hi Ard,
> 
> Thanx for your answer.... Especially the part concerning the 
> logs... So I could realize that they were disabled... Shame 
> on me !;-) Anyway... the logs showed me that some jars were 
> missing in the classpath.
> After correction, I re-created my repository again with one 
> Node where I attached 3 files (the means, the creation of a 
> nt:file node with a nt:resource node for each attached file). 
> My files are:
> 1. I set up the jcr:data property with a String, as you asked 
> me to do... I put text/plain as mimetype (since the field is 
> mandatory) 2. jcr:data is set up with a stream on a simple 
> text file (mime type: text/plain) 3. jcr:data is set up with 
> a stream on a Word Document file (mimetype: application/msword)
> 
> I created this nodes and here are extracts of the logs the I 
> got related to indexing. (note that there is no error log in 
> the whole log file, only debug) file 1: 
> DEBUG - persisting change log {#addedStates=15, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 172ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 31 ms.
> 
> file 2:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 79ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 0 ms.
> DEBUG - got EventStateCollection
> 
> file 3:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 125ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, application/msword, ) DEBUG - 
> onEvent: indexing finished in 78 ms.
> DEBUG - got EventStateCollection
> 
> 
> And checking the state of the index with Luke, I could figure 
> out that file 3 (Word) was tokenized... but the content of 
> file 1 and 2 don't appear anywhere, even though the 
> respective properties and nodes do appear!!!
> Consquently, when I run the following XPath query:
> /jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
> 
> The only result is the Word Document...
> 
> What happened with the 2 other files?
> Maybe the mimetype is wrong (text/plain) ?
> Or did I forget to define something ?
> Maybe I did something wrong in my filter definition, which is:
>    <param name="textFilterClasses" 
>    value="org.apache.jackrabbit.extractor.PlainTextExtractor,
>      org.apache.jackrabbit.extractor.MsWordTextExtractor,
>      org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>      org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>      org.apache.jackrabbit.extractor.PdfTextExtractor,
>      org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>      org.apache.jackrabbit.extractor.RTFTextExtractor,
>      org.apache.jackrabbit.extractor.HTMLTextExtractor,
>      org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> 
> 
> I thought that 
> org.apache.jackrabbit.extractor.PlainTextExtractor could 
> handle simple text files... 
> As you can see, it is getting better, but I still need a 
> little help ;-) so if you haven any idea, don't hesitate
> 
> Thank you in advance,
> BR
> Patrick
> 
> 
> 
> ----- Message d'origine ----
> De : Ard Schrijvers <[EMAIL PROTECTED]> À : 
> [email protected]; Patrick Wider 
> <[EMAIL PROTECTED]> Envoyé le : Lundi, 22 Octobre 2007, 
> 14h59mn 53s Objet : RE: Binary Content Search Problem...
> 
> Hello Patrick,
> 
> 
> > Patrick Wider wrote:
> > 
> > Of course the files contain somehow 'myKeyWord'... the text file 
> > contains it for sure, but in the Document, 'myKeyWord'
> > is wrapped by bold and italic styles. But I don't think the styles 
> > cause any problems... on the other hand, I have no idea how the 
> > extractors works ;-) it's just a guess....
> 
> Just for pinpointing the problem, what happens if:
> 
> 1) you search for a word that is not with bold or italic styles?
> 2) if you replace inputstr with "a string to test myKeyWord", 
> and then do the search again
> 
> You might want to turn on the logging for the indexing and 
> extractors, perhaps they reveal some problems. Furthermore 
> you might want to take a look at the latest created index 
> folder after adding a binary doc with luke [1] and see if the 
> binary data is present as tokens in the index
> 
> Regards Ard
> 
> [1] http://www.getopt.org/luke/
> 
> >
> 
> 
>      
> ______________________________________________________________
> _______________
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails 
> vers Yahoo! Mail 
>


      
_____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail

Re : Re : Binary Content Search Problem...

Reply via email to