Hello.
I'm a newbie in Jackrabbit.
I'm trying to index some content of different types of documents (word, pdf,
xml, ...).
I've configured the searchIndex in my workspace.xml in this way:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="supportHighlighting" value="true"/>
<param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
org.apache.jackrabbit.extractor.MsExcelTextExtractor,
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
org.apache.jackrabbit.extractor.PdfTextExtractor,
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
org.apache.jackrabbit.extractor.RTFTextExtractor,
org.apache.jackrabbit.extractor.HTMLTextExtractor,
org.apache.jackrabbit.extractor.XMLTextExtractor"/>
</SearchIndex>
When I create a document in the repository, I add the content in this way:
contenido = nodo.addNode("jcr:content", "nt:resource");
contenido.setProperty("jcr:data", J_OperacionesSesion
.getValueFactory().createBinary(is));
MimetypesFileTypeMap mimetypes = new MimetypesFileTypeMap();
String mime = mimetypes.getContentType(nodo.getName());
contenido.setProperty("jcr:mimeType", "application/pdf");
Afer creating the document, this warning is thrown:
16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text from
a binary property (LazyTextExtractorField.java, line 180)
org.apache.tika.exception.TikaException: Unable to extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
at
org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189)
at
org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
at java.lang.Thread.run(Thread.java:595)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException: OperatorProcessor
class org.pdfbox.util.operator.ShowTextGlyph could not be instantiated
at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152)
at org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129)
at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
... 13 more
Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph
at org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146)
... 16 more
Later, when I search for the document, filtering by content, in this way:
String consulta = "SELECT * FROM [arch:documento] AS documento WHERE CONTAINS (
documento.*, 'ubicacion')"; (arch:document extends from nt:file)
No documents were found.
Can you help me please??.
Thanks and regards.
Sergio Rojas Buitrago
Desarrollo Software
Gestión Documental
Ronda de Toledo s/n
13003. Ciudad Real
España
T +34 926 27 08 49
Ext: 237849
[email protected]<mailto:[email protected]>
www.indra.es<http://www.indra.es>
[cid:[email protected]]
________________________________
Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
contiene información de carácter confidencial exclusivamente dirigida a su
destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
notificado que la lectura, utilización, divulgación y/o copia sin autorización
está prohibida en virtud de la legislación vigente. En el caso de haber
recibido este correo electrónico por error, se ruega notificar inmediatamente
esta circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.
This email and any file attached to it (when applicable) contain(s)
confidential information that is exclusively addressed to its recipient(s). If
you are not the indicated recipient, you are informed that reading, using,
disseminating and/or copying it without authorisation is forbidden in
accordance with the legislation in effect. If you have received this email by
mistake, please immediately notify the sender of the situation by resending it
to their email address.
Avoid printing this message if it is not absolutely necessary.