I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse.
For the text extractors, I get the necessary library form the next maven
dependency:
<dependency>
<groupId>org.apache.jackrabbit</groupId>
<artifactId>jackrabbit-text-extractors</artifactId>
<version>1.6.4</version>
</dependency>
Are there any other util information to proporcionate?
Regards.
-----Mensaje original-----
De: [email protected] [mailto:[email protected]] En nombre de
Justin Edelson
Enviado el: jueves, 16 de diciembre de 2010 16:26
Para: [email protected]
Asunto: Re: FullText Indexing
Sergio-
The ClassCastException and the NoSuchMethodException you posted on
d...@suggest a classpath problem. I would suggest posting the details
of your
deployment - what JARs you are using, app server details, etc.
Justin
On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <[email protected]>wrote:
> Hello.
>
>
>
> I'm a newbie in Jackrabbit.
>
>
>
> I'm trying to index some content of different types of documents (word,
> pdf, xml, ...).
>
>
>
> I've configured the searchIndex in my workspace.xml in this way:
>
>
>
> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>
> <param name="path" value="${wsp.home}/index"/>
>
> <param name="supportHighlighting" value="true"/>
>
> <param
> name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>
>
> org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>
>
> org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>
>
> org.apache.jackrabbit.extractor.PdfTextExtractor,
>
>
> org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>
>
> org.apache.jackrabbit.extractor.RTFTextExtractor,
>
>
> org.apache.jackrabbit.extractor.HTMLTextExtractor,
>
>
> org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>
> </SearchIndex>
>
>
>
>
>
> When I create a document in the repository, I add the content in this way:
>
>
>
> contenido = nodo.addNode("jcr:content", "nt:resource");
>
> contenido.setProperty("jcr:data", J_OperacionesSesion
>
> .*getValueFactory*().createBinary(is));
>
>
>
> MimetypesFileTypeMap mimetypes =
> *new*MimetypesFileTypeMap();
>
> String *mime* =
> mimetypes.getContentType(nodo.getName());
>
> contenido.setProperty("jcr:mimeType", "application/pdf"
> );
>
>
>
> Afer creating the document, this warning is thrown:
>
>
>
> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
> from a binary property (LazyTextExtractorField.java, line 180)
>
> *org.apache.tika.exception.TikaException*: Unable to extract PDF content
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
>
> at org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
>
> at org.apache.tika.parser.CompositeParser.parse(*
> CompositeParser.java:120*)
>
> at org.apache.tika.parser.AutoDetectParser.parse(*
> AutoDetectParser.java:101*)
>
> at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> JackrabbitParser.java:189*)
>
> at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> *LazyTextExtractorField.java:174*)
>
> at java.util.concurrent.Executors$RunnableAdapter.call(*
> Executors.java:417*)
>
> at java.util.concurrent.FutureTask$Sync.innerRun(*
> FutureTask.java:269*)
>
> at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
>
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> *ScheduledThreadPoolExecutor.java:65*)
>
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> ScheduledThreadPoolExecutor.java:168*)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> ThreadPoolExecutor.java:650*)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> ThreadPoolExecutor.java:675*)
>
> at java.lang.Thread.run(*Thread.java:595*)
>
> Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
> instantiated
>
> at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:152*)
>
> at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> PDFTextStripper.java:129*)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
>
> at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
>
> ... 13 more
>
> Caused by: *java.lang.ClassCastException*:
> org.pdfbox.util.operator.ShowTextGlyph
>
> at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:146*)
>
> ... 16 more
>
>
>
> Later, when I search for the document, filtering by content, in this way:
>
>
>
> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> nt:file)
>
>
>
> No documents were found.
>
>
>
>
>
> Can you help me please??.
>
>
>
>
>
> Thanks and regards.
>
>
>
>
>
> *Sergio Rojas Buitrago*
>
> Desarrollo Software
> Gestión Documental
>
> Ronda de Toledo s/n
> 13003. Ciudad Real
> España
>
> T +34 926 27 08 49
>
> Ext: 237849
>
>
>
> [email protected]
> www.indra.es
>
> [image: indra]
>
>
>
> ------------------------------
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>
Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
contiene información de carácter confidencial exclusivamente dirigida a su
destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
notificado que la lectura, utilización, divulgación y/o copia sin autorización
está prohibida en virtud de la legislación vigente. En el caso de haber
recibido este correo electrónico por error, se ruega notificar inmediatamente
esta circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.
This email and any file attached to it (when applicable) contain(s)
confidential information that is exclusively addressed to its recipient(s). If
you are not the indicated recipient, you are informed that reading, using,
disseminating and/or copying it without authorisation is forbidden in
accordance with the legislation in effect. If you have received this email by
mistake, please immediately notify the sender of the situation by resending it
to their email address.
Avoid printing this message if it is not absolutely necessary.