Re: FullText Indexing

Justin Edelson Thu, 16 Dec 2010 07:26:14 -0800

Sergio-
The ClassCastException and the NoSuchMethodException you posted on
d...@suggest a classpath problem. I would suggest posting the details
of your
deployment - what JARs you are using, app server details, etc.


Justin

On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <[email protected]>wrote:

>  Hello.
>
>
>
> I’m a newbie in Jackrabbit.
>
>
>
> I’m trying to index some content of different types of documents (word,
> pdf, xml, …).
>
>
>
> I’ve configured the searchIndex in my workspace.xml in this way:
>
>
>
> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>
>             <param name="path" value="${wsp.home}/index"/>
>
>             <param name="supportHighlighting" value="true"/>
>
>                                                <param
> name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.PdfTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.RTFTextExtractor,
>
>
>                    org.apache.jackrabbit.extractor.HTMLTextExtractor,
>
>
>    org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>
>         </SearchIndex>
>
>
>
>
>
> When I create a document in the repository, I add the content in this way:
>
>
>
> contenido = nodo.addNode("jcr:content", "nt:resource");
>
>                   contenido.setProperty("jcr:data", J_OperacionesSesion
>
>                              .*getValueFactory*().createBinary(is));
>
>
>
>                   MimetypesFileTypeMap mimetypes = 
> *new*MimetypesFileTypeMap();
>
>                   String *mime* =
> mimetypes.getContentType(nodo.getName());
>
>                   contenido.setProperty("jcr:mimeType", "application/pdf"
> );
>
>
>
> Afer creating the document, this warning is thrown:
>
>
>
> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text
> from a binary property (LazyTextExtractorField.java, line 180)
>
> *org.apache.tika.exception.TikaException*: Unable to extract PDF content
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
>
>       at org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
>
>       at org.apache.tika.parser.CompositeParser.parse(*
> CompositeParser.java:120*)
>
>       at org.apache.tika.parser.AutoDetectParser.parse(*
> AutoDetectParser.java:101*)
>
>       at org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> JackrabbitParser.java:189*)
>
>       at
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> *LazyTextExtractorField.java:174*)
>
>       at java.util.concurrent.Executors$RunnableAdapter.call(*
> Executors.java:417*)
>
>       at java.util.concurrent.FutureTask$Sync.innerRun(*
> FutureTask.java:269*)
>
>       at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> *ScheduledThreadPoolExecutor.java:65*)
>
>       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> ScheduledThreadPoolExecutor.java:168*)
>
>       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> ThreadPoolExecutor.java:650*)
>
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> ThreadPoolExecutor.java:675*)
>
>       at java.lang.Thread.run(*Thread.java:595*)
>
> Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be
> instantiated
>
>       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:152*)
>
>       at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> PDFTextStripper.java:129*)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
>
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
>
>       ... 13 more
>
> Caused by: *java.lang.ClassCastException*:
> org.pdfbox.util.operator.ShowTextGlyph
>
>       at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> PDFStreamEngine.java:146*)
>
>       ... 16 more
>
>
>
> Later, when I search for the document, filtering by content, in this way:
>
>
>
> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> nt:file)
>
>
>
> No documents were found.
>
>
>
>
>
> Can you help me please??.
>
>
>
>
>
> Thanks and regards.
>
>
>
>
>
> *Sergio Rojas Buitrago*
>
> Desarrollo Software
> Gestión Documental
>
> Ronda de Toledo s/n
> 13003. Ciudad Real
> España
>
> T +34 926 27 08 49
>
> Ext: 237849
>
>
>
> [email protected]
> www.indra.es
>
> [image: indra]
>
>
>
> ------------------------------
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>

Re: FullText Indexing

Reply via email to