Then, how can i configure searchIndex in my workspace.xml for working with tika
text extractors?.
In I don't specify textFilterClasses no error or warning is thrown when I
create a document, but the search don´t find any result.
At this point, I don't know if what is failing is the indexer or my search
query. My query is:
String consulta = "SELECT * FROM [arch:documento] AS documento WHERE CONTAINS (
documento.*, 'ubicacion')";
arch:documento is a subtype of nt:file.
The content was added to node in this way:
contenido = nodo.addNode("jcr:content", "nt:resource");
contenido.setProperty("jcr:data",
J_OperacionesSesion.*getValueFactory*().createBinary(is));
The content is well added because I can see it in the jackrabbit web browser.
Thanks and regards.
Sergio Rojas Buitrago
Desarrollo Software
Gestión Documental
Ronda de Toledo s/n
13003. Ciudad Real
España
T +34 926 27 08 49
Ext: 237849
[email protected]
www.indra.es
-----Mensaje original-----
De: [email protected] [mailto:[email protected]] En nombre de
Justin Edelson
Enviado el: jueves, 16 de diciembre de 2010 17:52
Para: [email protected]
Asunto: Re: FullText Indexing
AFAIK, all of that functionality is now in Apache Tika. So just remove it.
On Thu, Dec 16, 2010 at 11:45 AM, Rojas Buitrago, Sergio <[email protected]>wrote:
> What version must i use?. 1.6.4 is the newly version of
> jackrabbit-text-extractors that I've found.
>
>
>
>
>
>
> -----Mensaje original-----
> De: [email protected] [mailto:[email protected]] En nombre de
> Justin Edelson
> Enviado el: jueves, 16 de diciembre de 2010 17:40
> Para: [email protected]
> Asunto: Re: FullText Indexing
>
> I would remove that dependency. Using a 1.6.4 library with Jackrabbit 2.1.2
> just seems like a bad idea.
>
> On Thu, Dec 16, 2010 at 11:10 AM, Rojas Buitrago, Sergio <[email protected]
> >wrote:
>
> > I'm using JackRabbit 2.1.2 deployed in a tomcat 6.0 managed from eclipse.
> >
> > For the text extractors, I get the necessary library form the next maven
> > dependency:
> >
> > <dependency>
> > <groupId>org.apache.jackrabbit</groupId>
> >
> <artifactId>jackrabbit-text-extractors</artifactId>
> > <version>1.6.4</version>
> > </dependency>
> >
> > Are there any other util information to proporcionate?
> >
> > Regards.
> >
> >
> >
> > -----Mensaje original-----
> > De: [email protected] [mailto:[email protected]] En nombre
> de
> > Justin Edelson
> > Enviado el: jueves, 16 de diciembre de 2010 16:26
> > Para: [email protected]
> > Asunto: Re: FullText Indexing
> >
> > Sergio-
> > The ClassCastException and the NoSuchMethodException you posted on
> > d...@suggest a classpath problem. I would suggest posting the details
> > of your
> > deployment - what JARs you are using, app server details, etc.
> >
> > Justin
> >
> > On Thu, Dec 16, 2010 at 9:31 AM, Rojas Buitrago, Sergio <[email protected]
> > >wrote:
> >
> > > Hello.
> > >
> > >
> > >
> > > I'm a newbie in Jackrabbit.
> > >
> > >
> > >
> > > I'm trying to index some content of different types of documents (word,
> > > pdf, xml, ...).
> > >
> > >
> > >
> > > I've configured the searchIndex in my workspace.xml in this way:
> > >
> > >
> > >
> > > <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
> > >
> > > <param name="path" value="${wsp.home}/index"/>
> > >
> > > <param name="supportHighlighting" value="true"/>
> > >
> > > <param
> > > name="textFilterClasses"
> > > value="org.apache.jackrabbit.extractor.MsWordTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.MsExcelTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.PdfTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.RTFTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.HTMLTextExtractor,
> > >
> > >
> > > org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> > >
> > > </SearchIndex>
> > >
> > >
> > >
> > >
> > >
> > > When I create a document in the repository, I add the content in this
> > way:
> > >
> > >
> > >
> > > contenido = nodo.addNode("jcr:content", "nt:resource");
> > >
> > > contenido.setProperty("jcr:data", J_OperacionesSesion
> > >
> > > .*getValueFactory*().createBinary(is));
> > >
> > >
> > >
> > > MimetypesFileTypeMap mimetypes =
> > *new*MimetypesFileTypeMap();
> > >
> > > String *mime* =
> > > mimetypes.getContentType(nodo.getName());
> > >
> > > contenido.setProperty("jcr:mimeType",
> "application/pdf"
> > > );
> > >
> > >
> > >
> > > Afer creating the document, this warning is thrown:
> > >
> > >
> > >
> > > 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract
> > text
> > > from a binary property (LazyTextExtractorField.java, line 180)
> > >
> > > *org.apache.tika.exception.TikaException*: Unable to extract PDF
> content
> > >
> > > at
> > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:61*)
> > >
> > > at
> org.apache.tika.parser.pdf.PDFParser.parse(*PDFParser.java:69*)
> > >
> > > at org.apache.tika.parser.CompositeParser.parse(*
> > > CompositeParser.java:120*)
> > >
> > > at org.apache.tika.parser.AutoDetectParser.parse(*
> > > AutoDetectParser.java:101*)
> > >
> > > at
> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(*
> > > JackrabbitParser.java:189*)
> > >
> > > at
> > >
> >
> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(
> > > *LazyTextExtractorField.java:174*)
> > >
> > > at java.util.concurrent.Executors$RunnableAdapter.call(*
> > > Executors.java:417*)
> > >
> > > at java.util.concurrent.FutureTask$Sync.innerRun(*
> > > FutureTask.java:269*)
> > >
> > > at java.util.concurrent.FutureTask.run(*FutureTask.java:123*)
> > >
> > > at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> > > *ScheduledThreadPoolExecutor.java:65*)
> > >
> > > at
> > >
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(*
> > > ScheduledThreadPoolExecutor.java:168*)
> > >
> > > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(*
> > > ThreadPoolExecutor.java:650*)
> > >
> > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(*
> > > ThreadPoolExecutor.java:675*)
> > >
> > > at java.lang.Thread.run(*Thread.java:595*)
> > >
> > > Caused by: *org.apache.pdfbox.exceptions.WrappedIOException*:
> > > OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could
> not
> > be
> > > instantiated
> > >
> > > at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > PDFStreamEngine.java:152*)
> > >
> > > at org.apache.pdfbox.util.PDFTextStripper.<init>(*
> > > PDFTextStripper.java:129*)
> > >
> > > at
> org.apache.tika.parser.pdf.PDF2XHTML.<init>(*PDF2XHTML.java:69*)
> > >
> > > at
> > org.apache.tika.parser.pdf.PDF2XHTML.process(*PDF2XHTML.java:56*)
> > >
> > > ... 13 more
> > >
> > > Caused by: *java.lang.ClassCastException*:
> > > org.pdfbox.util.operator.ShowTextGlyph
> > >
> > > at org.apache.pdfbox.util.PDFStreamEngine.<init>(*
> > > PDFStreamEngine.java:146*)
> > >
> > > ... 16 more
> > >
> > >
> > >
> > > Later, when I search for the document, filtering by content, in this
> way:
> > >
> > >
> > >
> > > String consulta = "SELECT * FROM [arch:documento] AS documento WHERE
> > > CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from
> > > nt:file)
> > >
> > >
> > >
> > > No documents were found.
> > >
> > >
> > >
> > >
> > >
> > > Can you help me please??.
> > >
> > >
> > >
> > >
> > >
> > > Thanks and regards.
> > >
> > >
> > >
> > >
> > >
> > > *Sergio Rojas Buitrago*
> > >
> > > Desarrollo Software
> > > Gestión Documental
> > >
> > > Ronda de Toledo s/n
> > > 13003. Ciudad Real
> > > España
> > >
> > > T +34 926 27 08 49
> > >
> > > Ext: 237849
> > >
> > >
> > >
> > > [email protected]
> > > www.indra.es
> > >
> > > [image: indra]
> > >
> > >
> > >
> > > ------------------------------
> > > Este correo electrónico y, en su caso, cualquier fichero anexo al
> mismo,
> > > contiene información de carácter confidencial exclusivamente dirigida a
> > su
> > > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> > queda
> > > notificado que la lectura, utilización, divulgación y/o copia sin
> > > autorización está prohibida en virtud de la legislación vigente. En el
> > caso
> > > de haber recibido este correo electrónico por error, se ruega notificar
> > > inmediatamente esta circunstancia mediante reenvío a la dirección
> > > electrónica del remitente.
> > > Evite imprimir este mensaje si no es estrictamente necesario.
> > >
> > > This email and any file attached to it (when applicable) contain(s)
> > > confidential information that is exclusively addressed to its
> > recipient(s).
> > > If you are not the indicated recipient, you are informed that reading,
> > > using, disseminating and/or copying it without authorisation is
> forbidden
> > in
> > > accordance with the legislation in effect. If you have received this
> > email
> > > by mistake, please immediately notify the sender of the situation by
> > > resending it to their email address.
> > > Avoid printing this message if it is not absolutely necessary.
> > >
> >
> > Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> > contiene información de carácter confidencial exclusivamente dirigida a
> su
> > destinatario o destinatarios. Si no es vd. el destinatario indicado,
> queda
> > notificado que la lectura, utilización, divulgación y/o copia sin
> > autorización está prohibida en virtud de la legislación vigente. En el
> caso
> > de haber recibido este correo electrónico por error, se ruega notificar
> > inmediatamente esta circunstancia mediante reenvío a la dirección
> > electrónica del remitente.
> > Evite imprimir este mensaje si no es estrictamente necesario.
> >
> > This email and any file attached to it (when applicable) contain(s)
> > confidential information that is exclusively addressed to its
> recipient(s).
> > If you are not the indicated recipient, you are informed that reading,
> > using, disseminating and/or copying it without authorisation is forbidden
> in
> > accordance with the legislation in effect. If you have received this
> email
> > by mistake, please immediately notify the sender of the situation by
> > resending it to their email address.
> > Avoid printing this message if it is not absolutely necessary.
> >
>
> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
> contiene información de carácter confidencial exclusivamente dirigida a su
> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
> notificado que la lectura, utilización, divulgación y/o copia sin
> autorización está prohibida en virtud de la legislación vigente. En el caso
> de haber recibido este correo electrónico por error, se ruega notificar
> inmediatamente esta circunstancia mediante reenvío a la dirección
> electrónica del remitente.
> Evite imprimir este mensaje si no es estrictamente necesario.
>
> This email and any file attached to it (when applicable) contain(s)
> confidential information that is exclusively addressed to its recipient(s).
> If you are not the indicated recipient, you are informed that reading,
> using, disseminating and/or copying it without authorisation is forbidden in
> accordance with the legislation in effect. If you have received this email
> by mistake, please immediately notify the sender of the situation by
> resending it to their email address.
> Avoid printing this message if it is not absolutely necessary.
>
Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
contiene información de carácter confidencial exclusivamente dirigida a su
destinatario o destinatarios. Si no es vd. el destinatario indicado, queda
notificado que la lectura, utilización, divulgación y/o copia sin autorización
está prohibida en virtud de la legislación vigente. En el caso de haber
recibido este correo electrónico por error, se ruega notificar inmediatamente
esta circunstancia mediante reenvío a la dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.
This email and any file attached to it (when applicable) contain(s)
confidential information that is exclusively addressed to its recipient(s). If
you are not the indicated recipient, you are informed that reading, using,
disseminating and/or copying it without authorisation is forbidden in
accordance with the legislation in effect. If you have received this email by
mistake, please immediately notify the sender of the situation by resending it
to their email address.
Avoid printing this message if it is not absolutely necessary.