Re: How to set Tika with ManifoldCF and Solr

Karl Wright Thu, 11 Oct 2018 05:21:02 -0700

I confirmed that both the Tika Service transformer and the Tika transformer
check the same exact mime type:


>>>>>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription,
String mimeType, IOutputCheckActivity checkActivity)
    throws ManifoldCFException, ServiceInterruption
  {
    // We should see what Tika will transform
    // MHL
    // Do a downstream check
    return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<<<<<

So: please verify that your Solr connection is set up correctly and the
"use extracting update handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright <[email protected]> wrote:

> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
> Thanks,
> Karl
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario <[email protected]>
> wrote:
>
>> Now the document isn’t ingested by solr because I obtain:
>>
>>
>>
>> Solr connector rejected document due to mime type restrictions:
>> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>>
>>
>>
>>
>>
>> But the mime type is on the tab
>>
>>
>>
>>
>>
>> And the settings worked well when I used Tika inside solr.
>>
>>
>>
>> Could you help me?
>>
>> Thanks
>>
>>
>>
>> *Da:* Bisonti Mario <[email protected]>
>> *Inviato:* giovedì 11 ottobre 2018 14:03
>> *A:* [email protected]
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>>
>>
>> My mistake…
>>
>> As you wrote me I had to uncheck “use extracting update handler”
>>
>>
>>
>> Now I have to understand the field mentioned in schema etc.
>>
>>
>>
>> *Da:* Bisonti Mario <[email protected]>
>> *Inviato:* giovedì 11 ottobre 2018 13:45
>> *A:* [email protected]
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> I see the job processed but without the document inside.
>>
>> 10-11-2018 13:32:25.649
>>
>> job end
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>> 10-11-2018 13:32:14.211
>>
>> job start
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Have I to uncheck, on my Solr output connection the “Use the Extract
>> Update Handler”?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright <[email protected]>
>> *Inviato:* giovedì 11 ottobre 2018 13:36
>> *A:* [email protected]
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Please have a look at your "Simple History" report to see why the
>> documents aren't getting indexed.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario <[email protected]>
>> wrote:
>>
>> Thanks Karl.
>>
>> I tried, but it doesn’t index documents.
>>
>> It seemes that it doesn’t see them?
>>
>>
>>
>> Perhaps is the “Ignore Tika exception that I don’t know where to set in
>> ManifoldCF  the problem?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright <[email protected]>
>> *Inviato:* giovedì 11 ottobre 2018 12:24
>> *A:* [email protected]
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Hi Mario,
>>
>>
>>
>> (1) When you use the Tika server externally, you do not get the
>> boilerpipe HTML extractor available for configuration and use.  That is
>> because it's external now.
>>
>> (2) In your Solr connection, you want to uncheck the box that says "use
>> extracting update handler", and you want to change the output handler from
>> "/update/extract" to just "/update".
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario <[email protected]>
>> wrote:
>>
>> Hallo.
>>
>> I would like to use Tika server started from command line into ManifoldCF
>> so, ManifoldCF as Trasformation connector, process with Tika and index to
>> the output connecto Solr.
>>
>>
>>
>> I started Tika server:
>> java -jar /opt/tika/tika-server-1.19.1.jar
>>
>>
>>
>> After, I created a transformation connection with TikaServer: localhost
>> and Tika port 998 and connection works.
>>
>>
>>
>> After, I created a job and in the Tab Connection I inserted the
>> Transformation yet created Before the Output Solr.
>>
>>
>>
>>
>>
>>
>>
>> Note that I don’t see the tab “Excepition” and “Boilerplate”
>>
>> Why this?
>>
>>
>>
>> Furthermore, if I start the job, I see that Solr hangs with exception:
>>
>> 2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share]
>> o.e.j.s.HttpChannel /solr/core_share/update/extract
>>
>> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
>>
>>         at java.lang.Class.forName0(Native Method) ~[?:?]
>>
>>         at java.lang.Class.forName(Class.java:374) ~[?:?]
>>
>>
>>
>> infact, I renamed the tika .jar:
>> in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t
>> use Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.
>>
>>
>>
>> Have I to configure solr to don’t use Tika I suppose.
>>
>>
>>
>> How to do this?
>>
>>
>>
>> I see
>> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatafari.atlassian.net%2Fwiki%2Fspaces%2FDATAFARI%2Fpages%2F107708451%2FData%2BExtraction%2BTika%2BEmbedded%2Bin%2BSolr%2BDeactivation%2BConfiguration&data=01%7C01%7CMario.Bisonti%40vimar.com%7C94121032337b4b8c0ed308d62f718964%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1&sdata=M%2B%2B%2F5IFICTgRKDcmvAwrANaTaS308x1NoR3NsbQUSrY%3D&reserved=0>
>> but I haven’t Datafari, so, in a Solr standard configuration, how could I
>> deactivated the tika ?
>>
>>
>>
>> Thanks a lot
>>
>>
>>
>> Mario
>>
>>
>>
>>

Re: How to set Tika with ManifoldCF and Solr

Reply via email to