Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Markus Schuch Thu, 22 Aug 2019 23:17:56 -0700

Hi Karl,

what do i have to do to make tika declare the extracted plain text with
mime type text/plain in my setup?


As i said, i have a tika extractor in place:

    Pipeline:
    1) Webcrawler Connector (Repository Connection)
    2) Tika Extractor (Transformation)
    3) Solr Connector (Output Connection,
                       Extracting Update Handler disabled)

This transformer does not populate the RepositoryDocument.setMimeType()
field with the value "text/plain". It just asks the downstream pipeline
if text/plain is indexable, but it then sends the extracted text along
with the original mime type in my setup.

My output connection:
https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d

My job/pipeline configuration:
https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b

History screenshot attached (hope that works on mailing lists...)

My MCF Version is trunk (r1865689)

Markus


Am 23.08.2019 um 01:17 schrieb Karl Wright:
> Hi Markus,
>
> If you use the straight update handler, with no Tika filter, then the
> Solr Connector by design restricts input to textual documents.  We can
> perhaps broaden that to web pages but then you will be indexing HTML
> tags as well and I rather doubt that's what you want.
>
> If you run Tika within ManifoldCF, the mime type it presents to the
> update handler is text/plain.
> If you run via the extracting update handler, then there is no content
> type check done by the Solr connector.
>
> Karl
>
>
> On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi,
>
>     i am playing around with the solrj mode of the solr output connector, to
>     avoid running tika extraction in solr.
>
>     My problem is, that the ingestion of web pages gets rejected with the
>     message
>
>         "Solr connector rejected document due to mime type restrictions:
>         (text/html; charset=UTF-8)"
>
>     My pipeline looks like this:
>
>         1) Webcrawler Connector (Repository Connection)
>         2) Tika Extractor (Transformation)
>         3) Solr Connector (Output Connection)
>
>     The webserver returns content type "text/html; charset=UTF-8" for
>     the pages.
>
>     The "Use extracting request handler" option is disabled in the solr
>     output connection.
>
>     The mimetype inclusions in the solr output connector are:
>
>         text/plain;charset=utf-8
>         text/html
>         text/html; charset=UTF-8
>
>     I think the ingestion gets rejected by the HttpPoster, because it
>     performs a hard check that the mime type has to be a "text/plain*" type
>     (see acceptableMimeTypes in HttpPoster).
>
>     The TikaExtractor asks if downstream pipeline accepts
>     "text/plain;charset=utf-8" as this is the result of the extraction. But
>     the sent RepositoryDocument still carries the original mimetype before
>     the extraction.
>
>     Is this a bug or am i missing something?
>
>     Many thanks in advance
>     Markus
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to