Hi Karl,
what do i have to do to make tika declare the extracted plain text with
mime type text/plain in my setup?
As i said, i have a tika extractor in place:
Pipeline:
1) Webcrawler Connector (Repository Connection)
2) Tika Extractor (Transformation)
3) Solr Connector (Output Connection,
Extracting Update Handler disabled)
This transformer does not populate the RepositoryDocument.setMimeType()
field with the value "text/plain". It just asks the downstream pipeline
if text/plain is indexable, but it then sends the extracted text along
with the original mime type in my setup.
My output connection:
https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
My job/pipeline configuration:
https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
History screenshot attached (hope that works on mailing lists...)
My MCF Version is trunk (r1865689)
Markus
Am 23.08.2019 um 01:17 schrieb Karl Wright:
> Hi Markus,
>
> If you use the straight update handler, with no Tika filter, then the
> Solr Connector by design restricts input to textual documents. We can
> perhaps broaden that to web pages but then you will be indexing HTML
> tags as well and I rather doubt that's what you want.
>
> If you run Tika within ManifoldCF, the mime type it presents to the
> update handler is text/plain.
> If you run via the extracting update handler, then there is no content
> type check done by the Solr connector.
>
> Karl
>
>
> On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hi,
>
> i am playing around with the solrj mode of the solr output connector, to
> avoid running tika extraction in solr.
>
> My problem is, that the ingestion of web pages gets rejected with the
> message
>
> "Solr connector rejected document due to mime type restrictions:
> (text/html; charset=UTF-8)"
>
> My pipeline looks like this:
>
> 1) Webcrawler Connector (Repository Connection)
> 2) Tika Extractor (Transformation)
> 3) Solr Connector (Output Connection)
>
> The webserver returns content type "text/html; charset=UTF-8" for
> the pages.
>
> The "Use extracting request handler" option is disabled in the solr
> output connection.
>
> The mimetype inclusions in the solr output connector are:
>
> text/plain;charset=utf-8
> text/html
> text/html; charset=UTF-8
>
> I think the ingestion gets rejected by the HttpPoster, because it
> performs a hard check that the mime type has to be a "text/plain*" type
> (see acceptableMimeTypes in HttpPoster).
>
> The TikaExtractor asks if downstream pipeline accepts
> "text/plain;charset=utf-8" as this is the result of the extraction. But
> the sent RepositoryDocument still carries the original mimetype before
> the extraction.
>
> Is this a bug or am i missing something?
>
> Many thanks in advance
> Markus
>