Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Markus Schuch Fri, 23 Aug 2019 13:10:21 -0700

Hi Karl,

yes, this helps.


The webpage is now ingested after tika extraction and i only have to
include the mime type text/html in the solr output connection.

Many thanks.

Cheers
Markus

Am 23.08.2019 um 13:45 schrieb Karl Wright:
> Created a ticket: CONNECTORS-1621.  Added a fix.  Please let me know if
> it resolves the problem for you.
>
> Thanks,
> Karl
>
>
> On Fri, Aug 23, 2019 at 7:33 AM Karl Wright <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi Markus,
>
>     You are correct.
>     This code was added as part
>     of https://issues.apache.org/jira/browse/CONNECTORS-1482 .  The code
>     that was added does look at the content mime type.  
>
>     The reason that the mime type is not modified in the document being
>     passed to Solr by Tika is because we want Solr to receive the
>     original mime type, because that may be of interest at indexing
>     time.  So a filter specified in the solr connector should always be
>     against the original mime type and not the modified one.
>
>     Let me make that change.
>
>     Karl
>
>
>     On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch <[email protected]
>     <mailto:[email protected]>> wrote:
>
>         I already have "update" in the handler field. One can see that
>         in the
>         gist link i posted and it is not working.
>
>         The HttpPoster of the SolrConnector takes
>         RepositoryDocument.getMimeType() and checks the mime type
>         against the
>         hardcoded plain text mime type list, if solr cell mode (extracting
>         request handler mode) is disabled.
>
>         I think
>         org.apache.manifoldcf.agents.transformation.tika.TikaExtractor
>         never calling setMimeType on the duplicated RepositoryDocument
>         to set
>         the MIME type to text/plain might be the source of my problem.
>
>         Markus
>
>         Am 23.08.2019 um 10:30 schrieb Karl Wright:
>         > There are two possible ways to configure Tika with Solr.
>         > First way: Tika extractor + Solr update handler
>         > Second way: no Tika extractor + Solr update/extract handler
>         >
>         > For the first way, the Solr Connector completely ignores any
>         "accepted
>         > mime types" you set for it, and only accepts text/plain.  For
>         the second
>         > way, what you set in the "accepted mime types" is used to
>         filter out
>         > what is being crawled.  You NEVER include the charset, by the
>         way, in
>         > the mime type you specify; that's supposed to get stripped off
>         by anyone
>         > who passes it between connectors.
>         >
>         > Both of these have been extensively used by many others.
>         >
>         > So what you need to do is change to the solr Update handler,
>         sounds to
>         > me.  That's not just unchecking the box, it is also entering
>         "update"
>         > rather than "update/extract" in the handler field.
>         >
>         > If you still use the update/extract handler, you are essentially
>         > invoking Tika twice, which is why we don't really support this
>         option
>         > very well.  But you should be able to just have it accept
>         "text/plain"
>         > and it should work.  OR uncheck the box and it should just
>         default to
>         > allowing "text/plain" with no other options accepted.
>         >
>         > Karl
>         >
>         >
>         > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch
>         <[email protected] <mailto:[email protected]>
>         > <mailto:[email protected] <mailto:[email protected]>>>
>         wrote:
>         >
>         >     Hi Karl,
>         >
>         >     what do i have to do to make tika declare the extracted
>         plain text with
>         >     mime type text/plain in my setup?
>         >
>         >     As i said, i have a tika extractor in place:
>         >
>         >         Pipeline:
>         >         1) Webcrawler Connector (Repository Connection)
>         >         2) Tika Extractor (Transformation)
>         >         3) Solr Connector (Output Connection,
>         >                            Extracting Update Handler disabled)
>         >
>         >     This transformer does not populate the
>         RepositoryDocument.setMimeType()
>         >     field with the value "text/plain". It just asks the
>         downstream pipeline
>         >     if text/plain is indexable, but it then sends the
>         extracted text along
>         >     with the original mime type in my setup.
>         >
>         >     My output connection:
>         >   
>          https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d
>         >
>         >     My job/pipeline configuration:
>         >   
>          https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b
>         >
>         >     History screenshot attached (hope that works on mailing
>         lists...)
>         >
>         >     My MCF Version is trunk (r1865689)
>         >
>         >     Markus
>         >
>         >
>         >     Am 23.08.2019 um 01:17 schrieb Karl Wright:
>         >     > Hi Markus,
>         >     >
>         >     > If you use the straight update handler, with no Tika
>         filter, then the
>         >     > Solr Connector by design restricts input to textual
>         documents.  We can
>         >     > perhaps broaden that to web pages but then you will be
>         indexing HTML
>         >     > tags as well and I rather doubt that's what you want.
>         >     >
>         >     > If you run Tika within ManifoldCF, the mime type it
>         presents to the
>         >     > update handler is text/plain.
>         >     > If you run via the extracting update handler, then there
>         is no content
>         >     > type check done by the Solr connector.
>         >     >
>         >     > Karl
>         >     >
>         >     >
>         >     > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch
>         >     <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>         >     > <mailto:[email protected]
>         <mailto:[email protected]> <mailto:[email protected]
>         <mailto:[email protected]>>>> wrote:
>         >     >
>         >     >     Hi,
>         >     >
>         >     >     i am playing around with the solrj mode of the solr
>         output
>         >     connector, to
>         >     >     avoid running tika extraction in solr.
>         >     >
>         >     >     My problem is, that the ingestion of web pages gets
>         rejected
>         >     with the
>         >     >     message
>         >     >
>         >     >         "Solr connector rejected document due to mime type
>         >     restrictions:
>         >     >         (text/html; charset=UTF-8)"
>         >     >
>         >     >     My pipeline looks like this:
>         >     >
>         >     >         1) Webcrawler Connector (Repository Connection)
>         >     >         2) Tika Extractor (Transformation)
>         >     >         3) Solr Connector (Output Connection)
>         >     >
>         >     >     The webserver returns content type "text/html;
>         charset=UTF-8" for
>         >     >     the pages.
>         >     >
>         >     >     The "Use extracting request handler" option is
>         disabled in the
>         >     solr
>         >     >     output connection.
>         >     >
>         >     >     The mimetype inclusions in the solr output connector
>         are:
>         >     >
>         >     >         text/plain;charset=utf-8
>         >     >         text/html
>         >     >         text/html; charset=UTF-8
>         >     >
>         >     >     I think the ingestion gets rejected by the
>         HttpPoster, because it
>         >     >     performs a hard check that the mime type has to be a
>         >     "text/plain*" type
>         >     >     (see acceptableMimeTypes in HttpPoster).
>         >     >
>         >     >     The TikaExtractor asks if downstream pipeline accepts
>         >     >     "text/plain;charset=utf-8" as this is the result of the
>         >     extraction. But
>         >     >     the sent RepositoryDocument still carries the original
>         >     mimetype before
>         >     >     the extraction.
>         >     >
>         >     >     Is this a bug or am i missing something?
>         >     >
>         >     >     Many thanks in advance
>         >     >     Markus
>         >     >
>         >
>

Re: Solr Connector rejects Webpages when using TikaExtractor and SolrJ mode

Reply via email to