Hi Karl, yes, this helps.
The webpage is now ingested after tika extraction and i only have to include the mime type text/html in the solr output connection. Many thanks. Cheers Markus Am 23.08.2019 um 13:45 schrieb Karl Wright: > Created a ticket: CONNECTORS-1621. Added a fix. Please let me know if > it resolves the problem for you. > > Thanks, > Karl > > > On Fri, Aug 23, 2019 at 7:33 AM Karl Wright <[email protected] > <mailto:[email protected]>> wrote: > > Hi Markus, > > You are correct. > This code was added as part > of https://issues.apache.org/jira/browse/CONNECTORS-1482 . The code > that was added does look at the content mime type. > > The reason that the mime type is not modified in the document being > passed to Solr by Tika is because we want Solr to receive the > original mime type, because that may be of interest at indexing > time. So a filter specified in the solr connector should always be > against the original mime type and not the modified one. > > Let me make that change. > > Karl > > > On Fri, Aug 23, 2019 at 6:31 AM Markus Schuch <[email protected] > <mailto:[email protected]>> wrote: > > I already have "update" in the handler field. One can see that > in the > gist link i posted and it is not working. > > The HttpPoster of the SolrConnector takes > RepositoryDocument.getMimeType() and checks the mime type > against the > hardcoded plain text mime type list, if solr cell mode (extracting > request handler mode) is disabled. > > I think > org.apache.manifoldcf.agents.transformation.tika.TikaExtractor > never calling setMimeType on the duplicated RepositoryDocument > to set > the MIME type to text/plain might be the source of my problem. > > Markus > > Am 23.08.2019 um 10:30 schrieb Karl Wright: > > There are two possible ways to configure Tika with Solr. > > First way: Tika extractor + Solr update handler > > Second way: no Tika extractor + Solr update/extract handler > > > > For the first way, the Solr Connector completely ignores any > "accepted > > mime types" you set for it, and only accepts text/plain. For > the second > > way, what you set in the "accepted mime types" is used to > filter out > > what is being crawled. You NEVER include the charset, by the > way, in > > the mime type you specify; that's supposed to get stripped off > by anyone > > who passes it between connectors. > > > > Both of these have been extensively used by many others. > > > > So what you need to do is change to the solr Update handler, > sounds to > > me. That's not just unchecking the box, it is also entering > "update" > > rather than "update/extract" in the handler field. > > > > If you still use the update/extract handler, you are essentially > > invoking Tika twice, which is why we don't really support this > option > > very well. But you should be able to just have it accept > "text/plain" > > and it should work. OR uncheck the box and it should just > default to > > allowing "text/plain" with no other options accepted. > > > > Karl > > > > > > On Fri, Aug 23, 2019 at 2:17 AM Markus Schuch > <[email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> > wrote: > > > > Hi Karl, > > > > what do i have to do to make tika declare the extracted > plain text with > > mime type text/plain in my setup? > > > > As i said, i have a tika extractor in place: > > > > Pipeline: > > 1) Webcrawler Connector (Repository Connection) > > 2) Tika Extractor (Transformation) > > 3) Solr Connector (Output Connection, > > Extracting Update Handler disabled) > > > > This transformer does not populate the > RepositoryDocument.setMimeType() > > field with the value "text/plain". It just asks the > downstream pipeline > > if text/plain is indexable, but it then sends the > extracted text along > > with the original mime type in my setup. > > > > My output connection: > > > https://gist.github.com/schuch/e8e00d22467552bf5b9354946291f15d > > > > My job/pipeline configuration: > > > https://gist.github.com/schuch/207e3a9f8de8f8481e9dbcdb69ebca5b > > > > History screenshot attached (hope that works on mailing > lists...) > > > > My MCF Version is trunk (r1865689) > > > > Markus > > > > > > Am 23.08.2019 um 01:17 schrieb Karl Wright: > > > Hi Markus, > > > > > > If you use the straight update handler, with no Tika > filter, then the > > > Solr Connector by design restricts input to textual > documents. We can > > > perhaps broaden that to web pages but then you will be > indexing HTML > > > tags as well and I rather doubt that's what you want. > > > > > > If you run Tika within ManifoldCF, the mime type it > presents to the > > > update handler is text/plain. > > > If you run via the extracting update handler, then there > is no content > > > type check done by the Solr connector. > > > > > > Karl > > > > > > > > > On Thu, Aug 22, 2019 at 5:44 PM Markus Schuch > > <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>> > > > <mailto:[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>>> wrote: > > > > > > Hi, > > > > > > i am playing around with the solrj mode of the solr > output > > connector, to > > > avoid running tika extraction in solr. > > > > > > My problem is, that the ingestion of web pages gets > rejected > > with the > > > message > > > > > > "Solr connector rejected document due to mime type > > restrictions: > > > (text/html; charset=UTF-8)" > > > > > > My pipeline looks like this: > > > > > > 1) Webcrawler Connector (Repository Connection) > > > 2) Tika Extractor (Transformation) > > > 3) Solr Connector (Output Connection) > > > > > > The webserver returns content type "text/html; > charset=UTF-8" for > > > the pages. > > > > > > The "Use extracting request handler" option is > disabled in the > > solr > > > output connection. > > > > > > The mimetype inclusions in the solr output connector > are: > > > > > > text/plain;charset=utf-8 > > > text/html > > > text/html; charset=UTF-8 > > > > > > I think the ingestion gets rejected by the > HttpPoster, because it > > > performs a hard check that the mime type has to be a > > "text/plain*" type > > > (see acceptableMimeTypes in HttpPoster). > > > > > > The TikaExtractor asks if downstream pipeline accepts > > > "text/plain;charset=utf-8" as this is the result of the > > extraction. But > > > the sent RepositoryDocument still carries the original > > mimetype before > > > the extraction. > > > > > > Is this a bug or am i missing something? > > > > > > Many thanks in advance > > > Markus > > > > > >
