RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
There's also, of course, tika-server. No matter the method, it is always best to isolate Tika to its own jvm, vm or m. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 4:15 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I actually used solr 5.x, the more like this features, and a subset of human tagged data (about 10%) to apply subject coding with around a 95% accuracy rate to over 2 million documents, so it is definitely doable On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitchwrote: > I know it was a joke, but I've been thinking of something like that. > Not a chatbot per say, but perhaps something that uses Machine > Learning/topic clustering on the past discussions and match them to > the new questions. Still would need to be rechecked by a human for > final response, but could be very helpful. I certainly wished for that > many times as I was answering newbie's questions (or my own). > > And, I feel, current version of Solr actually has all the pieces to > make such thing happen. Could be a fun project/demo/service for > the next LuceneSolrRevolution for somebody with time on their hands > :-) > > Regards, >Alex. > > On 9 April 2018 at 13:24, Allison, Timothy B. wrote: > > +1 > > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > wrote: > > > >> Hello! > >> > >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > >> we have in our Sharepoint system. I have used the tika-app.jar > >> directly to extract the document in question and it does _not_ throw > >> an exception and extract the contents just fine. So it would seem Solr > >> is doing something different than a Tika standalone installation. > >> > >> After some Googling, I found out that Solr uses its custom HtmlMapper > >> (MostlyPassthroughHtmlMapper) which passes through all elements in the > >> HTML document to Tika. As Tika limits nested elements to 100, this > >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of > >> XML element nesting. This is metioned in TIKA-2091 > >> (https://issues.apache.org/ jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > >> "solution" is to use Tika's default parsing/mapping mechanism but no > >> details have been provided on how to configure this at Solr. > >> > >> I'm hoping some folks here have the knowledge on how to configure Solr > >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and > >> use Tika's implementation. > >> > >> Thank you! > >> Harinder > >> > >> > >> > >> NOTICE - > >> This communication is intended ONLY for the use of the person or > >> entity named above and may contain information that is confidential or > >> legally privileged. If you are not the intended recipient named above > >> or a person responsible for delivering messages or communications to > >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > >> distribution, or copying of this communication or any of the > >> information contained in it is strictly prohibited. If you have > >> received this communication in error, please notify us immediately by > >> telephone and then destroy or delete this communication, or return it > >> to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > >> >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I know it was a joke, but I've been thinking of something like that. Not a chatbot per say, but perhaps something that uses Machine Learning/topic clustering on the past discussions and match them to the new questions. Still would need to be rechecked by a human for final response, but could be very helpful. I certainly wished for that many times as I was answering newbie's questions (or my own). And, I feel, current version of Solr actually has all the pieces to make such thing happen. Could be a fun project/demo/service for the next LuceneSolrRevolution for somebody with time on their hands :-) Regards, Alex. On 9 April 2018 at 13:24, Allison, Timothy B.wrote: > +1 > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Monday, April 9, 2018 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > I'd recommend you run Tika externally to Solr, which will allow you to catch > this kind of problem and prevent it bringing down your Solr installation. > > Cheers > > Charlie > > On 9 April 2018 at 16:59, Hanjan, Harinder > wrote: > >> Hello! >> >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents >> we have in our Sharepoint system. I have used the tika-app.jar >> directly to extract the document in question and it does _not_ throw >> an exception and extract the contents just fine. So it would seem Solr >> is doing something different than a Tika standalone installation. >> >> After some Googling, I found out that Solr uses its custom HtmlMapper >> (MostlyPassthroughHtmlMapper) which passes through all elements in the >> HTML document to Tika. As Tika limits nested elements to 100, this >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of >> XML element nesting. This is metioned in TIKA-2091 >> (https://issues.apache.org/ >> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The >> "solution" is to use Tika's default parsing/mapping mechanism but no >> details have been provided on how to configure this at Solr. >> >> I'm hoping some folks here have the knowledge on how to configure Solr >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and >> use Tika's implementation. >> >> Thank you! >> Harinder >> >> >> >> NOTICE - >> This communication is intended ONLY for the use of the person or >> entity named above and may contain information that is confidential or >> legally privileged. If you are not the intended recipient named above >> or a person responsible for delivering messages or communications to >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, >> distribution, or copying of this communication or any of the >> information contained in it is strictly prohibited. If you have >> received this communication in error, please notify us immediately by >> telephone and then destroy or delete this communication, or return it >> to us by mail if requested by us. The City of Calgary thanks you for your >> attention and co-operation. >>
RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Oh this is great! Saves me a whole bunch of manual work. Thanks! -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 09, 2018 2:15 PM To: solr-user@lucene.apache.org Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY= written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinderwrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain > > documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem > > Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom > > HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in > > the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK > Du vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). > > The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure > > Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential > > or > > > legally privileged. If you are not the intended recipient named > > above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately > > by > > > telephone and then destroy or delete this communication, or return > > it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinderwrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= > > > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu > vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential or > > > legally privileged. If you are not the intended recipient named above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately by > > > telephone and then destroy or delete this communication, or return it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >
RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Thank you Charlie, Tim. I will integrate Tika in my Java app and use SolrJ to send data to Solr. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 09, 2018 11:24 AM To: solr-user@lucene.apache.org Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? +1 https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > we have in our Sharepoint system. I have used the tika-app.jar > directly to extract the document in question and it does _not_ throw > an exception and extract the contents just fine. So it would seem Solr > is doing something different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the > HTML document to Tika. As Tika limits nested elements to 100, this > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > XML element nesting. This is metioned in TIKA-2091 > (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > use Tika's implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or > entity named above and may contain information that is confidential or > legally privileged. If you are not the intended recipient named above > or a person responsible for delivering messages or communications to > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > distribution, or copying of this communication or any of the > information contained in it is strictly prohibited. If you have > received this communication in error, please notify us immediately by > telephone and then destroy or delete this communication, or return it > to us by mail if requested by us. The City of Calgary thanks you for your > attention and co-operation. >
RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
+1 https://lucidworks.com/2012/02/14/indexing-with-solrj/ We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > we have in our Sharepoint system. I have used the tika-app.jar > directly to extract the document in question and it does _not_ throw > an exception and extract the contents just fine. So it would seem Solr > is doing something different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the > HTML document to Tika. As Tika limits nested elements to 100, this > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > XML element nesting. This is metioned in TIKA-2091 > (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > use Tika's implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or > entity named above and may contain information that is confidential or > legally privileged. If you are not the intended recipient named above > or a person responsible for delivering messages or communications to > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > distribution, or copying of this communication or any of the > information contained in it is strictly prohibited. If you have > received this communication in error, please notify us immediately by > telephone and then destroy or delete this communication, or return it > to us by mail if requested by us. The City of Calgary thanks you for your > attention and co-operation. >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we > have in our Sharepoint system. I have used the tika-app.jar directly to > extract the document in question and it does _not_ throw an exception and > extract the contents just fine. So it would seem Solr is doing something > different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML > document to Tika. As Tika limits nested elements to 100, this causes Tika > to throw an exception: Suspected zip bomb: 100 levels of XML element > nesting. This is metioned in TIKA-2091 (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr to > effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's > implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or entity > named above and may contain information that is confidential or legally > privileged. If you are not the intended recipient named above or a person > responsible for delivering messages or communications to the intended > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > of this communication or any of the information contained in it is strictly > prohibited. If you have received this communication in error, please notify > us immediately by telephone and then destroy or delete this communication, > or return it to us by mail if requested by us. The City of Calgary thanks > you for your attention and co-operation. >