RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
There's also, of course, tika-server. No matter the method, it is always best to isolate Tika to its own jvm, vm or m. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 4:15 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I actually used solr 5.x, the more like this features, and a subset of human tagged data (about 10%) to apply subject coding with around a 95% accuracy rate to over 2 million documents, so it is definitely doable On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitchwrote: > I know it was a joke, but I've been thinking of something like that. > Not a chatbot per say, but perhaps something that uses Machine > Learning/topic clustering on the past discussions and match them to > the new questions. Still would need to be rechecked by a human for > final response, but could be very helpful. I certainly wished for that > many times as I was answering newbie's questions (or my own). > > And, I feel, current version of Solr actually has all the pieces to > make such thing happen. Could be a fun project/demo/service for > the next LuceneSolrRevolution for somebody with time on their hands > :-) > > Regards, >Alex. > > On 9 April 2018 at 13:24, Allison, Timothy B. wrote: > > +1 > > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > wrote: > > > >> Hello! > >> > >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > >> we have in our Sharepoint system. I have used the tika-app.jar > >> directly to extract the document in question and it does _not_ throw > >> an exception and extract the contents just fine. So it would seem Solr > >> is doing something different than a Tika standalone installation. > >> > >> After some Googling, I found out that Solr uses its custom HtmlMapper > >> (MostlyPassthroughHtmlMapper) which passes through all elements in the > >> HTML document to Tika. As Tika limits nested elements to 100, this > >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of > >> XML element nesting. This is metioned in TIKA-2091 > >> (https://issues.apache.org/ jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > >> "solution" is to use Tika's default parsing/mapping mechanism but no > >> details have been provided on how to configure this at Solr. > >> > >> I'm hoping some folks here have the knowledge on how to configure Solr > >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and > >> use Tika's implementation. > >> > >> Thank you! > >> Harinder > >> > >> > >> > >> NOTICE - > >> This communication is intended ONLY for the use of the person or > >> entity named above and may contain information that is confidential or > >> legally privileged. If you are not the intended recipient named above > >> or a person responsible for delivering messages or communications to > >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > >> distribution, or copying of this communication or any of the > >> information contained in it is strictly prohibited. If you have > >> received this communication in error, please notify us immediately by > >> telephone and then destroy or delete this communication, or return it > >> to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > >> >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I know it was a joke, but I've been thinking of something like that. Not a chatbot per say, but perhaps something that uses Machine Learning/topic clustering on the past discussions and match them to the new questions. Still would need to be rechecked by a human for final response, but could be very helpful. I certainly wished for that many times as I was answering newbie's questions (or my own). And, I feel, current version of Solr actually has all the pieces to make such thing happen. Could be a fun project/demo/service for the next LuceneSolrRevolution for somebody with time on their hands :-) Regards, Alex. On 9 April 2018 at 13:24, Allison, Timothy B.wrote: > +1 > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Monday, April 9, 2018 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > I'd recommend you run Tika externally to Solr, which will allow you to catch > this kind of problem and prevent it bringing down your Solr installation. > > Cheers > > Charlie > > On 9 April 2018 at 16:59, Hanjan, Harinder > wrote: > >> Hello! >> >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents >> we have in our Sharepoint system. I have used the tika-app.jar >> directly to extract the document in question and it does _not_ throw >> an exception and extract the contents just fine. So it would seem Solr >> is doing something different than a Tika standalone installation. >> >> After some Googling, I found out that Solr uses its custom HtmlMapper >> (MostlyPassthroughHtmlMapper) which passes through all elements in the >> HTML document to Tika. As Tika limits nested elements to 100, this >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of >> XML element nesting. This is metioned in TIKA-2091 >> (https://issues.apache.org/ >> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The >> "solution" is to use Tika's default parsing/mapping mechanism but no >> details have been provided on how to configure this at Solr. >> >> I'm hoping some folks here have the knowledge on how to configure Solr >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and >> use Tika's implementation. >> >> Thank you! >> Harinder >> >> >> >> NOTICE - >> This communication is intended ONLY for the use of the person or >> entity named above and may contain information that is confidential or >> legally privileged. If you are not the intended recipient named above >> or a person responsible for delivering messages or communications to >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, >> distribution, or copying of this communication or any of the >> information contained in it is strictly prohibited. If you have >> received this communication in error, please notify us immediately by >> telephone and then destroy or delete this communication, or return it >> to us by mail if requested by us. The City of Calgary thanks you for your >> attention and co-operation. >>
RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Oh this is great! Saves me a whole bunch of manual work. Thanks! -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 09, 2018 2:15 PM To: solr-user@lucene.apache.org Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY= written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinderwrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain > > documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem > > Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom > > HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in > > the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK > Du vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). > > The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure > > Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential > > or > > > legally privileged. If you are not the intended recipient named > > above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately > > by > > > telephone and then destroy or delete this communication, or return > > it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinderwrote: > Thank you Charlie, Tim. > I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from > HTML document instead of Solr's MostlyPassthroughHtmlMapper ? > > +1 > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_ > BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d- > HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_ > 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= > > > > We should add a chatbot to the list that includes Charlie's advice and the > link to Erick's blog post whenever Tika is used. > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder > > wrote: > > > > > Hello! > > > > > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > > > we have in our Sharepoint system. I have used the tika-app.jar > > > directly to extract the document in question and it does _not_ throw > > > an exception and extract the contents just fine. So it would seem Solr > > > is doing something different than a Tika standalone installation. > > > > > > After some Googling, I found out that Solr uses its custom HtmlMapper > > > (MostlyPassthroughHtmlMapper) which passes through all elements in the > > > HTML document to Tika. As Tika limits nested elements to 100, this > > > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > > > XML element nesting. This is metioned in TIKA-2091 > > > (https://urldefense.proofpoint.com/v2/url?u=https- > 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu > vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U= > 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6- > in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091? > focusedCommentId=15514131=com.atlassian.jira. > > > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > > > "solution" is to use Tika's default parsing/mapping mechanism but no > > > details have been provided on how to configure this at Solr. > > > > > > I'm hoping some folks here have the knowledge on how to configure Solr > > > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > > > use Tika's implementation. > > > > > > Thank you! > > > Harinder > > > > > > > > > > > > NOTICE - > > > This communication is intended ONLY for the use of the person or > > > entity named above and may contain information that is confidential or > > > legally privileged. If you are not the intended recipient named above > > > or a person responsible for delivering messages or communications to > > > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > > > distribution, or copying of this communication or any of the > > > information contained in it is strictly prohibited. If you have > > > received this communication in error, please notify us immediately by > > > telephone and then destroy or delete this communication, or return it > > > to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > > > > >
RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Thank you Charlie, Tim. I will integrate Tika in my Java app and use SolrJ to send data to Solr. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 09, 2018 11:24 AM To: solr-user@lucene.apache.org Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? +1 https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0= We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > we have in our Sharepoint system. I have used the tika-app.jar > directly to extract the document in question and it does _not_ throw > an exception and extract the contents just fine. So it would seem Solr > is doing something different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the > HTML document to Tika. As Tika limits nested elements to 100, this > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > XML element nesting. This is metioned in TIKA-2091 > (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > use Tika's implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or > entity named above and may contain information that is confidential or > legally privileged. If you are not the intended recipient named above > or a person responsible for delivering messages or communications to > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > distribution, or copying of this communication or any of the > information contained in it is strictly prohibited. If you have > received this communication in error, please notify us immediately by > telephone and then destroy or delete this communication, or return it > to us by mail if requested by us. The City of Calgary thanks you for your > attention and co-operation. >
RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
+1 https://lucidworks.com/2012/02/14/indexing-with-solrj/ We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > we have in our Sharepoint system. I have used the tika-app.jar > directly to extract the document in question and it does _not_ throw > an exception and extract the contents just fine. So it would seem Solr > is doing something different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the > HTML document to Tika. As Tika limits nested elements to 100, this > causes Tika to throw an exception: Suspected zip bomb: 100 levels of > XML element nesting. This is metioned in TIKA-2091 > (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr > to effectively by-pass its built in MostlyPassthroughHtmlMapper and > use Tika's implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or > entity named above and may contain information that is confidential or > legally privileged. If you are not the intended recipient named above > or a person responsible for delivering messages or communications to > the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > distribution, or copying of this communication or any of the > information contained in it is strictly prohibited. If you have > received this communication in error, please notify us immediately by > telephone and then destroy or delete this communication, or return it > to us by mail if requested by us. The City of Calgary thanks you for your > attention and co-operation. >
Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinderwrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we > have in our Sharepoint system. I have used the tika-app.jar directly to > extract the document in question and it does _not_ throw an exception and > extract the contents just fine. So it would seem Solr is doing something > different than a Tika standalone installation. > > After some Googling, I found out that Solr uses its custom HtmlMapper > (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML > document to Tika. As Tika limits nested elements to 100, this causes Tika > to throw an exception: Suspected zip bomb: 100 levels of XML element > nesting. This is metioned in TIKA-2091 (https://issues.apache.org/ > jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira. > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > "solution" is to use Tika's default parsing/mapping mechanism but no > details have been provided on how to configure this at Solr. > > I'm hoping some folks here have the knowledge on how to configure Solr to > effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's > implementation. > > Thank you! > Harinder > > > > NOTICE - > This communication is intended ONLY for the use of the person or entity > named above and may contain information that is confidential or legally > privileged. If you are not the intended recipient named above or a person > responsible for delivering messages or communications to the intended > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > of this communication or any of the information contained in it is strictly > prohibited. If you have received this communication in error, please notify > us immediately by telephone and then destroy or delete this communication, > or return it to us by mail if requested by us. The City of Calgary thanks > you for your attention and co-operation. >
How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
Hello! Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have in our Sharepoint system. I have used the tika-app.jar directly to extract the document in question and it does _not_ throw an exception and extract the contents just fine. So it would seem Solr is doing something different than a Tika standalone installation. After some Googling, I found out that Solr uses its custom HtmlMapper (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML document to Tika. As Tika limits nested elements to 100, this causes Tika to throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This is metioned in TIKA-2091 (https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The "solution" is to use Tika's default parsing/mapping mechanism but no details have been provided on how to configure this at Solr. I'm hoping some folks here have the knowledge on how to configure Solr to effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's implementation. Thank you! Harinder NOTICE - This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.