On Tue, Nov 18, 2008 at 2:49 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it
> will be stored in the index and needed to be removed while searching. In my
> case the HTML tags has no need at all. So I created HTMLStripTransformer for
> the DIH to remove the HTML tags and save space on the index. I have used the
> HTML parser included with Lucene ( org.apache.lucene.demo.html). It is well
> performing and worked with me (while working with Lucene before moving to
> Solr)
>
> What do you think? Does it worth contribution?
Yes. You can contribute this new transformer as an enhancement .
>
> My best wishes,
>
> Regards,
> Ahmed
>
> On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
>
>> There is a nice HTML stripper inside Solr.
>> "solr.HTMLStripStandardTokenizerFactory"
>>
>> -----Original Message-----
>> From: Ahmed Hammad [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, November 05, 2008 10:43 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Regex Transformer Error
>>
>> Hi,
>>
>> It works with the attribute regex="&lt;(.|\n)*?&gt;"
>>
>> Sorry for the disturbance.
>>
>> Regards,
>>
>> ahmd
>>
>>
>> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
>>
>> > Hi,
>> >
>> > I am using Solr 1.3 data import handler. One of my table fields has
>> > html tags, I want to strip it of the field text. So obviously I need
>> > the Regex Transformer.
>> >
>> > I added transformer="RegexTransformer" attribute to my entity and a
>> > new field with:
>> >
>> > <field sourceColName="content" column="content" regex="English"
>> > replaceWith="XXXXX"/>
>> >
>> > Every thing works fine. The text is replace without any problem. The
>> > provlem happend with my regular experession to strip html tags. So I
>> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not
>> > allowed in XML. I tried the following regex="&lt;(.|\n)*?&gt;" and
>> > regex="&#3C;(.|\n)*?&#3E;" but I get the following error:
>> >
>> > The value of attribute "regex" associated with an element type "field"
>>
>> > must not contain the '<' character. at
>> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
>> > Source) ...
>> >
>> > The full stack trace is following:
>> >
>> > *FATAL: Could not create importer. DataImporter config invalid
>> > org.apache.solr.common.SolrException: FATAL: Could not create
>> importer.
>> > DataImporter config invalid at
>> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
>> > Handler.java:114)
>> > at
>> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
>> > (DataImportHandler.java:206)
>> > at
>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
>> > rBase.java:131) at
>> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
>> > java:303)
>> > at
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
>> > .java:232)
>> > at
>> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
>> > cationFilterChain.java:235)
>> > at
>> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
>> > lterChain.java:206)
>> > at
>> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
>> > lve.java:233)
>> > at
>> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
>> > lve.java:191)
>> > at
>> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
>> > va:128)
>> > at
>> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
>> > va:102)
>> > at
>> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
>> > e.java:109)
>> > at
>> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
>> > :286)
>> > at
>> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
>> > .java:857)
>> > at
>> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
>> > cess(Http11AprProtocol.java:565) at
>> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
>> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
>> > org.apache.solr.handler.dataimport.DataImportHandlerException:
>> > Exception occurred while initializing context Processing Document # at
>> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
>> > orter.java:176)
>> > at
>> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
>> > va:93)
>> > at
>> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
>> > Handler.java:106) ... 17 more Caused by:
>> > org.xml.sax.SAXParseException: The value of attribute "regex"
>> > associated with an element type "field" must not contain the '<'
>> > character. at
>> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
>> > Source) at
>> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
>> > own
>> > Source) at
>> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
>> > orter.java:166)
>> > ... 19 more *
>> >
>> > *description* *The server encountered an internal error (FATAL: Could
>> > not create importer. DataImporter config invalid
>> > org.apache.solr.common.SolrException: FATAL: Could not create
>> importer.
>> > DataImporter config invalid at
>> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
>> > Handler.java:114)
>> > at
>> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
>> > (DataImportHandler.java:206)
>> > at
>> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
>> > rBase.java:131) at
>> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
>> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
>> > java:303)
>> > at
>> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
>> > .java:232)
>> > at
>> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
>> > cationFilterChain.java:235)
>> > at
>> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
>> > lterChain.java:206)
>> > at
>> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
>> > lve.java:233)
>> > at
>> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
>> > lve.java:191)
>> > at
>> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
>> > va:128)
>> > at
>> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
>> > va:102)
>> > at
>> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
>> > e.java:109)
>> > at
>> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
>> > :286)
>> > at
>> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
>> > .java:857)
>> > at
>> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
>> > cess(Http11AprProtocol.java:565) at
>> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
>> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
>> > org.apache.solr.handler.dataimport.DataImportHandlerException:
>> > Exception occurred while initializing context Processing Document # at
>> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
>> > orter.java:176)
>> > at
>> > org.apache.solr.handler.dataimport.DataImporter.<init>(DataImporter.ja
>> > va:93)
>> > at
>> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
>> > Handler.java:106) ... 17 more Caused by:
>> > org.xml.sax.SAXParseException: The value of attribute "regex"
>> > associated with an element type "field" must not contain the '<'
>> > character. at
>> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
>> > Source) at
>> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn
>> > own
>> > Source) at
>> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
>> > orter.java:166) ... 19 more ) that prevented it from fulfilling this
>> > request.*
>> >
>> > I appreciate your help.
>> >
>> > Regards,
>> > ahmd
>> >
>> >
>>
>



-- 
--Noble Paul

Reply via email to