Awesome, I'll give it try. Thanks Jack! On Tue, May 1, 2012 at 10:23 AM, Jack Krupansky <j...@basetechnology.com>wrote:
> Sorry for the confusion. It is doable. If you feed the raw HTML into a > field that has the HTMLStripCharFilter, the stored value will retain the > HTML tags, while the indexed text will be stripped of the of the tags > during analysis and be searchable just like a normal text field. Then, > search will not see "<p>". > > > -- Jack Krupansky > > -----Original Message----- From: okayndc > Sent: Tuesday, May 01, 2012 10:08 AM > To: solr-user@lucene.apache.org > Subject: Re: extracting/indexing HTML via cURL > > > Thank you Jack. > > So, it's not doable/possible to search and highlight keywords within a > field that contains the raw formatted HTML? and strip out the HTML tags > during analysis...so that a user would get back nothing if they did a > search for (ex. <p>)? > > On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky <j...@basetechnology.com>* > *wrote: > > I was thinking that you wanted to index the actual text from the HTML >> page, but have the stored field value still have the raw HTML with tags. >> If >> you just want to store only the raw HTML, a simple string field is >> sufficient, but then you can't easily do a text search on it. >> >> Or, you can have two fields, one string field for the raw HTML (stored, >> but not indexed) and then do a CopyField to a text field field that has >> the >> HTMLStripCharFilter to strip the HTML tags and index only the text >> (indexed, but not stored.) >> >> -- Jack Krupansky >> >> -----Original Message----- From: okayndc >> Sent: Monday, April 30, 2012 5:06 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Solr: extracting/indexing HTML via cURL >> >> Great, thank you for the input. My understanding of HTMLStripCharFilter >> is >> that it strips HTML tags, which is not what I want ~ is this correct? I >> want to keep the HTML tags intact. >> >> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com >> > >> **wrote: >> >> >> If by "extracting HTML content via cURL" you mean using SolrCell to parse >> >>> html files, this seems to make sense. The sequence is that regardless of >>> the file type, each file extraction "parser" will strip off all >>> formatting >>> and produce a raw text stream. Office, PDF, and HTML files are all >>> treated >>> the same in that way. Then, the unformatted text stream is sent through >>> the >>> field type analyzers to be tokenized into terms that Lucene can index. >>> The >>> input string to the field type analyzer is what gets stored for the >>> field, >>> but this occurs after the extraction file parser has already removed >>> formatting. >>> >>> No way for the formatting to be preserved in that case, other than to go >>> back to the original input document before extraction parsing. >>> >>> If you really do want to preserve full HTML formatted text, you would >>> need >>> to define a field whose field type uses the HTMLStripCharFilter and then >>> directly add documents that direct the raw HTML to that field. >>> >>> There may be some other way to hook into the update processing chain, but >>> that may be too much effort compared to the HTML strip filter. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: okayndc >>> Sent: Monday, April 30, 2012 10:07 AM >>> To: solr-user@lucene.apache.org >>> Subject: Solr: extracting/indexing HTML via cURL >>> >>> >>> Hello, >>> >>> Over the weekend I experimented with extracting HTML content via cURL and >>> just >>> wondering why the extraction/indexing process does not include the HTML >>> tags. >>> It seems as though the HTML tags either being ignored or stripped >>> somewhere >>> in the pipeline. >>> If this is the case, is it possible to include the HTML tags, as I would >>> like to keep the >>> formatted HTML intact? >>> >>> Any help is greatly appreciated. >>> >>> >>> >> >