Re: extracting/indexing HTML via cURL

okayndc Tue, 01 May 2012 07:36:39 -0700

Awesome, I'll give it try.  Thanks Jack!

On Tue, May 1, 2012 at 10:23 AM, Jack Krupansky <j...@basetechnology.com>wrote:


> Sorry for the confusion. It is doable. If you feed the raw HTML into a
> field that has the HTMLStripCharFilter, the stored value will retain the
> HTML tags, while the indexed text will be stripped of the of the tags
> during analysis and be searchable just like a normal text field. Then,
> search will not see "<p>".
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Tuesday, May 01, 2012 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: extracting/indexing HTML via cURL
>
>
> Thank you Jack.
>
> So, it's not doable/possible to search and highlight keywords within a
> field that contains the raw formatted HTML?  and strip out the HTML tags
> during analysis...so that a user would get back nothing if they did a
> search for (ex. <p>)?
>
> On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  I was thinking that you wanted to index the actual text from the HTML
>> page, but have the stored field value still have the raw HTML with tags.
>> If
>> you just want to store only the raw HTML, a simple string field is
>> sufficient, but then you can't easily do a text search on it.
>>
>> Or, you can have two fields, one string field for the raw HTML (stored,
>> but not indexed) and then do a CopyField to a text field field that has
>> the
>> HTMLStripCharFilter to strip the HTML tags and index only the text
>> (indexed, but not stored.)
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 5:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr: extracting/indexing HTML via cURL
>>
>> Great, thank you for the input.  My understanding of HTMLStripCharFilter
>> is
>> that it strips HTML tags, which is not what I want ~ is this correct?  I
>> want to keep the HTML tags intact.
>>
>> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <j...@basetechnology.com
>> >
>> **wrote:
>>
>>
>>  If by "extracting HTML content via cURL" you mean using SolrCell to parse
>>
>>> html files, this seems to make sense. The sequence is that regardless of
>>> the file type, each file extraction "parser" will strip off all
>>> formatting
>>> and produce a raw text stream. Office, PDF, and HTML files are all
>>> treated
>>> the same in that way. Then, the unformatted text stream is sent through
>>> the
>>> field type analyzers to be tokenized into terms that Lucene can index.
>>> The
>>> input string to the field type analyzer is what gets stored for the
>>> field,
>>> but this occurs after the extraction file parser has already removed
>>> formatting.
>>>
>>> No way for the formatting to be preserved in that case, other than to go
>>> back to the original input document before extraction parsing.
>>>
>>> If you really do want to preserve full HTML formatted text, you would
>>> need
>>> to define a field whose field type uses the HTMLStripCharFilter and then
>>> directly add documents that direct the raw HTML to that field.
>>>
>>> There may be some other way to hook into the update processing chain, but
>>> that may be too much effort compared to the HTML strip filter.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: okayndc
>>> Sent: Monday, April 30, 2012 10:07 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Solr: extracting/indexing HTML via cURL
>>>
>>>
>>> Hello,
>>>
>>> Over the weekend I experimented with extracting HTML content via cURL and
>>> just
>>> wondering why the extraction/indexing process does not include the HTML
>>> tags.
>>> It seems as though the HTML tags either being ignored or stripped
>>> somewhere
>>> in the pipeline.
>>> If this is the case, is it possible to include the HTML tags, as I would
>>> like to keep the
>>> formatted HTML intact?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>>
>>
>

Re: extracting/indexing HTML via cURL

Reply via email to