Re: how to present html content in browse

2012-05-04 Thread okayndc
Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string field
or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

 Make two fields, one with stores the stripped HTML and another that
 stores the parsed HTML. You can use copyField so that you do not
 have to submit the html page twice.

 You would mark the stripped field 'indexed=true stored=false' and the
 full text field the other way around. The full text field should be a
 String type.

 On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
  I am indexing records from database using DIH. The content of my record
 is in
  html format. When I use browse
  I would like to show the content in html format, not in text format. Any
  ideas?
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
  Sent from the Solr - User mailing list archive at Nabble.com.



 --
 Lance Norskog
 goks...@gmail.com



Re: how to present html content in browse

2012-05-04 Thread Jack Krupansky
Evidently there was a problem with highlighting of HTML that is supposedly 
fixed in Solr 3.6 and trunk:


https://issues.apache.org/jira/browse/SOLR-42

-- Jack Krupansky

-Original Message- 
From: okayndc

Sent: Friday, May 04, 2012 4:35 PM
To: solr-user@lucene.apache.org
Subject: Re: how to present html content in browse

Is it possible to return the HTML field highlighted?

On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky 
j...@basetechnology.comwrote:



1. The raw html field (call it, text_html) would be a string type
field that is stored but not indexed. This is the field you direct DIH
to output to. This is the field you would return in your search results
with the HTML to be displayed.

2. The stripped field (call it, text_stripped) would be a text type
field (where text is a field type you add that uses the HTML strip char
filter as shown below) that is not stored but is indexed. Add a
CopyField to your schema that copies from the raw html field to the
stripped field (say, text_html to text_stripped.)

For reference on HTML strip (HTMLStripCharFilterFactory), see:
http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Which has:

fieldtype name=text class=solr.TextField
 analyzer
  charFilter class=solr.**HTMLStripCharFilterFactory/
  charFilter class=solr.**MappingCharFilterFactory mapping=mapping-**
ISOLatin1Accent.txt/
  tokenizer class=solr.**StandardTokenizerFactory/
  filter class=solr.**LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory**/
  filter class=solr.**PorterStemFilterFactory/
 /analyzer
/fieldtype

Although, you might want to call that field type text_stripped to avoid
confusion with a simple text field

You can add HTMLStripCharFilterFactory to some other field type that you
might want to use, but this charFilter needs to be before the
tokenizer. The text field type above is just an example.

-- Jack Krupansky

-Original Message- From: okayndc
Sent: Friday, May 04, 2012 1:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to present html content in browse


Hello,

I'm having a hard time understanding this, and I had this same question.

When using DIH should the HTML field be stored in the raw HTML string 
field

or the stripped field?
Also what source field(s) need to be copied and to what destination?

Thanks


On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

 Make two fields, one with stores the stripped HTML and another that

stores the parsed HTML. You can use copyField so that you do not
have to submit the html page twice.

You would mark the stripped field 'indexed=true stored=false' and the
full text field the other way around. The full text field should be a
String type.

On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
 I am indexing records from database using DIH. The content of my record
is in
 html format. When I use browse
 I would like to show the content in html format, not in text format. 
 Any

 ideas?

 --
 View this message in context:
http://lucene.472066.n3.**nabble.com/how-to-present-**
html-content-in-browse-**tp3960327.htmlhttp://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
 Sent from the Solr - User mailing list archive at Nabble.com.



--
Lance Norskog
goks...@gmail.com








Re: how to present html content in browse

2012-05-04 Thread Lance Norskog
You need positions and offsets to do highlighting. A CharFilter does
not preserve positions.

I think you have to analyze the raw HTML with a different Analyzer, as
well as the stripper. I think this is how it works: use a new Analyzer
stack that uses the StandardAnalyzer, and the lower case filter and
stemmer/synonym etc. Now, store the HTML field with that text type.
You then search on the stripped field, but highlight from the raw
field with 'hl.fl'.

Here's the cool part: you do not actually need to index the raw HTML,
only store it. If you do not index a field, the Highlighter analyzes
the HTML when it needs the positions and offsets.

On Fri, May 4, 2012 at 2:25 PM, okayndc bodymo...@gmail.com wrote:
 Okay, thanks for the info.

 On Fri, May 4, 2012 at 4:42 PM, Jack Krupansky j...@basetechnology.comwrote:

 Evidently there was a problem with highlighting of HTML that is supposedly
 fixed in Solr 3.6 and trunk:

 https://issues.apache.org/**jira/browse/SOLR-42https://issues.apache.org/jira/browse/SOLR-42


 -- Jack Krupansky

 -Original Message- From: okayndc
 Sent: Friday, May 04, 2012 4:35 PM

 To: solr-user@lucene.apache.org
 Subject: Re: how to present html content in browse

 Is it possible to return the HTML field highlighted?

 On Fri, May 4, 2012 at 1:27 PM, Jack Krupansky j...@basetechnology.com**
 wrote:

  1. The raw html field (call it, text_html) would be a string type
 field that is stored but not indexed. This is the field you direct DIH
 to output to. This is the field you would return in your search results
 with the HTML to be displayed.

 2. The stripped field (call it, text_stripped) would be a text type
 field (where text is a field type you add that uses the HTML strip char
 filter as shown below) that is not stored but is indexed. Add a
 CopyField to your schema that copies from the raw html field to the
 stripped field (say, text_html to text_stripped.)

 For reference on HTML strip (HTMLStripCharFilterFactory), see:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltershttp://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s
 http://wiki.apache.org/**solr/**AnalyzersTokenizersTokenFilter**shttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 


 Which has:

 fieldtype name=text class=solr.TextField
  analyzer
  charFilter class=solr.HTMLStripCharFilterFactory/
  charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-**
 ISOLatin1Accent.txt/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory/
  filter class=solr.PorterStemFilterFactory/

  /analyzer
 /fieldtype

 Although, you might want to call that field type text_stripped to avoid
 confusion with a simple text field

 You can add HTMLStripCharFilterFactory to some other field type that you
 might want to use, but this charFilter needs to be before the
 tokenizer. The text field type above is just an example.

 -- Jack Krupansky

 -Original Message- From: okayndc
 Sent: Friday, May 04, 2012 1:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: how to present html content in browse


 Hello,

 I'm having a hard time understanding this, and I had this same question.

 When using DIH should the HTML field be stored in the raw HTML string
 field
 or the stripped field?
 Also what source field(s) need to be copied and to what destination?

 Thanks


 On Thu, May 3, 2012 at 10:15 PM, Lance Norskog goks...@gmail.com wrote:

  Make two fields, one with stores the stripped HTML and another that

 stores the parsed HTML. You can use copyField so that you do not
 have to submit the html page twice.

 You would mark the stripped field 'indexed=true stored=false' and the
 full text field the other way around. The full text field should be a
 String type.

 On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
  I am indexing records from database using DIH. The content of my record
 is in
  html format. When I use browse
  I would like to show the content in html format, not in text format. 
 Any
  ideas?
 
  --
  View this message in context:
 http://lucene.472066.n3.**nabb**le.com/how-to-present-**http://nabble.com/how-to-present-**
 html-content-in-browse-tp3960327.htmlhttp://lucene.**
 472066.n3.nabble.com/how-to-**present-html-content-in-**
 browse-tp3960327.htmlhttp://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
 

  Sent from the Solr - User mailing list archive at Nabble.com.



 --
 Lance Norskog
 goks...@gmail.com








-- 
Lance Norskog
goks...@gmail.com


how to present html content in browse

2012-05-03 Thread srini
I am indexing records from database using DIH. The content of my record is in
html format. When I use browse
I would like to show the content in html format, not in text format. Any
ideas?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to present html content in browse

2012-05-03 Thread Lance Norskog
Make two fields, one with stores the stripped HTML and another that
stores the parsed HTML. You can use copyField so that you do not
have to submit the html page twice.

You would mark the stripped field 'indexed=true stored=false' and the
full text field the other way around. The full text field should be a
String type.

On Thu, May 3, 2012 at 1:04 PM, srini softtec...@gmail.com wrote:
 I am indexing records from database using DIH. The content of my record is in
 html format. When I use browse
 I would like to show the content in html format, not in text format. Any
 ideas?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-present-html-content-in-browse-tp3960327.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com