Re: Indexing HTML

Ravish Bhagdev Wed, 03 Oct 2007 03:27:20 -0700

Hi Erik, All,

I escaped HTML text into entities before sending to Solr and indexing
went fine.  The problem now is that when I get back a snippet with
highlighted text for the html field, its not well formed as the
highliting dosen't somtimes include the entire tag if present.  For
e.g.:


<lst name="0008369D">
−
        <arr name="document">
−
        <str>
ound-color: #FFFFFF; text-align: left; text-indent: 0px;
<em>line-heigh</em>t: normal ; margin-top: 0px; margin-ri
</str>
</arr>
</lst>

<lst name="0008369B">
−
        <arr name="document">
−
        <str>
/TR&gt;<br />
&lt;TR align=&quot;left<em>&quot;  va</em>lign=&quot;middle&quot;
style=&quot; height: 28.800000px;&q
</str>
</arr>
</lst>
</lst>

Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?

Any help will be appreciated.

Thanks,
Ravi

On 8/27/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
> > What's odd about this is that the error seems to indicate that I did.
>
> Actually the error message looks like you escaped too much.  You
> should _not_ escape <field>, only the contents of it.
>
>         Erik
>
>
>
> >
> > The full text (minus the stack trace) was
> >
> > org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
> > or TEXT
> > to read text (position: START_TAG seen ...&lt;field
> > name="line"&gt;&lt;a
> > href="foobar"&gt;... @4:37)
> >
> > Or is that just a byproduct of how SOLR reports the errors back -
> > always
> > escaping them?
> >
> > Thanks guys - I'll have another crack at this tonight.
> >
> >
> > On 8/27/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >>
> >> Michael,
> >>
> >> I think the issue is that you're not escaping the <field> values.
> >> Send something like this to Solr instead:
> >>
> >>   <field name="line">&lt;a
> >> href="foobar"&gt;&lt;b&gt;&lt;i&gt;linktext&lt;/i&gt;&lt;/b&gt;&lt;/
> >> a&gt;</field>
> >>
> >>         Erik
> >>
> >>
> >> On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
> >>
> >>> Hello
> >>>
> >>> I'm trying to index individual lines of an HTML file, and I'm
> >>> hitting this
> >>> error:
> >>>
> >>> TEXT must be immediately followed by END_TAG and not START_TAG
> >>>
> >>> I've got something that looks like
> >>>
> >>> <add>
> >>> <doc>
> >>> <field name="id">4</field>
> >>> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></
> >>> field>
> >>> </doc>
> >>> </add>
> >>>
> >>> Actually, that sample code above, as its own data file POSTed to
> >>> SOLR,
> >>> throws
> >>>
> >>> parser must be on START_TAG or TEXT to read text (position:
> >>> START_TAG seen
> >>> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
> >>>
> >>> as an error.
> >>>
> >>> Any clues as to how I can do this?  I'd like to keep the original
> >>> copy of
> >>> each line intact in the index.
> >>>
> >>> Thanks!
> >>>
> >>> --
> >>> Michael Kimsal
> >>> http://webdevradio.com
> >>
> >>
> >
> >
> > --
> > Michael Kimsal
> > http://webdevradio.com
>
>

Re: Indexing HTML

Reply via email to