Re: MoreLikeThis: How to get quality terms from html from content stream?

Grant Ingersoll Mon, 10 Aug 2009 06:16:11 -0700

Right, a SearchComponent wrapper around some of the Solr Cellcapabilities could make this so.


On Aug 9, 2009, at 11:21 AM, Jay Hill wrote:

Solr Cell definitely sounds like it has a place here. But wouldn'tit be
needed for as an extracting component earlier in the process for the
MoreLikeThisHandler? The MLT Handler works great when it's directedto acontent stream of plain text. If we could just use Solr Cell toidentify thefile type and do the content extraction earlier in the stream thatwould dothe trick I think. Then whether the URL pointed to HTML, a PDF, orwhatever,
MLT would be receiving a stream of extracted content.

-Jay
On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll<gsing...@apache.org> wrote:
It's starting to sound like Solr Cell needs a SearchComponent aswell, thatcan come before the QueryComponent and can be used to map into theothercomponents. Essentially, take the functionality of the extractOnlyoption
and have it feed other SearchComponent.



On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:
On Aug 7, 2009, at 5:23pm, Jay Hill wrote:
I'm using the MoreLikeThisHandler with a content stream to getdocuments
from my index that match content from an html page like this:

http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
But, not surprisingly, the query generated is meaningless becausea lot
of
the markup is picked out as terms:
<str name="parsedquery_toString">
body:li body:href  body:div body:class body:a body:script body:type
body:js
body:ul body:text body:javascript body:style body:css body:hbody:img
body:var body:articl body:ad body:http body:span body:prop
</str>
Does anyone know a way to transform the html so that the contentcan beparsed out of the content stream and processed w/o the markup? Ordo I
need
to write my own HTMLParsingMoreLikeThisHandler?
You'd want to parse the HTML to extract only text first, and usethat for
your index data.
Both the Nutch and Tika OSS projects have examples of using HTMLparsers(based on TagSoup or CyberNeko) to generate content suitable forindexing.
-- Ken
If I parse the content out to a plain text file and point thestream.url
param to file:///parsedfile.txt it works great.

-Jay
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: MoreLikeThis: How to get quality terms from html from content stream?

Reply via email to