Re: strategy for post-processing answer set

Chris Hostetter Wed, 28 Sep 2011 18:09:20 -0700

: it looks to me as if Solr just brings back the URLs. what I want to do is to
: get the actual documents in the answer set, simplify their HTML and remove
: all the javascript, ads, etc., and append them into a single document.
: 
: Now ... does Nutch already have the documents? can I get them from its db?
: or do I have to go get the documents again with something like a wget?


i *think* what you are saying is that:

a) you built your index using nutch
b) when you query Solr, you only get back a "url" field for each matching 
document 
c) what you want is to combine the whole text of webpages corrisponding to 
all of those urls into one massive html page

If that's the case,then you should either:

1) ask on the nutch-user mailing list about how to "store" the whole 
content of web pages that nutch crawls so you cna build up a page like 
this (nutch may already be doing it, i don't know -- depends on the 
schema)

2) write custom client code (probably outside of the score of velocity) to 
re-fetch these urls "at query time" and parse them and combine them as you 
see fit.

which approach is right for you all depends on your goals and use case -- 
but solr can only give you back the fields you store in it.

-Hoss

Re: strategy for post-processing answer set

Reply via email to