: it looks to me as if Solr just brings back the URLs. what I want to do is to : get the actual documents in the answer set, simplify their HTML and remove : all the javascript, ads, etc., and append them into a single document. : : Now ... does Nutch already have the documents? can I get them from its db? : or do I have to go get the documents again with something like a wget?
i *think* what you are saying is that: a) you built your index using nutch b) when you query Solr, you only get back a "url" field for each matching document c) what you want is to combine the whole text of webpages corrisponding to all of those urls into one massive html page If that's the case,then you should either: 1) ask on the nutch-user mailing list about how to "store" the whole content of web pages that nutch crawls so you cna build up a page like this (nutch may already be doing it, i don't know -- depends on the schema) 2) write custom client code (probably outside of the score of velocity) to re-fetch these urls "at query time" and parse them and combine them as you see fit. which approach is right for you all depends on your goals and use case -- but solr can only give you back the fields you store in it. -Hoss