I don't know much about alternative pieces of software. I do know that making parse plugins in Nutch is quite easy and flexible with full access to the DOM.
On Monday 12 September 2011 14:15:49 dpt9876 wrote: > Ok nice. So its possible. Do you think this is a better method than > scraping using an alternate? It seems to me it is in that it will work > better with my end state, being Solr faceted search and I can remove > layers of complexity. On Sep 12, 2011 8:03 PM, "Markus Jelsma-2 [via > Lucene]" < > > [email protected]> wrote: > > Yes you can. As Ken replied in your Solr thread you must create custom > > parse > > > and indexing filters. The parse filter is needed to extract the > > information > > > and store it in the document and the index filter is used to pass that > > new > > > > information to the Solr index. > > > > On Monday 12 September 2011 12:55:49 dpt9876 wrote: > >> Hi, the friendly guys at the Solr user group pointed me here. > >> > >> I am wondering if Nutch/Solr will do the following for a project I am > >> working on. > >> I want to create a search engine with facets for potentially hundreds of > >> websites. > >> Similar to say crawling amazon + buy.com + ebay and someone can search > >> these 3 sites from my 1 website. > >> (I realise there are better ways of doing the above example, its for > >> illustrative purposes). > >> Eventually I would build that search crawl to index say 200 or 1000 > >> merchants. > >> Someone would come to my site and search for "digital camera". > >> > >> They would get results from all 3 indexes and hopefully dynamic facets > >> eg Price $100-200 > >> Price 200-300 > >> Resolution 1mp-2mp > >> > >> etc etc > >> > >> Can this be done on the fly? > >> > >> I ask this because I am currently developing webscrapers to crawl these > >> websites, dump that data into a db, then was thinking of tacking on a > > solr > > >> server to crawl my db. > >> > >> Problem with that approach is that crawling the worlds ecommerce sites > > will > > >> take forever, when it seems solr might do that for me? (I have read > >> about multiple indexes etc). > >> > >> Many thanks > >> > >> -- > > >> View this message in context: > http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak > > >> a-a-mini-google-with-faceted-search-tp3329346p3329346.html Sent from the > >> Nutch - User mailing list archive at Nabble.com. > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > > > > > _______________________________________________ > > If you reply to this email, your message will be added to the discussion > > below: > > http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-aka > -a-mini-google-with-faceted-search-tp3329346p3329431.html > > > To unsubscribe from Will Solr/Nutch crawl multi websites (aka a mini > > google with faceted search)?, visit > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscri > be_by_code&node=3329346&code=ZGFuaW50aGV0cm9waWNzQGdtYWlsLmNvbXwzMzI5MzQ2fC > 04MDk0NTc1ODg= > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Will-Solr-Nutch-crawl-multi-websites-ak > a-a-mini-google-with-faceted-search-tp3329346p3329454.html Sent from the > Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

