Re: Solr scraping: Nutch and other alternatives.

2011-10-19 Thread Luis Cappa Banda
Hello Marco, Markus and Óscar. Thank you very much for your answers. What you suggest, Óscar, sounds very interesting. I mean the alternative that covers data mining with any 'popular searcher'. Do you know any tutorial or book that can teach me the first steps? Bye!

Re: Solr scraping: Nutch and other alternatives.

2011-10-19 Thread Igor MILOVANOVIC
Try this if you haven't use python before : http://gun.io/blog/python-for-the-web/ Keep in mind that the usage of some very known search engine is usually not in line with their ToS, so they will sooner or later block you, at least. Be gentle and polite, and you even might make it work... ;)

Solr scraping: Nutch and other alternatives.

2011-10-18 Thread Luis Cappa Banda
Hello everyone. I've been thinking about a way to retrieve information from a domain (for example, http://www.ign.com) to process and index. My idea is to use Solr as a searcher. I'm familiarized with Apache Nutch and I know that the latest version has a gateway to Solr to retrieve and index

Re: Solr scraping: Nutch and other alternatives.

2011-10-18 Thread Marco Martinez
Hi Luis, Have you tried the copyField function with custom analyzers and tokenizers? bye, Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2011/10/18 Luis Cappa Banda luisca...@gmail.com Hello

Re: Solr scraping: Nutch and other alternatives.

2011-10-18 Thread Markus Jelsma
I'm a bit biased but i would certainly use Nutch as it's the right tool for the job, it seems. Developing custom plugins is actually easier than you might think. Solr, with it's extracting request handling, can only help in a very limited way. Hello everyone. I've been thinking about a

Re: Solr scraping: Nutch and other alternatives.

2011-10-18 Thread Óscar Marín Miró
Hi Luis, just an opinion (worked with Nutch intensively, 2005-2008). Web crawling is a bitch, and Nutch won't make it any easier. Some problems you'll find along the way: 1. Spidering tunnels/traps 2. Duplicate and near-duplicate content removal 3. GET parameter explosion in dynamic