Hello! I think that really depends on what you want to achieve and what parts of your current system you would like to reuse. If it is only HTML processing I would let Nutch and Solr do that. Of course you can extend Nutch (it has a plugin API) and implement the custom logic you need as a Nutch plugin. There is even an example of Nutch plugin available (http://wiki.apache.org/nutch/WritingPluginExample), but its for Nutch 1.3.
-- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch > Thanks Rafał and Markus for your comments. > I think Droids it has serious problem with URL parameters in > current version (0.2.0) from Maven central: > https://issues.apache.org/jira/browse/DROIDS-144 > I knew about Nutch, but I haven't been able to implement a crawler > with it. Have you done that or seen an example application? > It's probably easy to call a Nutch jar and make it index a website and maybe > I will have to do that. > But as we already have a Java implementation to index other > sources, it would be nice if we could integrate the crawling part too. > Regards, > Alexander > ------------------------------------ > Hello! > You can implement your own crawler using Droids > (http://incubator.apache.org/droids/) or use Apache Nutch > (http://nutch.apache.org/), which is very easy to integrate with > Solr and is very powerful crawler. > -- > Regards, > Rafał Kuć > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch >> This may be a bit off topic: How do you index an existing website and >> control the data going into index? >> We already have Java code to process the HTML (or XHTML) and turn it >> into a SolrJ Document (removing tags and other things we do not want >> in the index). We use SolrJ for indexing. >> So I guess the question is essentially which Java crawler could be useful. >> We used to use wget on command line in our publishing process, but we do no >> longer want to do that. >> Thanks, >> Alexander