Some month ago I have tested YaCy, this works pretty well.
http://yacy.net/en/

You can install it as stand-alone and setup your own crawler (single or 
cluster).
Very nice admin and control surface.
After installation disable the internal database and enable the feed to SOLR, 
thats it.

Regards,
Bernd


Am 05.09.2012 17:05, schrieb Lochschmied, Alexander:
> This may be a bit off topic: How do you index an existing website and control 
> the data going into index?
> 
> We already have Java code to process the HTML (or XHTML) and turn it into a 
> SolrJ Document (removing tags and other things we do not want in the index). 
> We use SolrJ for indexing.
> So I guess the question is essentially which Java crawler could be useful.
> 
> We used to use wget on command line in our publishing process, but we do no 
> longer want to do that.
> 
> Thanks,
> Alexander
> 
> 

Reply via email to