You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml
). Otherwise, look into a crawler such as Nutch or Droids or Heretrix.
-Grant
On Oct 27, 2009, at 11:14 AM, Insight 49, LLC wrote:
Hi,
If I use the ExtractingRequestHandler <http://wiki.apache.org/solr/ExtractingRequestHandler
> on a local file (as shown in http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput
), all works well, but how do I do this for files located on a server?
e.g. (works)
curl http://localhost:8983/solr/update/extract?extractOnly=true --
data-binary @mylocalfile.htm -H "Content-type:text/html"
e.g (doesn't work)
curl http://localhost:8983/solr/update/extract?extractOnly=true --
data-binary @http://myweb.com/mylocalfile.htm -H "Content-type:text/
html"
Thanks,
Dan
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search