crawling html in asynchronous service?

sam lee Thu, 24 Feb 2011 12:20:46 -0800

Hey,

I am using Scheduler to crawl html files.


It runs every minute.
And it needs to crawl /content/foo.html

If I use Apache commons HttpClient for GET /content/foo.html,  I need to set
up authentication (Basic Auth?).

However, since all html pages that I want to crawl are served within Sling,
is there an API that "resolves" (or "renders") paths like /content/foo.html,
/content/bar.json ... etc?


If I have to actually make HTTP request, where should I get authentication
information? Scheduler does not know JCR Session..
Should I explicitly get ResourceResolver (using
ResourceResolverFactory.getAdministrativeResourceResolver()) every time job
is fired?

crawling html in asynchronous service?

Reply via email to