On 2010-08-06 21:01, Roger Marin wrote:
Hello,

I am new to nutch and I have a requirement to embed the crawler into my
application, however I have been running into some issues that I hope
someone can help me with.

First of all, I understand that nutch requires a unix like environment to
run, but what can I do if I need to embed nutch in an app that can run in
both windows and linux without
the guarantee that cygwin might be installed?.

This is currently difficult... The dependency on POSIX utilities can be cut out from Hadoop but not easily - Java API doesn't give access to some of the information that Hadoop needs. At one time I used AspectJ to replace calls to these utilities with Java classes that return real data (if possible to obtain e.g. using Java 1.6 File API) or return fake data. While this worked for my application I wouldn't recommend this in general case.

Another option is to include a small subset of cygwin utils and libs that are needed by Hadoop, and provide a "private" cygwin install with your application.


The other stuff I need to figure out is if it's possible to programmaticaly
set some of the parameters needed to use the crawler, for instance I need to
programmatically set the values of the urls instead of having a url file, or
a crawl-urlfilter file as well as the properties in the nutch-site.xml,
because these can be configured dynamically by the application and are
relative to the application itself
so I cannot hardcode these properties.

Most, if not all Nutch tools implement the Tool interface, which means you can execute them through run(String[] args). Most of them also provide specialized methods that accept typed arguments.

Also, tools are configured with an instance of Configuration - before you execute run() you can tweak this Configuration object to your liking by setting properties programmatically.

Re: seed file - this actually needs to be a file so that Hadoop FileInputFormat works. If you absolutely can't create seedlist in a temp file, then you will need to change Injector to use a different InputFormat implementation that reads these values from some other source...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to