Hi Vlad,

On Mon, Jun 18, 2012 at 2:58 PM, Vlad Paunescu <[email protected]> wrote:
> - create a local directory structure which resembles the remote structure:
> is there any elegant way of using the existing Nutch API to accomplish
> this, or I need to manually create the structure from the segments content;

As far as I know Nutch doesn't currently have this
escaping/transformation between absolute --> relative paths within the
API. A suitable option would be to implement something of this nature
within o.a.n.util.URLUtil... thjere are already some excellent methods
in there to get you started with the kind of URL processing that Nutch
offers. I think it would be excellent if this kind of mapping could be
achieved and make configurable... if you get working on it then please
open an issue if you can.

> - convert links inside every page to relative links. For example, if a src
> points to "http://www.mysite.com/resources/foo.txt"; I need to change that
> to be "/resources/foo.txt" because I want to point to the local file. My
> question is if I can use the crawl_parse, or parse_data to get the links. I
> am not sure how to do this, using the Nutch API.

The Parse class allows you to access Parse.getOutlinks which would
then enable you to process them if you could write the correct
configuration as above.

Let us know how you get on.

hth



-- 
Lewis

Reply via email to