Re: how to force set fetch-status without actually fetching

Sourajit Basak Wed, 10 Apr 2013 00:12:03 -0700

Because those pages do not exist in reality. However we have all the
metadata required to build an index.


This is solved by generating the *required outlinks* for the xml (hub)
page.


On Mon, Apr 8, 2013 at 8:45 PM, feng lu <[email protected]> wrote:

> Hi Sourajit
>
> Why do you want to index unfetched webpages? The index processing will be
> failed if these pages will not have some fields that is to be needed by
> indexer, such as digest.
>
>
> On Mon, Apr 8, 2013 at 7:15 PM, Sourajit Basak <[email protected]
> >wrote:
>
> > We have a use case where we are generating multiple parse outputs per
> url.
> > In short the url hosts a custom xml file which is being parsed to
> generate
> > several records.
> >
> > However, in reality the discovered or generated urls are not actually
> > fetched. According to  NUTCH-514, anything which isn't fetched will be
> > skipped during index.
> >
> > We need to override this behavior. Any ideas how it can be accomplished ?
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: how to force set fetch-status without actually fetching

Reply via email to