Re: how to force set fetch-status without actually fetching

feng lu Wed, 10 Apr 2013 08:02:07 -0700

yes, but any unparsed and unfetched page will be skipped during index.

maybe your can extend your own IndexFilter and IndexWriter to accomplished
this. In IndexFilter pulgin ,you can set NutchDocument according to
ParseData.  in IndexWriter plugin , you can parse this NutchDocument and
generate several SolrInputDocuments, each SolrInputDocument represent one
parsed record.



On Wed, Apr 10, 2013 at 3:11 PM, Sourajit Basak <[email protected]>wrote:

> Because those pages do not exist in reality. However we have all the
> metadata required to build an index.
>
> This is solved by generating the *required outlinks* for the xml (hub)
> page.
>
>
> On Mon, Apr 8, 2013 at 8:45 PM, feng lu <[email protected]> wrote:
>
> > Hi Sourajit
> >
> > Why do you want to index unfetched webpages? The index processing will be
> > failed if these pages will not have some fields that is to be needed by
> > indexer, such as digest.
> >
> >
> > On Mon, Apr 8, 2013 at 7:15 PM, Sourajit Basak <[email protected]
> > >wrote:
> >
> > > We have a use case where we are generating multiple parse outputs per
> > url.
> > > In short the url hosts a custom xml file which is being parsed to
> > generate
> > > several records.
> > >
> > > However, in reality the discovered or generated urls are not actually
> > > fetched. According to  NUTCH-514, anything which isn't fetched will be
> > > skipped during index.
> > >
> > > We need to override this behavior. Any ideas how it can be
> accomplished ?
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: how to force set fetch-status without actually fetching

Reply via email to