Re: Language-focused crawling

Safdar Kureishy Sun, 01 Jul 2012 12:38:58 -0700

Thanks Marcus/Alexander.

Alexander -- what "filter" are you suggesting I implement, if not the
scoring filter?


Marcus -- the fetch filter filters the content regardless of who pointed to
it. The use-case that I'm trying to implement does that on the next hop ...
i.e, pages of other languages that are pointed to by my language of choice
are still crawled, but those that it points to will not be (unless another
page of the desired language points to it).

For this, as per Alexander's suggestion (but also contrary to it) it seems
that the ScoringFilter is the right way to go:
- During distributeScoreToOutlinks(), I would distribute a metadata tag
(.e.g "LangBias=true") to good pages.
- During updateDbScore() I would look for "LangBias=true" on the inlinks.
If found, I would set the score to 1. If not found, I would set the score
to 0 (or a negative value).

Then, if my understanding is correct, I would be able to filter out the bad
pages using a threshold for the generate() phase of the crawl cycle.
Alternatively, I could use generateSortValue() method to filter out the
pages that don't have LangBias=true by giving them a really low sort value
for the generate() phase.

Thanks again...
Safdar

On Sun, Jul 1, 2012 at 9:45 PM, Markus Jelsma <[email protected]>wrote:

> It's a use case for a fetch filter:
> https://issues.apache.org/jira/browse/NUTCH-828
>
>
>
> -----Original message-----
> > From:Alexander Aristov <[email protected]>
> > Sent: Sun 01-Jul-2012 20:43
> > To: [email protected]; [email protected]
> > Subject: Re: Language-focused crawling
> >
> > Hi
> >
> > First of all you understand that in order to detect page language the
> page
> > must be crawled and at least sent to parser. As you admitted
> > language-identifier filter adds lang field and that's it.
> >
> > You will need to modify or write your own filter that would discard
> > unwanted languages (return null).
> >
> > scoring filters are something different and not suitable for the purpose.
> >
> > As for indexing pages referenced by desired paged then the solution might
> > be is to add a flag to outlink metadata which then would be used to pass
> > the page through your filter.
> >
> > This all is not really difficult if you have necessary programing skills
> > and strong desire. :)
> >
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 1 July 2012 17:00, Safdar Kureishy <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I would like to do a focused web crawl using Nutch, for all pages of a
> > > specific language - let's say "lang". However, the default
> > > language-identifier plugin from Nutch does not support this language.
> > >
> > > The heuristic I'd like to use is that I want all pages pointed to by
> pages
> > > containing "lang" content to be crawled, but pages that are pointed to
> by
> > > non-"lang" pages should not be crawled (unless at least one "lang" page
> > > points to it). It appears that I would need to create a ScoringFilter
> for
> > > this, and exploit the distributeScoreToOutlinks() and updateDbScore()
> > > methods of the filter. However, before I embark on that journey, I
> thought
> > > I'd ask if there is already a solution to this problem of a language
> > > focused crawl in any Nutch plugin library somewhere, that supports an
> > > extensive list of languages?
> > >
> > > Thanks,
> > > Safdar
> > >
> >
>

Re: Language-focused crawling

Reply via email to