Hi,

The language is determined by the HTMLLanguageParser which is a ParseFilter
as well. You'll need to make sure that your parse filter is called after it
(have a look in nutch-default.xml for the exact name of the param). As you
can see in HTMLLanguageParser, the value is put in the parse metadata :

      parse.getData().getParseMeta().set(Metadata.LANGUAGE, lang);
>

Simply do something like this in your code (Nutch 1.x)

    Parse parse = parseResult.get(content.getUrl());
>     Metadata metadata = parse.getData().getParseMeta();
>     String lang = metadata.get(Metadata.LANGUAGE);


Note that the HTMLLanguageParser simply uses the language code returned in
the http header or specified in the HTML code. The statistical guessing of
the language is not done before the indexing.

HTH

Julien


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 27 August 2010 00:41, Savannah Beckett <[email protected]>wrote:

> Hi,
>   How do I determine the language of the document inside a parse filter
> function?  I am writing a my own parse filter:
>
>
>  public ParseResult filter(Content content, ParseResult parseResult,
>                               HTMLMetaTags metaTags, DocumentFragment doc)
>
> I am trying to do "doc.get("lang")", but compiler complained it cannot find
> symfol for Get( ) function in DocumentFragment interface.
>
>
> Thanks.
>
>
>
>

Reply via email to