Thank you for your answer!
I create the new .ngp file for the new language using tika-app-1.0.jar.
A 
post<http://stackoverflow.com/questions/9044916/how-can-i-detect-farsi-web-pages-by-tika/9045385>in
stackoverflow recommends to use the tika.language.override.properties
file to configure the new language in Tika. But I can't find information to
configure it in Nutch.

I very appreciate some hints to do that.

Thanks
Patricio


2012/10/8 Julien Nioche <[email protected]>

> Hello Patricio
>
> The language identification is delegated to Tika since 1.4 (
> https://issues.apache.org/jira/browse/NUTCH-1075) so you should create
> your
> own models with Tika instead. As for the second part of your question this
> is more of a SOLR issue, you'd get more help on the SOLR list instead
>
> Best
>
> Julien
>
> On 8 October 2012 02:11, Patricio Galeas <[email protected]>
> wrote:
>
> > Hi,
> > two years ago with (Nutch 1.0), I used the following command to create a
> > new language profile:
> > *nutch plugin language-identifier
> > org.apache.nutch.analysis.lang.NGramProfile -create <profile-name>
> > <filename> <encoding>*
> > Now, I trying to do the same with Nutch 1.5 but *
> > org.apache.nutch.analysis.lang.NGramProfile* does not exist.
> > I tried with the language-identifier and language-detector plugins but
> the
> > performance ist not good enough for the language that I need to identify.
> >
> > I also tried the language detection in Solr. Following the hints from
> > http://wiki.apache.org/solr/LanguageDetection
> > with the following configuration:
> >
> > *     <updateRequestProcessorChain name="langid">*
> > *       <processor
> >
> >
> class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
> > *
> > *         <bool name="langid">true</bool>         *
> > *         <str name="langid.fl">content,title</str>*
> > *         <str name="langid.whitelist">sq</str>*
> > *         <str name="langid.langField">lang</str>*
> > *         <str name="langid.fallback">en</str>*
> > *       </processor>*
> > *       <processor class="solr.LogUpdateProcessorFactory" />*
> > *       <processor class="solr.RunUpdateProcessorFactory" />*
> > *     </updateRequestProcessorChain>*
> >
> > But, after the indexing the field "lang" was always empty.
> >
> > ¿What I'm doing wrong?
> >
> > Any help would be appreciated
> >
> > Thanks
> > Pat
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to