I patched it also against the tarball... -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Tuesday, November 05, 2013 2:30 PM To: [email protected] Subject: RE: Language identification
Ah, that patch is for the 2.x branch and it won't work on trunk but it can be ported with relative ease but it'll take some time. -----Original message----- > From:Ralf R. Kotowski <[email protected]> > Sent: Tuesday 5th November 2013 14:26 > To: [email protected] > Subject: RE: Language identification > > OK, when I do this on the SVN trunk I get: > > blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 < > language-filter.patch > patching file conf/nutch-default.xml > Hunk #1 succeeded at 941 (offset 19 lines). > patching file ivy/ivy.xml > Hunk #1 FAILED at 111. > 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej > patching file src/plugin/build.xml > Hunk #1 succeeded at 30 with fuzz 1. > Hunk #2 succeeded at 79 with fuzz 1. > Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines). > patching file src/plugin/language-filter/build.xml > patching file src/plugin/language-filter/ivy.xml > patching file src/plugin/language-filter/plugin.xml > patching file > src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil > ter.java > patching file > src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag > eFilter.java > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Tuesday, November 05, 2013 1:17 PM > To: [email protected] > Subject: RE: Language identification > > These are git patches and work differently then we are used to at the ASF > (a/ and b/ prefixes). > In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based > patches. > > > > -----Original message----- > > From:Ralf R. Kotowski <[email protected]> > > Sent: Tuesday 5th November 2013 13:12 > > To: [email protected] > > Subject: RE: Language identification > > > > Thank you, > > > > I'm still learning ow to patch nutch... not much luck so far... > > > > -----Original Message----- > > From: ilhami Kalkan [mailto:[email protected]] > > Sent: Tuesday, November 05, 2013 10:36 AM > > To: [email protected] > > Subject: Re: Language identification > > > > Hi Ralf, > > > > I patched language-filter plugin for filter or accept pages which > > specified languages while parse phase. > > > > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> > > > > > > On 02-11-2013 22:05, Julien Nioche wrote: > > > Ralf, > > > > > > The parameter http.accept.language tells the servers you are hitting > that > > > they should provide you the content in the languages you specified but > > that > > > does not give you any guarantees nor allows you to filter the content. > > Look > > > at the languageidentifier plugin as a starting point, then you could add > a > > > custom mapreduce job to remove the pages which are not in the languages > of > > > interest. > > > > > > HTH > > > > > > Julien > > > > > > > > > > > > On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: > > > > > >> Hi, > > >> > > >> > > >> > > >> What is the correct process to only store documents in a desired > > language? > > >> > > >> > > >> > > >> I'm currently doing this: > > >> > > >> > > >> > > >> <property> > > >> <name>http.accept.language</name> > > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > > >> <description>Value of the "Accept-Language" request header field. > > >> This allows selecting non-English language as default one to retrieve. > > >> It is a useful setting for search engines build for certain national > > group. > > >> </description> > > >> </property> > > >> > > >> > > >> > > >> Using a seed.txt with URL's I know are in the language I want, but as > the > > >> crawl grows it seems I'm starting to get more and more docs in other > > >> languages. > > >> > > >> > > >> > > >> > > >> > > >> Thnx in advance > > >> > > >> > > > > > > > > > > >

