OK, when I do this on the SVN trunk I get: blackcie@blackcie-VirtualBox:~/nutch-eclipse/2.x$ patch -p1 < language-filter.patch patching file conf/nutch-default.xml Hunk #1 succeeded at 941 (offset 19 lines). patching file ivy/ivy.xml Hunk #1 FAILED at 111. 1 out of 1 hunk FAILED -- saving rejects to file ivy/ivy.xml.rej patching file src/plugin/build.xml Hunk #1 succeeded at 30 with fuzz 1. Hunk #2 succeeded at 79 with fuzz 1. Hunk #3 succeeded at 112 with fuzz 1 (offset 2 lines). patching file src/plugin/language-filter/build.xml patching file src/plugin/language-filter/ivy.xml patching file src/plugin/language-filter/plugin.xml patching file src/plugin/language-filter/src/java/org/apache/nutch/filter/lang/LanguageFil ter.java patching file src/plugin/language-filter/src/test/org/apache/nutch/filter/lang/TestLanguag eFilter.java
-----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Tuesday, November 05, 2013 1:17 PM To: [email protected] Subject: RE: Language identification These are git patches and work differently then we are used to at the ASF (a/ and b/ prefixes). In Nutch' root, patch -p1 < patchfile or -p0 for the usual SVN based patches. -----Original message----- > From:Ralf R. Kotowski <[email protected]> > Sent: Tuesday 5th November 2013 13:12 > To: [email protected] > Subject: RE: Language identification > > Thank you, > > I'm still learning ow to patch nutch... not much luck so far... > > -----Original Message----- > From: ilhami Kalkan [mailto:[email protected]] > Sent: Tuesday, November 05, 2013 10:36 AM > To: [email protected] > Subject: Re: Language identification > > Hi Ralf, > > I patched language-filter plugin for filter or accept pages which > specified languages while parse phase. > > NUTCH-1663 <https://issues.apache.org/jira/browse/NUTCH-1663> > > > On 02-11-2013 22:05, Julien Nioche wrote: > > Ralf, > > > > The parameter http.accept.language tells the servers you are hitting that > > they should provide you the content in the languages you specified but > that > > does not give you any guarantees nor allows you to filter the content. > Look > > at the languageidentifier plugin as a starting point, then you could add a > > custom mapreduce job to remove the pages which are not in the languages of > > interest. > > > > HTH > > > > Julien > > > > > > > > On 2 November 2013 17:15, Ralf R. Kotowski <[email protected]> wrote: > > > >> Hi, > >> > >> > >> > >> What is the correct process to only store documents in a desired > language? > >> > >> > >> > >> I'm currently doing this: > >> > >> > >> > >> <property> > >> <name>http.accept.language</name> > >> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> > >> <description>Value of the "Accept-Language" request header field. > >> This allows selecting non-English language as default one to retrieve. > >> It is a useful setting for search engines build for certain national > group. > >> </description> > >> </property> > >> > >> > >> > >> Using a seed.txt with URL's I know are in the language I want, but as the > >> crawl grows it seems I'm starting to get more and more docs in other > >> languages. > >> > >> > >> > >> > >> > >> Thnx in advance > >> > >> > > > > >

