Nutch plugins/feed

2012-08-08 Thread Ake Tangkananond
Hi, I see there is a rss parser under src/plugins but it wasn't put into deployment profile in src/plugins/build.xml. Are there a substitution to this parser now? Which one I should use, or I should myself port previous rss parser to nutch 2.x ? Thank you. Regards, Ake Tangkananond

Re: Nutch plugins/feed

2012-08-08 Thread Julien Nioche
IIRC the Tika parser should handle RSS feeds. The one in src/plugins probably hasn't been ported to 2.x yet as it generates X sub documents from a single source which parsers in Nutch 2.x can't do at the moment On 8 August 2012 08:54, Ake Tangkananond iam...@gmail.com wrote: Hi, I see there

CHM Files and Tika

2012-08-08 Thread Jan Riewe
Hey there, i try to parse CHM (Microsoft Help Files) with Nucht, but i get a: Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which should be able to parse those files

Nutch Encoding on AWS

2012-08-08 Thread Niccolò Becchi
Hi, I have been using Nutch for fetching english sites (UTF-8 and ISO-8859-1). All go well running in local-mode or on a single-node hadoop cluster installed on my pc. Recently I have moved the crawling system to the Amazon AWS and Fetcher has some encoding problems with special character, they

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for processing mobile365.ru and www.mobile365 as single site? Thanks. On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Alexei, I tried a crawl with

RE: crawling site without www

2012-08-08 Thread Markus Jelsma
-Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012 15:43 To: user@nutch.apache.org Subject: Re: crawling site without www Hi, Sebastian Seems you are right. I have db.ignore.external.links is true. But how to configure nutch for

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
You can use the HostURLNormalizer for this task or just crawl the www OR the non-www, not both. I'm trying to crawl only version without www. As I see, I can remove www. using proper configured regex-normalize.xml. But will it work if mobile365.ru redirect on www.mobile365.ru (it's very common

RE: crawling site without www

2012-08-08 Thread Markus Jelsma
If it starts to redirect and you are on the wrong side of the redirect, you're in trouble. But with the HostNormalizer you can then renormalize all URL's to the host that is being redirected to. -Original message- From:Alexei Korolev alexei.koro...@gmail.com Sent: Wed 08-Aug-2012

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Thanks. On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote: If

Re: crawling site without www

2012-08-08 Thread Sebastian Nagel
Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list like this +^www.mobile365.ru +^mobile365.ru Better: +^https?://(?:www\.)?mobile365\.ru/ or to catch all of mobile365.ru

Re: crawling site without www

2012-08-08 Thread Alexei Korolev
Ok. Thank you a lot. I'll try later :) On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi Alexei, So I see just one solution for crawling limited count of sites with behaviour like on mobile365. Its limit scope of sites using regex-urlfilter.txt with list

java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-08 Thread Bai Shen
Is this something other people are seeing? I was parsing 10k urls when I got this exception. I'm running Nutch 2 head as of Aug 6 with the default memory settings(1 GB). Just wondering if anybody else has experienced this on Nutch 2. Thanks.

Re: Nutch Encoding on AWS

2012-08-08 Thread X3C TECH
Not sure if it matters, but what data center are you using? Maybe the data center region uses different characters if the native language isn't english On Wed, Aug 8, 2012 at 7:25 AM, Niccolò Becchi niccolo.bec...@gmail.comwrote: Hi, I have been using Nutch for fetching english sites (UTF-8

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-08-08 Thread Niccolò Becchi
If you are using Nutch in an hadoop cluster and you have enough memory try with this parameters: property namemapred.child.java.opts/name value-Xmx1600m -XX:-UseGCOverheadLimit -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/value /property On Wed, Aug 8, 2012 at 9:32 PM, Bai

Nutch script to crawl a whole domain

2012-08-08 Thread aabbcc
Hi, my problem is that i have a domain (es http://*.apache.org) and I want to crawl every document and page in this website and indicize them with Solr. I was able to do it using the basic command to crawl with nutch: bin/nutch crawl urls -solr http://localhost:8983/solr/ but the