Hi,
I see there is a rss parser under src/plugins but it wasn't put into
deployment profile in src/plugins/build.xml. Are there a substitution to
this parser now? Which one I should use, or I should myself port previous
rss parser to nutch 2.x ?
Thank you.
Regards,
Ake Tangkananond
IIRC the Tika parser should handle RSS feeds. The one in src/plugins
probably hasn't been ported to 2.x yet as it generates X sub documents from
a single source which parsers in Nutch 2.x can't do at the moment
On 8 August 2012 08:54, Ake Tangkananond iam...@gmail.com wrote:
Hi,
I see there
Hey there,
i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
Hi,
I have been using Nutch for fetching english sites (UTF-8 and ISO-8859-1).
All go well running in local-mode or on a single-node hadoop cluster
installed on my pc.
Recently I have moved the crawling system to the Amazon AWS and Fetcher has
some encoding problems with special character, they
Hi, Sebastian
Seems you are right. I have db.ignore.external.links is true.
But how to configure nutch for processing mobile365.ru and www.mobile365 as
single site?
Thanks.
On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel wastl.na...@googlemail.com
wrote:
Hi Alexei,
I tried a crawl with
-Original message-
From:Alexei Korolev alexei.koro...@gmail.com
Sent: Wed 08-Aug-2012 15:43
To: user@nutch.apache.org
Subject: Re: crawling site without www
Hi, Sebastian
Seems you are right. I have db.ignore.external.links is true.
But how to configure nutch for
You can use the HostURLNormalizer for this task or just crawl the www OR
the non-www, not both.
I'm trying to crawl only version without www. As I see, I can remove www.
using proper configured regex-normalize.xml.
But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
common
If it starts to redirect and you are on the wrong side of the redirect, you're
in trouble. But with the HostNormalizer you can then renormalize all URL's to
the host that is being redirected to.
-Original message-
From:Alexei Korolev alexei.koro...@gmail.com
Sent: Wed 08-Aug-2012
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this
+^www.mobile365.ru
+^mobile365.ru
Thanks.
On Wed, Aug 8, 2012 at 5:56 PM, Markus Jelsma markus.jel...@openindex.iowrote:
If
Hi Alexei,
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list like this
+^www.mobile365.ru
+^mobile365.ru
Better:
+^https?://(?:www\.)?mobile365\.ru/
or to catch all of mobile365.ru
Ok. Thank you a lot. I'll try later :)
On Wed, Aug 8, 2012 at 9:18 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:
Hi Alexei,
So I see just one solution for crawling limited count of sites with
behaviour like on mobile365. Its limit scope of sites using
regex-urlfilter.txt with list
Is this something other people are seeing? I was parsing 10k urls when I
got this exception. I'm running Nutch 2 head as of Aug 6 with the default
memory settings(1 GB).
Just wondering if anybody else has experienced this on Nutch 2.
Thanks.
Not sure if it matters, but what data center are you using? Maybe the data
center region uses different characters if the native language isn't english
On Wed, Aug 8, 2012 at 7:25 AM, Niccolò Becchi niccolo.bec...@gmail.comwrote:
Hi,
I have been using Nutch for fetching english sites (UTF-8
If you are using Nutch in an hadoop cluster and you have enough memory try
with this parameters:
property
namemapred.child.java.opts/name
value-Xmx1600m -XX:-UseGCOverheadLimit
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/value
/property
On Wed, Aug 8, 2012 at 9:32 PM, Bai
Hi,
my problem is that i have a domain (es http://*.apache.org) and I want to
crawl every document and page in this website and indicize them with Solr.
I was able to do it using the basic command to crawl with nutch:
bin/nutch crawl urls -solr http://localhost:8983/solr/
but the
15 matches
Mail list logo