> I should mention, that I'm using Nutch in a Web-Application.
It's possible though it's hard.
> While debugging I came across the runParser method in ParseUtil class in
> which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null.
See http://wiki.apache.org/nutch/RunNutchInEclipse#Debuggi
I checked the directory permissions. They should be ok, set to read/write
access.
It's just hard to debug, as i can't make Hadoop logs work. I only see
Warnings and Infos in the console.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp39946
Sent: Tue, Oct 9, 2012 10:03 am
Subject: Re: Error parsing html
i now also tried using all source files itself instead of the nutch.jar, but
nothing changed.
Is there anyone who has an idea what the reason for this error might be? Or
at least where and what i should look for? Any hint?!
Thanks in
.472066.n3.nabble.com/Error-parsing-html-tp3994699p4012755.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Can you provide a few lines of log or the url that gives the exception?
-Original Message-
From: CarinaBambina
To: user
Sent: Tue, Oct 2, 2012 2:04 pm
Subject: Re: Error parsing html
Thanks for the reply. I'm now using Nutch 1.5.1, but nothing has changed so
far.
While debu
ogram raise
the ParseException.
Right now i have no clue what the problem could be. I also tried using all
default configurations, but nothing changed.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4011495.html
Sent from the Nutch - User mailing
Hi,
For starters can you please use 1.5.1.
On Tue, Oct 2, 2012 at 4:32 PM, CarinaBambina wrote:
> Hi,
> i'm curious if you have come up with any solution yet? As i'm having the
> exact same problem!
> When i start the crawl the entered Url is parsed perfectly, but for all
> 'links' on this site
I'm using Nutch 1.5.
Thanks!
--
View this message in context:
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4011436.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Please provide the whole log snippet. Is it an HTML file? Can the parser parse
it, is it large?
-Original message-
> From:Sudip Datta
> Sent: Thu 12-Jul-2012 23:47
> To: Markus Jelsma
> Cc: user@nutch.apache.org
> Subject: Re: Error parsing html
>
> In Parse
ma
> > Cc: user@nutch.apache.org
> > Subject: Re: Error parsing html
> >
> > Hi Markus,
> >
> > Yes, they seem to be rightly mapped:
> >
> > parse-plugins.xml reads:
> >
> >
> >
> >
> >
>
Seems correct indeed. Please check the logs, they may tell some more.
-Original message-
> From:Sudip Datta
> Sent: Thu 12-Jul-2012 21:51
> To: Markus Jelsma
> Cc: user@nutch.apache.org
> Subject: Re: Error parsing html
>
> Hi Markus,
>
> Yes, the
a regex of content types.
>
>
> -Original message-
> > From:Sudip Datta
> > Sent: Thu 12-Jul-2012 20:36
> > To: user@nutch.apache.org
> > Subject: Re: Error parsing html
> >
> > Nopes. That didn't help. In fact, I had added that entry minutes before
&
tch.apache.org
> Subject: Re: Error parsing html
>
> Nopes. That didn't help. In fact, I had added that entry minutes before
> sending a mail to the group and after couple of hours of frustration in
> trying to get the parser to work.
>
> On Thu, Jul 12, 2012 at 11:40 P
Nopes. That didn't help. In fact, I had added that entry minutes before
sending a mail to the group and after couple of hours of frustration in
trying to get the parser to work.
On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:
> For starters there is no p
For starters there is no parse-xhtml plugin unless of course this is a
custom one you've written yourself.
Unless this is the case then remove this from the plugin.includes
property and re-spin it
hth
On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta wrote:
> Hi,
>
> I am using Nutch 1.4 and Solr. M
Hi,
I am using Nutch 1.4 and Solr. My crawls were working perfectly fine before
I made some changes to the SolrWriter (which I believe has nothing to do
with my problem). Since then, I am getting:
WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully parse
content of type text/html
IN
16 matches
Mail list logo