Maybe you can filter javascript files(*.js) using url filter..
/Jack
On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:
Hello
Sorry my little English
I use nutch-0.7.1 and have issue with html parser
I got in summary javascript code and don't know how to remove it. For
example
This script present in html page inside script//!-- code //--/script
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 15, 2006 10:58 AM
To: nutch-user@lucene.apache.org
Subject: Re: javascript in summaries [nutch-0.7.1]
Maybe you can filter
Yes I see that. But in fact I see javascript in my summaries too and don't
know how remove it :)
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 15, 2006 11:14 AM
To: nutch-user@lucene.apache.org
Subject: Re: javascript in summaries [nutch-0.7.1]
On
Hi there
Can you fetch only one page, say
http://www.pozvonok.ru/shop/vp.php?id=377size=-1idtype=26
And try to find the code working or not?
Good luck!
On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:
Yes I see that. But in fact I see javascript in my summaries too and don't
know how
Nice one thanks.
-Neges Wreiddiol-/-Original Message-
Oddi wrth/From: TDLN [mailto:[EMAIL PROTECTED]
Anfonwyd/Sent: 15 March 2006 09:21
At/To: nutch-user@lucene.apache.org
Pwnc/Subject: Re: Links limit per page?
There is a db.max.outlinks.per.page setting in
Am 14.03.2006 um 23:20 schrieb ArentJan Banck:
java.lang.NullPointerException
at org.apache.nutch.indexer.Indexer$OutputFormat$1.write
(Indexer.java:109)
What for index plugins do you have configured in your nutch-
default.xml or nutch-site.xml? Be sure that the index-basic plugin
Hi,
I reproduce this with nutch-0.8 with neko html parser (it seems that script
tags are not removed).
You can switch the html parser implementation to tagsoup. In my tests, all
is ok.
(property parser.html.impl)
Regards
Jérôme
On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:
Thanks for your help.
-Original Message-
From: Jerome Charron [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 15, 2006 12:00 PM
To: nutch-user@lucene.apache.org
Subject: Re: javascript in summaries [nutch-0.7.1]
Hi,
I reproduce this with nutch-0.8 with neko html parser (it seems that
Thanks, that was it. That plugin definition got lost in the expression
defining the plugins.includes.
- Arent-Jan
Am 14.03.2006 um 23:20 schrieb ArentJan Banck:
java.lang.NullPointerException
at org.apache.nutch.indexer.Indexer$OutputFormat$1.write
(Indexer.java:109)
What for
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote:
I am not familiar with Rhino engine. But it is said jdk 6 adopted it
as embeded javascript engine. Can we build one RhinoInterpreter first,
and then evaluate the javascipt function to get the result rather than
extracting pure text now.
I want to tweak nutch for spealized vortal (searching Hotels). So, was
wondering if some one can clarify few questions I have..
After reading the mail archieve and the code base.
I have concluded the following
I could write a index filter. In the implementation, I have access to the
content, from
Just a small correction!!
On 3/15/06, Vertical Search [EMAIL PROTECTED] wrote:
I want to tweak nutch for spealized vortal (searching Hotels). So, was
wondering if some one can clarify few questions I have..
After reading the mail archieve and the code base.
I have concluded the following
Hi everyone,
I am hoping someone could help me on this. I am indexing ~ 2 million URLs on
12 machines
and I found out that the results were not quite scalable, for example:
when mapred.reduce.tasks was set to 12, it took total about 20 minutes to
complete the job
(11 minutes for reduce);
Hello
Sorry my little English
I use nutch-0.7.1 and have issue with html parser
I got in summary javascript code and don't know how to remove it. For
example
. \n'); } if (plugin) { document.write(' '); document.write(' ');
document.write(' '); document.write(' ');
Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating
patches :D
I'll try to put my solution here first to receive comments from our
community. Since we must differentiate 3 possibilities: must have, may have
and must not have; we need at least 2 boolean variables in
Jérôme Charron wrote:
I reproduce this with nutch-0.8 with neko html parser (it seems that script
tags are not removed).
You can switch the html parser implementation to tagsoup. In my tests, all
is ok.
(property parser.html.impl)
Should we switch the default from neko to tagsoup? Are there
Olive g wrote:
Is hadoop/nutch scalable at all or I can tune some other parameters?
I'm not sure what you're asking. How long does it take to run this on a
single machine? My guess is that it's much longer. So things are
scaling: they're running faster when more hardware is added. In all
This looks like a good approach. Note also that you will probably need
to change BasicQueryFilter and perhaps other filters to work correctly
with optional terms.
Nguyen Ngoc Giang wrote:
Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating
patches :D
I'll try to put my
Hi All,
We're merrily proceeding down our route of a country specific search
engine, nutch seems to be working well. However we're finding some
sites creeping in that aren't from our country. Specifically, we
automatically allow in sites that are hosted within the country. We're
finding
on: http://lucene.apache.org/nutch/issue_tracking.html
http://nagoya.apache.org/jira/browse/Nutch no longer works.
Should be: http://issues.apache.org/jira/browse/Nutch
- Arent-Jan
I just fixed this.
Thanks,
Doug
ArentJan Banck wrote:
on: http://lucene.apache.org/nutch/issue_tracking.html
http://nagoya.apache.org/jira/browse/Nutch no longer works.
Should be: http://issues.apache.org/jira/browse/Nutch
- Arent-Jan
I don't think we need to modify the query filters. Look into the code of
BasicQueryFilter, I found that it takes isRequired and isProhibited flags as
arguments, so as long as we can set the flags correctly, BasicQueryFilter
will take care the rest.
I've experimented with my approach. Let
I follow nutch-0.8 tutorial crawl few pages.
And got next error: file not found index/segment
I look into my index directory and see few sub directory like part-0,
part-1
My question is: how to setup correct searcher?
Thanks
23 matches
Mail list logo