Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Jack Tang
Maybe you can filter javascript files(*.js) using url filter.. /Jack On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote: Hello Sorry my little English I use nutch-0.7.1 and have issue with html parser I got in summary javascript code and don't know how to remove it. For example

RE: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Ilia S. Yatsenko
This script present in html page inside script//!-- code //--/script -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 10:58 AM To: nutch-user@lucene.apache.org Subject: Re: javascript in summaries [nutch-0.7.1] Maybe you can filter

RE: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Ilia S. Yatsenko
Yes I see that. But in fact I see javascript in my summaries too and don't know how remove it :) -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 11:14 AM To: nutch-user@lucene.apache.org Subject: Re: javascript in summaries [nutch-0.7.1] On

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Jack Tang
Hi there Can you fetch only one page, say http://www.pozvonok.ru/shop/vp.php?id=377size=-1idtype=26 And try to find the code working or not? Good luck! On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote: Yes I see that. But in fact I see javascript in my summaries too and don't know how

ATB: Links limit per page?

2006-03-15 Thread Aled Jones
Nice one thanks. -Neges Wreiddiol-/-Original Message- Oddi wrth/From: TDLN [mailto:[EMAIL PROTECTED] Anfonwyd/Sent: 15 March 2006 09:21 At/To: nutch-user@lucene.apache.org Pwnc/Subject: Re: Links limit per page? There is a db.max.outlinks.per.page setting in

Re: 0.8: NullPointerException Optimizing index when crawling

2006-03-15 Thread Marko Bauhardt
Am 14.03.2006 um 23:20 schrieb ArentJan Banck: java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write (Indexer.java:109) What for index plugins do you have configured in your nutch- default.xml or nutch-site.xml? Be sure that the index-basic plugin

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Jérôme Charron
Hi, I reproduce this with nutch-0.8 with neko html parser (it seems that script tags are not removed). You can switch the html parser implementation to tagsoup. In my tests, all is ok. (property parser.html.impl) Regards Jérôme On 3/15/06, Ilia S. Yatsenko [EMAIL PROTECTED] wrote:

RE: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Ilia S. Yatsenko
Thanks for your help. -Original Message- From: Jerome Charron [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 12:00 PM To: nutch-user@lucene.apache.org Subject: Re: javascript in summaries [nutch-0.7.1] Hi, I reproduce this with nutch-0.8 with neko html parser (it seems that

Re: 0.8: NullPointerException Optimizing index when crawling

2006-03-15 Thread ajbanck
Thanks, that was it. That plugin definition got lost in the expression defining the plugins.includes. - Arent-Jan Am 14.03.2006 um 23:20 schrieb ArentJan Banck: java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write (Indexer.java:109) What for

Re: Buggy fetchlist' urls

2006-03-15 Thread Jack Tang
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote: I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the javascipt function to get the result rather than extracting pure text now.

Question adding specialized index-search capabilities.!!

2006-03-15 Thread Vertical Search
I want to tweak nutch for spealized vortal (searching Hotels). So, was wondering if some one can clarify few questions I have.. After reading the mail archieve and the code base. I have concluded the following I could write a index filter. In the implementation, I have access to the content, from

Re: Question adding specialized index-search capabilities.!!

2006-03-15 Thread Vertical Search
Just a small correction!! On 3/15/06, Vertical Search [EMAIL PROTECTED] wrote: I want to tweak nutch for spealized vortal (searching Hotels). So, was wondering if some one can clarify few questions I have.. After reading the mail archieve and the code base. I have concluded the following

Question on scalability

2006-03-15 Thread Olive g
Hi everyone, I am hoping someone could help me on this. I am indexing ~ 2 million URLs on 12 machines and I found out that the results were not quite scalable, for example: when mapred.reduce.tasks was set to 12, it took total about 20 minutes to complete the job (11 minutes for reduce);

javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Ilia S. Yatsenko
Hello Sorry my little English I use nutch-0.7.1 and have issue with html parser I got in summary javascript code and don't know how to remove it. For example . \n'); } if (plugin) { document.write(' '); document.write(' '); document.write(' '); document.write(' ');

Re: Boolean OR QueryFilter

2006-03-15 Thread Nguyen Ngoc Giang
Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating patches :D I'll try to put my solution here first to receive comments from our community. Since we must differentiate 3 possibilities: must have, may have and must not have; we need at least 2 boolean variables in

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Doug Cutting
Jérôme Charron wrote: I reproduce this with nutch-0.8 with neko html parser (it seems that script tags are not removed). You can switch the html parser implementation to tagsoup. In my tests, all is ok. (property parser.html.impl) Should we switch the default from neko to tagsoup? Are there

Re: Question on scalability

2006-03-15 Thread Doug Cutting
Olive g wrote: Is hadoop/nutch scalable at all or I can tune some other parameters? I'm not sure what you're asking. How long does it take to run this on a single machine? My guess is that it's much longer. So things are scaling: they're running faster when more hardware is added. In all

Re: Boolean OR QueryFilter

2006-03-15 Thread Doug Cutting
This looks like a good approach. Note also that you will probably need to change BasicQueryFilter and perhaps other filters to work correctly with optional terms. Nguyen Ngoc Giang wrote: Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating patches :D I'll try to put my

Searching only a whitelist (country specific SE)

2006-03-15 Thread Insurance Squared Inc.
Hi All, We're merrily proceeding down our route of a country specific search engine, nutch seems to be working well. However we're finding some sites creeping in that aren't from our country. Specifically, we automatically allow in sites that are hosted within the country. We're finding

Site: invalid Jira link

2006-03-15 Thread ArentJan Banck
on: http://lucene.apache.org/nutch/issue_tracking.html http://nagoya.apache.org/jira/browse/Nutch no longer works. Should be: http://issues.apache.org/jira/browse/Nutch - Arent-Jan

Re: Site: invalid Jira link

2006-03-15 Thread Doug Cutting
I just fixed this. Thanks, Doug ArentJan Banck wrote: on: http://lucene.apache.org/nutch/issue_tracking.html http://nagoya.apache.org/jira/browse/Nutch no longer works. Should be: http://issues.apache.org/jira/browse/Nutch - Arent-Jan

Re: Boolean OR QueryFilter

2006-03-15 Thread Nguyen Ngoc Giang
I don't think we need to modify the query filters. Look into the code of BasicQueryFilter, I found that it takes isRequired and isProhibited flags as arguments, so as long as we can set the flags correctly, BasicQueryFilter will take care the rest. I've experimented with my approach. Let

newbie question about nutch 0.8

2006-03-15 Thread Ilia S. Yatsenko
I follow nutch-0.8 tutorial crawl few pages. And got next error: file not found index/segment I look into my index directory and see few sub directory like part-0, part-1 My question is: how to setup correct searcher? Thanks