Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-12 Thread Andy Xue
Hi : Which setting should I modify in order to do normalization before filtering? Should I swap the order in plugin.includes property? Regards On 7 June 2012 21:24, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, On Wed, Jun 6, 2012 at 10:16 AM, Markus Jelsma

Re: nutch-site.xml not robust

2012-06-12 Thread Andy Xue
Hi all: Like I suspected, this vulnerability affects more properties apart from the ones I described in NUTCH-1385. For instance, the property plugin.includes: valueplugin_1|plugin_2/value This is fine, it will load both plugins. valueplugin_1|plugin_2 /value This is not fine

Re: Merging crawldbs and linkdbs during incremental crawl

2012-06-12 Thread Ali Safdar Kureishy
Hi, Just checking if anyone could comment on my post below. :) Thanks in advance. Safdar On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, I'm trying to build an incremental crawler, using the various Nutch crawl tools (generate + fetch/parse +

Re: Nutch hadoop integration

2012-06-12 Thread Bharat Goyal
The wiki information is not complete and doesnt work in all the cases, I have done some modifications, should i mail it your personal id? Regards, Bharat Goyal On Monday 11 June 2012 11:00 AM, abhishek tiwari wrote: Thanks for your response . i am very new in nutch and hadoop Actually i

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

2012-06-12 Thread Lewis John Mcgibbney
Hi Ali, Please check out this post [0] I found. I need to agree with the response in the thread ans state that I don't know how Hadoop ensures even distribution of workload but we can assume that by explicitly specifying the mapper and reducers we can ensure that all 'will' be used across your

Re: Nutch hadoop integration

2012-06-12 Thread Lewis John Mcgibbney
Hi Bharat, If you are able to, it would be great if you were able to add this info to the Nutch wiki. As you mention, the current Hadoop tutorial seems awfully convoluted and we could do with simplifying it. If you are able to contribute your efforts, please sign up to the wiki and I will add

Nutch name spyder

2012-06-12 Thread david
Hello, I have changed namehttp.agent.name/name valueMyNameSpider/value namehttp.robots.agents/name valueMyNameSpider,*/value When I look at my website stats, I always Robots / Spiders visitors Nutch with a link http://nutch.apache.org/; you have a solution for the name of the spider

Re: Getting seed url

2012-06-12 Thread Julien Nioche
That's the idea indeed. The urlmeta plugin allows to do that simply by setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for description etc...) On 11 June 2012 22:45, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Sandeep, tracking the seed(s) for a document could be done

Re: Getting seed url

2012-06-12 Thread Julien Nioche
forgot to say : this would work by adding a seed metadata to the urls in the seed list, the value of which is then propagated by the scoring filter in urlmeta On 12 June 2012 14:41, Julien Nioche lists.digitalpeb...@gmail.com wrote: That's the idea indeed. The urlmeta plugin allows to do that

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

2012-06-12 Thread Julien Nioche
Guys, This has to do with the way URLs are grouped for politeness and not so much with the number of blocks in the input. Limiting the URLs by # host names, domains or IP is a way of ensuring an even distribution across the cluster. See nutch-default.xml for details J. On 12 June 2012 13:06,

Re: disable filtering and normalization in the crawl-tool

2012-06-12 Thread Matthias Paul
Why? Unnecessary pages are already filtered out in the parse step? On Tue, Jun 12, 2012 at 12:52 AM, remi tassing tassingr...@gmail.com wrote: Certainty, but you might need them to avoid crawling unnecessary pages On Monday, June 11, 2012, Matthias Paul wrote: Hi, wouldn't it be better

Nutch as a crawler

2012-06-12 Thread Vlad Paunescu
Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I do not intend to use it as a search engine). I couldn't figure a simpler way to accomplish my task, so what I do now is: - crawl the site, using

Re: Nutch as a crawler

2012-06-12 Thread Emre Çelikten
Hello, Here's a workaround as a last resort: I think you can add simple code to remove all occurrences of the string http://www.example.com/; from a dump if you are going to use a Java program anyway. Best, Emre On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu vlad.paune...@gmail.comwrote:

Re: Nutch name spyder

2012-06-12 Thread Sebastian Nagel
Hello David, can you specify which version of Nutch you are using? I've run a local test crawl with Nutch 1.5 two weeks ago and just looked into the Apache log file. All seems correct: 127.0.0.1 - - [31/May/2012:22:25:46 +0200] GET /robots.txt HTTP/1.0 404 462 - sn-test-crawler/Nutch-1.5

Re: Getting seed url

2012-06-12 Thread Sebastian Nagel
Thanks Julien, I've missed that urlmeta passes the tags to the outlinks. Sebastian On 06/12/2012 03:42 PM, Julien Nioche wrote: forgot to say : this would work by adding a seed metadata to the urls in the seed list, the value of which is then propagated by the scoring filter in urlmeta On

Inject using custom score and fetchInterval

2012-06-12 Thread mhunter
According to the documentation nutch inject is supposed to allow for an entry with custom score and fetchInterval as well as custom metadata values. I have tried injecting a tab delimited text file with entires like: http://www.domain-one.com/ nutch.score=10 nutch.fetchInterval=172800

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-12 Thread Sebastian Nagel
My current workaround would be to delete the .com and .au lines from the configuration file. You could also activate the option +P in suffix-urlfilter.txt: # uncomment the line below to filter on url path #+P The pattern are then exclusively applied to the path of the URL and not to host or

Re: focused crawl extended with user generated content

2012-06-12 Thread Lewis John Mcgibbney
Hi Magnús, Firstly On Tue, Jun 12, 2012 at 4:56 PM, Magnús Skúlason magg...@gmail.com wrote: However I would like to allow users to edit and extend the content showed on my site. Like adding a better description, adding tags and sorting items into categories. I have not built a search engine

Re: Making the crawler follow a regular expression

2012-06-12 Thread Emre Çelikten
Hello again, Thanks. This does not seem very generalizable though. Is there a up-to-date way to achieve focusing using a plugin?

very long fetch reduce task

2012-06-12 Thread kaveh minooie
Hi everybody I have an unusual issue. when i run nutch on top off hadoop, after the map tasks finish, the reduce task start to finish very fast almost all of them finish in less than 2 hours but there is alway one or two that take a lot longer. this is a link to the list of a completed reduce

RE: focused crawl extended with user generated content

2012-06-12 Thread Arkadi.Kosmynin
Hi Magnus -Original Message- From: Magnús Skúlason [mailto:magg...@gmail.com] Sent: Wednesday, 13 June 2012 1:57 AM To: nutch-u...@lucene.apache.org Subject: focused crawl extended with user generated content Hi, I am using nutch for a focused crawl vertical search engine, so