date:20120612

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-12 Thread Andy Xue

Hi : Which setting should I modify in order to do normalization before filtering? Should I swap the order in plugin.includes property? Regards On 7 June 2012 21:24, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, On Wed, Jun 6, 2012 at 10:16 AM, Markus Jelsma

Re: nutch-site.xml not robust

2012-06-12 Thread Andy Xue

Hi all: Like I suspected, this vulnerability affects more properties apart from the ones I described in NUTCH-1385. For instance, the property plugin.includes: valueplugin_1|plugin_2/value This is fine, it will load both plugins. valueplugin_1|plugin_2 /value This is not fine

Re: Merging crawldbs and linkdbs during incremental crawl

2012-06-12 Thread Ali Safdar Kureishy

Hi, Just checking if anyone could comment on my post below. :) Thanks in advance. Safdar On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, I'm trying to build an incremental crawler, using the various Nutch crawl tools (generate + fetch/parse +

Re: Nutch hadoop integration

2012-06-12 Thread Bharat Goyal

The wiki information is not complete and doesnt work in all the cases, I have done some modifications, should i mail it your personal id? Regards, Bharat Goyal On Monday 11 June 2012 11:00 AM, abhishek tiwari wrote: Thanks for your response . i am very new in nutch and hadoop Actually i

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

2012-06-12 Thread Lewis John Mcgibbney

Hi Ali, Please check out this post [0] I found. I need to agree with the response in the thread ans state that I don't know how Hadoop ensures even distribution of workload but we can assume that by explicitly specifying the mapper and reducers we can ensure that all 'will' be used across your

Re: Nutch hadoop integration

2012-06-12 Thread Lewis John Mcgibbney

Hi Bharat, If you are able to, it would be great if you were able to add this info to the Nutch wiki. As you mention, the current Hadoop tutorial seems awfully convoluted and we could do with simplifying it. If you are able to contribute your efforts, please sign up to the wiki and I will add

Nutch name spyder

2012-06-12 Thread david

Hello, I have changed namehttp.agent.name/name valueMyNameSpider/value namehttp.robots.agents/name valueMyNameSpider,*/value When I look at my website stats, I always Robots / Spiders visitors Nutch with a link http://nutch.apache.org/; you have a solution for the name of the spider

Re: Getting seed url

2012-06-12 Thread Julien Nioche

That's the idea indeed. The urlmeta plugin allows to do that simply by setting urlmeta.tags in nutch-site.xml (see nutch-default.xml for description etc...) On 11 June 2012 22:45, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Sandeep, tracking the seed(s) for a document could be done

Re: Getting seed url

2012-06-12 Thread Julien Nioche

forgot to say : this would work by adding a seed metadata to the urls in the seed list, the value of which is then propagated by the scoring filter in urlmeta On 12 June 2012 14:41, Julien Nioche lists.digitalpeb...@gmail.com wrote: That's the idea indeed. The urlmeta plugin allows to do that

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

2012-06-12 Thread Julien Nioche

Guys, This has to do with the way URLs are grouped for politeness and not so much with the number of blocks in the input. Limiting the URLs by # host names, domains or IP is a way of ensuring an even distribution across the cluster. See nutch-default.xml for details J. On 12 June 2012 13:06,

Re: disable filtering and normalization in the crawl-tool

2012-06-12 Thread Matthias Paul

Why? Unnecessary pages are already filtered out in the parse step? On Tue, Jun 12, 2012 at 12:52 AM, remi tassing tassingr...@gmail.com wrote: Certainty, but you might need them to avoid crawling unnecessary pages On Monday, June 11, 2012, Matthias Paul wrote: Hi, wouldn't it be better

Nutch as a crawler

2012-06-12 Thread Vlad Paunescu

Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I do not intend to use it as a search engine). I couldn't figure a simpler way to accomplish my task, so what I do now is: - crawl the site, using

Re: Nutch as a crawler

2012-06-12 Thread Emre Çelikten

Hello, Here's a workaround as a last resort: I think you can add simple code to remove all occurrences of the string http://www.example.com/; from a dump if you are going to use a Java program anyway. Best, Emre On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu vlad.paune...@gmail.comwrote:

Re: Nutch name spyder

2012-06-12 Thread Sebastian Nagel

Hello David, can you specify which version of Nutch you are using? I've run a local test crawl with Nutch 1.5 two weeks ago and just looked into the Apache log file. All seems correct: 127.0.0.1 - - [31/May/2012:22:25:46 +0200] GET /robots.txt HTTP/1.0 404 462 - sn-test-crawler/Nutch-1.5

Re: Getting seed url

2012-06-12 Thread Sebastian Nagel

Thanks Julien, I've missed that urlmeta passes the tags to the outlinks. Sebastian On 06/12/2012 03:42 PM, Julien Nioche wrote: forgot to say : this would work by adding a seed metadata to the urls in the seed list, the value of which is then propagated by the scoring filter in urlmeta On

Inject using custom score and fetchInterval

2012-06-12 Thread mhunter

According to the documentation nutch inject is supposed to allow for an entry with custom score and fetchInterval as well as custom metadata values. I have tried injecting a tab delimited text file with entires like: http://www.domain-one.com/ nutch.score=10 nutch.fetchInterval=172800

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-12 Thread Sebastian Nagel

My current workaround would be to delete the .com and .au lines from the configuration file. You could also activate the option +P in suffix-urlfilter.txt: # uncomment the line below to filter on url path #+P The pattern are then exclusively applied to the path of the URL and not to host or

Re: focused crawl extended with user generated content

2012-06-12 Thread Lewis John Mcgibbney

Hi Magnús, Firstly On Tue, Jun 12, 2012 at 4:56 PM, Magnús Skúlason magg...@gmail.com wrote: However I would like to allow users to edit and extend the content showed on my site. Like adding a better description, adding tags and sorting items into categories. I have not built a search engine

Re: Making the crawler follow a regular expression

2012-06-12 Thread Emre Çelikten

Hello again, Thanks. This does not seem very generalizable though. Is there a up-to-date way to achieve focusing using a plugin?

very long fetch reduce task

2012-06-12 Thread kaveh minooie

Hi everybody I have an unusual issue. when i run nutch on top off hadoop, after the map tasks finish, the reduce task start to finish very fast almost all of them finish in less than 2 hours but there is alway one or two that take a lot longer. this is a link to the list of a completed reduce

RE: focused crawl extended with user generated content

2012-06-12 Thread Arkadi.Kosmynin

Hi Magnus -Original Message- From: Magnús Skúlason [mailto:magg...@gmail.com] Sent: Wednesday, 13 June 2012 1:57 AM To: nutch-u...@lucene.apache.org Subject: focused crawl extended with user generated content Hi, I am using nutch for a focused crawl vertical search engine, so

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

Re: nutch-site.xml not robust

Re: Merging crawldbs and linkdbs during incremental crawl

Re: Nutch hadoop integration

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

Re: Nutch hadoop integration

Nutch name spyder

Re: Getting seed url

Re: Getting seed url

Re: How to ensure even distribution of the fetch phase across Hadoop nodes

Re: disable filtering and normalization in the crawl-tool

Nutch as a crawler

Re: Nutch as a crawler

Re: Nutch name spyder

Re: Getting seed url

Inject using custom score and fetchInterval

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

Re: focused crawl extended with user generated content

Re: Making the crawler follow a regular expression

very long fetch reduce task

RE: focused crawl extended with user generated content

21 matches

Site Navigation

Mail list logo

Footer information