date:20100420

incremental nutch crawl on remote machine

2010-04-20 Thread Piet van Remortel

Hi all I'm new to Nutch, and turned to it to obtain a setup along the following lines: We want a remote machine, running nutch (?), that we can incrementally feed URLs to, and access the index and raw content of the crawled version of those URLs. It seems to me that nutch is what we need, but I

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch

Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote: > after getting this email, I tried commenting out this line in > regex-urlfilter.txt = > > #-[...@=] > > but it didn't help... i still get same message - no urls to feth > > > regex-urlfilter.txt = > > # skip URLs conta

Re: Format of the Nutch Results

2010-04-20 Thread Harry Nutch

try bin/nutch on the console. It will give you a list of commands. You could use them to read segments e.g bin/nutch readdb .. On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 wrote: > > I have a doubt...How are the final results of Nutch stored?I mean, in which > format is stored the information c

conf questions

2010-04-20 Thread Phil Barnett

What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml. What uses what?

Re: Question about crawler.

2010-04-20 Thread Phil Barnett

I meant the production 1.0 server is still crawling them.

Re: Question about crawler.

2010-04-20 Thread Phil Barnett

On Tue, Apr 20, 2010 at 7:02 PM, wrote: > Hi Phil, > > > -Original Message- > > From: Phil Barnett [mailto:ph...@philb.us] > > Sent: Wednesday, 21 April 2010 8:39 AM > > To: nutch-user@lucene.apache.org > > Subject: Question about crawler. > > > > Is there some place to tell why the crawl

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul

after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin

What is in your regex-urlfilter.txt? > -Original Message- > From: joshua paul [mailto:jos...@neocodesoftware.com] > Sent: Wednesday, 21 April 2010 9:44 AM > To: nutch-user@lucene.apache.org > Subject: nutch says No URLs to fetch - check your seed list and URL > filters when trying to index

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([

RE: Question about crawler.

2010-04-20 Thread Arkadi.Kosmynin

Hi Phil, > -Original Message- > From: Phil Barnett [mailto:ph...@philb.us] > Sent: Wednesday, 21 April 2010 8:39 AM > To: nutch-user@lucene.apache.org > Subject: Question about crawler. > > Is there some place to tell why the crawler has rejected a page? I'm > trying > to get 1.1 working

Question about crawler.

2010-04-20 Thread Phil Barnett

Is there some place to tell why the crawler has rejected a page? I'm trying to get 1.1 working and basically it doesn't seem to crawl the same way that 1.0 does. I have tika included in the parse- section of conf/nutch-site.xml I have DEBUG set for all the crawl sections, but it doesn't really sa

RE: Hadoop Disk Error

2010-04-20 Thread Arkadi.Kosmynin

1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to a place with, say, 50GB free? Your task may be successful on Windows just because the temp space limit is different there. From: Joshua J Pavel [mailto:jpa...@us.ibm.com] Sent: Wednesday, 21 April 2010 3:40 AM To: nut

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

Here is the output, with fetcher parsing enabled: Command output: crawl started in: cmrolg-even/crawl rootUrlDir = /projects/events/search/nutch-1.0/cmrolg-even/urls threads = 10 depth = 5 Injector: starting Injector: crawlDb: cmrolg-even/crawl/crawldb. Injector: urlDir: /projects/events/search/

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

Yes - how much free space does it need? We ran 0.9 using /tmp, and that has ~ 1 GB. After I first saw this error, I moved it to another filesystem where I have 2 GB free (maybe not "gigs and gigs", but more than I think I need to complete a small test crawl?). |> | From: | |--

Re: Hadoop Disk Error

2010-04-20 Thread Julien Nioche

Hi Joshua, The error message you got definitely indicates that you are running out of space. Have you changed the value of hadoop.tmp.dir in the config file? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 20 April 2010 14:00, Joshua J Pavel wrote: > I am - I changed the location to

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

Apologies for filling the thread with troubleshooting. I tried the same configuration on an identical server, and I still have the same exact errors. I used the same configuration on a Windows system over cygwin, and it works successfully. So now I'm wondering if there is some incompatibility w

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

I am - I changed the location to a filesystem with lots of free space and watched disk utilization during a crawl. It'll be a relatively small crawl, and I have gigs and gigs free. |> | From: | |> >-

Re: how to parse html files while crawling

2010-04-20 Thread nachonieto3

I have a doubt related with this topic (I guess)...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is s

Format of the Nutch Results

2010-04-20 Thread nachonieto3

I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is stored in "segments" but how can I

incremental nutch crawl on remote machine

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

Re: Format of the Nutch Results

conf questions

Re: Question about crawler.

Re: Question about crawler.

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

RE: Question about crawler.

Question about crawler.

RE: Hadoop Disk Error

Re: Hadoop Disk Error

Re: Hadoop Disk Error

Re: Hadoop Disk Error

RE: Hadoop Disk Error

RE: Hadoop Disk Error

Re: how to parse html files while crawling

Format of the Nutch Results

19 matches

Site Navigation

Mail list logo

Footer information