Format of the Nutch Results

2010-04-20 Thread nachonieto3
I have a doubt...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is stored in "segments" but how can I

Re: how to parse html files while crawling

2010-04-20 Thread nachonieto3
I have a doubt related with this topic (I guess)...How are the final results of Nutch stored?I mean, in which format is stored the information contained in the links analyzed? I understood that Nutch need the information in plan text to parse it...but in which format is stored finally?I know is s

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel
I am - I changed the location to a filesystem with lots of free space and watched disk utilization during a crawl. It'll be a relatively small crawl, and I have gigs and gigs free. |> | From: | |> >-

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel
Apologies for filling the thread with troubleshooting. I tried the same configuration on an identical server, and I still have the same exact errors. I used the same configuration on a Windows system over cygwin, and it works successfully. So now I'm wondering if there is some incompatibility w

Re: Hadoop Disk Error

2010-04-20 Thread Julien Nioche
Hi Joshua, The error message you got definitely indicates that you are running out of space. Have you changed the value of hadoop.tmp.dir in the config file? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 20 April 2010 14:00, Joshua J Pavel wrote: > I am - I changed the location to

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel
Yes - how much free space does it need? We ran 0.9 using /tmp, and that has ~ 1 GB. After I first saw this error, I moved it to another filesystem where I have 2 GB free (maybe not "gigs and gigs", but more than I think I need to complete a small test crawl?). |> | From: | |--

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel
Here is the output, with fetcher parsing enabled: Command output: crawl started in: cmrolg-even/crawl rootUrlDir = /projects/events/search/nutch-1.0/cmrolg-even/urls threads = 10 depth = 5 Injector: starting Injector: crawlDb: cmrolg-even/crawl/crawldb. Injector: urlDir: /projects/events/search/

RE: Hadoop Disk Error

2010-04-20 Thread Arkadi.Kosmynin
1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to a place with, say, 50GB free? Your task may be successful on Windows just because the temp space limit is different there. From: Joshua J Pavel [mailto:jpa...@us.ibm.com] Sent: Wednesday, 21 April 2010 3:40 AM To: nut

Question about crawler.

2010-04-20 Thread Phil Barnett
Is there some place to tell why the crawler has rejected a page? I'm trying to get 1.1 working and basically it doesn't seem to crawl the same way that 1.0 does. I have tika included in the parse- section of conf/nutch-site.xml I have DEBUG set for all the crawl sections, but it doesn't really sa

RE: Question about crawler.

2010-04-20 Thread Arkadi.Kosmynin
Hi Phil, > -Original Message- > From: Phil Barnett [mailto:ph...@philb.us] > Sent: Wednesday, 21 April 2010 8:39 AM > To: nutch-user@lucene.apache.org > Subject: Question about crawler. > > Is there some place to tell why the crawler has rejected a page? I'm > trying > to get 1.1 working

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http://([

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin
What is in your regex-urlfilter.txt? > -Original Message- > From: joshua paul [mailto:jos...@neocodesoftware.com] > Sent: Wednesday, 21 April 2010 9:44 AM > To: nutch-user@lucene.apache.org > Subject: nutch says No URLs to fetch - check your seed list and URL > filters when trying to index

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
after getting this email, I tried commenting out this line in regex-urlfilter.txt = #-[...@=] but it didn't help... i still get same message - no urls to feth regex-urlfilter.txt = # skip URLs containing certain characters as probable queries, etc. -[...@=] # skip URLs with slash-delimited

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
On Tue, Apr 20, 2010 at 7:02 PM, wrote: > Hi Phil, > > > -Original Message- > > From: Phil Barnett [mailto:ph...@philb.us] > > Sent: Wednesday, 21 April 2010 8:39 AM > > To: nutch-user@lucene.apache.org > > Subject: Question about crawler. > > > > Is there some place to tell why the crawl

Re: Question about crawler.

2010-04-20 Thread Phil Barnett
I meant the production 1.0 server is still crawling them.

conf questions

2010-04-20 Thread Phil Barnett
What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml. What uses what?

Re: Format of the Nutch Results

2010-04-20 Thread Harry Nutch
try bin/nutch on the console. It will give you a list of commands. You could use them to read segments e.g bin/nutch readdb .. On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 wrote: > > I have a doubt...How are the final results of Nutch stored?I mean, in which > format is stored the information c

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote: > after getting this email, I tried commenting out this line in > regex-urlfilter.txt = > > #-[...@=] > > but it didn't help... i still get same message - no urls to feth > > > regex-urlfilter.txt = > > # skip URLs conta

incremental nutch crawl on remote machine

2010-04-20 Thread Piet van Remortel
Hi all I'm new to Nutch, and turned to it to obtain a setup along the following lines: We want a remote machine, running nutch (?), that we can incrementally feed URLs to, and access the index and raw content of the crawled version of those URLs. It seems to me that nutch is what we need, but I