Hi all
I'm new to Nutch, and turned to it to obtain a setup along the following
lines:
We want a remote machine, running nutch (?), that we can incrementally feed
URLs to, and access the index and raw content of the crawled version of
those URLs.
It seems to me that nutch is what we need, but I
Did you check robots.txt
On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote:
> after getting this email, I tried commenting out this line in
> regex-urlfilter.txt =
>
> #-[...@=]
>
> but it didn't help... i still get same message - no urls to feth
>
>
> regex-urlfilter.txt =
>
> # skip URLs conta
try bin/nutch on the console.
It will give you a list of commands. You could use them to read segments e.g
bin/nutch readdb ..
On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 wrote:
>
> I have a doubt...How are the final results of Nutch stored?I mean, in which
> format is stored the information c
What's the difference between regex-urlfilters.xml and crawl-urlfilters.xml.
What uses what?
I meant the production 1.0 server is still crawling them.
On Tue, Apr 20, 2010 at 7:02 PM, wrote:
> Hi Phil,
>
> > -Original Message-
> > From: Phil Barnett [mailto:ph...@philb.us]
> > Sent: Wednesday, 21 April 2010 8:39 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Question about crawler.
> >
> > Is there some place to tell why the crawl
after getting this email, I tried commenting out this line in
regex-urlfilter.txt =
#-[...@=]
but it didn't help... i still get same message - no urls to feth
regex-urlfilter.txt =
# skip URLs containing certain characters as probable queries, etc.
-[...@=]
# skip URLs with slash-delimited
What is in your regex-urlfilter.txt?
> -Original Message-
> From: joshua paul [mailto:jos...@neocodesoftware.com]
> Sent: Wednesday, 21 April 2010 9:44 AM
> To: nutch-user@lucene.apache.org
> Subject: nutch says No URLs to fetch - check your seed list and URL
> filters when trying to index
nutch says No URLs to fetch - check your seed list and URL filters when
trying to index fmforums.com.
I am using this command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
- urls directory contains urls.txt which contains http://www.fmforums.com/
- crawl-urlfilter.txt contains +^http://([
Hi Phil,
> -Original Message-
> From: Phil Barnett [mailto:ph...@philb.us]
> Sent: Wednesday, 21 April 2010 8:39 AM
> To: nutch-user@lucene.apache.org
> Subject: Question about crawler.
>
> Is there some place to tell why the crawler has rejected a page? I'm
> trying
> to get 1.1 working
Is there some place to tell why the crawler has rejected a page? I'm trying
to get 1.1 working and basically it doesn't seem to crawl the same way that
1.0 does.
I have tika included in the parse- section of conf/nutch-site.xml
I have DEBUG set for all the crawl sections, but it doesn't really sa
1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to
a place with, say, 50GB free? Your task may be successful on Windows just
because the temp space limit is different there.
From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Wednesday, 21 April 2010 3:40 AM
To: nut
Here is the output, with fetcher parsing enabled:
Command output:
crawl started in: cmrolg-even/crawl
rootUrlDir = /projects/events/search/nutch-1.0/cmrolg-even/urls
threads = 10
depth = 5
Injector: starting
Injector: crawlDb: cmrolg-even/crawl/crawldb.
Injector: urlDir: /projects/events/search/
Yes - how much free space does it need? We ran 0.9 using /tmp, and that
has ~ 1 GB. After I first saw this error, I moved it to another filesystem
where I have 2 GB free (maybe not "gigs and gigs", but more than I think I
need to complete a small test crawl?).
|>
| From: |
|--
Hi Joshua,
The error message you got definitely indicates that you are running out of
space. Have you changed the value of hadoop.tmp.dir in the config file?
J.
--
DigitalPebble Ltd
http://www.digitalpebble.com
On 20 April 2010 14:00, Joshua J Pavel wrote:
> I am - I changed the location to
Apologies for filling the thread with troubleshooting.
I tried the same configuration on an identical server, and I still have the
same exact errors. I used the same configuration on a Windows system over
cygwin, and it works successfully. So now I'm wondering if there is some
incompatibility w
I am - I changed the location to a filesystem with lots of free space and
watched disk utilization during a crawl. It'll be a relatively small
crawl, and I have gigs and gigs free.
|>
| From: |
|>
>-
I have a doubt related with this topic (I guess)...How are the final results
of Nutch stored?I mean, in which format is stored the information contained
in the links analyzed?
I understood that Nutch need the information in plan text to parse it...but
in which format is stored finally?I know is s
I have a doubt...How are the final results of Nutch stored?I mean, in which
format is stored the information contained in the links analyzed?
I understood that Nutch need the information in plan text to parse it...but
in which format is stored finally?I know is stored in "segments" but how can
I
19 matches
Mail list logo