Yeah, copy the segments folder to the 'bin' directory of tomcat and it will
work
On 1/31/06, Raghavendra Prabhu <[EMAIL PROTECTED]> wrote:
>
> Then you have to copy the segments folder to the tomcat directory
>
> When i tried it out , i had to copy in the directory above webapps
>
>
> Do copy the
Hi
Thanks . I got the explanation .So with mapreduce we will be able to process
crawldb efficiently
Rgds
Prabhu
On 1/31/06, Byron Miller <[EMAIL PROTECTED]> wrote:
>
> Prabhu,
>
> For nutch .7x the upper limit of webdb isn't
> necessarily file size but hardware/computation size.
> You basicall
Hi,
Is it possible to use utch to collect web statistic like the one
google did about web html/css usage:
http://code.google.com/webstats/index.html
if not, is there a better way to do it?
Thank you.
Dear Nutch Users,
We've been using Nutch 0.8 (MapReduce) to perform some internet
crawling. Things seemed to be going well on our 11 machines (1 master
with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes)
until...
060129 222409 Lost tracker 'tracker_56288'
060129 222409 Task 'tas
I have a huge a disk too and /tmp folder was fine and has almost 200G free
space on that partition but it still fails. I am going to do the same and
look for the bad URL that makes the problem. But how come Nutch is sensitive
to a particular URL and fails!? It might be because of the parser plugins
Okay, I think I have done a good crawl and when I do a stats command I
get this...
[EMAIL PROTECTED] nutch-nightly]# bin/nutch readdb crawl/crawldb -stats
060130 175317 CrawlDb statistics start: crawl/crawldb
060130 175317 parsing
file:/nutch_binaries/nutch-nightly/conf/nutch-default.xml
060130
Hi,
I don't think its a problem of disc capacity since I am working on huge Disk
and only 10% is used,
What I decide to do is to split the seed into two part and see if I still
get this problem so one half ended succesfuly but the second had the same
problem so I continue with teh spliting,
Hi,
I don't think its a problem of disc capacity since I am working on huge Disk
and only 10% is used,
What I decide to do is to split the seed into two part and see if I still
get this problem so one half ended succesfuly but the second had the same
problem so I continue with teh spliting,
Prabhu,
For nutch .7x the upper limit of webdb isn't
necessarily file size but hardware/computation size.
You basically need 210% of your webdb size to do any
processing of it so if you have 100 million urls and a
1.5 terrabyte webdb you need (on the same server) 3.7
terrabytes of disk space to
If i am not mistaken
I think you can enter like this also
+^http://lucene.apache.org/nutch/
any link going out which meets the above condition will work
On 1/31/06, Lakshman, Madhusudhan <[EMAIL PROTECTED]>
wrote:
>
> Hi,
>
>
>
> I am trying to configure for multiple site indexing using intr
Hi Stefan
Thanks for your mail
What i would like to know is (since i am using nutch-0.7) ,what is the upper
limit on the webdb size if any such limit exists in nutch-0.7
Will the generate for a web db formed from one TB of data (just an example)
work ?
And what is the difference between webdb
Then you have to copy the segments folder to the tomcat directory
When i tried it out , i had to copy in the directory above webapps
Do copy the segments folder to the parent directory of webapps
On 1/30/06, Sameer Tamsekar <[EMAIL PROTECTED]> wrote:
>
> yes, i do have 0.7.1 version but proble
On Unix you can delete the file but it's contents would not be removed
till application that is keeping file open would close it.
Just look at the disk usage change after deletion and after shutting
down searchers.
Regards
Piotr
Sunnyvale Fl wrote:
My question was buried in another thread so I
Hi,
I am trying to configure for multiple site indexing using intranet
crawling. I need help on how to keep the entries in the "urls" flat
file and the crawl-urlfilter.txt files.
For example, I want to configure for the below mentioned 2 URLs,
1.http://lucene.apache.org/nutch/
2.http
My question was buried in another thread so I'll pull it out here.
Can someone help clarify?
Does the searcher load everything into memory, including segments, on
startup? Because it seems that if I delete segments or replace them
while the
searcher is running, it doesn't affect the search result
Thanks Stefan. It turns out my problem/confusion was because I
was using fields="[my_field_name]" instead of fields="DEFAULT" in the
plugin.xml definition of my query filter. If I understand it correctly
that was causing my filter to only get used when I did a search for
"[my_field_name]:
yes, i do have 0.7.1 version but problem persists.
Is there any difference differences between the two situations:
1) use several entry urls in flat file and url patterns in
crawl-urlfilter.txt when doing intranet crawl
2) inject only a few urls and use url patterns in regex-urlfilter.txt when
doing whole-web crawl
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍
Thank you.
在06-1-30,Steve Betts <[EMAIL PROTECTED]> 写道:
>
> Actually, the ^ means start of line. This character is used as a negative
> indicator only within the context of sets, eg, [^0-9].
>
> Thanks,
>
> Steve Betts
> [EMAIL PROTECTED]
> 937-477-1797
>
> -Original Message-
> From: 盖世豪侠
^ is a negate in character classes if I remember correctly, however in
this REGEX it means beginning of the line, like $ is end of line (input)
On Mon, 2006-01-30 at 21:57 +0800, 盖世豪侠 wrote:
> +^http://([a-z0-9]*\.)*pilat.free.fr/
> As far as I know, '^' means matching the characters not within
Actually, the ^ means start of line. This character is used as a negative
indicator only within the context of sets, eg, [^0-9].
Thanks,
Steve Betts
[EMAIL PROTECTED]
937-477-1797
-Original Message-
From: ¸ÇÊÀºÀÏÀ [mailto:[EMAIL PROTECTED]
Sent: Monday, January 30, 2006 8:58 AM
To: nutch
If you just want to save a copy of the site locally, you'd
probably be better off using wget.
Jake.
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Sunday, January 29, 2006 7:21 PM
To: nutch-user@lucene.apache.org
Subject: Re: download/mirror
Nutch al
Dominik Friedrich wrote:
in the searcher log it clearly says it opens the linkdb for some reason:
well, i haven't looked at the searcher code yet so I might be wrong
and the linkdb is used for searching. maybe somebody can explain what
data is used for searching.
NutchBean uses linkDB to r
+^http://([a-z0-9]*\.)*pilat.free.fr/
As far as I know, '^' means matching the characters not within a range by *
complementing* the set, so why it's a accepted pattern for crawl urls?
So the same with
-^(file|ftp|mailto)
Any differences?
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然
Gal Nitzan schrieb:
1. fetch/parse - ndfs
2. decide how many segments (datasize?) you want on each searcher
machine
3. invert,index,dedup, merge selected indexes to some ndfs folder
4. copy the ndfs folder to searcher machine
sounds ok to me. when updating the linkdb you should also include
p
Dominik, thank you so much for your answers, you have been very helpful.
Just one more :)
if I understand correctly... the way to go about the whole process:
1. fetch/parse - ndfs
2. decide how many segments (datasize?) you want on each searcher
machine
3. invert,index,dedup, merge selected inde
I look through the mailinglist archive for a nutch irc channel and
searched with google but didn't find any. So I opened #nutch on efnet
and will be there while being online. If there is already an nutch irc
channel please let me know where.
best regards,
Dominik
Yes, this is correct. For each job a new job.xml is generated.
best regards,
Dominik
Raghavendra Prabhu schrieb:
Hi
Thanks for helping me out .But i have some additional doubts
So essentially while running a nutch indexing process ,does the jobtracker
distribute this
file job.xml (after gene
Gal Nitzan schrieb:
I have copied only the segments directory but the searcher returns 0
hits.
You have to put the index and segments dir into a directory named
"crawl" and start tomcat from the directory that contains crawl. The
nutch.war file contains a nutch-default.xml with
searcher.
Thanks for your reply.
I have copied only the segments directory but the searcher returns 0
hits.
Do I need to copy the linkdb and the index folders as well?
Thanks.
On Sun, 2006-01-29 at 23:18 +0100, Dominik Friedrich wrote:
> Gal Nitzan schrieb:
> > 1. If NDFS is too slow and all data must be
You can already use ndfs in 0.7, however in case the webdb is to
lareg it took to much time to generate segments.
So the problem is the webdb size, not the hdd limit.
Am 30.01.2006 um 07:31 schrieb Raghavendra Prabhu:
Hi Stefan
So can i assume that hard disk space is the only constraint in
Sameer
Can you tell me which version of nutch you have
If you are using the nutch-0.8 The search index has to be in a foder called
crawl under tomcat directory(assuming that you are using tomcat )
/usr/bin/tomcat/crawl
The crawl should have the index and segments
if you are using nutch-0.7 copy
Hi all,
I crawled one website successfully using Whole web crawling, but when
i try to search the index i don't get any result.
When i perform same search using Luke (using the same index) , i get
some results.
Can somebody help me for tackling the problem
Regards,
Sameer Tamsekar
Hi
Thanks for helping me out .But i have some additional doubts
So essentially while running a nutch indexing process ,does the jobtracker
distribute this
file job.xml (after generating it ) to the clients (task trackers )
And if i run a new indexing process with the mapred-default.xml , will t
When the tasktracker starts a task it reads the config files in the
following order: nutch-default.xml, mapred-default.xml, job.xml,
nutch-site.xml.
Except the job.xml all files are those local in the tasktrackers conf
directory. The job.xml is generated for each job by the tool you use,
e.g. G
35 matches
Mail list logo