Yes! I have abandoned the 'crawl' command for even my single site
searches. I wrote shell scripts that accomplish (generally) the same
tasks the crawl does.
The only piece I had to watch out for is: one of the first thing the
'crawl' class does is load 'crawl-tool.xml'. So to get the
Hi,
1. I have disabled speculative tasks by setting it to false in
hadoop-site.xml
2. Now I notice that the fetcher does not complete the whole fetchlist.
3. By adding additional logging info in generate I see 2 links being
generated, but the fetcher without any indication to any error just
I like to think of it as a framework. Building blocks
to build what you ultimately need.
If your after the one stop shop, plug in play, no
development necessary then perhaps some other
commercial systems may be your best bet.
Mailing list is very active, most people get responses
fairly quickly.
Make sure you have language-identifier enabled in your
web deployment as well.
WEB-INF/classes/nutch-site.xml or nutch-default.xml
and restart your app server.
-byron
--- Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
Hello,
I enabled language-identifier plugin and indexed
some documents.
But
Hi,
I think that this is my first post. I follow the mailing list and read as
many of the emails as I can.
I'm going to make a few proposals.
I have obtained some money to spend on them.
I use and get paid for my nutch expertise.
I have some experience.
I don't just speak for myself but also for
Hello all,
I think Nutch is a fantastic product. I used 0.6 initially, then 0.7.
My 0.7 installation is in production, and mostly works really well. I
haven't made the move to 0.8 yet, because the direction that Nutch has
gone for 0.8 is quite different from what my organisation requires from
I think of the Nutch project as a marathon, not a sprint. Nutch's
stated goals include:
* Scale to entire web
- pages on millions of different servers
- billions of pages
* Support high traffic
- thousands of searches per second
* State-of-the-art search quality
(see
Hello again.
OK - first of all I hate mailing lists. I don't consider them to be a valid
form of communication for anything but the people doing the coding and don't
really consider them of much use at all unless there is no other
alternative. Except one - and that is when there needs to be
This is very slow!
You can expect results in less than a second from my experience.
+ check memory settings of tomcat.
+ you do not use ndfs, right?
Am 06.03.2006 um 00:23 schrieb Insurance Squared Inc.:
Asking again for the patience of the list, we're still working on
speed. I guess what I
In my hacking, i inadvertedly lost a seqment. What happened? How in
the Heck did I manage to do something so stupid?
Well, when i started another round of fetching I did a generate command
and specified /crawldb/segements/* as the segment, and it made my new
segment under the
directory of the
That's correct, we're not using ndfs. As far as I know it's an out of
the box installation of Mandrake 2006, tomcat, and nutch.
Byron's suggestion of merging to one index cut speeds by about 1/3 or
1/2. I think we've already looked at the tomcat memory settings but
I'll ask our developer to
Hi,
http or www are very good test queries.
double check that the nutch-default.xml which inside the nutch.war
points to the correct folder namesearcher.dir/name.
Stefan
Am 06.03.2006 um 02:31 schrieb Hasan Diwan:
I've followed the nutch tutorial for crawling and started tomcat from
the
If none are being fetched, something is definaltely wrong with
your filter or url file.
Yes, since it is blog it may has dynamic pages like foo.com?entry=23
this definitely filtered by default.
-
blog: http://www.find23.org
company:
Gentlemen:
On 05/03/06, Richard Braman [EMAIL PROTECTED] wrote:
This sounds like your crawl didn't get anything. I have seen that
happen when the url wasn't added right, or the filter was bad. Pipe the
crawl to crawl.log and look in there. It should show some pages being
fecthed. If none
Hi,
for people that found that interesting I had published some
measurement values I had done a long time ago.
http://www.find23.net/Web-Site/blog/A712F01B-4BB1-4FC6-AE95-
E64988FBCC79.html
All time related values are in milliseconds.
Don't take the values to serious however at least they
It did fetch some urls:
-Original Message-
From: Jack Tang [mailto:[EMAIL PROTECTED]
Sent: Sunday, March 05, 2006 9:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: NullPointerException
Hey Hasan
Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
When I read pages out of a webdb and printed out the url of each page, I
found two urls are just the same.
Is it possible that two pages with the same url?
--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
Mr Tang:
Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
[your-query-string] in shell/cmd?
server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan
060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060305 192042 10 parsing
I'll take part in your forum. Just added first post.
-Original Message-
From: Greg Boulter [mailto:[EMAIL PROTECTED]
Sent: Sunday, March 05, 2006 6:33 PM
To: nutch-user@lucene.apache.org
Subject: Re: [Nutch-general] Re: project vitality?
Hello again.
OK - first of all I hate mailing
Mr Tang:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
Weird! You are running nutch on local file system or distributed file system?
Local file system
And can you find the same query hasan via luke?
Nope
--
Cheers,
Hasan Diwan [EMAIL PROTECTED]
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
Mr Tang:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
Weird! You are running nutch on local file system or distributed file
system?
Local file system
And can you find the same query hasan via luke?
Nope
ok. As stepan said, can you get
Steven Could you share those schell scripts?
-Original Message-
From: Steven Yelton [mailto:[EMAIL PROTECTED]
Sent: Sunday, March 05, 2006 10:22 AM
To: nutch-user@lucene.apache.org
Subject: Re: how can i go deep?
Yes! I have abandoned the 'crawl' command for even my single site
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
ok. As stepan said, can you get any hit when you try to search http or
www?
No
Hey, can you zip the index and send it to me directly?
--
Cheers,
Hasan Diwan [EMAIL PROTECTED]
--
Keep
Hasan
It seems your index is not completed.
If you get whole(correct) indices, index dir should include
1. segements file
2. deletable file
3. other files
I am not sure what's wrong in nutch-0.7.1 indexing, but now it is
possible to upgrade to nutch 0.8(svn version)?
/Jack
On 3/6/06, Jack
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
I am not sure what's wrong in nutch-0.7.1 indexing, but now it is
possible to upgrade to nutch 0.8(svn version)?
It is possible, but I was under the assumption that 0.8 required NDFS?
--
Cheers,
Hasan Diwan [EMAIL PROTECTED]
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
You can still build it on local file system:)
Build, yes, but what of deployment? Can I use it in the same way? At
present, I don't have enough resources to run a distributed crawl.
--
Cheers,
Hasan Diwan [EMAIL PROTECTED]
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote:
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote:
You can still build it on local file system:)
Build, yes, but what of deployment? Can I use it in the same way?
Of course yes.
At
present, I don't have enough resources to run a distributed
If you want to narrow down whether it's a Tomcat issue, maybe you
could try running Nutch on another app server like Resin to see if
there's a difference. It's been a while since I used Tomcat, but I
did find the performance to be kind of slow. I think things are
supposed to be better now, but
Right then.. compiled the svn version of nutch. Tried running the
crawl with it and this is the log:
server: 11:32pm % ./bin/nutch crawl ../SpectraSearch/urls -dir
../SpectraSearch/crawl -depth 2 -threads 20
060305 233255 parsing
29 matches
Mail list logo