Okay, I actually just wrote you a long email of what to do, step by step but
when I tried to send it, my web mail session timed out and forced me to
re-login, losing it all... I'm not happy :(
But straight to the point, since your using the older 0.7 code-base you can use
partially fetched
On 1/2/07, Sean Dean [EMAIL PROTECTED] wrote:
There actually isn't much of a reason to generate huge multi-million page
fetch lists when you can create lots of smaller ones and merge them together. This allows
for more of a ladder-style approach, and in some cases reduces the risk of errors in
thank you Sean Dean
that sounds good .. i will try it out .
tell me if i am rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right
You need to delete the old index before you re-index when working within the
same directory structure.
This is the procedure I follow, which is pretty much what your doing. This
assumes you already have at least one active segment and index. Edit as needed.
bin/nutch generate crawl/crawldb
I'm glad you got the slowness issue straightened out.
When you import the dmoz urls into your Nutch DB, the -subset command isn't
really meant to limit the size of your fetch lists. This becomes even more true
when you start re-fetching. You can actually skip the subset command and allow
all
o i understand it now ..
well and thanks again for ur help sean
i was wondering if anyone wud be interested in making a gui to setup and run
the crawl .. say for no voice users
i dont know if there is any ..
i wud be glad to help if people are keen on making one
Thanks Regards
Shrinivas
There currently is open development on a Nutch administration GUI for version
0.9. I have not tested, or even really looked at it myself but apparently most
of the features work. This will not work on your version, but here is the link
to JIRA where you can find the patches and ongoing
On 1/2/07, Sean Dean [EMAIL PROTECTED] wrote:
You need to delete the old index before you re-index when working within the
same directory structure
This is the procedure I follow, which is pretty much what your doing. This
assumes you already have at least one active segment and index. Edit as
As an interim solution when using the Nutch front end, what we did is
generate the new index in a temporary folder. Then our script (Ant
actually) would turn off the web server (Tomcat in our case) to free the
existing index from the Nutch bean, and do a quick switcheroo using OS
rename commands.
Looking at what I wrote, yes, it will not be acceptable for a production
environment.
What I failed to mention is that I copy the completed crawl directory somewhere
else, and point Tomcat to look there instead via nutch-site.xml.
When you completed all those steps, and have your new index
I'm getting a similar problem with 0.9, but during the injector:
2007-01-02 21:30:25,797 INFO crawl.Injector - Injector: Merging injected
urls into crawl db.
2007-01-02 21:30:25,906 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java
Alan Tanaman wrote:
I'm getting a similar problem with 0.9, but during the injector:
2007-01-02 21:30:25,797 INFO crawl.Injector - Injector: Merging injected
urls into crawl db.
2007-01-02 21:30:25,906 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform...
Thanks for pointing that out Andrzej. I think that the problem is actually
that our native Java Nutch launcher needs to be rebuilt with the Hadoop 0.9
jar.
Best regards,
Alan
_
Alan Tanaman
iDNA Solutions
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL
13 matches
Mail list logo