I am asking a few websites to allow me to index there site, what you
they add to the robots.txt and where do i get the exact name of my crawler.
Cheers.
Shane
I have a few sites i mirror via FTP and would like too index them with
nutch but not hit the site too save bandwidth.
/var/www/www.somesite.com/ lets say one is located in this folder o the
same server nutch is located on, how would
you index it and have the search results point too the actual
I have indexed several site successfully.
Now i wish too index a new site and not update any other sites already
indexed.
I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go
about indexing a new site only
if someone can give examples of command lines that would be
Can you choose a custom regex-urlfilter.txt too save editing it each
time you wish too index a different site ?.
I am surprised you can't enter a url when generating a fetch list. ie
/bin/nutch generate --only someurl.com --job 192833-292837
The you fetch job 192833-292837 parse job
Thanks for your reply
I made no changes except i editored the db.fetch.schedule.class and
replaced the default value with
org.apache.nutch.crawl.AdaptiveFetchSchedule Now the modifiedTime
Sorry about that i copy and pasted the name of the field and ended up
sending the link.. Oppps
Shane.
On 31/03/14 13:57, Shane Wood wrote:
Thanks for your reply
I made no changes except i editored the db.fetch.schedule.class and
replaced the default value
I'm using Nutch 2.2 as per this install tutorial would this patch
already been added to the newer version ?.
http://nlp.solutions.asia/?p=362
Enjoy
Shane.
On 27/03/14 18:54, Talat Uyarer wrote:
Hi Shane,
Which version of nutch do you use ? If you use Nutch 2.2.1. This a a bug.
You should
I setup Nutch as per this http://nlp.solutions.asia/?p=362.
I wiped the data within MYSQL and re indexed several time and these
fields remain NULL
modifiedTime prevModifiedTime
MYSQL version 5.6.16
Nutch version 2.2
./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch
How do you use the readdb command when using MYSQL there is no crawldb
created ? Can you list the command too use.
Or does Nutch still create a crawldb but i cant find it, where is it
created ? i have /crawl folder but nothing appears in there.
I use Nutch 2.2 and MYSQL version 5.6.16 as per
As generate does not get the urls not yet fetched, no amount of indexing
now adds more too my index i've hit somekind of wall.
Can i force Nutch to only generate urls not yet fetched and not the ones
already fetched.
Cheer
Shane.
On 26/03/14 09:29, Shane Wood wrote:
Yes only error warn i
Yes only error warn i get is
mapred.FileOutputCommitter - Output path is null in cleanup
What does this mean? what would be the command line too index a single
domain. say test.com
Why does generate give me the same fetch list every time ? i thought Nutch
would only re indexed the same page
I have setup Nutch Solr and MYSQL as per this how too
http://nlp.solutions.asia/?p=362
I run Nutch using these commands.
./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb
./bin/nutch solrindex http://127.0.0.1:8983/solr/
12 matches
Mail list logo