question about robots.txt

2014-12-13 Thread Shane Wood
I am asking a few websites to allow me to index there site, what you they add to the robots.txt and where do i get the exact name of my crawler. Cheers. Shane

Index web folders.

2014-04-08 Thread Shane Wood
I have a few sites i mirror via FTP and would like too index them with nutch but not hit the site too save bandwidth. /var/www/www.somesite.com/ lets say one is located in this folder o the same server nutch is located on, how would you index it and have the search results point too the actual

One site only index.

2014-04-02 Thread Shane Wood
I have indexed several site successfully. Now i wish too index a new site and not update any other sites already indexed. I use Nutch 2.21 MYSQL 5.3 and Solr 4.7.0 how would you recommend i go about indexing a new site only if someone can give examples of command lines that would be

Re: One site only index.

2014-04-02 Thread Shane Wood
Can you choose a custom regex-urlfilter.txt too save editing it each time you wish too index a different site ?. I am surprised you can't enter a url when generating a fetch list. ie /bin/nutch generate --only someurl.com --job 192833-292837 The you fetch job 192833-292837 parse job

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-30 Thread Shane Wood
Thanks for your reply I made no changes except i editored the db.fetch.schedule.class and replaced the default value with org.apache.nutch.crawl.AdaptiveFetchSchedule Now the modifiedTime

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-30 Thread Shane Wood
Sorry about that i copy and pasted the name of the field and ended up sending the link.. Oppps Shane. On 31/03/14 13:57, Shane Wood wrote: Thanks for your reply I made no changes except i editored the db.fetch.schedule.class and replaced the default value

Re: MYSQL field meanings

2014-03-27 Thread Shane Wood
I'm using Nutch 2.2 as per this install tutorial would this patch already been added to the newer version ?. http://nlp.solutions.asia/?p=362 Enjoy Shane. On 27/03/14 18:54, Talat Uyarer wrote: Hi Shane, Which version of nutch do you use ? If you use Nutch 2.2.1. This a a bug. You should

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-27 Thread Shane Wood
I setup Nutch as per this http://nlp.solutions.asia/?p=362. I wiped the data within MYSQL and re indexed several time and these fields remain NULL modifiedTime prevModifiedTime MYSQL version 5.6.16 Nutch version 2.2 ./bin/nutch inject urls ./bin/nutch generate -topN 20 ./bin/nutch fetch

Re: user Digest 27 Mar 2014 08:54:48 -0000 Issue 2182

2014-03-27 Thread Shane Wood
How do you use the readdb command when using MYSQL there is no crawldb created ? Can you list the command too use. Or does Nutch still create a crawldb but i cant find it, where is it created ? i have /crawl folder but nothing appears in there. I use Nutch 2.2 and MYSQL version 5.6.16 as per

Re: crawl data

2014-03-26 Thread Shane Wood
As generate does not get the urls not yet fetched, no amount of indexing now adds more too my index i've hit somekind of wall. Can i force Nutch to only generate urls not yet fetched and not the ones already fetched. Cheer Shane. On 26/03/14 09:29, Shane Wood wrote: Yes only error warn i

Re: crawl data

2014-03-25 Thread Shane Wood
Yes only error warn i get is mapred.FileOutputCommitter - Output path is null in cleanup What does this mean? what would be the command line too index a single domain. say test.com Why does generate give me the same fetch list every time ? i thought Nutch would only re indexed the same page

crawl data

2014-03-24 Thread Shane Wood
I have setup Nutch Solr and MYSQL as per this how too http://nlp.solutions.asia/?p=362 I run Nutch using these commands. ./bin/nutch inject urls ./bin/nutch generate -topN 20 ./bin/nutch fetch -all ./bin/nutch parse -all ./bin/nutch updatedb ./bin/nutch solrindex http://127.0.0.1:8983/solr/