I am using Nutch 2.x using Cassandra as storage. Currently I am just crawling only one website, and data is getting loaded to Cassandra in byte code format. When I use readdb command in Nutch, I did get any useful crawling data.
Below are the details of different files and output I am getting: *========== command to run crawler =====================* bin/crawl urls/ crawlDir/ solr_link 3 *======================== seed.txt data ================* http://www.ft.com *=====Output of readdb command to read data from cassandra webpage.f table======* bin/nutch readdb -dump data -content http://www.ft.com/ key: com.ft.www:http/ baseUrl: null status: 4 (status_redir_temp) fetchTime: 1426888912463 prevFetchTime: 1424296904936 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null) title: null score: 1.0 marker *injmrk* : y marker dist : 0 reprUrl: null batchId: 1424296906-20007 metadata *csh* : *===============content of regex-urlfilter.txt ==============* *# skip file: ftp: and mailto: urls* -^(file|ftp|mailto): *# skip image and other suffixes we can't yet parse *# for a more extensive coverage use the urlfilter-suffix plugin** -.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ *# skip URLs containing certain characters as probable queries, etc.* -[?*!@=] *# skip URLs with slash-delimited segment that repeats 3+ times, to break loops* -.*(/[^/]+)/[^/]+\1/[^/]+\1/ *# accept anything else* +. *=========content of log file which are bothering me ===========* 2015-02-18 13:57:51,253 ERROR store.CassandraStore - 2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90 2015-02-18 14:01:45,537 INFO connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s *=======================================================* Please let me know if you need more information. Can someone please help me ? Thanks in advance. -Sumant

