I am using Nutch 2.x using Cassandra as storage. Currently I am just
crawling only one website, and data is getting loaded to Cassandra in byte
code format. When I use readdb command in Nutch, I did get any useful
crawling data.

Below are the details of different files and output I am getting:

*========== command to run crawler =====================*

bin/crawl urls/ crawlDir/ solr_link 3

*======================== seed.txt data ================*

http://www.ft.com

*=====Output of readdb command to read data from cassandra webpage.f
table======*

bin/nutch readdb -dump data -content

http://www.ft.com/ key: com.ft.www:http/

baseUrl: null

status: 4 (status_redir_temp)

fetchTime: 1426888912463

prevFetchTime: 1424296904936

fetchInterval: 2592000

retriesSinceFetch: 0

modifiedTime: 0

prevModifiedTime: 0

protocolStatus: (null)

parseStatus: (null)

title: null

score: 1.0

marker *injmrk* : y

marker dist : 0

reprUrl: null

batchId: 1424296906-20007

metadata *csh* :

*===============content of regex-urlfilter.txt ==============*

*# skip file: ftp: and mailto: urls*

-^(file|ftp|mailto):

*# skip image and other suffixes we can't yet parse *# for a more extensive
coverage use the urlfilter-suffix plugin**

-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

*# skip URLs containing certain characters as probable queries, etc.*

-[?*!@=]

*# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops*

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

*# accept anything else*

+.

*=========content of log file which are bothering me ===========*

2015-02-18 13:57:51,253 ERROR store.CassandraStore -

2015-02-18 13:57:51,253 ERROR store.CassandraStore -
[Ljava.lang.StackTraceElement;@653e3e90

2015-02-18 14:01:45,537 INFO connection.CassandraHostRetryService - Downed
Host Retry service started with queue size -1 and retry delay 10s

*=======================================================*

Please let me know if you need more information.

Can someone please help me ?

Thanks in advance.

-Sumant

Reply via email to