Dear all:
i am new to Nutch, i study nutch1.4 yestoday.i can crawl some text use
nutch and i can search them use Solr3.5,but i want to crawl some image that
in the content text,i change some nutch config like this:
1) Open 'regex-urlfilter.txt'
change:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
to:
-\.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|js|JS)$
2) Open 'nutch-site.xml' and add property that crawl limit
<property>
<name>file.content.limit</name>
<value>2097152</value>
</property>
<property>
<name>http.content.limit</name>
<value>2097152</value>
</property>
and then i begin crawl use command in cygwin:
$bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 2 -topN 30
but i cannot get image path in solr search field 'content'. is that
something wrong ?
thank you very much.
Yours.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-can-i-crawl-some-image-in-content-tp3550096p3550096.html
Sent from the Nutch - User mailing list archive at Nabble.com.