i using nutch 2.3 ,solr-4.10.3 and hbase-0.94.26. command line for nutch 2.3. Thank sent back.
On Thu, Sep 24, 2015 at 7:29 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > > CC'd user@nutch > > Which version of Nutch are you using? > Your command line usage seems to be outdated. Can you please confirm? > Thank you > Lewis > > On Wed, Sep 23, 2015 at 2:55 AM, Vu Quang Tin <[email protected]> > wrote: > >> Hi Lewis John McGibbney. >> I'm a vietnam. >> I'm not very good english. >> I have problems with the crawler web by Nutch. >> when i using : >> >> ./bin/nutch org.apache.nutch.parse.ParserChecker "http://dantri.com.vn/ >> ">dantri2.txt >> result: >> fetching: http://dantri.com.vn/ >> parsing: http://dantri.com.vn/ >> contentType: application/xhtml+xml >> signature: 5ddaf9394c8b4bd3ce275253e22e7c7e >> --------- >> Url >> --------------- >> >> http://dantri.com.vn/ >> --------- >> Metadata >> --------- >> ... >> Outlinks >> --------- >> >> outlink: toUrl: >> http://dantri3.vcmedia.vn/App_Themes/Default/Images/favico.ico anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_224.ads >> anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_256.ads >> anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_226.ads >> anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_227.ads >> anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_1087.ads >> anchor: >> outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_228.ads >> anchor: >> .... >> >> --------- >> Headers >> --------- >> >> Date : Wed, 23 Sep 2015 03:15:27 GMT >> Content-Length : 33298 >> Content-Encoding : gzip >> ServerName : 118 >> Connection : close >> Content-Type : text/html; charset=utf-8 >> Server : Microsoft-IIS/7.5 >> X-Powered-By : ASP.NET >> Cache-Control : private >> >> great number of outlink( >300 link) -->OK >> >> but When i using: >> ./bin/crawl ./dantri/urls_common urls_dantri14 >> http://localhost:8983/solr/ 10 >crawlCommonMotTheGioilogThread.log >> >> result >> http://dantri.com.vn/ key: vn.com.dantri:http/ >> baseUrl: null >> status: 2 (status_fetched) >> fetchTime: 1445590410793 >> prevFetchTime: 1442998395854 >> fetchInterval: 2592000 >> retriesSinceFetch: 0 >> modifiedTime: 0 >> prevModifiedTime: 0 >> protocolStatus: SUCCESS, args=[] >> parseStatus: success/redirect (1/100), args=[ >> http://dantri.com.vn/,1800] >> title: null >> score: 1.0 >> marker _injmrk_ : y >> marker dist : 0 >> reprUrl: http://dantri.com.vn/ >> batchId: 1442998403-21918 >> ... >> ..... >> metadata meta_generator : VCCorp.vn >> metadata meta_content-type : text/html; charset=UTF-8 >> metadata meta_resource-type : Document >> metadata OriginalCharEncoding : utf-8 >> metadata meta_copyright : Công ty Cổ phần VCCorp >> metadata _rs_ : \00\00\00# >> outlink: http://dantri.com.vn/ >> inlink: http://dantri.com.vn/ >> >> ERROR: only 1 outlink and 1 inlink >> and in log: "parseStatus: success/redirect (1/100), args=[ >> http://dantri.com.vn/,1800]" >> >> >> (while the right result : parseStatus: success/ok (1/0), args=[]) >> >> I can not configure Nutch in last 2 weeks. >> Can You help me? >> thanks verry verry much! >> >> >> >> > > > -- > *Lewis* >

