Hi,

CC'd user@nutch

Which version of Nutch are you using?
Your command line usage seems to be outdated. Can you please confirm?
Thank you
Lewis

On Wed, Sep 23, 2015 at 2:55 AM, Vu Quang Tin <[email protected]> wrote:

> Hi Lewis John McGibbney.
> I'm a vietnam.
> I'm not very good english.
> I have problems with the crawler web by Nutch.
> when i using :
>
> ./bin/nutch org.apache.nutch.parse.ParserChecker "http://dantri.com.vn/
> ">dantri2.txt
> result:
> fetching: http://dantri.com.vn/
> parsing: http://dantri.com.vn/
> contentType: application/xhtml+xml
> signature: 5ddaf9394c8b4bd3ce275253e22e7c7e
> ---------
> Url
> ---------------
>
> http://dantri.com.vn/
> ---------
> Metadata
> ---------
> ...
> Outlinks
> ---------
>
>   outlink: toUrl:
> http://dantri3.vcmedia.vn/App_Themes/Default/Images/favico.ico anchor:
>  outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_224.ads
> anchor:
>   outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_256.ads
> anchor:
>   outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_226.ads
> anchor:
>   outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_227.ads
> anchor:
>   outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_1087.ads
> anchor:
>   outlink: toUrl: http://admicro1.vcmedia.vn/ads_codes/ads_box_228.ads
> anchor:
> ....
>
> ---------
> Headers
> ---------
>
> Date :     Wed, 23 Sep 2015 03:15:27 GMT
> Content-Length :     33298
> Content-Encoding :     gzip
> ServerName :     118
> Connection :     close
> Content-Type :     text/html; charset=utf-8
> Server :     Microsoft-IIS/7.5
> X-Powered-By :     ASP.NET
> Cache-Control :     private
>
> great number of  outlink( >300 link) -->OK
>
> but When i using:
> ./bin/crawl ./dantri/urls_common urls_dantri14 http://localhost:8983/solr/
> 10 >crawlCommonMotTheGioilogThread.log
>
> result
> http://dantri.com.vn/    key:    vn.com.dantri:http/
> baseUrl:    null
> status:    2 (status_fetched)
> fetchTime:    1445590410793
> prevFetchTime:    1442998395854
> fetchInterval:    2592000
> retriesSinceFetch:    0
> modifiedTime:    0
> prevModifiedTime:    0
> protocolStatus:    SUCCESS, args=[]
> parseStatus:    success/redirect (1/100), args=[http://dantri.com.vn/,1800
> ]
> title:    null
> score:    1.0
> marker _injmrk_ :     y
> marker dist :     0
> reprUrl:    http://dantri.com.vn/
> batchId:    1442998403-21918
> ...
> .....
> metadata meta_generator :     VCCorp.vn
> metadata meta_content-type :     text/html; charset=UTF-8
> metadata meta_resource-type :     Document
> metadata OriginalCharEncoding :     utf-8
> metadata meta_copyright :     Công ty Cổ phần VCCorp
> metadata _rs_ :     \00\00\00#
> outlink:    http://dantri.com.vn/
> inlink:    http://dantri.com.vn/
>
> ERROR: only 1 outlink and 1 inlink
> and  in log: "parseStatus:    success/redirect (1/100), args=[
> http://dantri.com.vn/,1800]";
>
>
> (while the right result : parseStatus:    success/ok (1/0), args=[])
>
> I can not configure Nutch in last 2 weeks.
> Can You help me?
> thanks verry verry much!
>
>
>
>


-- 
*Lewis*

Reply via email to