Hi Tejas, I used the redseg command:bin/nutch readseg -dump test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test -nogenerate -noparse -nofetch -noparsedata
It generated the dump file,then I used less/cat command: /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt - viewed the content as text file(gedit). Below is brief of that text file(test459.txt): Recno:: 0 URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm ParseText::  SONY China Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� 选择产å“�å�类别 选择产å“�系列 /.......................... this is little huge.. so didn't paste everything. Content:: Version: -1 url: http://service.sony.com.cn/vaio/Announcments/33412.htm base: http://service.sony.com.cn/vaio/Announcments/33412.htm contentType: application/xhtml+xml metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT Content-Type=text/html Connection=close Content: Thanks - David On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[email protected]>wrote: > I dont think so. The tool that you are using to view this must have support > for the desired languages. I had same problem while looking at the pages > having chinese content over putty. Installing language packs and tweaking > putty settings made this go away. I don't recall exact steps / details as I > did that about a year back. > > > On Wed, Mar 13, 2013 at 9:58 PM, David Philip > <[email protected]>wrote: > > > Hi, > > > > For some specific urls, the content fetched is in the form of special > > characters, Is it character encoding issue? any settings need to be done > at > > nutch parsing level? > > > > > > *url:* > > http://service.sony.com.cn/vaio/Announcments/33412.htm > > > > *content extracted is something like this: * > > * > > * > >  SONY China > > Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   > > 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� > > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ > > æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� > > æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ > > 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· > > 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ > > 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� > å®¶åºéŸ³å“�产å“� > > 其他产å“.................. > > > > *title: * > > SONY China Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 > > > > > > Thanks - David > > >

