Hi Tejas,

   I used the redseg command:bin/nutch readseg -dump
test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
-nogenerate -noparse -nofetch -noparsedata

It generated the dump file,then I used less/cat command:
/Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
viewed the content as text file(gedit).


Below is brief of that text file(test459.txt):

Recno:: 0
URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm

ParseText::
 SONY China
Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
首页   新闻与公告   产�支� 个人电脑�周边产�
VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
�影�产� 家庭影�产� 家庭音�产� 其他产�
æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
按照产�型��索 关键字     选择产�系列 / 型�
选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
其他产� 选择产��类别 选择产�系列
/..........................
this is little huge.. so didn't paste everything.


Content::
Version: -1
url: http://service.sony.com.cn/vaio/Announcments/33412.htm
base: http://service.sony.com.cn/vaio/Announcments/33412.htm
contentType: application/xhtml+xml
metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
Content-Type=text/html Connection=close
Content:


Thanks - David






On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[email protected]>wrote:

> I dont think so. The tool that you are using to view this must have support
> for the desired languages. I had same problem while looking at the pages
> having chinese content over putty. Installing language packs and tweaking
> putty settings made this go away. I don't recall exact steps / details as I
> did that about a year back.
>
>
> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> <[email protected]>wrote:
>
> > Hi,
> >
> >   For some specific urls, the content fetched is in the form of special
> > characters, Is it character encoding issue? any settings need to be done
> at
> > nutch parsing level?
> >
> >
> > *url:*
> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >
> > *content extracted is something like this: *
> > *
> > *
> >  SONY China
> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> > 按照产�型��索 关键字     选择产�系列 / 型�
> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> > 其他产å“..................
> >
> > *title: *
> > SONY China Service-关于建议使用正宗索尼电�适�器的声明
> >
> >
> > Thanks - David
> >
>

Reply via email to