Hi,

  I did crawl through this
url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
its same issue.

Title extracted is in this format:SONY China
Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ

It was supposed to be like this :
<title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>

For specific urls like above it has this special characters problem. For
rest, characters extracted are proper. ex: this
url<http://service.sony.com.cn/9380.htm>it is proper parse.


Thanks David.


On Thu, Mar 14, 2013 at 12:17 PM, David Philip
<[email protected]>wrote:

> I am attaching the extracted text file. not sure if you can receive and
> view it.
>
> My observation:
> When I compared the extracted text with 
> url<http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> (by doing view source). all most everything looks same other than data that
> is in ParseText:: section of the extracted text.
>
>
> Thanks -David
>
>
>
> On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> [email protected]> wrote:
>
>> Hi Tejas,
>>
>>    I used the redseg command:bin/nutch readseg -dump
>> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> -nogenerate -noparse -nofetch -noparsedata
>>
>> It generated the dump file,then I used less/cat command:
>> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt -
>> viewed the content as text file(gedit).
>>
>>
>> Below is brief of that text file(test459.txt):
>>
>> Recno:: 0
>> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>>
>> ParseText::
>>  SONY China
>> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>> 首页   新闻与公告   产�支� 个人电脑�周边产�
>> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> 按照产�型��索 关键字     选择产�系列 / 型�
>> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产� 选择产��类别 选择产�系列
>> /..........................
>> this is little huge.. so didn't paste everything.
>>
>>
>> Content::
>> Version: -1
>> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> contentType: application/xhtml+xml
>> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> Content-Type=text/html Connection=close
>> Content:
>>
>>
>> Thanks - David
>>
>>
>>
>>
>>
>>
>> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil 
>> <[email protected]>wrote:
>>
>>> I dont think so. The tool that you are using to view this must have
>>> support
>>> for the desired languages. I had same problem while looking at the pages
>>> having chinese content over putty. Installing language packs and tweaking
>>> putty settings made this go away. I don't recall exact steps / details
>>> as I
>>> did that about a year back.
>>>
>>>
>>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>>> <[email protected]>wrote:
>>>
>>> > Hi,
>>> >
>>> >   For some specific urls, the content fetched is in the form of special
>>> > characters, Is it character encoding issue? any settings need to be
>>> done at
>>> > nutch parsing level?
>>> >
>>> >
>>> > *url:*
>>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>>> >
>>> > *content extracted is something like this: *
>>> > *
>>> > *
>>> >  SONY China
>>> > Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
>>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
>>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>>> > 按照产�型��索 关键字     选择产�系列 / 型�
>>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>>> 家庭音�产�
>>> > 其他产å“..................
>>> >
>>> > *title: *
>>> > SONY China
>>> Service-关于建议使用正宗索尼电�适�器的声明
>>> >
>>> >
>>> > Thanks - David
>>> >
>>>
>>
>>
>

Reply via email to