Hi David

The problem is that parseHtml will detect the encoding of parsing html. The
page http://service.sony.com.cn/vaio/Announcments/33412.htm can not be
detected by EncodingDetector class. so it set to the default charactor
encoding. Maybe you can set this property parser.character.encoding.default
to utf-8 to fixed this problem temporarily.

<property>
  <name>parser.character.encoding.default</name>
  <value>utf-8</value>
  <description>The character encoding to fall back to when no other
information
  is available</description>
</property>

i test it in my computer and output is like this:

gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch plugin
parse-html org.apache.nutch.parse.html.HtmlParser ~/Downloads/45962.htm
data: Version: 5
Status: success(1,0)
Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知

.....






On Thu, Mar 14, 2013 at 3:23 PM, David Philip
<[email protected]>wrote:

> Hi,
>
>   I did crawl through this
> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> its same issue.
>
> Title extracted is in this format:SONY China
>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>
> It was supposed to be like this :
> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>
> For specific urls like above it has this special characters problem. For
> rest, characters extracted are proper. ex: this
> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>
>
> Thanks David.
>
>
> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> <[email protected]>wrote:
>
> > I am attaching the extracted text file. not sure if you can receive and
> > view it.
> >
> > My observation:
> > When I compared the extracted text with url<
> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> > (by doing view source). all most everything looks same other than data
> that
> > is in ParseText:: section of the extracted text.
> >
> >
> > Thanks -David
> >
> >
> >
> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> > [email protected]> wrote:
> >
> >> Hi Tejas,
> >>
> >>    I used the redseg command:bin/nutch readseg -dump
> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> -nogenerate -noparse -nofetch -noparsedata
> >>
> >> It generated the dump file,then I used less/cat command:
> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >test459.txt -
> >> viewed the content as text file(gedit).
> >>
> >>
> >> Below is brief of that text file(test459.txt):
> >>
> >> Recno:: 0
> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>
> >> ParseText::
> >>  SONY China
> >> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> 按照产�型��索 关键字     选择产�系列 / 型�
> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> 家庭音�产�
> >> 其他产� 选择产��类别 选择产�系列
> >> /..........................
> >> this is little huge.. so didn't paste everything.
> >>
> >>
> >> Content::
> >> Version: -1
> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> contentType: application/xhtml+xml
> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> Content-Type=text/html Connection=close
> >> Content:
> >>
> >>
> >> Thanks - David
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[email protected]
> >wrote:
> >>
> >>> I dont think so. The tool that you are using to view this must have
> >>> support
> >>> for the desired languages. I had same problem while looking at the
> pages
> >>> having chinese content over putty. Installing language packs and
> tweaking
> >>> putty settings made this go away. I don't recall exact steps / details
> >>> as I
> >>> did that about a year back.
> >>>
> >>>
> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >>> <[email protected]>wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >   For some specific urls, the content fetched is in the form of
> special
> >>> > characters, Is it character encoding issue? any settings need to be
> >>> done at
> >>> > nutch parsing level?
> >>> >
> >>> >
> >>> > *url:*
> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >>> >
> >>> > *content extracted is something like this: *
> >>> > *
> >>> > *
> >>> >  SONY China
> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“� 其他产å“�
> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >>> 家庭音�产�
> >>> > 其他产å“..................
> >>> >
> >>> > *title: *
> >>> > SONY China
> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >>> >
> >>> >
> >>> > Thanks - David
> >>> >
> >>>
> >>
> >>
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to