Hi

The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
bytes.

In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
will not be detected. But this CHUNK_SIZE param can not configured.




On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[email protected]> wrote:

> Hi David
>
> The problem is that parseHtml will detect the encoding of parsing html.
> The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> be detected by EncodingDetector class. so it set to the default charactor
> encoding. Maybe you can set this property parser.character.encoding.default
> to utf-8 to fixed this problem temporarily.
>
> <property>
>   <name>parser.character.encoding.default</name>
>   <value>utf-8</value>
>   <description>The character encoding to fall back to when no other
> information
>   is available</description>
> </property>
>
> i test it in my computer and output is like this:
>
> gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> plugin parse-html org.apache.nutch.parse.html.HtmlParser
> ~/Downloads/45962.htm
> data: Version: 5
> Status: success(1,0)
> Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
>
> .....
>
>
>
>
>
>
> On Thu, Mar 14, 2013 at 3:23 PM, David Philip <[email protected]
> > wrote:
>
>> Hi,
>>
>>   I did crawl through this
>> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
>> its same issue.
>>
>> Title extracted is in this format:SONY China
>>
>> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
>>
>> It was supposed to be like this :
>> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
>>
>> For specific urls like above it has this special characters problem. For
>> rest, characters extracted are proper. ex: this
>> url<http://service.sony.com.cn/9380.htm>it is proper parse.
>>
>>
>> Thanks David.
>>
>>
>> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
>> <[email protected]>wrote:
>>
>> > I am attaching the extracted text file. not sure if you can receive and
>> > view it.
>> >
>> > My observation:
>> > When I compared the extracted text with url<
>> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
>> > (by doing view source). all most everything looks same other than data
>> that
>> > is in ParseText:: section of the extracted text.
>> >
>> >
>> > Thanks -David
>> >
>> >
>> >
>> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
>> > [email protected]> wrote:
>> >
>> >> Hi Tejas,
>> >>
>> >>    I used the redseg command:bin/nutch readseg -dump
>> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
>> >> -nogenerate -noparse -nofetch -noparsedata
>> >>
>> >> It generated the dump file,then I used less/cat command:
>> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
>> >test459.txt -
>> >> viewed the content as text file(gedit).
>> >>
>> >>
>> >> Below is brief of that text file(test459.txt):
>> >>
>> >> Recno:: 0
>> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>
>> >> ParseText::
>> >>  SONY China
>> >> Service-关于建议使用正宗索尼电�适�器的声明   &nbsp
>> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
>> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç
>> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
>> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >> 按照产�型��索 关键字     选择产�系列 / 型�
>> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> 家庭音�产�
>> >> 其他产� 选择产��类别 选择产�系列
>> >> /..........................
>> >> this is little huge.. so didn't paste everything.
>> >>
>> >>
>> >> Content::
>> >> Version: -1
>> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >> contentType: application/xhtml+xml
>> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
>> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
>> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
>> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
>> >> Content-Type=text/html Connection=close
>> >> Content:
>> >>
>> >>
>> >> Thanks - David
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
>> [email protected]>wrote:
>> >>
>> >>> I dont think so. The tool that you are using to view this must have
>> >>> support
>> >>> for the desired languages. I had same problem while looking at the
>> pages
>> >>> having chinese content over putty. Installing language packs and
>> tweaking
>> >>> putty settings made this go away. I don't recall exact steps / details
>> >>> as I
>> >>> did that about a year back.
>> >>>
>> >>>
>> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
>> >>> <[email protected]>wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> >   For some specific urls, the content fetched is in the form of
>> special
>> >>> > characters, Is it character encoding issue? any settings need to be
>> >>> done at
>> >>> > nutch parsing level?
>> >>> >
>> >>> >
>> >>> > *url:*
>> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
>> >>> >
>> >>> > *content extracted is something like this: *
>> >>> > *
>> >>> > *
>> >>> >  SONY China
>> >>> > Service-关于建议使用正宗索尼电�适�器的声明
>> &nbsp
>> >>> > 首页   新闻与公告   产�支� 个人电脑�周边产�
>> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
>> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
>> 其他产�
>> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
>> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
>> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
>> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
>> >>> 家庭音�产�
>> >>> > 其他产å“..................
>> >>> >
>> >>> > *title: *
>> >>> > SONY China
>> >>> Service-关于建议使用正宗索尼电�适�器的声明
>> >>> >
>> >>> >
>> >>> > Thanks - David
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to