Hi David The problem is that parseHtml will detect the encoding of parsing html. The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not be detected by EncodingDetector class. so it set to the default charactor encoding. Maybe you can set this property parser.character.encoding.default to utf-8 to fixed this problem temporarily.
<property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property> i test it in my computer and output is like this: gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser ~/Downloads/45962.htm data: Version: 5 Status: success(1,0) Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知 ..... On Thu, Mar 14, 2013 at 3:23 PM, David Philip <[email protected]>wrote: > Hi, > > I did crawl through this > url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and > its same issue. > > Title extracted is in this format:SONY China > > Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�é€šçŸ > > It was supposed to be like this : > <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title> > > For specific urls like above it has this special characters problem. For > rest, characters extracted are proper. ex: this > url<http://service.sony.com.cn/9380.htm>it is proper parse. > > > Thanks David. > > > On Thu, Mar 14, 2013 at 12:17 PM, David Philip > <[email protected]>wrote: > > > I am attaching the extracted text file. not sure if you can receive and > > view it. > > > > My observation: > > When I compared the extracted text with url< > http://service.sony.com.cn/vaio/Announcments/33412.htm> page > > (by doing view source). all most everything looks same other than data > that > > is in ParseText:: section of the extracted text. > > > > > > Thanks -David > > > > > > > > On Thu, Mar 14, 2013 at 11:59 AM, David Philip < > > [email protected]> wrote: > > > >> Hi Tejas, > >> > >> I used the redseg command:bin/nutch readseg -dump > >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test > >> -nogenerate -noparse -nofetch -noparsedata > >> > >> It generated the dump file,then I used less/cat command: > >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump > >test459.txt - > >> viewed the content as text file(gedit). > >> > >> > >> Below is brief of that text file(test459.txt): > >> > >> Recno:: 0 > >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> > >> ParseText:: > >>  SONY China > >> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   > >> 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� > >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç > >> �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� > >> æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ > >> 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· > >> 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ > >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� > å®¶åºéŸ³å“�产å“� > >> 其他产å“� 选择产å“�å�类别 选择产å“�系列 > >> /.......................... > >> this is little huge.. so didn't paste everything. > >> > >> > >> Content:: > >> Version: -1 > >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> contentType: application/xhtml+xml > >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 > >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip > >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 > >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT > >> Content-Type=text/html Connection=close > >> Content: > >> > >> > >> Thanks - David > >> > >> > >> > >> > >> > >> > >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <[email protected] > >wrote: > >> > >>> I dont think so. The tool that you are using to view this must have > >>> support > >>> for the desired languages. I had same problem while looking at the > pages > >>> having chinese content over putty. Installing language packs and > tweaking > >>> putty settings made this go away. I don't recall exact steps / details > >>> as I > >>> did that about a year back. > >>> > >>> > >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip > >>> <[email protected]>wrote: > >>> > >>> > Hi, > >>> > > >>> > For some specific urls, the content fetched is in the form of > special > >>> > characters, Is it character encoding issue? any settings need to be > >>> done at > >>> > nutch parsing level? > >>> > > >>> > > >>> > *url:* > >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm > >>> > > >>> > *content extracted is something like this: * > >>> > * > >>> > * > >>> >  SONY China > >>> > Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 >   > >>> > 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� > >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ > >>> > æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� > >>> > æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ > >>> > 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· > >>> > 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ > >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� > >>> å®¶åºéŸ³å“�产å“� > >>> > 其他产å“.................. > >>> > > >>> > *title: * > >>> > SONY China > >>> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 > >>> > > >>> > > >>> > Thanks - David > >>> > > >>> > >> > >> > > > -- Don't Grow Old, Grow Up... :-)

