Hi The problem is happen in HtmlParser#sniffCharacterEncoding. It read out 'charset' parameter in the meta tag from the first <code>CHUNK_SIZE</code> bytes.
In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding will not be detected. But this CHUNK_SIZE param can not configured. On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[email protected]> wrote: > Hi David > > The problem is that parseHtml will detect the encoding of parsing html. > The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not > be detected by EncodingDetector class. so it set to the default charactor > encoding. Maybe you can set this property parser.character.encoding.default > to utf-8 to fixed this problem temporarily. > > <property> > <name>parser.character.encoding.default</name> > <value>utf-8</value> > <description>The character encoding to fall back to when no other > information > is available</description> > </property> > > i test it in my computer and output is like this: > > gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch > plugin parse-html org.apache.nutch.parse.html.HtmlParser > ~/Downloads/45962.htm > data: Version: 5 > Status: success(1,0) > Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知 > > ..... > > > > > > > On Thu, Mar 14, 2013 at 3:23 PM, David Philip <[email protected] > > wrote: > >> Hi, >> >> I did crawl through this >> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and >> its same issue. >> >> Title extracted is in this format:SONY China >> >> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�é€šçŸ >> >> It was supposed to be like this : >> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title> >> >> For specific urls like above it has this special characters problem. For >> rest, characters extracted are proper. ex: this >> url<http://service.sony.com.cn/9380.htm>it is proper parse. >> >> >> Thanks David. >> >> >> On Thu, Mar 14, 2013 at 12:17 PM, David Philip >> <[email protected]>wrote: >> >> > I am attaching the extracted text file. not sure if you can receive and >> > view it. >> > >> > My observation: >> > When I compared the extracted text with url< >> http://service.sony.com.cn/vaio/Announcments/33412.htm> page >> > (by doing view source). all most everything looks same other than data >> that >> > is in ParseText:: section of the extracted text. >> > >> > >> > Thanks -David >> > >> > >> > >> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip < >> > [email protected]> wrote: >> > >> >> Hi Tejas, >> >> >> >> I used the redseg command:bin/nutch readseg -dump >> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test >> >> -nogenerate -noparse -nofetch -noparsedata >> >> >> >> It generated the dump file,then I used less/cat command: >> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >> >test459.txt - >> >> viewed the content as text file(gedit). >> >> >> >> >> >> Below is brief of that text file(test459.txt): >> >> >> >> Recno:: 0 >> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm >> >> >> >> ParseText:: >> >>  SONY China >> >> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   >> >> 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� >> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç >> >> �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� >> >> æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ >> >> 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· >> >> 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ >> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� >> å®¶åºéŸ³å“�产å“� >> >> 其他产å“� 选择产å“�å�类别 选择产å“�系列 >> >> /.......................... >> >> this is little huge.. so didn't paste everything. >> >> >> >> >> >> Content:: >> >> Version: -1 >> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm >> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm >> >> contentType: application/xhtml+xml >> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 >> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip >> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 >> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT >> >> Content-Type=text/html Connection=close >> >> Content: >> >> >> >> >> >> Thanks - David >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil < >> [email protected]>wrote: >> >> >> >>> I dont think so. The tool that you are using to view this must have >> >>> support >> >>> for the desired languages. I had same problem while looking at the >> pages >> >>> having chinese content over putty. Installing language packs and >> tweaking >> >>> putty settings made this go away. I don't recall exact steps / details >> >>> as I >> >>> did that about a year back. >> >>> >> >>> >> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip >> >>> <[email protected]>wrote: >> >>> >> >>> > Hi, >> >>> > >> >>> > For some specific urls, the content fetched is in the form of >> special >> >>> > characters, Is it character encoding issue? any settings need to be >> >>> done at >> >>> > nutch parsing level? >> >>> > >> >>> > >> >>> > *url:* >> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm >> >>> > >> >>> > *content extracted is something like this: * >> >>> > * >> >>> > * >> >>> >  SONY China >> >>> > Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 >>   >> >>> > 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� >> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ >> >>> > æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� >> 其他产å“� >> >>> > æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ >> >>> > 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· >> >>> > 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ >> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� >> >>> å®¶åºéŸ³å“�产å“� >> >>> > 其他产å“.................. >> >>> > >> >>> > *title: * >> >>> > SONY China >> >>> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 >> >>> > >> >>> > >> >>> > Thanks - David >> >>> > >> >>> >> >> >> >> >> > >> > > > > -- > Don't Grow Old, Grow Up... :-) > -- Don't Grow Old, Grow Up... :-)

