Hi, I did crawl through this url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and its same issue.
Title extracted is in this format:SONY China Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�é€šçŸ It was supposed to be like this : <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title> For specific urls like above it has this special characters problem. For rest, characters extracted are proper. ex: this url<http://service.sony.com.cn/9380.htm>it is proper parse. Thanks David. On Thu, Mar 14, 2013 at 12:17 PM, David Philip <[email protected]>wrote: > I am attaching the extracted text file. not sure if you can receive and > view it. > > My observation: > When I compared the extracted text with > url<http://service.sony.com.cn/vaio/Announcments/33412.htm> page > (by doing view source). all most everything looks same other than data that > is in ParseText:: section of the extracted text. > > > Thanks -David > > > > On Thu, Mar 14, 2013 at 11:59 AM, David Philip < > [email protected]> wrote: > >> Hi Tejas, >> >> I used the redseg command:bin/nutch readseg -dump >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test >> -nogenerate -noparse -nofetch -noparsedata >> >> It generated the dump file,then I used less/cat command: >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump >test459.txt - >> viewed the content as text file(gedit). >> >> >> Below is brief of that text file(test459.txt): >> >> Recno:: 0 >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm >> >> ParseText:: >>  SONY China >> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   >> 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ æ•°ç >> �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� >> æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ >> 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· >> 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� >> 其他产å“� 选择产å“�å�类别 选择产å“�系列 >> /.......................... >> this is little huge.. so didn't paste everything. >> >> >> Content:: >> Version: -1 >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm >> contentType: application/xhtml+xml >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT >> Content-Type=text/html Connection=close >> Content: >> >> >> Thanks - David >> >> >> >> >> >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil >> <[email protected]>wrote: >> >>> I dont think so. The tool that you are using to view this must have >>> support >>> for the desired languages. I had same problem while looking at the pages >>> having chinese content over putty. Installing language packs and tweaking >>> putty settings made this go away. I don't recall exact steps / details >>> as I >>> did that about a year back. >>> >>> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip >>> <[email protected]>wrote: >>> >>> > Hi, >>> > >>> > For some specific urls, the content fetched is in the form of special >>> > characters, Is it character encoding issue? any settings need to be >>> done at >>> > nutch parsing level? >>> > >>> > >>> > *url:* >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm >>> > >>> > *content extracted is something like this: * >>> > * >>> > * >>> >  SONY China >>> > Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明   >>> > 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ >>> > æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� >>> > æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ >>> > 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· >>> > 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� >>> å®¶åºéŸ³å“�产å“� >>> > 其他产å“.................. >>> > >>> > *title: * >>> > SONY China >>> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 >>> > >>> > >>> > Thanks - David >>> > >>> >> >> >

