Hi, Thank you Rajani Maski and feng lu. It worked for me. I had done the tomcat setting but had missed nutch setting. Thank you very much.
Thanks - David On Thu, Mar 14, 2013 at 3:16 PM, feng lu <[email protected]> wrote: > Hi > > The problem is happen in HtmlParser#sniffCharacterEncoding. It read out > 'charset' parameter in the meta tag from the first <code>CHUNK_SIZE</code> > bytes. > > In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the > meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding > will not be detected. But this CHUNK_SIZE param can not configured. > > > > > On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[email protected]> wrote: > > > Hi David > > > > The problem is that parseHtml will detect the encoding of parsing html. > > The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not > > be detected by EncodingDetector class. so it set to the default charactor > > encoding. Maybe you can set this property > parser.character.encoding.default > > to utf-8 to fixed this problem temporarily. > > > > <property> > > <name>parser.character.encoding.default</name> > > <value>utf-8</value> > > <description>The character encoding to fall back to when no other > > information > > is available</description> > > </property> > > > > i test it in my computer and output is like this: > > > > gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch > > plugin parse-html org.apache.nutch.parse.html.HtmlParser > > ~/Downloads/45962.htm > > data: Version: 5 > > Status: success(1,0) > > Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知 > > > > ..... > > > > > > > > > > > > > > On Thu, Mar 14, 2013 at 3:23 PM, David Philip < > [email protected] > > > wrote: > > > >> Hi, > >> > >> I did crawl through this > >> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and > >> its same issue. > >> > >> Title extracted is in this format:SONY China > >> > >> > Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�é€šçŸ > >> > >> It was supposed to be like this : > >> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title> > >> > >> For specific urls like above it has this special characters problem. For > >> rest, characters extracted are proper. ex: this > >> url<http://service.sony.com.cn/9380.htm>it is proper parse. > >> > >> > >> Thanks David. > >> > >> > >> On Thu, Mar 14, 2013 at 12:17 PM, David Philip > >> <[email protected]>wrote: > >> > >> > I am attaching the extracted text file. not sure if you can receive > and > >> > view it. > >> > > >> > My observation: > >> > When I compared the extracted text with url< > >> http://service.sony.com.cn/vaio/Announcments/33412.htm> page > >> > (by doing view source). all most everything looks same other than data > >> that > >> > is in ParseText:: section of the extracted text. > >> > > >> > > >> > Thanks -David > >> > > >> > > >> > > >> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip < > >> > [email protected]> wrote: > >> > > >> >> Hi Tejas, > >> >> > >> >> I used the redseg command:bin/nutch readseg -dump > >> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test > >> >> -nogenerate -noparse -nofetch -noparsedata > >> >> > >> >> It generated the dump file,then I used less/cat command: > >> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump > >> >test459.txt - > >> >> viewed the content as text file(gedit). > >> >> > >> >> > >> >> Below is brief of that text file(test459.txt): > >> >> > >> >> Recno:: 0 > >> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> >> > >> >> ParseText:: > >> >>  SONY China > >> >> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 >   > >> >> 首页 新闻与公告 产å“�支æŒ� 个人电脑å�Šå‘¨è¾¹äº§å“� > >> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ > æ•°ç > >> >> �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� 其他产å“� > >> >> æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ > >> >> 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· > >> >> 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ > >> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� > >> å®¶åºéŸ³å“�产å“� > >> >> 其他产å“� 选择产å“�å�类别 选择产å“�系列 > >> >> /.......................... > >> >> this is little huge.. so didn't paste everything. > >> >> > >> >> > >> >> Content:: > >> >> Version: -1 > >> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm > >> >> contentType: application/xhtml+xml > >> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187 > >> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip > >> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33 > >> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT > >> >> Content-Type=text/html Connection=close > >> >> Content: > >> >> > >> >> > >> >> Thanks - David > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil < > >> [email protected]>wrote: > >> >> > >> >>> I dont think so. The tool that you are using to view this must have > >> >>> support > >> >>> for the desired languages. I had same problem while looking at the > >> pages > >> >>> having chinese content over putty. Installing language packs and > >> tweaking > >> >>> putty settings made this go away. I don't recall exact steps / > details > >> >>> as I > >> >>> did that about a year back. > >> >>> > >> >>> > >> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip > >> >>> <[email protected]>wrote: > >> >>> > >> >>> > Hi, > >> >>> > > >> >>> > For some specific urls, the content fetched is in the form of > >> special > >> >>> > characters, Is it character encoding issue? any settings need to > be > >> >>> done at > >> >>> > nutch parsing level? > >> >>> > > >> >>> > > >> >>> > *url:* > >> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm > >> >>> > > >> >>> > *content extracted is something like this: * > >> >>> > * > >> >>> > * > >> >>> >  SONY China > >> >>> > Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 > >>   > >> >>> > 首页 新闻与公告 产å“�支æŒ� > 个人电脑å�Šå‘¨è¾¹äº§å“� > >> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平æ�¿ç”µè„‘ > >> >>> > æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� å®¶åºéŸ³å“�产å“� > >> 其他产å“� > >> >>> > æœ�务网络 è�”系我们 æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢ > >> >>> > 按照产å“�åž‹å�·æ�œç´¢ å…³é”®å— é€‰æ‹©äº§å“�系列 / åž‹å�· > >> >>> > 选择产å“�类别 VAIO个人电脑 Sony Tablet 索尼平æ�¿ç”µè„‘ > >> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� å®¶åºå½±åƒ�产å“� > >> >>> å®¶åºéŸ³å“�产å“� > >> >>> > 其他产å“.................. > >> >>> > > >> >>> > *title: * > >> >>> > SONY China > >> >>> Service-关于建议使用æ£å®—索尼电æº�适é…�器的声明 > >> >>> > > >> >>> > > >> >>> > Thanks - David > >> >>> > > >> >>> > >> >> > >> >> > >> > > >> > > > > > > > > -- > > Don't Grow Old, Grow Up... :-) > > > > > > -- > Don't Grow Old, Grow Up... :-) >

