Hi,

   Thank you Rajani Maski and feng lu. It worked for me. I had done the
tomcat setting but had missed nutch setting.
Thank you very much.

Thanks - David



On Thu, Mar 14, 2013 at 3:16 PM, feng lu <[email protected]> wrote:

> Hi
>
> The problem is happen in HtmlParser#sniffCharacterEncoding. It read out
> 'charset' parameter in the meta tag  from the first <code>CHUNK_SIZE</code>
> bytes.
>
> In this page http://service.sony.com.cn/vaio/Announcments/33412.htm, the
> meta tag is pass the first 2000 (CHUNK_SIZE) bytes. So that page encoding
> will not be detected. But this CHUNK_SIZE param can not configured.
>
>
>
>
> On Thu, Mar 14, 2013 at 5:18 PM, feng lu <[email protected]> wrote:
>
> > Hi David
> >
> > The problem is that parseHtml will detect the encoding of parsing html.
> > The page http://service.sony.com.cn/vaio/Announcments/33412.htm can not
> > be detected by EncodingDetector class. so it set to the default charactor
> > encoding. Maybe you can set this property
> parser.character.encoding.default
> > to utf-8 to fixed this problem temporarily.
> >
> > <property>
> >   <name>parser.character.encoding.default</name>
> >   <value>utf-8</value>
> >   <description>The character encoding to fall back to when no other
> > information
> >   is available</description>
> > </property>
> >
> > i test it in my computer and output is like this:
> >
> > gxl@gxl-desktop:~/workspace/java/nutch-svn/runtime/local$ bin/nutch
> > plugin parse-html org.apache.nutch.parse.html.HtmlParser
> > ~/Downloads/45962.htm
> > data: Version: 5
> > Status: success(1,0)
> > Title: SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知
> >
> > .....
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 14, 2013 at 3:23 PM, David Philip <
> [email protected]
> > > wrote:
> >
> >> Hi,
> >>
> >>   I did crawl through this
> >> url<http://service.sony.com.cn/vaio/Announcments/45962.htm> and
> >> its same issue.
> >>
> >> Title extracted is in this format:SONY China
> >>
> >>
> Service-关于对部分索尼VAIO个人电脑产å“�å�‘布安全更新程åº�çš„é‡�è¦�通çŸ
> >>
> >> It was supposed to be like this :
> >> <title>SONY China Service-关于对部分索尼VAIO个人电脑产品发布安全更新程序的重要通知</title>
> >>
> >> For specific urls like above it has this special characters problem. For
> >> rest, characters extracted are proper. ex: this
> >> url<http://service.sony.com.cn/9380.htm>it is proper parse.
> >>
> >>
> >> Thanks David.
> >>
> >>
> >> On Thu, Mar 14, 2013 at 12:17 PM, David Philip
> >> <[email protected]>wrote:
> >>
> >> > I am attaching the extracted text file. not sure if you can receive
> and
> >> > view it.
> >> >
> >> > My observation:
> >> > When I compared the extracted text with url<
> >> http://service.sony.com.cn/vaio/Announcments/33412.htm> page
> >> > (by doing view source). all most everything looks same other than data
> >> that
> >> > is in ParseText:: section of the extracted text.
> >> >
> >> >
> >> > Thanks -David
> >> >
> >> >
> >> >
> >> > On Thu, Mar 14, 2013 at 11:59 AM, David Philip <
> >> > [email protected]> wrote:
> >> >
> >> >> Hi Tejas,
> >> >>
> >> >>    I used the redseg command:bin/nutch readseg -dump
> >> >> test_serviceCnSite/segments/20130314114459/ extracttestcrawl/test
> >> >> -nogenerate -noparse -nofetch -noparsedata
> >> >>
> >> >> It generated the dump file,then I used less/cat command:
> >> >> /Downloads/apache-nutch-1.6/extracttestcrawl/test$ cat dump
> >> >test459.txt -
> >> >> viewed the content as text file(gedit).
> >> >>
> >> >>
> >> >> Below is brief of that text file(test459.txt):
> >> >>
> >> >> Recno:: 0
> >> >> URL:: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>
> >> >> ParseText::
> >> >>  SONY China
> >> >> Service-关于建议使用正宗索尼电�适�器的声明
> &nbsp
> >> >> 首页   新闻与公告   产�支� 个人电脑�周边产�
> >> >> VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> æ•°ç
> >> >> �影�产� 家庭影�产� 家庭音�产� 其他产�
> >> >> æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >> 按照产�型��索 关键字     选择产�系列 / 型�
> >> >> 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >> 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> 家庭音�产�
> >> >> 其他产� 选择产��类别 选择产�系列
> >> >> /..........................
> >> >> this is little huge.. so didn't paste everything.
> >> >>
> >> >>
> >> >> Content::
> >> >> Version: -1
> >> >> url: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> base: http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >> contentType: application/xhtml+xml
> >> >> metadata: cache-control=max-age=14400 Age=0 Content-Length=13187
> >> >> Last-Modified=Sun, 09 Dec 2012 15:14:54 GMT Content-Encoding=gzip
> >> >> nutch.crawl.score=1.0 server=SAP J2EE Engine/7.00 _fst_=33
> >> >> nutch.segment.name=20130314114459 date=Thu, 14 Mar 2013 06:15:22 GMT
> >> >> Content-Type=text/html Connection=close
> >> >> Content:
> >> >>
> >> >>
> >> >> Thanks - David
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Mar 14, 2013 at 10:37 AM, Tejas Patil <
> >> [email protected]>wrote:
> >> >>
> >> >>> I dont think so. The tool that you are using to view this must have
> >> >>> support
> >> >>> for the desired languages. I had same problem while looking at the
> >> pages
> >> >>> having chinese content over putty. Installing language packs and
> >> tweaking
> >> >>> putty settings made this go away. I don't recall exact steps /
> details
> >> >>> as I
> >> >>> did that about a year back.
> >> >>>
> >> >>>
> >> >>> On Wed, Mar 13, 2013 at 9:58 PM, David Philip
> >> >>> <[email protected]>wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> >   For some specific urls, the content fetched is in the form of
> >> special
> >> >>> > characters, Is it character encoding issue? any settings need to
> be
> >> >>> done at
> >> >>> > nutch parsing level?
> >> >>> >
> >> >>> >
> >> >>> > *url:*
> >> >>> > http://service.sony.com.cn/vaio/Announcments/33412.htm
> >> >>> >
> >> >>> > *content extracted is something like this: *
> >> >>> > *
> >> >>> > *
> >> >>> >  SONY China
> >> >>> > Service-关于建议使用正宗索尼电�适�器的声明
> >> &nbsp
> >> >>> > 首页   新闻与公告   产�支�
> 个人电脑�周边产�
> >> >>> > VAIO个人电脑 电脑周边外设 Sony Tablet 索尼平�电脑
> >> >>> > æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“� 家庭音å“�产å“�
> >> 其他产�
> >> >>> > æœ�务网络   è�”系我们   æ ¹æ�®äº§å“�åž‹å�·æ�œç´¢
> >> >>> > 按照产�型��索 关键字     选择产�系列 / 型�
> >> >>> > 选择产�类别 VAIO个人电脑 Sony Tablet 索尼平�电脑
> >> >>> > 电脑周边外设 æ•°ç �å½±åƒ�产å“� 家庭影åƒ�产å“�
> >> >>> 家庭音�产�
> >> >>> > 其他产å“..................
> >> >>> >
> >> >>> > *title: *
> >> >>> > SONY China
> >> >>> Service-关于建议使用正宗索尼电�适�器的声明
> >> >>> >
> >> >>> >
> >> >>> > Thanks - David
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to