Here it goes.
Try to dump the content from this url with the following settings.
http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
This page is gzip encoded. You will see that the fetcher is unable to
download any content. Check by inspecting the content-length.
Initially I was thinking it to be a problem with the parse-html plugin but
now it seems that the fetcher returns null content.
This seemed related to NUTCH-374
Let me know if you need further info.
On Fri, Feb 1, 2013 at 1:54 AM, Lewis John Mcgibbney <
[email protected]> wrote:
> Can you briefly describe the problem here Sourajit?
>
> On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak
> <[email protected]> wrote:
> > Seems to be related to NUTCH-374 but that shows as fixed.
> >
> > I have set Nutch to accept unlimited content size & this page is gzip
> > encoded.
> >
> >
> >
> > On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak <
> [email protected]>wrote:
> >
> >> Re-opening this thread.
> >>
> >> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use
> >> parse-html)
> >>
> >>
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> >>
> >> I do not get any content from the fetcher. This is my fetcher accept
> >> params.
> >> <property>
> >> <name>http.accept</name>
> >>
> >>
> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
> >> </property>
> >>
> >>
> >>
> >>
> >> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <
> [email protected]
> >> > wrote:
> >>
> >>> Ignore my last post. Tika isn't slowing down, neither is this property.
> >>>
> >>>
> >>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak <
> >>> [email protected]> wrote:
> >>>
> >>>> Enabling this property slows down the parse phase drastically when
> >>>> encountered with mime-type image/jpeg.
> >>>>
> >>>>
> >>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Thanks Julien.
> >>>>>
> >>>>> I can get the outlinks now, let me check if I can get the raw
> content.
> >>>>> I will update this thread.
> >>>>>
> >>>>>
> >>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> The parameter
> >>>>>>
> >>>>>> <property>
> >>>>>> <name>mime.type.magic</name>
> >>>>>> <value>true</value>
> >>>>>> <description>Defines if the mime content type detector uses magic
> >>>>>> resolution.
> >>>>>> </description>
> >>>>>> </property>
> >>>>>>
> >>>>>> should trigger the mime type detection based on the content and not
> on
> >>>>>> what
> >>>>>> the server returns. It is not a Tika issue as such as the selection
> of
> >>>>>> what
> >>>>>> parser to use is based on the mimetype that Nutch uses.
> >>>>>>
> >>>>>> The param above should be set to true by default. I thought we had
> more
> >>>>>> options but am probably confusing with the language identification
> >>>>>>
> >>>>>> Julien
> >>>>>>
> >>>>>>
> >>>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>> > DEBUG tika.TikaParser - Using Tika parser
> >>>>>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain
> >>>>>> >
> >>>>>> > The above indicates Tika is fired. But somehow I need to tell Tika
> >>>>>> to use
> >>>>>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs.
> >>>>>> >
> >>>>>> > Is it possible to do anything in Nutch ?
> >>>>>> >
> >>>>>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
> >>>>>> [email protected]
> >>>>>> > >wrote:
> >>>>>> >
> >>>>>> > > Some of my target webpages return a mime type of text/plain
> though
> >>>>>> they
> >>>>>> > > are htmls. I changed "http.accept" to include text/plain and
> >>>>>> configured
> >>>>>> > > both tika & parse-html to see if those can be parsed. However,
> >>>>>> both seem
> >>>>>> > to
> >>>>>> > > produce no content.
> >>>>>> > >
> >>>>>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to
> >>>>>> match
> >>>>>> > this
> >>>>>> > > mime type.
> >>>>>> > >
> >>>>>> > > Has anyone encountered this problem ?
> >>>>>> > >
> >>>>>> > >
> >>>>>> > >
> >>>>>> >
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> *
> >>>>>> *Open Source Solutions for Text Engineering
> >>>>>>
> >>>>>> http://digitalpebble.blogspot.com/
> >>>>>> http://www.digitalpebble.com
> >>>>>> http://twitter.com/digitalpebble
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>
>
> --
> Lewis
>