Re: mime type text/plain

Sourajit Basak Sat, 02 Feb 2013 21:32:30 -0800

Did you set follow redirects ?

On Sat, Feb 2, 2013 at 8:43 PM, Sebastian Nagel
<[email protected]>wrote:


> Hi,
>
> the given URL is a redirect (HTTP 303, at least, when I try) with no
> content (only the HTTP header).
> Tried with curl and Nutch's parsechecker tool:
>
> % bin/nutch parsechecker
> "
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> "
> fetching: http://www.nytimes.com/...
> ...
> Content Metadata: Vary=Host Date=Sat, 02 Feb 2013 15:01:18 GMT
> Content-Length=0
> Location=
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=548bb88dQ2FRtezRXVzRQ3DQ3DQ3DRfzDQ2AR(tgrHttzEREQ26Q271RQ26Q27R1Q27Rz5gfXtQ2At.VRgf_X5r5hfQ51gG5Hrh_X!_Q2AzHQ51z5hX5Q3DhVtHGhz_D5rhgtDeQ2Bz5HrQ5EfzDQ2A
> Set-Cookie=RMID=007f0100777d510d2a3e0045; Expires=Sun, 02 Feb 2014
> 15:01:18 GMT; Path=/;
> Domain=.nytimes.com; Content-Type=text/plain Connection=close
> Server=Apache
> Parse Metadata: Content-Encoding=UTF-8 Content-Type=text/plain;
> charset=UTF-8
>
> % curl -v
> "
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> "
> >/dev/null
> ...
> > GET
>
> /2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> HTTP/1.1
> ...
> >
> < HTTP/1.1 303 See Other
> < Date: Sat, 02 Feb 2013 14:59:03 GMT
> < Server: Apache
> < Set-Cookie: RMID=007f01000e9f510d29b70033; Expires=Sun, 02 Feb 2014
> 14:59:03 GMT; Path=/;
> Domain=.nytimes.com;
> < Vary: Host
> < Location:
>
> http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=f39d9b3aQ2FQ2AmQ51dQ2AKSdQ2A(((Q2AQ7Ddg_Q2ANm46JmmdUQ2AUCVMQ2ACVQ2AMVQ2AdQ274Q7DKm_mrSQ2A4Q7DtKQ276Q27!Q7DQ7E42Q27J6!tKyt_dJQ7EdQ27!KQ27(!SmJ2!dtgQ276!4mgQ51ndQ27J6GQ7Ddg_
> < Content-Length: 0
> < Connection: close
> < Content-Type: text/plain
>
> Sebastian
>
>
> On 02/01/2013 05:47 AM, Sourajit Basak wrote:
> > Here it goes.
> >
> > Try to dump the content from this url with the following settings.
> >
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> >
> >   <property>
> >     <name>http.content.limit</name>
> >     <value>-1</value>
> >   </property>
> >
> > This page is gzip encoded. You will see that the fetcher is unable to
> > download any content. Check by inspecting the content-length.
> > Initially I was thinking it to be a problem with the parse-html plugin
> but
> > now it seems that the fetcher returns null content.
> >
> > This seemed related to NUTCH-374
> >
> > Let me know if you need further info.
> >
> > On Fri, Feb 1, 2013 at 1:54 AM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Can you briefly describe the problem here Sourajit?
> >>
> >> On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak
> >> <[email protected]> wrote:
> >>> Seems to be related to NUTCH-374 but that shows as fixed.
> >>>
> >>> I have set Nutch to accept unlimited content size & this page is gzip
> >>> encoded.
> >>>
> >>>
> >>>
> >>> On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak <
> >> [email protected]>wrote:
> >>>
> >>>> Re-opening this thread.
> >>>>
> >>>> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use
> >>>> parse-html)
> >>>>
> >>>>
> >>
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> >>>>
> >>>> I do not get any content from the fetcher. This is my fetcher accept
> >>>> params.
> >>>>   <property>
> >>>>     <name>http.accept</name>
> >>>>
> >>>>
> >>
> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
> >>>>   </property>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <
> >> [email protected]
> >>>>> wrote:
> >>>>
> >>>>> Ignore my last post. Tika isn't slowing down, neither is this
> property.
> >>>>>
> >>>>>
> >>>>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Enabling this property slows down the parse phase drastically when
> >>>>>> encountered with mime-type image/jpeg.
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> Thanks Julien.
> >>>>>>>
> >>>>>>> I can get the outlinks now, let me check if I can get the raw
> >> content.
> >>>>>>> I will update this thread.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> The parameter
> >>>>>>>>
> >>>>>>>> <property>
> >>>>>>>>   <name>mime.type.magic</name>
> >>>>>>>>   <value>true</value>
> >>>>>>>>   <description>Defines if the mime content type detector uses
> magic
> >>>>>>>> resolution.
> >>>>>>>>   </description>
> >>>>>>>> </property>
> >>>>>>>>
> >>>>>>>> should trigger the mime type detection based on the content and
> not
> >> on
> >>>>>>>> what
> >>>>>>>> the server returns. It is not a Tika issue as such as the
> selection
> >> of
> >>>>>>>> what
> >>>>>>>> parser to use is based on the mimetype that Nutch uses.
> >>>>>>>>
> >>>>>>>> The param above should be set to true by default. I thought we had
> >> more
> >>>>>>>> options but am probably confusing with the language identification
> >>>>>>>>
> >>>>>>>> Julien
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 25 November 2012 14:16, Sourajit Basak <
> [email protected]
> >>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> DEBUG tika.TikaParser - Using Tika parser
> >>>>>>>>> org.apache.tika.parser.txt.TXTParser for mime-type text/plain
> >>>>>>>>>
> >>>>>>>>> The above indicates Tika is fired. But somehow I need to tell
> Tika
> >>>>>>>> to use
> >>>>>>>>> HtmlParser for mime-type text/plain. Have to dig into Tika docs.
> >>>>>>>>>
> >>>>>>>>> Is it possible to do anything in Nutch ?
> >>>>>>>>>
> >>>>>>>>> On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
> >>>>>>>> [email protected]
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Some of my target webpages return a mime type of text/plain
> >> though
> >>>>>>>> they
> >>>>>>>>>> are htmls. I changed "http.accept" to include text/plain and
> >>>>>>>> configured
> >>>>>>>>>> both tika & parse-html to see if those can be parsed. However,
> >>>>>>>> both seem
> >>>>>>>>> to
> >>>>>>>>>> produce no content.
> >>>>>>>>>>
> >>>>>>>>>> I changed parse-plugins.xml & the corresponding plugin.xml's to
> >>>>>>>> match
> >>>>>>>>> this
> >>>>>>>>>> mime type.
> >>>>>>>>>>
> >>>>>>>>>> Has anyone encountered this problem ?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> *
> >>>>>>>> *Open Source Solutions for Text Engineering
> >>>>>>>>
> >>>>>>>> http://digitalpebble.blogspot.com/
> >>>>>>>> http://www.digitalpebble.com
> >>>>>>>> http://twitter.com/digitalpebble
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
> >
>
>

Re: mime type text/plain

Reply via email to