Did you set follow redirects ? On Sat, Feb 2, 2013 at 8:43 PM, Sebastian Nagel <[email protected]>wrote:
> Hi, > > the given URL is a redirect (HTTP 303, at least, when I try) with no > content (only the HTTP header). > Tried with curl and Nutch's parsechecker tool: > > % bin/nutch parsechecker > " > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > " > fetching: http://www.nytimes.com/... > ... > Content Metadata: Vary=Host Date=Sat, 02 Feb 2013 15:01:18 GMT > Content-Length=0 > Location= > http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=548bb88dQ2FRtezRXVzRQ3DQ3DQ3DRfzDQ2AR(tgrHttzEREQ26Q271RQ26Q27R1Q27Rz5gfXtQ2At.VRgf_X5r5hfQ51gG5Hrh_X!_Q2AzHQ51z5hX5Q3DhVtHGhz_D5rhgtDeQ2Bz5HrQ5EfzDQ2A > Set-Cookie=RMID=007f0100777d510d2a3e0045; Expires=Sun, 02 Feb 2014 > 15:01:18 GMT; Path=/; > Domain=.nytimes.com; Content-Type=text/plain Connection=close > Server=Apache > Parse Metadata: Content-Encoding=UTF-8 Content-Type=text/plain; > charset=UTF-8 > > % curl -v > " > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > " > >/dev/null > ... > > GET > > /2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > HTTP/1.1 > ... > > > < HTTP/1.1 303 See Other > < Date: Sat, 02 Feb 2013 14:59:03 GMT > < Server: Apache > < Set-Cookie: RMID=007f01000e9f510d29b70033; Expires=Sun, 02 Feb 2014 > 14:59:03 GMT; Path=/; > Domain=.nytimes.com; > < Vary: Host > < Location: > > http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=f39d9b3aQ2FQ2AmQ51dQ2AKSdQ2A(((Q2AQ7Ddg_Q2ANm46JmmdUQ2AUCVMQ2ACVQ2AMVQ2AdQ274Q7DKm_mrSQ2A4Q7DtKQ276Q27!Q7DQ7E42Q27J6!tKyt_dJQ7EdQ27!KQ27(!SmJ2!dtgQ276!4mgQ51ndQ27J6GQ7Ddg_ > < Content-Length: 0 > < Connection: close > < Content-Type: text/plain > > Sebastian > > > On 02/01/2013 05:47 AM, Sourajit Basak wrote: > > Here it goes. > > > > Try to dump the content from this url with the following settings. > > > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > > > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > </property> > > > > This page is gzip encoded. You will see that the fetcher is unable to > > download any content. Check by inspecting the content-length. > > Initially I was thinking it to be a problem with the parse-html plugin > but > > now it seems that the fetcher returns null content. > > > > This seemed related to NUTCH-374 > > > > Let me know if you need further info. > > > > On Fri, Feb 1, 2013 at 1:54 AM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > >> Can you briefly describe the problem here Sourajit? > >> > >> On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak > >> <[email protected]> wrote: > >>> Seems to be related to NUTCH-374 but that shows as fixed. > >>> > >>> I have set Nutch to accept unlimited content size & this page is gzip > >>> encoded. > >>> > >>> > >>> > >>> On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak < > >> [email protected]>wrote: > >>> > >>>> Re-opening this thread. > >>>> > >>>> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use > >>>> parse-html) > >>>> > >>>> > >> > http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home > >>>> > >>>> I do not get any content from the fetcher. This is my fetcher accept > >>>> params. > >>>> <property> > >>>> <name>http.accept</name> > >>>> > >>>> > >> > <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> > >>>> </property> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak < > >> [email protected] > >>>>> wrote: > >>>> > >>>>> Ignore my last post. Tika isn't slowing down, neither is this > property. > >>>>> > >>>>> > >>>>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> Enabling this property slows down the parse phase drastically when > >>>>>> encountered with mime-type image/jpeg. > >>>>>> > >>>>>> > >>>>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Thanks Julien. > >>>>>>> > >>>>>>> I can get the outlinks now, let me check if I can get the raw > >> content. > >>>>>>> I will update this thread. > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> The parameter > >>>>>>>> > >>>>>>>> <property> > >>>>>>>> <name>mime.type.magic</name> > >>>>>>>> <value>true</value> > >>>>>>>> <description>Defines if the mime content type detector uses > magic > >>>>>>>> resolution. > >>>>>>>> </description> > >>>>>>>> </property> > >>>>>>>> > >>>>>>>> should trigger the mime type detection based on the content and > not > >> on > >>>>>>>> what > >>>>>>>> the server returns. It is not a Tika issue as such as the > selection > >> of > >>>>>>>> what > >>>>>>>> parser to use is based on the mimetype that Nutch uses. > >>>>>>>> > >>>>>>>> The param above should be set to true by default. I thought we had > >> more > >>>>>>>> options but am probably confusing with the language identification > >>>>>>>> > >>>>>>>> Julien > >>>>>>>> > >>>>>>>> > >>>>>>>> On 25 November 2012 14:16, Sourajit Basak < > [email protected] > >>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> DEBUG tika.TikaParser - Using Tika parser > >>>>>>>>> org.apache.tika.parser.txt.TXTParser for mime-type text/plain > >>>>>>>>> > >>>>>>>>> The above indicates Tika is fired. But somehow I need to tell > Tika > >>>>>>>> to use > >>>>>>>>> HtmlParser for mime-type text/plain. Have to dig into Tika docs. > >>>>>>>>> > >>>>>>>>> Is it possible to do anything in Nutch ? > >>>>>>>>> > >>>>>>>>> On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak < > >>>>>>>> [email protected] > >>>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Some of my target webpages return a mime type of text/plain > >> though > >>>>>>>> they > >>>>>>>>>> are htmls. I changed "http.accept" to include text/plain and > >>>>>>>> configured > >>>>>>>>>> both tika & parse-html to see if those can be parsed. However, > >>>>>>>> both seem > >>>>>>>>> to > >>>>>>>>>> produce no content. > >>>>>>>>>> > >>>>>>>>>> I changed parse-plugins.xml & the corresponding plugin.xml's to > >>>>>>>> match > >>>>>>>>> this > >>>>>>>>>> mime type. > >>>>>>>>>> > >>>>>>>>>> Has anyone encountered this problem ? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> * > >>>>>>>> *Open Source Solutions for Text Engineering > >>>>>>>> > >>>>>>>> http://digitalpebble.blogspot.com/ > >>>>>>>> http://www.digitalpebble.com > >>>>>>>> http://twitter.com/digitalpebble > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >> > >> -- > >> Lewis > >> > > > >

