Hi,

the given URL is a redirect (HTTP 303, at least, when I try) with no content 
(only the HTTP header).
Tried with curl and Nutch's parsechecker tool:

% bin/nutch parsechecker
"http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home";
fetching: http://www.nytimes.com/...
...
Content Metadata: Vary=Host Date=Sat, 02 Feb 2013 15:01:18 GMT Content-Length=0
Location=http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=548bb88dQ2FRtezRXVzRQ3DQ3DQ3DRfzDQ2AR(tgrHttzEREQ26Q271RQ26Q27R1Q27Rz5gfXtQ2At.VRgf_X5r5hfQ51gG5Hrh_X!_Q2AzHQ51z5hX5Q3DhVtHGhz_D5rhgtDeQ2Bz5HrQ5EfzDQ2A
Set-Cookie=RMID=007f0100777d510d2a3e0045; Expires=Sun, 02 Feb 2014 15:01:18 
GMT; Path=/;
Domain=.nytimes.com; Content-Type=text/plain Connection=close Server=Apache
Parse Metadata: Content-Encoding=UTF-8 Content-Type=text/plain; charset=UTF-8

% curl -v
"http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home";
>/dev/null
...
> GET
/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
HTTP/1.1
...
>
< HTTP/1.1 303 See Other
< Date: Sat, 02 Feb 2013 14:59:03 GMT
< Server: Apache
< Set-Cookie: RMID=007f01000e9f510d29b70033; Expires=Sun, 02 Feb 2014 14:59:03 
GMT; Path=/;
Domain=.nytimes.com;
< Vary: Host
< Location:
http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html&OQ=pagewantedQ3D2Q26_rQ3D1Q26refQ3Dglobal-homeQ26&OP=f39d9b3aQ2FQ2AmQ51dQ2AKSdQ2A(((Q2AQ7Ddg_Q2ANm46JmmdUQ2AUCVMQ2ACVQ2AMVQ2AdQ274Q7DKm_mrSQ2A4Q7DtKQ276Q27!Q7DQ7E42Q27J6!tKyt_dJQ7EdQ27!KQ27(!SmJ2!dtgQ276!4mgQ51ndQ27J6GQ7Ddg_
< Content-Length: 0
< Connection: close
< Content-Type: text/plain

Sebastian


On 02/01/2013 05:47 AM, Sourajit Basak wrote:
> Here it goes.
> 
> Try to dump the content from this url with the following settings.
> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
> 
>   <property>
>     <name>http.content.limit</name>
>     <value>-1</value>
>   </property>
> 
> This page is gzip encoded. You will see that the fetcher is unable to
> download any content. Check by inspecting the content-length.
> Initially I was thinking it to be a problem with the parse-html plugin but
> now it seems that the fetcher returns null content.
> 
> This seemed related to NUTCH-374
> 
> Let me know if you need further info.
> 
> On Fri, Feb 1, 2013 at 1:54 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
> 
>> Can you briefly describe the problem here Sourajit?
>>
>> On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak
>> <[email protected]> wrote:
>>> Seems to be related to NUTCH-374 but that shows as fixed.
>>>
>>> I have set Nutch to accept unlimited content size & this page is gzip
>>> encoded.
>>>
>>>
>>>
>>> On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak <
>> [email protected]>wrote:
>>>
>>>> Re-opening this thread.
>>>>
>>>> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use
>>>> parse-html)
>>>>
>>>>
>> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
>>>>
>>>> I do not get any content from the fetcher. This is my fetcher accept
>>>> params.
>>>>   <property>
>>>>     <name>http.accept</name>
>>>>
>>>>
>> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>>>   </property>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <
>> [email protected]
>>>>> wrote:
>>>>
>>>>> Ignore my last post. Tika isn't slowing down, neither is this property.
>>>>>
>>>>>
>>>>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Enabling this property slows down the parse phase drastically when
>>>>>> encountered with mime-type image/jpeg.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Julien.
>>>>>>>
>>>>>>> I can get the outlinks now, let me check if I can get the raw
>> content.
>>>>>>> I will update this thread.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> The parameter
>>>>>>>>
>>>>>>>> <property>
>>>>>>>>   <name>mime.type.magic</name>
>>>>>>>>   <value>true</value>
>>>>>>>>   <description>Defines if the mime content type detector uses magic
>>>>>>>> resolution.
>>>>>>>>   </description>
>>>>>>>> </property>
>>>>>>>>
>>>>>>>> should trigger the mime type detection based on the content and not
>> on
>>>>>>>> what
>>>>>>>> the server returns. It is not a Tika issue as such as the selection
>> of
>>>>>>>> what
>>>>>>>> parser to use is based on the mimetype that Nutch uses.
>>>>>>>>
>>>>>>>> The param above should be set to true by default. I thought we had
>> more
>>>>>>>> options but am probably confusing with the language identification
>>>>>>>>
>>>>>>>> Julien
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]
>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> DEBUG tika.TikaParser - Using Tika parser
>>>>>>>>> org.apache.tika.parser.txt.TXTParser for mime-type text/plain
>>>>>>>>>
>>>>>>>>> The above indicates Tika is fired. But somehow I need to tell Tika
>>>>>>>> to use
>>>>>>>>> HtmlParser for mime-type text/plain. Have to dig into Tika docs.
>>>>>>>>>
>>>>>>>>> Is it possible to do anything in Nutch ?
>>>>>>>>>
>>>>>>>>> On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
>>>>>>>> [email protected]
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Some of my target webpages return a mime type of text/plain
>> though
>>>>>>>> they
>>>>>>>>>> are htmls. I changed "http.accept" to include text/plain and
>>>>>>>> configured
>>>>>>>>>> both tika & parse-html to see if those can be parsed. However,
>>>>>>>> both seem
>>>>>>>>> to
>>>>>>>>>> produce no content.
>>>>>>>>>>
>>>>>>>>>> I changed parse-plugins.xml & the corresponding plugin.xml's to
>>>>>>>> match
>>>>>>>>> this
>>>>>>>>>> mime type.
>>>>>>>>>>
>>>>>>>>>> Has anyone encountered this problem ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *
>>>>>>>> *Open Source Solutions for Text Engineering
>>>>>>>>
>>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>>> http://www.digitalpebble.com
>>>>>>>> http://twitter.com/digitalpebble
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>>
>> --
>> Lewis
>>
> 

Reply via email to