Can you briefly describe the problem here Sourajit?

On Thu, Jan 31, 2013 at 9:01 AM, Sourajit Basak
<[email protected]> wrote:
> Seems to be related to NUTCH-374 but that shows as fixed.
>
> I have set Nutch to accept unlimited content size & this page is gzip
> encoded.
>
>
>
> On Thu, Jan 31, 2013 at 9:38 PM, Sourajit Basak 
> <[email protected]>wrote:
>
>> Re-opening this thread.
>>
>> Using Nutch v1.5 try to get the parseText from this NYTimes url (Use
>> parse-html)
>>
>> http://www.nytimes.com/2013/01/31/technology/chinese-hackers-infiltrate-new-york-times-computers.html?pagewanted=2&_r=0&ref=global-home
>>
>> I do not get any content from the fetcher. This is my fetcher accept
>> params.
>>   <property>
>>     <name>http.accept</name>
>>
>> <value>text/plain,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>   </property>
>>
>>
>>
>>
>> On Mon, Nov 26, 2012 at 11:03 PM, Sourajit Basak <[email protected]
>> > wrote:
>>
>>> Ignore my last post. Tika isn't slowing down, neither is this property.
>>>
>>>
>>> On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak <
>>> [email protected]> wrote:
>>>
>>>> Enabling this property slows down the parse phase drastically when
>>>> encountered with mime-type image/jpeg.
>>>>
>>>>
>>>> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <
>>>> [email protected]> wrote:
>>>>
>>>>> Thanks Julien.
>>>>>
>>>>> I can get the outlinks now, let me check if I can get the raw content.
>>>>> I will update this thread.
>>>>>
>>>>>
>>>>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The parameter
>>>>>>
>>>>>> <property>
>>>>>>   <name>mime.type.magic</name>
>>>>>>   <value>true</value>
>>>>>>   <description>Defines if the mime content type detector uses magic
>>>>>> resolution.
>>>>>>   </description>
>>>>>> </property>
>>>>>>
>>>>>> should trigger the mime type detection based on the content and not on
>>>>>> what
>>>>>> the server returns. It is not a Tika issue as such as the selection of
>>>>>> what
>>>>>> parser to use is based on the mimetype that Nutch uses.
>>>>>>
>>>>>> The param above should be set to true by default. I thought we had more
>>>>>> options but am probably confusing with the language identification
>>>>>>
>>>>>> Julien
>>>>>>
>>>>>>
>>>>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > DEBUG tika.TikaParser - Using Tika parser
>>>>>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain
>>>>>> >
>>>>>> > The above indicates Tika is fired. But somehow I need to tell Tika
>>>>>> to use
>>>>>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs.
>>>>>> >
>>>>>> > Is it possible to do anything in Nutch ?
>>>>>> >
>>>>>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
>>>>>> [email protected]
>>>>>> > >wrote:
>>>>>> >
>>>>>> > > Some of my target webpages return a mime type of text/plain though
>>>>>> they
>>>>>> > > are htmls. I changed "http.accept" to include text/plain and
>>>>>> configured
>>>>>> > > both tika & parse-html to see if those can be parsed. However,
>>>>>> both seem
>>>>>> > to
>>>>>> > > produce no content.
>>>>>> > >
>>>>>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to
>>>>>> match
>>>>>> > this
>>>>>> > > mime type.
>>>>>> > >
>>>>>> > > Has anyone encountered this problem ?
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *
>>>>>> *Open Source Solutions for Text Engineering
>>>>>>
>>>>>> http://digitalpebble.blogspot.com/
>>>>>> http://www.digitalpebble.com
>>>>>> http://twitter.com/digitalpebble
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>



-- 
Lewis

Reply via email to