Ignore my last post. Tika isn't slowing down, neither is this property.

On Mon, Nov 26, 2012 at 10:50 PM, Sourajit Basak
<[email protected]>wrote:

> Enabling this property slows down the parse phase drastically when
> encountered with mime-type image/jpeg.
>
>
> On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak 
> <[email protected]>wrote:
>
>> Thanks Julien.
>>
>> I can get the outlinks now, let me check if I can get the raw content. I
>> will update this thread.
>>
>>
>> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> The parameter
>>>
>>> <property>
>>>   <name>mime.type.magic</name>
>>>   <value>true</value>
>>>   <description>Defines if the mime content type detector uses magic
>>> resolution.
>>>   </description>
>>> </property>
>>>
>>> should trigger the mime type detection based on the content and not on
>>> what
>>> the server returns. It is not a Tika issue as such as the selection of
>>> what
>>> parser to use is based on the mimetype that Nutch uses.
>>>
>>> The param above should be set to true by default. I thought we had more
>>> options but am probably confusing with the language identification
>>>
>>> Julien
>>>
>>>
>>> On 25 November 2012 14:16, Sourajit Basak <[email protected]>
>>> wrote:
>>>
>>> > DEBUG tika.TikaParser - Using Tika parser
>>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain
>>> >
>>> > The above indicates Tika is fired. But somehow I need to tell Tika to
>>> use
>>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs.
>>> >
>>> > Is it possible to do anything in Nutch ?
>>> >
>>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
>>> [email protected]
>>> > >wrote:
>>> >
>>> > > Some of my target webpages return a mime type of text/plain though
>>> they
>>> > > are htmls. I changed "http.accept" to include text/plain and
>>> configured
>>> > > both tika & parse-html to see if those can be parsed. However, both
>>> seem
>>> > to
>>> > > produce no content.
>>> > >
>>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to match
>>> > this
>>> > > mime type.
>>> > >
>>> > > Has anyone encountered this problem ?
>>> > >
>>> > >
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>

Reply via email to