Enabling this property slows down the parse phase drastically when
encountered with mime-type image/jpeg.


On Mon, Nov 26, 2012 at 8:07 PM, Sourajit Basak <[email protected]>wrote:

> Thanks Julien.
>
> I can get the outlinks now, let me check if I can get the raw content. I
> will update this thread.
>
>
> On Mon, Nov 26, 2012 at 2:37 PM, Julien Nioche <
> [email protected]> wrote:
>
>> The parameter
>>
>> <property>
>>   <name>mime.type.magic</name>
>>   <value>true</value>
>>   <description>Defines if the mime content type detector uses magic
>> resolution.
>>   </description>
>> </property>
>>
>> should trigger the mime type detection based on the content and not on
>> what
>> the server returns. It is not a Tika issue as such as the selection of
>> what
>> parser to use is based on the mimetype that Nutch uses.
>>
>> The param above should be set to true by default. I thought we had more
>> options but am probably confusing with the language identification
>>
>> Julien
>>
>>
>> On 25 November 2012 14:16, Sourajit Basak <[email protected]>
>> wrote:
>>
>> > DEBUG tika.TikaParser - Using Tika parser
>> > org.apache.tika.parser.txt.TXTParser for mime-type text/plain
>> >
>> > The above indicates Tika is fired. But somehow I need to tell Tika to
>> use
>> > HtmlParser for mime-type text/plain. Have to dig into Tika docs.
>> >
>> > Is it possible to do anything in Nutch ?
>> >
>> > On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak <
>> [email protected]
>> > >wrote:
>> >
>> > > Some of my target webpages return a mime type of text/plain though
>> they
>> > > are htmls. I changed "http.accept" to include text/plain and
>> configured
>> > > both tika & parse-html to see if those can be parsed. However, both
>> seem
>> > to
>> > > produce no content.
>> > >
>> > > I changed parse-plugins.xml & the corresponding plugin.xml's to match
>> > this
>> > > mime type.
>> > >
>> > > Has anyone encountered this problem ?
>> > >
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Reply via email to