thank you agin Lewis,
but do you think that my strange content field it's for my cause?
beacuse I disabled the indexing of about all field.
this is my schema:
<fields>
<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<!-- fields for index-basic plugin -->
<field name="host" type="url" stored="false" indexed="false"/>
<field name="site" type="string" stored="true" indexed="false"/>
<field name="url" type="url" stored="true" indexed="false"
required="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="title" type="text" stored="true" indexed="false"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="false"
multiValued="true"/>
<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="false"
multiValued="true"/>
<field name="contentLength" type="long" stored="true"
indexed="false"/>
<field name="lastModified" type="date" stored="false"
indexed="false"/>
<field name="date" type="date" stored="true" indexed="false"/>
<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="false"/>
<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
indexed="false" multiValued="true"/>
<!-- fields for feed plugin (tag is also used by
microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true"
multiValued="false"/>
<field name="feed" type="string" stored="true" indexed="false"/>
<field name="publishedDate" type="date" stored="true"
indexed="false"/>
<field name="updatedDate" type="date" stored="true"
indexed="false"/>
<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
multiValued="true"/>
</fields>
what do you think?
alessio
Il giorno 07 aprile 2012 21:57, Lewis John Mcgibbney <
[email protected]> ha scritto:
> From the limited HTML that I've seen I can only assume that the offending
> xhtml is in the content field.
>
> If this is the case then you will need to write a custom plugin
> implementation that removes this. There is loads of info allowing you to
> get up to speed with plugins on our wiki.[0]
>
> Once you have something that requires help get on to the list and let us
> know.
>
> Lewis
>
> [0] http://wiki.apache.org/nutch/PluginCentral
>
> On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi <
> [email protected]> wrote:
>
> > may be it'd my cause with my schema?
> > I chose for inex about only title, author and content.
> >
> > can you help me for setting a parsefilter?
> > thank you
> > alessio
> >
> >
>