>From the limited HTML that I've seen I can only assume that the offending xhtml is in the content field.
If this is the case then you will need to write a custom plugin implementation that removes this. There is loads of info allowing you to get up to speed with plugins on our wiki.[0] Once you have something that requires help get on to the list and let us know. Lewis [0] http://wiki.apache.org/nutch/PluginCentral On Sat, Apr 7, 2012 at 2:33 PM, alessio crisantemi < [email protected]> wrote: > may be it'd my cause with my schema? > I chose for inex about only title, author and content. > > can you help me for setting a parsefilter? > thank you > alessio > >

