Re: Extracting title from XHTML pages
Michael Wechner wrote: Sami Siren wrote: Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html / /mimeType This was changed in trunk and it _should_ fix that problem. thanks :-) this seems to work. Shall I send a patch for nutch-0.8.x? Or is nutch 0.8.x unmaintained? I have added a patch https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202 Thanks Michi Cheers Michi -- Sami Siren -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
Re: Extracting title from XHTML pages
Michael Wechner wrote: I have added a patch https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202 sorry, I actually meant https://issues.apache.org/jira/browse/NUTCH-418 Cheers Michi Thanks Michi Cheers Michi -- Sami Siren -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
Re: Extracting title from XHTML pages
Michael Wechner wrote: Hi It seems to me that Nutch 0.8.x cannot extract the title from an XHTML page, e.g. Try changing the following in your parse-plugins.xml mimeType name=application/xhtml+xml plugin id=parse-html / /mimeType This was changed in trunk and it _should_ fix that problem. -- Sami Siren