Re: Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner

Michael Wechner wrote:


Sami Siren wrote:


Michael Wechner wrote:
 


Hi

It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
page, e.g.
  



Try changing the following in your parse-plugins.xml

mimeType name=application/xhtml+xml
plugin id=parse-html /
/mimeType

This was changed in trunk and it _should_ fix that problem.
 



thanks :-) this seems to work.

Shall I send a patch for nutch-0.8.x? Or is nutch 0.8.x unmaintained?



I have added a patch

https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202

Thanks

Michi



Cheers

Michi


--
Sami Siren

 







--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



Re: Extracting title from XHTML pages

2006-12-21 Thread Michael Wechner

Michael Wechner wrote:




I have added a patch

https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12359202



sorry, I actually meant

https://issues.apache.org/jira/browse/NUTCH-418

Cheers

Michi



Thanks

Michi



Cheers

Michi


--
Sami Siren

 










--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



Re: Extracting title from XHTML pages

2006-12-20 Thread Sami Siren
Michael Wechner wrote:
 Hi
 
 It seems to me that Nutch 0.8.x cannot extract the title from an XHTML
 page, e.g.

Try changing the following in your parse-plugins.xml

mimeType name=application/xhtml+xml
plugin id=parse-html /
/mimeType

This was changed in trunk and it _should_ fix that problem.

--
 Sami Siren