Re: [xml] Extract title from html file

Liam R E Quin Thu, 09 Aug 2018 19:22:24 -0700

On Fri, 2018-08-10 at 02:46 +0100, James Read via xml wrote:
> I have a bunch of html files on disk and want to open them and
> extract the contents of the title tag using libxml2.


By this do you mean the title element in the head?

You can use XPath on an XML document to extract /html/head/title but
you may need to use the HTML reader, as most HTML files are not well-
formed XML syntactically. You can experiment first with xmllint --xpath 
/html/head/title foo.xml and see what happens.

If "a bunch" means tens of thousands of HTML files and you do this
often, consider a tree store such as dbxml or (much easier to get
started with i think) BaseX, so that there's an element index (or
btree) and retrieval might be orders of magnitude faster.

Liam


-- 
Liam Quin, https://www.holoweb.net/liam/cv/
Web slave for vintage clipart http://www.fromoldbooks.org/
Available for XML/Document/Information Architecture/
XSL/XQuery/Web/Text Processing/A11Y work & consulting.

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Extract title from html file

Reply via email to