date:20180809

Re: [xml] Extract title from html file

2018-08-09 Thread Liam R E Quin

On Fri, 2018-08-10 at 02:46 +0100, James Read via xml wrote:
> I have a bunch of html files on disk and want to open them and
> extract the contents of the title tag using libxml2. 

By this do you mean the title element in the head?

You can use XPath on an XML document to extract /html/head/title but
you may need to use the HTML reader, as most HTML files are not well-
formed XML syntactically. You can experiment first with xmllint --xpath 
/html/head/title foo.xml and see what happens.

If "a bunch" means tens of thousands of HTML files and you do this
often, consider a tree store such as dbxml or (much easier to get
started with i think) BaseX, so that there's an element index (or
btree) and retrieval might be orders of magnitude faster.

Liam

-- 
Liam Quin, https://www.holoweb.net/liam/cv/
Web slave for vintage clipart http://www.fromoldbooks.org/
Available for XML/Document/Information Architecture/
XSL/XQuery/Web/Text Processing/A11Y work & consulting.

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

[xml] Extract title from html file

2018-08-09 Thread James Read via xml

I have a bunch of html files on disk and want to open them and extract the
contents of the title tag using libxml2. Any ideas how to do this? Which
functions to use?
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Extract title from html file

[xml] Extract title from html file

2 matches

Site Navigation

Mail list logo

Footer information