Re: [xml] Extract title from html file

2018-08-14 Thread Eric S Eberhard
If all you need is the  tag then I'd just get the file size  (or
enough bytes to make sure the title is read) and then calloc that memory,
read it all in as a single string, use strstr() to get  and 
and take out what is between the pointers of each strstr().

For something that simple this is much easier, you don't need to link
libxml2, etc.

Yes, I am the contrary one that is always looking for the quick and easy way
-- the above will require no changes to the link Makefile and 10 lines of
code in your C program.

E


Eric S Eberhard
VICS (Vertical Integrated Computer Systems)
Voice: 928 567 3529
Cell: 928 301 7537  (not reliable except for text or if not home)
2933 W Middle Verde Rd
Camp Verde, AZ  86322

-Original Message-
From: xml [mailto:xml-boun...@gnome.org] On Behalf Of Liam R E Quin
Sent: Thursday, August 09, 2018 7:23 PM
To: James Read ; xml@gnome.org
Subject: Re: [xml] Extract title from html file

On Fri, 2018-08-10 at 02:46 +0100, James Read via xml wrote:
> I have a bunch of html files on disk and want to open them and extract 
> the contents of the title tag using libxml2.

By this do you mean the title element in the head?

You can use XPath on an XML document to extract /html/head/title but you may
need to use the HTML reader, as most HTML files are not well- formed XML
syntactically. You can experiment first with xmllint --xpath
/html/head/title foo.xml and see what happens.

If "a bunch" means tens of thousands of HTML files and you do this often,
consider a tree store such as dbxml or (much easier to get started with i
think) BaseX, so that there's an element index (or
btree) and retrieval might be orders of magnitude faster.

Liam


--
Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart
http://www.fromoldbooks.org/ Available for XML/Document/Information
Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.

___
xml mailing list, project page  http://xmlsoft.org/ xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Extract title from html file

2018-08-09 Thread Liam R E Quin
On Fri, 2018-08-10 at 02:46 +0100, James Read via xml wrote:
> I have a bunch of html files on disk and want to open them and
> extract the contents of the title tag using libxml2. 

By this do you mean the title element in the head?

You can use XPath on an XML document to extract /html/head/title but
you may need to use the HTML reader, as most HTML files are not well-
formed XML syntactically. You can experiment first with xmllint --xpath 
/html/head/title foo.xml and see what happens.

If "a bunch" means tens of thousands of HTML files and you do this
often, consider a tree store such as dbxml or (much easier to get
started with i think) BaseX, so that there's an element index (or
btree) and retrieval might be orders of magnitude faster.

Liam


-- 
Liam Quin, https://www.holoweb.net/liam/cv/
Web slave for vintage clipart http://www.fromoldbooks.org/
Available for XML/Document/Information Architecture/
XSL/XQuery/Web/Text Processing/A11Y work & consulting.

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml