Re: html fetcher/parser

2017-08-14 Thread Adam D. Ruppe via Digitalmars-d-learn

On Monday, 14 August 2017 at 23:15:13 UTC, Faux Amis wrote:
(Althought following the spec would be the first step to a D 
html layout engine :D )


Oh, I've actually done some of that before too.
https://github.com/adamdruppe/arsd/blob/master/htmlwidget.d


It is pretty horrible... but managed to render my old homepage 
which used css float, boxes, and basic tables. I don't know if it 
still compiles, I haven't even tried it for years.


Re: html fetcher/parser

2017-08-14 Thread Faux Amis via Digitalmars-d-learn

On 2017-08-13 19:51, Adam D. Ruppe wrote:

On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
Just curious, but is there a spec of sorts which defines which errors 
should be fixed and such?


The HTML5 spec describes how you are supposed to parse various things, 
including the recovery paths for broken markup.


My module, however, isn't so formal. I just used it for a web scraping 
thing at work that hit a few hundred sites and fixed bugs as they came 
up to give good enough results for me (one thing I found is a lot of 
sites claiming to be UTF-8 are actually latin-1, so it validates and 
falls back to handle that. My http thing, while buggier, is similar - I 
hit a server once that ignored the accept gzip header and always sent it 
anyway, so I had to handle that... and I noticed curl actually didn't!)


So on the one hand, there's surely still bugs and weird cases, but on 
the other hand, it did get a fair chunk of real-world use so I am fairly 
confident it will be ok for most things.




Sounds good!
(Althought following the spec would be the first step to a D html layout 
engine :D )


Re: html fetcher/parser

2017-08-13 Thread Adam D. Ruppe via Digitalmars-d-learn

On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
Just curious, but is there a spec of sorts which defines which 
errors should be fixed and such?


The HTML5 spec describes how you are supposed to parse various 
things, including the recovery paths for broken markup.


My module, however, isn't so formal. I just used it for a web 
scraping thing at work that hit a few hundred sites and fixed 
bugs as they came up to give good enough results for me (one 
thing I found is a lot of sites claiming to be UTF-8 are actually 
latin-1, so it validates and falls back to handle that. My http 
thing, while buggier, is similar - I hit a server once that 
ignored the accept gzip header and always sent it anyway, so I 
had to handle that... and I noticed curl actually didn't!)


So on the one hand, there's surely still bugs and weird cases, 
but on the other hand, it did get a fair chunk of real-world use 
so I am fairly confident it will be ok for most things.




Re: html fetcher/parser

2017-08-13 Thread Faux Amis via Digitalmars-d-learn

On 2017-08-13 01:49, Soulsbane wrote:

On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
I would like to get into D again by making a small program which 
fetches a website every X-time and keeps track of all changes within 
specified dom elements.


fetching: should I go for std curl, vibe.d or something else?
parsing: I could only find these dub packages: htmld & libdominator.
And they don't seem overly active, any recommendations?

As I haven't been using D for some time I just don't want to get off 
with a bad start :)

thx


I've the requests module nice to work with: 
http://code.dlang.org/packages/requests

Thanks, looks nice! I'll try it if Adam's modules fail me :)


Re: html fetcher/parser

2017-08-13 Thread Faux Amis via Digitalmars-d-learn

On 2017-08-12 22:22, Adam D. Ruppe wrote:

On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:

[...]


[...]
---
// compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}

import std.stdio;
import arsd.dom;

void main() {
 auto document = Document.fromUrl("https://dlang.org/;);
 writeln(document.optionSelector("p").innerText);
}
---

Nice!


[...]
Document.fromUrl uses the http lib to fetch it, then automatically parse 
the contents as a dom document. It will correct for common errors in 
webpage markup, character sets, etc.


Just curious, but is there a spec of sorts which defines which errors 
should be fixed and such?


[...] 
Bonus fact: 
http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html 
that function from the standard library makes doing a diff display of 
before and after pretty simple

Thanks for the pointer!


Re: html fetcher/parser

2017-08-12 Thread Soulsbane via Digitalmars-d-learn

On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
I would like to get into D again by making a small program 
which fetches a website every X-time and keeps track of all 
changes within specified dom elements.


fetching: should I go for std curl, vibe.d or something else?
parsing: I could only find these dub packages: htmld & 
libdominator.

And they don't seem overly active, any recommendations?

As I haven't been using D for some time I just don't want to 
get off with a bad start :)

thx


I've the requests module nice to work with: 
http://code.dlang.org/packages/requests


Re: html fetcher/parser

2017-08-12 Thread Michael via Digitalmars-d-learn

On Saturday, 12 August 2017 at 20:22:44 UTC, Adam D. Ruppe wrote:

On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:

[...]


My dom.d and http2.d combine to make this easy:

https://github.com/adamdruppe/arsd/blob/master/dom.d
https://github.com/adamdruppe/arsd/blob/master/http2.d

[...]


Sometimes it feels like there's the standard D library, Phobos, 
and then for everything else you have already developed a 
suitable library to supplement it haha!


Re: html fetcher/parser

2017-08-12 Thread Adam D. Ruppe via Digitalmars-d-learn

On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
I would like to get into D again by making a small program 
which fetches a website every X-time and keeps track of all 
changes within specified dom elements.


My dom.d and http2.d combine to make this easy:

https://github.com/adamdruppe/arsd/blob/master/dom.d
https://github.com/adamdruppe/arsd/blob/master/http2.d

and support file for random encodings:

https://github.com/adamdruppe/arsd/blob/master/characterencodings.d


Or via dub:

http://code.dlang.org/packages/arsd-official

the dom and http subpackages are the ones you want.


Docs: http://dpldocs.info/arsd.dom


Sample program:

---
// compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}

import std.stdio;
import arsd.dom;

void main() {
auto document = Document.fromUrl("https://dlang.org/;);
writeln(document.optionSelector("p").innerText);
}
---

Output:

D is a general-purpose programming language with
static typing, systems-level access, and C-like syntax.
It combines efficiency, control and modeling power with 
safety

and programmer productivity.




Note that the https support requires OpenSSL available on your 
system. Works best on Linux with it installed as a devel lib (so 
like openssl-devel or whatever, just like you would if using it 
from C).




How it works:


Document.fromUrl uses the http lib to fetch it, then 
automatically parse the contents as a dom document. It will 
correct for common errors in webpage markup, character sets, etc.


Document and Element both have various methods for navigating, 
modifying, and accessing the DOM tree. Here, I used 
`optionSelector`, which works like `querySelector` in Javascript 
(and the same syntax is used for CSS), returning the first 
matching element.


querySelector, however, returns null if there is nothing found. 
optionSelector returns a dummy object instead, so you don't have 
to explicitly test it for null and instead just access its 
methods.


`innerText` returns the text inside, stripped of markup. You 
might also want `innerHTML`, or `toString` to get the whole 
thing, markup and all.




there's a lot more you can do too but just these few functions I 
think will be enough for your task.



Bonus fact: 
http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple


html fetcher/parser

2017-08-12 Thread Faux Amis via Digitalmars-d-learn
I would like to get into D again by making a small program which fetches 
a website every X-time and keeps track of all changes within specified 
dom elements.


fetching: should I go for std curl, vibe.d or something else?
parsing: I could only find these dub packages: htmld & libdominator.
And they don't seem overly active, any recommendations?

As I haven't been using D for some time I just don't want to get off 
with a bad start :)

thx