There is a WIP collection of HTML retrieval and processing UDFs available at
- GitLab: https://gitlab.com/hrbrmstr/drill-html-tools - GitHub: https://github.com/hrbrmstr/drill-html-tools The UDFs use the lightweight jsoup Java library (https://github.com/jhy/jsoup/) included with the project configuration. There are, at present, four main functions: - soup_read_html() : fetches contents of a URL (i.e. makes an HTTP GET call, so the Drill nodes that use this need to be able to reach the intended resource). NOTE: Older Java installations may not be able to reach some of the new fangled TLS sites without an update to their certificate stores and crypto libraries. - soup_html_to_plaintext(): can convert strings of HTML to plaintext* - soup_select_text(): given strings of HTML and a valid CSS selector this UDF will return a list with the text of each node (with or without child nodes included) - soup_select_attr(): similar to its _text() counterpart, given strings of HTML, a valid CSS selector and a DOM node attribute key (e.g. 'href' is a node attribute key of most '<a>' tags) this will return a list with the text of they specified key/node selection (if any). No blog post is available yet, but the README has some examples. I'll refrain from a large text posting here, but this is type of query you can perform with it (this is a mostly complete example): SELECT a.url AS url, substr(a.doc, 1, 100) AS origDoc, substr(soup_html_to_plaintext(a.doc), 1, 100) AS docTxt, soup_select_text(a.doc, 'title')[0] AS title, soup_select_attr(a.doc, 'a', 'href') AS links, soup_select_attr(a.doc, 'img', 'src') AS imgSrc, soup_select_attr(a.doc, 'script', 'src') AS scriptSrc FROM ( SELECT url, soup_read_html(url, 5000) doc FROM (SELECT 'https://community.apache.org/' AS url FROM (VALUES((1)))) ) a WHERE doc IS NOT NULL The output for that and another example is in the README. One long-term goal is to facilitate the use of Apache Drill as an ersatz web-scraping platform (i.e. feed it a CSV or JSON or whatev of URLs and have Drill create a Parquet file of scraped content and any associated extractions). As such, additional UDFs are planned to facilitate retries, return HTTP response metadata, provide a means of access to HTTP headers, etc. Tested under a very recent 1.14.0 snapshot. Kick the tyres & file issues or PRs on either repository as needed. -Bob
