There is a WIP collection of HTML retrieval and processing UDFs available at

- GitLab: https://gitlab.com/hrbrmstr/drill-html-tools
- GitHub: https://github.com/hrbrmstr/drill-html-tools

The UDFs use the lightweight jsoup Java library (https://github.com/jhy/jsoup/) 
included with the project configuration.

There are, at present, four main functions:

- soup_read_html() : fetches contents of a URL (i.e. makes an HTTP GET call,
  so the Drill nodes that use this need to be able to reach the intended
  resource). NOTE: Older Java installations may not be able to reach some
  of the new fangled TLS sites without an update to their certificate stores
  and crypto libraries.

- soup_html_to_plaintext(): can convert strings of HTML to plaintext*

- soup_select_text(): given strings of HTML and a valid CSS 
  selector this UDF will return a list with the text of each node (with or
  without child nodes included)

- soup_select_attr(): similar to its _text() counterpart, given strings of
  HTML, a valid CSS selector and a DOM node attribute key (e.g. 'href' is a
  node attribute key of most '<a>' tags) this will return a list with the 
  text of they specified key/node selection (if any).

No blog post is available yet, but the README has some examples. I'll refrain 
from a large text posting here, but this is type of query you can perform with 
it (this is a mostly complete example):

    SELECT
      a.url AS url,
      substr(a.doc, 1, 100) AS origDoc,
      substr(soup_html_to_plaintext(a.doc), 1, 100) AS docTxt,
      soup_select_text(a.doc, 'title')[0] AS title,
      soup_select_attr(a.doc, 'a', 'href') AS links,
      soup_select_attr(a.doc, 'img', 'src') AS imgSrc,
      soup_select_attr(a.doc, 'script', 'src') AS scriptSrc
    FROM (
        SELECT 
           url,
           soup_read_html(url, 5000) doc
        FROM 
          (SELECT 
               'https://community.apache.org/' AS url
             FROM (VALUES((1))))
    ) a
    WHERE doc IS NOT NULL

The output for that and another example is in the README.

One long-term goal is to facilitate the use of Apache Drill as an ersatz 
web-scraping platform (i.e. feed it a CSV or JSON or whatev of URLs and have 
Drill create a Parquet file of scraped content and any associated extractions). 
As such, additional UDFs are planned to facilitate retries, return HTTP 
response metadata, provide a means of access to HTTP headers, etc.

Tested under a very recent 1.14.0 snapshot.

Kick the tyres & file issues or PRs on either repository as needed.

-Bob

Reply via email to