Re: Writing Nutch data in Parquet format

Lewis John McGibbney Thu, 06 May 2021 12:35:47 -0700

Hi Seb,
Really interesting. Thanks for the response. Below....

On 2021/05/05 11:42:04, Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
wrote: 
> 
> Yes, but not directly - it's a multi-step process.


As I expected ;)

> 
> This Parquet index is optimized by sorting the rows by a special form of the 
> URL [1] which
> - drops the protocol or scheme
> - reverses the host name and
> - puts it in front of the remaining URL parts (path and query)
> - with some additional normalization of path and query (eg. sorting of query 
> params)
> 
> One example:
>    https://example.com/path/search?q=foo&l=en
>    com,example)/path/search?l=en&q=foo
> 
> The SURT URL is similar to the URL format used by Nutch2
>    com.example/https/path/search?q=foo&l=en
> to address rows in the WebPage table [2]. This format is inspired by the 
> BigTable
> paper [3].  The point is that  cf. [4].

OK, I recognize this data model. Seems logical. 

> Ok, back to the question: both 1) and 2) are trivial if you do not care about
> writing an optimal Parquet files: just define a schema following the methods 
> implementing
> the Writable interface. Parquet is easier to feed into various data 
> processing systems
> because it integrates the schema. The Sequence file format requires that the
> Writable formats are provided - although Spark and other big data tools 
> support
> Sequence files this requirement is sometimes a blocker, also because Nutch
> does not ship a small "nutch-formats" jar.

In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet 
format was to facilitate (improved) analytics within the Databricks platform 
which we are currently evaluating.
I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked 
any retrievals but I 'hope' that I can begin to work on 'optimizing' the way 
that Nutch data is written such that it can be analyzed with relative ease 
within, for example Databricks.

> 
> Nevertheless, the price for Parquet is slower writing - which is ok for 
> write-once-read-many
> use cases. 

Yes, this is our use case.

> But the typical use case for Nutch is "write-once-read-twice":
> - segment: read for CrawlDb update and indexing
> - CrawlDb: read during update then replace, in some cycles read for 
> deduplication, statistics, etc.

So sequence files are optimal for use within the Nutch system but for 
additional analytics (on outside platforms such as Databricks) I suspect that 
Parquet would be preferred. 

Maybe we can share more ideas. I wonder if a utility tool to write segments as 
Parquet data would be useful?

Thanks Seb

Re: Writing Nutch data in Parquet format

Reply via email to