Hi Seb, Really interesting. Thanks for the response. Below.... On 2021/05/05 11:42:04, Sebastian Nagel <wastl.na...@googlemail.com.INVALID> wrote: > > Yes, but not directly - it's a multi-step process.
As I expected ;) > > This Parquet index is optimized by sorting the rows by a special form of the > URL [1] which > - drops the protocol or scheme > - reverses the host name and > - puts it in front of the remaining URL parts (path and query) > - with some additional normalization of path and query (eg. sorting of query > params) > > One example: > https://example.com/path/search?q=foo&l=en > com,example)/path/search?l=en&q=foo > > The SURT URL is similar to the URL format used by Nutch2 > com.example/https/path/search?q=foo&l=en > to address rows in the WebPage table [2]. This format is inspired by the > BigTable > paper [3]. The point is that cf. [4]. OK, I recognize this data model. Seems logical. > Ok, back to the question: both 1) and 2) are trivial if you do not care about > writing an optimal Parquet files: just define a schema following the methods > implementing > the Writable interface. Parquet is easier to feed into various data > processing systems > because it integrates the schema. The Sequence file format requires that the > Writable formats are provided - although Spark and other big data tools > support > Sequence files this requirement is sometimes a blocker, also because Nutch > does not ship a small "nutch-formats" jar. In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet format was to facilitate (improved) analytics within the Databricks platform which we are currently evaluating. I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked any retrievals but I 'hope' that I can begin to work on 'optimizing' the way that Nutch data is written such that it can be analyzed with relative ease within, for example Databricks. > > Nevertheless, the price for Parquet is slower writing - which is ok for > write-once-read-many > use cases. Yes, this is our use case. > But the typical use case for Nutch is "write-once-read-twice": > - segment: read for CrawlDb update and indexing > - CrawlDb: read during update then replace, in some cycles read for > deduplication, statistics, etc. So sequence files are optimal for use within the Nutch system but for additional analytics (on outside platforms such as Databricks) I suspect that Parquet would be preferred. Maybe we can share more ideas. I wonder if a utility tool to write segments as Parquet data would be useful? Thanks Seb