Re: Writing Nutch data in Parquet format
Hi Seb, Really interesting. Thanks for the response. Below On 2021/05/05 11:42:04, Sebastian Nagel wrote: > > Yes, but not directly - it's a multi-step process. As I expected ;) > > This Parquet index is optimized by sorting the rows by a special form of the > URL [1] which > - drops the protocol or scheme > - reverses the host name and > - puts it in front of the remaining URL parts (path and query) > - with some additional normalization of path and query (eg. sorting of query > params) > > One example: >https://example.com/path/search?q=foo=en >com,example)/path/search?l=en=foo > > The SURT URL is similar to the URL format used by Nutch2 >com.example/https/path/search?q=foo=en > to address rows in the WebPage table [2]. This format is inspired by the > BigTable > paper [3]. The point is that cf. [4]. OK, I recognize this data model. Seems logical. > Ok, back to the question: both 1) and 2) are trivial if you do not care about > writing an optimal Parquet files: just define a schema following the methods > implementing > the Writable interface. Parquet is easier to feed into various data > processing systems > because it integrates the schema. The Sequence file format requires that the > Writable formats are provided - although Spark and other big data tools > support > Sequence files this requirement is sometimes a blocker, also because Nutch > does not ship a small "nutch-formats" jar. In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet format was to facilitate (improved) analytics within the Databricks platform which we are currently evaluating. I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked any retrievals but I 'hope' that I can begin to work on 'optimizing' the way that Nutch data is written such that it can be analyzed with relative ease within, for example Databricks. > > Nevertheless, the price for Parquet is slower writing - which is ok for > write-once-read-many > use cases. Yes, this is our use case. > But the typical use case for Nutch is "write-once-read-twice": > - segment: read for CrawlDb update and indexing > - CrawlDb: read during update then replace, in some cycles read for > deduplication, statistics, etc. So sequence files are optimal for use within the Nutch system but for additional analytics (on outside platforms such as Databricks) I suspect that Parquet would be preferred. Maybe we can share more ideas. I wonder if a utility tool to write segments as Parquet data would be useful? Thanks Seb
Re: Writing Nutch data in Parquet format
Hi Lewis, > 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Yes, but not directly - it's a multi-step process. The outcome: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ This Parquet index is optimized by sorting the rows by a special form of the URL [1] which - drops the protocol or scheme - reverses the host name and - puts it in front of the remaining URL parts (path and query) - with some additional normalization of path and query (eg. sorting of query params) One example: https://example.com/path/search?q=foo=en com,example)/path/search?l=en=foo The SURT URL is similar to the URL format used by Nutch2 com.example/https/path/search?q=foo=en to address rows in the WebPage table [2]. This format is inspired by the BigTable paper [3]. The point is that cf. [4]. Ok, back to the question: both 1) and 2) are trivial if you do not care about writing an optimal Parquet files: just define a schema following the methods implementing the Writable interface. Parquet is easier to feed into various data processing systems because it integrates the schema. The Sequence file format requires that the Writable formats are provided - although Spark and other big data tools support Sequence files this requirement is sometimes a blocker, also because Nutch does not ship a small "nutch-formats" jar. Nevertheless, the price for Parquet is slower writing - which is ok for write-once-read-many use cases. But the typical use case for Nutch is "write-once-read-twice": - segment: read for CrawlDb update and indexing - CrawlDb: read during update then replace, in some cycles read for deduplication, statistics, etc. Lewis, I'd be really interested what your particular use case is? Also because at Common Crawl we plan to provide more data in the Parquet format: page metadata, links and text dumps. Storing URLs and wb page metadata efficiently was part of the motivation for Dremel [5] which again inspired Parquet [6]. Best, Sebastian [1] https://github.com/internetarchive/surt [2] https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction [3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf [4] https://cloud.google.com/bigtable/docs/schema-design#domain-names [5] https://research.google/pubs/pub36632/ [6] https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html On 5/4/21 11:14 PM, Lewis John McGibbney wrote: Hi user@, Has anyone experimented/accomplished either 1) writing Nutch data directly as Parquet format, or 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Thank you lewismc
Writing Nutch data in Parquet format
Hi user@, Has anyone experimented/accomplished either 1) writing Nutch data directly as Parquet format, or 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Thank you lewismc