Re: Writing Nutch data in Parquet format

2021-05-06 Thread Lewis John McGibbney
Hi Seb,
Really interesting. Thanks for the response. Below

On 2021/05/05 11:42:04, Sebastian Nagel  
wrote: 
> 
> Yes, but not directly - it's a multi-step process. 

As I expected ;)

> 
> This Parquet index is optimized by sorting the rows by a special form of the 
> URL [1] which
> - drops the protocol or scheme
> - reverses the host name and
> - puts it in front of the remaining URL parts (path and query)
> - with some additional normalization of path and query (eg. sorting of query 
> params)
> 
> One example:
>https://example.com/path/search?q=foo=en
>com,example)/path/search?l=en=foo
> 
> The SURT URL is similar to the URL format used by Nutch2
>com.example/https/path/search?q=foo=en
> to address rows in the WebPage table [2]. This format is inspired by the 
> BigTable
> paper [3].  The point is that  cf. [4].

OK, I recognize this data model. Seems logical. 

> Ok, back to the question: both 1) and 2) are trivial if you do not care about
> writing an optimal Parquet files: just define a schema following the methods 
> implementing
> the Writable interface. Parquet is easier to feed into various data 
> processing systems
> because it integrates the schema. The Sequence file format requires that the
> Writable formats are provided - although Spark and other big data tools 
> support
> Sequence files this requirement is sometimes a blocker, also because Nutch
> does not ship a small "nutch-formats" jar.

In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet 
format was to facilitate (improved) analytics within the Databricks platform 
which we are currently evaluating.
I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked 
any retrievals but I 'hope' that I can begin to work on 'optimizing' the way 
that Nutch data is written such that it can be analyzed with relative ease 
within, for example Databricks.

> 
> Nevertheless, the price for Parquet is slower writing - which is ok for 
> write-once-read-many
> use cases. 

Yes, this is our use case.

> But the typical use case for Nutch is "write-once-read-twice":
> - segment: read for CrawlDb update and indexing
> - CrawlDb: read during update then replace, in some cycles read for 
> deduplication, statistics, etc.

So sequence files are optimal for use within the Nutch system but for 
additional analytics (on outside platforms such as Databricks) I suspect that 
Parquet would be preferred. 

Maybe we can share more ideas. I wonder if a utility tool to write segments as 
Parquet data would be useful?

Thanks Seb


Re: Writing Nutch data in Parquet format

2021-05-05 Thread Sebastian Nagel

Hi Lewis,

> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?

Yes, but not directly - it's a multi-step process. The outcome:
  
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

This Parquet index is optimized by sorting the rows by a special form of the 
URL [1] which
- drops the protocol or scheme
- reverses the host name and
- puts it in front of the remaining URL parts (path and query)
- with some additional normalization of path and query (eg. sorting of query 
params)

One example:
  https://example.com/path/search?q=foo=en
  com,example)/path/search?l=en=foo

The SURT URL is similar to the URL format used by Nutch2
  com.example/https/path/search?q=foo=en
to address rows in the WebPage table [2]. This format is inspired by the 
BigTable
paper [3].  The point is that  cf. [4].


Ok, back to the question: both 1) and 2) are trivial if you do not care about
writing an optimal Parquet files: just define a schema following the methods 
implementing
the Writable interface. Parquet is easier to feed into various data processing 
systems
because it integrates the schema. The Sequence file format requires that the
Writable formats are provided - although Spark and other big data tools support
Sequence files this requirement is sometimes a blocker, also because Nutch
does not ship a small "nutch-formats" jar.

Nevertheless, the price for Parquet is slower writing - which is ok for 
write-once-read-many
use cases. But the typical use case for Nutch is "write-once-read-twice":
- segment: read for CrawlDb update and indexing
- CrawlDb: read during update then replace, in some cycles read for 
deduplication, statistics, etc.


Lewis, I'd be really interested what your particular use case is?

Also because at Common Crawl we plan to provide more data in the Parquet 
format: page metadata,
links and text dumps. Storing URLs and wb page metadata efficiently was part of 
the motivation
for Dremel [5] which again inspired Parquet [6].


Best,
Sebastian


[1] https://github.com/internetarchive/surt
[2] 
https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction
[3] 
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[4] https://cloud.google.com/bigtable/docs/schema-design#domain-names
[5] https://research.google/pubs/pub36632/
[6] 
https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html


On 5/4/21 11:14 PM, Lewis John McGibbney wrote:

Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?
Thank you
lewismc





Writing Nutch data in Parquet format

2021-05-04 Thread Lewis John McGibbney
Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?
Thank you
lewismc