Re: Writing Nutch data in Parquet format

2021-05-06 Thread Lewis John McGibbney
Hi Seb,
Really interesting. Thanks for the response. Below

On 2021/05/05 11:42:04, Sebastian Nagel  
wrote: 
> 
> Yes, but not directly - it's a multi-step process. 

As I expected ;)

> 
> This Parquet index is optimized by sorting the rows by a special form of the 
> URL [1] which
> - drops the protocol or scheme
> - reverses the host name and
> - puts it in front of the remaining URL parts (path and query)
> - with some additional normalization of path and query (eg. sorting of query 
> params)
> 
> One example:
>https://example.com/path/search?q=foo=en
>com,example)/path/search?l=en=foo
> 
> The SURT URL is similar to the URL format used by Nutch2
>com.example/https/path/search?q=foo=en
> to address rows in the WebPage table [2]. This format is inspired by the 
> BigTable
> paper [3].  The point is that  cf. [4].

OK, I recognize this data model. Seems logical. 

> Ok, back to the question: both 1) and 2) are trivial if you do not care about
> writing an optimal Parquet files: just define a schema following the methods 
> implementing
> the Writable interface. Parquet is easier to feed into various data 
> processing systems
> because it integrates the schema. The Sequence file format requires that the
> Writable formats are provided - although Spark and other big data tools 
> support
> Sequence files this requirement is sometimes a blocker, also because Nutch
> does not ship a small "nutch-formats" jar.

In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet 
format was to facilitate (improved) analytics within the Databricks platform 
which we are currently evaluating.
I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked 
any retrievals but I 'hope' that I can begin to work on 'optimizing' the way 
that Nutch data is written such that it can be analyzed with relative ease 
within, for example Databricks.

> 
> Nevertheless, the price for Parquet is slower writing - which is ok for 
> write-once-read-many
> use cases. 

Yes, this is our use case.

> But the typical use case for Nutch is "write-once-read-twice":
> - segment: read for CrawlDb update and indexing
> - CrawlDb: read during update then replace, in some cycles read for 
> deduplication, statistics, etc.

So sequence files are optimal for use within the Nutch system but for 
additional analytics (on outside platforms such as Databricks) I suspect that 
Parquet would be preferred. 

Maybe we can share more ideas. I wonder if a utility tool to write segments as 
Parquet data would be useful?

Thanks Seb


Re: Redirection behavior

2021-05-06 Thread prateek
Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see
what's causing it. Thanks for your help

Regards
Prateek

On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel 
wrote:

> Hi Prateek,
>
> (sorry, I pressed the wrong reply button, so redirecting the discussion
> back to user@nutch)
>
>
>  > I am not sure what I am missing.
>
> Well, URL filters?  Robots.txt?  Don't know...
>
>
>  > I am currently using Nutch 1.16
>
> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1])
> which caused Fetcher
> not to follow redirects. But it was fixed already in Nutch 1.15.
>
> I've retried using Nutch 1.16:
> - using -Dplugin.includes='protocol-okhttp|parse-html'
> FetcherThread 43 fetching http://wikipedia.com/ (queue crawl
> delay=3000ms)
> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
>
> Note: there might be an issue using protocol-http
> (-Dplugin.includes='protocol-http|parse-html')
> together with Nutch 1.16:
> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
> delay=3000ms)
> Couldn't get robots.txt for https://wikipedia.com/:
> java.net.SocketException: Socket is closed
> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
> delay=3000ms)
> Couldn't get robots.txt for https://www.wikipedia.org/:
> java.net.SocketException: Socket is closed
> Failed to get protocol output java.net.SocketException: Socket is
> closed
>  at
> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
>  at
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:162)
>  at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
>  at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
>  at
> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
> FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
> java.net.SocketException: Socket is closed
>
> But it's not reproducible using Nutch master / 1.18 - as it relates to
> HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
>
> In any case, could you try to reproduce the problem using Nutch 1.18 ?
>
> Best,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH-2550
> [2] https://issues.apache.org/jira/browse/NUTCH-2794
>
>
> On 5/6/21 11:54 AM, prateek wrote:
> > Thanks for your reply Sebastian.
> >
> > I am using http.redirect.max=5 for my setup.
> > In the seed URL, I am only passing http://wikipedia.com/ <
> http://wikipedia.com/> and https://zyfro.com/  .
> CrawlDatum
> > and ParseData shared in my earlier email are from http://wikipedia.com/
>  url.
> > I don't see the other redirected URL's in the logs or segments. Here is
> my log -
> >
> > /2021-05-05 17:35:23,854 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode :
> byHost
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold: -1
> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
> Fetcher: throughput threshold retries: 5
> > *2021-05-05 17:35:23,855 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> http://wikipedia.com/
> >  (queue crawl delay=1000ms)*
> >
> > *2021-05-05 17:35:29,095 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
> https://zyfro.com/
> >  (queue crawl delay=1000ms)*
> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/robots.txt 
> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher:
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> > fetchQueues.getQueueCount=1
> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http:
> fetching https://zyfro.com/ 
> > 2021-05-05 17:35:30,786 INFO [main]
> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work
> available/
> >
> > I am not sure what I am missing.
> >
> > Regards
> > Prateek
> >
> >
> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <
> wastl.na...@googlemail.com > wrote:
> >
> > Hi Prateek,
> >
> > could you share information about all pages/URLs in the redirect
> chain?
> >
> > http://wikipedia.com/ 
> > https://wikipedia.com/ 
> > https://www.wikipedia.org/ 
> >
> > If I'm not wrong, the shown  CrawlDatum and ParseData stems from
> > https://www.wikipedia.org/  and is
> 

Re: Redirection behavior

2021-05-06 Thread Sebastian Nagel

Hi Prateek,

(sorry, I pressed the wrong reply button, so redirecting the discussion back to 
user@nutch)


> I am not sure what I am missing.

Well, URL filters?  Robots.txt?  Don't know...


> I am currently using Nutch 1.16

Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) which 
caused Fetcher
not to follow redirects. But it was fixed already in Nutch 1.15.

I've retried using Nutch 1.16:
- using -Dplugin.includes='protocol-okhttp|parse-html'
   FetcherThread 43 fetching http://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)

Note: there might be an issue using protocol-http 
(-Dplugin.includes='protocol-http|parse-html')
together with Nutch 1.16:
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   Couldn't get robots.txt for https://wikipedia.com/: 
java.net.SocketException: Socket is closed
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   Couldn't get robots.txt for https://www.wikipedia.org/: 
java.net.SocketException: Socket is closed
   Failed to get protocol output java.net.SocketException: Socket is closed
at 
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
at 
org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:162)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
   FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: 
java.net.SocketException: Socket is closed

But it's not reproducible using Nutch master / 1.18 - as it relates to 
HTTPS/SSL it's likely fixed by NUTCH-2794 [2].

In any case, could you try to reproduce the problem using Nutch 1.18 ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2550
[2] https://issues.apache.org/jira/browse/NUTCH-2794


On 5/6/21 11:54 AM, prateek wrote:

Thanks for your reply Sebastian.

I am using http.redirect.max=5 for my setup.
In the seed URL, I am only passing http://wikipedia.com/  and https://zyfro.com/  . CrawlDatum 
and ParseData shared in my earlier email are from http://wikipedia.com/  url.

I don't see the other redirected URL's in the logs or segments. Here is my log -

/2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 1 Using queue mode : byHost
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold: -1
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold retries: 5
*2021-05-05 17:35:23,855 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching http://wikipedia.com/ 
 (queue crawl delay=1000ms)*


*2021-05-05 17:35:29,095 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching https://zyfro.com/ 
 (queue crawl delay=1000ms)*

2021-05-05 17:35:29,095 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/robots.txt 
2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1

2021-05-05 17:35:30,189 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/ 
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 50 has no more work available/

I am not sure what I am missing.

Regards
Prateek


On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@googlemail.com>> wrote:

Hi Prateek,

could you share information about all pages/URLs in the redirect chain?

http://wikipedia.com/ 
https://wikipedia.com/ 
https://www.wikipedia.org/ 

If I'm not wrong, the shown  CrawlDatum and ParseData stems from
https://www.wikipedia.org/  and is 
_http_status_code_=200.
So, looks like the redirects have been followed.

Note: all 3 URLs should have records in the segment and the CrawlDb.

I've also verified that the above redirect chain is followed by Fetcher
with the following settings (passed on the command-line via -D) using
Nutch master (1.18):
   -Dhttp.redirect.max=3
   -Ddb.ignore.external.links=true
   -Ddb.ignore.external.links.mode=byDomain
   -Ddb.ignore.also.redirects=false

Fetcher log