[jira] [Assigned] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

2019-03-06 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2696:
--

Assignee: Sebastian Nagel

> Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x
> --
>
> Key: NUTCH-2696
> URL: https://issues.apache.org/jira/browse/NUTCH-2696
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: Hadoop version : 3.0.0 (CDH 6.1)
> Nutch : 1.15
> Mode : distributed mode
>Reporter: Laurent Hervaud
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> All Nutch tasks work properly with Hadoop 3.x. (except SegmentReader)
>  SegmentReader with -get option work fine.
>  SegmentReader with -dump option replace non-ascii character by ?
> Exemple url : [http://www.wikipedia.fr/index.php]
>  
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -dump 
> /user/nutch/crawl1.15/segments/20190221093756 /tmp/dump1.15 -nocontent 
> -nogenerate -noparse -noparsedata
> ParseText::
>  Wikipedia.fr - Portail de recherche sur les projets Wikim?dia
>  Chercher sur Wikip?dia en fran?ais
>  L?encyclop?die librement r?utilisable que chacun peut am?liorer.
> {code}
>  
>  
> {code:java}
> command : ./runtime/deploy/bin/nutch readseg -get 
> /user/nutch/crawl1.15/segments/20190221093756 
> http://www.wikipedia.fr/index.php -nocontent -nogenerate -noparse -noparsedata
> ParseText::
>  Wikipedia.fr - Portail de recherche sur les projets Wikimédia
>  Chercher sur Wikipédia en français
>  L’encyclopédie librement réutilisable que chacun peut améliorer.
> {code}
>  
> I try to build with hadoop 3.0.0 dependencies in ivy.xml but i have the same 
> result
> It's work fine in local mode.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

2019-03-06 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785836#comment-16785836
 ] 

Sebastian Nagel commented on NUTCH-2683:


Any comments or objections? Thanks! Otherwise I'll commit.

> DeduplicationJob: add option to prefer https:// over http://
> 
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit

2019-03-06 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785819#comment-16785819
 ] 

Sebastian Nagel commented on NUTCH-2666:


Any objections? It's a huge jump but the it may be sufficient as default for 
the next years.

> Increase default value for http.content.limit / ftp.content.limit / 
> file.content.limit
> --
>
> Key: NUTCH-2666
> URL: https://issues.apache.org/jira/browse/NUTCH-2666
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Marco Ebbinghaus
>Priority: Minor
> Fix For: 1.16
>
>
> The default value for http.content.limit in nutch-default.xml (The length 
> limit for downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. 
> Maybe this default value should be increased as many pages today are greater 
> than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much 
> greater than 64kb and because of that with every crawl cycle the count of 
> db_unfetched urls decreased until it hit zero and the crawler became inactive 
> (because the first 64 kB contained always the same set of navigation links)
> The description might also be updated as this is not only the case for the 
> http protocol, but also for https.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


4 Apache Events in 2019: DC Roadshow soon; next up Chicago, Las Vegas, and Berlin!

2019-03-06 Thread Rich Bowen
Dear Apache Enthusiast,

(You’re receiving this because you are subscribed to one or more user
mailing lists for an Apache Software Foundation project.)

TL;DR:
 * Apache Roadshow DC is in 3 weeks. Register now at
https://apachecon.com/usroadshowdc19/
 * Registration for Apache Roadshow Chicago is open.
http://apachecon.com/chiroadshow19
 * The CFP for ApacheCon North America is now open.
https://apachecon.com/acna19
 * Save the date: ApacheCon Europe will be held in Berlin, October 22nd
through 24th.  https://apachecon.com/aceu19


Registration is open for two Apache Roadshows; these are smaller events
with a more focused program and regional community engagement:

Our Roadshow event in Washington DC takes place in under three weeks, on
March 25th. We’ll be hosting a day-long event at the Fairfax campus of
George Mason University. The roadshow is a full day of technical talks
(two tracks) and an open source job fair featuring AWS, Bloomberg, dito,
GridGain, Linode, and Security University. More details about the
program, the job fair, and to register, visit
https://apachecon.com/usroadshowdc19/

Apache Roadshow Chicago will be held May 13-14th at a number of venues
in Chicago’s Logan Square neighborhood. This event will feature sessions
in AdTech, FinTech and Insurance, startups, “Made in Chicago”, Project
Shark Tank (innovations from the Apache Incubator), community diversity,
and more. It’s a great way to learn about various Apache projects “at
work” while playing at a brewery, a beercade, and a neighborhood bar.
Sign up today at https://www.apachecon.com/chiroadshow19/

We’re delighted to announce that the Call for Presentations (CFP) is now
open for ApacheCon North America in Las Vegas, September 9-13th! As the
official conference series of the ASF, ApacheCon North America will
feature over a dozen Apache project summits, including Cassandra,
Cloudstack, Tomcat, Traffic Control, and more. We’re looking for talks
in a wide variety of categories -- anything related to ASF projects and
the Apache development process. The CFP closes at midnight on May 26th.
In addition, the ASF will be celebrating its 20th Anniversary during the
event. For more details and to submit a proposal for the CFP, visit
https://apachecon.com/acna19/ . Registration will be opening soon.

Be sure to mark your calendars for ApacheCon Europe, which will be held
in Berlin, October 22-24th at the KulturBrauerei, a landmark of Berlin's
industrial history. In addition to innovative content from our projects,
we are collaborating with the Open Source Design community
(https://opensourcedesign.net/) to offer a track on design this year.
The CFP and registration will open soon at https://apachecon.com/aceu19/ .

Sponsorship opportunities are available for all events, with details
listed on each event’s site at http://apachecon.com/.

We look forward to seeing you!

Rich, for the ApacheCon Planners
@apachecon