[GitHub] nutch pull request: NUTCH-2218 - Update CrawlComplete util with Co...

2016-02-18 Thread MJJoyce
Github user MJJoyce closed the pull request at: https://github.com/apache/nutch/pull/91 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] nutch pull request: NUTCH-2213 : do not store the headers verbatim...

2016-02-17 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/88#discussion_r53254383 --- Diff: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java --- @@ -256,6 +252,11 @@ public HttpResponse(HttpBase http, URL

[GitHub] nutch pull request: NUTCH-2218 - Update CrawlComplete util with Co...

2016-02-12 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/91 NUTCH-2218 - Update CrawlComplete util with Commons CLI arg parsing - Switch all argument parsing and checking to commons CLI. - Update input directory processing such that the 'crawldb' folder

[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/83 NUTCH-2155 - Add crawl completion utility - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper scrip

[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324656 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43325287 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] nutch pull request: NUTCH-2150 - Add protocolstats utility

2015-10-27 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/82 NUTCH-2150 - Add protocolstats utility - Add ProtocolStatusStatistics utility that will aggregate protocol status code information from a crawl database. - Update nutch helper script

[GitHub] nutch pull request: NUTCH-2150 - Add protocolstats utility

2015-10-27 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/82#discussion_r43189357 --- Diff: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java --- @@ -0,0 +1,161 @@ +/** + * Licensed to the Apache Software Foundation (ASF

[GitHub] nutch pull request: NUTCH-2129 - Add protocol status tracking to c...

2015-09-30 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/68 NUTCH-2129 - Add protocol status tracking to crawl datum - Add protocol status information to CrawlDatum. - Bump CrawlDatum version number. - Save HTTP Protocol status in CrawlDatum in lib

[GitHub] nutch pull request: NUTCH-2115 - Add total counts to mimetype stat...

2015-09-23 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/65 NUTCH-2115 - Add total counts to mimetype stats You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2115 Alternatively you can

[GitHub] nutch pull request: NUTCH-2077 - Update to Tika 1.10

2015-08-28 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/52 NUTCH-2077 - Update to Tika 1.10 You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2077 Alternatively you can review and apply

[GitHub] nutch pull request: NUTCH-2088 - Add URL Processing Check to Inter...

2015-08-28 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/53 NUTCH-2088 - Add URL Processing Check to Interactive Selenium Handlers - Add shouldProcessURL to Handler interface. Handlers may now check URLs to determine if they should interact with them prior

[GitHub] nutch pull request: NUTCH-2062 - Interactive Selenium Plugin

2015-07-21 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/46 NUTCH-2062 - Interactive Selenium Plugin - Extend lib-selenium to allow for external interaction with the WebDriver. - Add Interactive Selenium plugin so users can create a Selenium Handler

[GitHub] nutch pull request: NUTCH-1906 - Remove duplicate stats flag listi...

2015-04-17 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/20 NUTCH-1906 - Remove duplicate stats flag listing in readdb help You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-1906

[GitHub] nutch pull request: NUTCH-1911 - Make domainstatics help info a sm...

2015-04-17 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/21 NUTCH-1911 - Make domainstatics help info a smidge more helpful You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-1911

[GitHub] nutch pull request: NUTCH-1986 - Update and clarify default Elasti...

2015-04-17 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/17 NUTCH-1986 - Update and clarify default Elasticsearch conf values - Host value is now defaulted to 'localhost'. - Update port description to make it apparent that 9300 is more likely

[GitHub] nutch pull request: NUTCH-1987 - Make bin/crawl indexer agnostic

2015-04-17 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/18 NUTCH-1987 - Make bin/crawl indexer agnostic - Add solr.server.url property to nutch-default and set to value consistent with URL used in the Nutch Tutorial. - Change SOLRURL references

[GitHub] nutch pull request: NUTCH-1988 - Add optional flat directory flag ...

2015-04-17 Thread MJJoyce
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/19 NUTCH-1988 - Add optional flat directory flag to dump command - Add optional flatdir flag to dump command so that a user can dump their crawl data to a flat directory instead of the nested