Github user MJJoyce closed the pull request at:
https://github.com/apache/nutch/pull/91
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user MJJoyce commented on a diff in the pull request:
https://github.com/apache/nutch/pull/88#discussion_r53254383
--- Diff:
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
---
@@ -256,6 +252,11 @@ public HttpResponse(HttpBase http, URL
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/91
NUTCH-2218 - Update CrawlComplete util with Commons CLI arg parsing
- Switch all argument parsing and checking to commons CLI.
- Update input directory processing such that the 'crawldb' folder
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/83
NUTCH-2155 - Add crawl completion utility
- Add simple crawl completion utility that reports count of fetch and
unfetched pages per domain or host.
- Update "nutch" helper scrip
Github user MJJoyce commented on a diff in the pull request:
https://github.com/apache/nutch/pull/83#discussion_r43324656
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under
Github user MJJoyce commented on a diff in the pull request:
https://github.com/apache/nutch/pull/83#discussion_r43325287
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/82
NUTCH-2150 - Add protocolstats utility
- Add ProtocolStatusStatistics utility that will aggregate protocol
status code information from a crawl database.
- Update nutch helper script
Github user MJJoyce commented on a diff in the pull request:
https://github.com/apache/nutch/pull/82#discussion_r43189357
--- Diff: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java ---
@@ -0,0 +1,161 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/68
NUTCH-2129 - Add protocol status tracking to crawl datum
- Add protocol status information to CrawlDatum.
- Bump CrawlDatum version number.
- Save HTTP Protocol status in CrawlDatum in lib
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/65
NUTCH-2115 - Add total counts to mimetype stats
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MJJoyce/nutch NUTCH-2115
Alternatively you can
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/52
NUTCH-2077 - Update to Tika 1.10
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MJJoyce/nutch NUTCH-2077
Alternatively you can review and apply
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/53
NUTCH-2088 - Add URL Processing Check to Interactive Selenium Handlers
- Add shouldProcessURL to Handler interface. Handlers may now check URLs to
determine if they should interact with them prior
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/46
NUTCH-2062 - Interactive Selenium Plugin
- Extend lib-selenium to allow for external interaction with the WebDriver.
- Add Interactive Selenium plugin so users can create a Selenium Handler
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/20
NUTCH-1906 - Remove duplicate stats flag listing in readdb help
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MJJoyce/nutch NUTCH-1906
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/21
NUTCH-1911 - Make domainstatics help info a smidge more helpful
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MJJoyce/nutch NUTCH-1911
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/17
NUTCH-1986 - Update and clarify default Elasticsearch conf values
- Host value is now defaulted to 'localhost'.
- Update port description to make it apparent that 9300 is more likely
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/18
NUTCH-1987 - Make bin/crawl indexer agnostic
- Add solr.server.url property to nutch-default and set to value
consistent with URL used in the Nutch Tutorial.
- Change SOLRURL references
GitHub user MJJoyce opened a pull request:
https://github.com/apache/nutch/pull/19
NUTCH-1988 - Add optional flat directory flag to dump command
- Add optional flatdir flag to dump command so that a user can dump
their crawl data to a flat directory instead of the nested
18 matches
Mail list logo