[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-17 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151475#comment-15151475 ] ASF GitHub Bot commented on NUTCH-2213: --- Github user MJJoyce commented on a diff in the pull

[GitHub] nutch pull request: NUTCH-2213 : do not store the headers verbatim...

2016-02-17 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/88#discussion_r53254383 --- Diff: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java --- @@ -256,6 +252,11 @@ public HttpResponse(HttpBase http, URL

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151183#comment-15151183 ] Karanjeet Singh commented on NUTCH-2191: Sure, [~markus17]. I tried to integrate your patch and

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151154#comment-15151154 ] Markus Jelsma commented on NUTCH-2191: -- Hi Karanjeet - looks like the only changes you made are in

[jira] [Issue Comment Deleted] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karanjeet Singh updated NUTCH-2191: --- Comment: was deleted (was: Updated patch to include HtmlUnit from Selenium library. This

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karanjeet Singh updated NUTCH-2191: --- Attachment: (was: NUTCH-2191.patch) > Add protocol-htmlunit > - > >

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karanjeet Singh updated NUTCH-2191: --- Attachment: NUTCH-2191.patch > Add protocol-htmlunit > - > >

[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karanjeet Singh updated NUTCH-2191: --- Attachment: NUTCH-2191.patch Updated patch to include HtmlUnit from Selenium library. This

[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-17 Thread Karanjeet Singh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151136#comment-15151136 ] Karanjeet Singh commented on NUTCH-2191: Updated patch to include HtmlUnit from Selenium library.

How to extract only body

2016-02-17 Thread Zara Parst
Hi everybody, I am trying to make search for my own website. For that I am using nutch and solr. Problem with nutch is htmparser seems to me as a flat parser which concatenate everything title , Metatag , body into one single field content. Which is not my desired search result. Is it

[jira] [Commented] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150306#comment-15150306 ] Hudson commented on NUTCH-2223: --- FAILURE: Integrated in Nutch-trunk #3348 (See

[jira] [Commented] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150307#comment-15150307 ] Hudson commented on NUTCH-2224: --- FAILURE: Integrated in Nutch-trunk #3348 (See

[jira] [Commented] (NUTCH-2225) Parsed time calculated incorrectly

2016-02-17 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150308#comment-15150308 ] Hudson commented on NUTCH-2225: --- FAILURE: Integrated in Nutch-trunk #3348 (See

Build failed in Jenkins: Nutch-trunk #3348

2016-02-17 Thread Apache Jenkins Server
See Changes: [markus] NUTCH-2223 Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection [markus] NUTCH-2224 Average bytes/second calculated incorrectly in fetcher [markus] NUTCH-2225 Parsed time calculated

[jira] [Resolved] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2223. -- Resolution: Fixed Committed to trunk in revision 1730808. > Upgrade xercesImpl to 2.11.0 to

[jira] [Commented] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150264#comment-15150264 ] Markus Jelsma commented on NUTCH-2223: -- Thanks Tien Nguyen Manh! > Upgrade xercesImpl to 2.11.0 to

[jira] [Commented] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150248#comment-15150248 ] Markus Jelsma commented on NUTCH-2223: -- Incredible, i tried the tika-breaker.html file in the linked

[jira] [Assigned] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2223: Assignee: Markus Jelsma > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Priority: Major (was: Minor) > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Description: Stracktrace for the hang seems to be: {code} at

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Fix Version/s: 1.12 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2223: - Description: {code}Stracktrace for the hang seems to be: at

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Component/s: fetcher > Average bytes/second calculated incorrectly in fetcher >

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Affects Version/s: 1.11 > Average bytes/second calculated incorrectly in fetcher >

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Fix Version/s: 1.12 > Average bytes/second calculated incorrectly in fetcher >

[jira] [Resolved] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2224. -- Resolution: Fixed Committed to trunk in revision 1730803. Thanks Tien Nguyen Manh! > Average

[jira] [Updated] (NUTCH-2224) Average bytes/second calculated incorrectly in fetcher

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2224: - Summary: Average bytes/second calculated incorrectly in fetcher (was: Wrong metric compute in

[jira] [Assigned] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2224: Assignee: Markus Jelsma > Wrong metric compute in Fetcher status report >

[jira] [Resolved] (NUTCH-2225) Parsed time calculated incorrectly

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2225. -- Resolution: Fixed Committed to trunk in revision 1730802. Thanks Tien Nguyen Manh! > Parsed

[jira] [Updated] (NUTCH-2225) Parsed time calculated incorrectly

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Summary: Parsed time calculated incorrectly (was: Parsed time not include time to parse) >

[jira] [Assigned] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2225: Assignee: Markus Jelsma > Parsed time not include time to parse >

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Affects Version/s: 1.11 > Parsed time not include time to parse >

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2225: - Fix Version/s: 1.12 > Parsed time not include time to parse >

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2225: Attachment: NUTCH-2225.patch > Parsed time not include time to parse >

[jira] [Created] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2225: --- Summary: Parsed time not include time to parse Key: NUTCH-2225 URL: https://issues.apache.org/jira/browse/NUTCH-2225 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2224: Attachment: NUTCH-2224.patch > Wrong metric compute in Fetcher status report >

[jira] [Created] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2224: --- Summary: Wrong metric compute in Fetcher status report Key: NUTCH-2224 URL: https://issues.apache.org/jira/browse/NUTCH-2224 Project: Nutch Issue

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Attachment: NUTCH-2223.patch Patch for nutch 1.11 > Upgrade xercesImpl to 2.11.0 to fix

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: 1.13 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype

[jira] [Created] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2223: --- Summary: Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection Key: NUTCH-2223 URL: https://issues.apache.org/jira/browse/NUTCH-2223

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: (was: 1.13) > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika