[jira] [Created] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2383: --- Summary: Wrong FS exception in Fetcher Key: NUTCH-2383 URL: https://issues.apache.org/jira/browse/NUTCH-2383 Project: Nutch Issue Type: Bug Component

[jira] [Comment Edited] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992887#comment-15992887 ] Yossi Tamari edited comment on NUTCH-2383 at 5/2/17 1:28 PM: -

[jira] [Updated] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2383: Attachment: crawl output.txt The output of the crawl job, with {code}set -x{code}. > Wrong FS excep

[jira] [Updated] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-03 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2383: Environment: Hadoop 2.8 and Hadoop 2.7.2 (was: Hadoop 2.8 (Not tested yet in 2.7.2)) Descriptio

[jira] [Commented] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-03 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994709#comment-15994709 ] Yossi Tamari commented on NUTCH-2383: - Setting the MapReduce framework to YARN solved

[jira] [Commented] (NUTCH-2383) Wrong FS exception in Fetcher

2017-05-04 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996954#comment-15996954 ] Yossi Tamari commented on NUTCH-2383: - Yes. > Wrong FS exception in Fetcher > ---

[jira] [Created] (NUTCH-2399) indexer-elastic does not index multi-value fields (only the first value is indexed)

2017-07-09 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2399: --- Summary: indexer-elastic does not index multi-value fields (only the first value is indexed) Key: NUTCH-2399 URL: https://issues.apache.org/jira/browse/NUTCH-2399 Proje

[jira] [Created] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2414: --- Summary: Allow LanguageIndexingFilter to actually filter documents by language. Key: NUTCH-2414 URL: https://issues.apache.org/jira/browse/NUTCH-2414 Project: Nutch

[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144330#comment-16144330 ] Yossi Tamari commented on NUTCH-2414: - Markus, if I understand correctly, there are tw

[jira] [Created] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-29 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2415: --- Summary: Create a JEXL based IndexingFilter Key: NUTCH-2415 URL: https://issues.apache.org/jira/browse/NUTCH-2415 Project: Nutch Issue Type: New Feature

[jira] [Created] (NUTCH-2448) Allow Sending an empty http.agent.version

2017-10-23 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2448: --- Summary: Allow Sending an empty http.agent.version Key: NUTCH-2448 URL: https://issues.apache.org/jira/browse/NUTCH-2448 Project: Nutch Issue Type: Bug

[jira] [Created] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2449: --- Summary: Usage of Tika LanguageIdentifier in language-identifier plugin Key: NUTCH-2449 URL: https://issues.apache.org/jira/browse/NUTCH-2449 Project: Nutch I

[jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217004#comment-16217004 ] Yossi Tamari commented on NUTCH-2449: - Since in Tika LanguageIdentifier was superseded

[jira] [Created] (NUTCH-2456) Redirected documents are not indexed

2017-11-06 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2456: --- Summary: Redirected documents are not indexed Key: NUTCH-2456 URL: https://issues.apache.org/jira/browse/NUTCH-2456 Project: Nutch Issue Type: Bug Co

[jira] [Updated] (NUTCH-2456) Redirected documents are not indexed

2017-11-06 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2456: Description: If http.redirect.max is set to a positive value, the Fetcher will follow redirects, cr

[jira] [Commented] (NUTCH-2456) Redirected documents are not indexed

2017-11-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242295#comment-16242295 ] Yossi Tamari commented on NUTCH-2456: - db.update.additions.allowed is set to false, wh

[jira] [Commented] (NUTCH-2456) Redirected documents are not indexed

2017-11-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242304#comment-16242304 ] Yossi Tamari commented on NUTCH-2456: - BTW, I submitted a PR that tries to be a minima

[jira] [Updated] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

2017-11-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2456: Summary: Allow to index pages/URLs not contained in CrawlDb (was: Redirected documents are not inde

[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

2017-11-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242491#comment-16242491 ] Yossi Tamari commented on NUTCH-2456: - Updated the title as you suggested. It seems to

[jira] [Created] (NUTCH-2463) Enable sampling CrawlDB

2017-11-20 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2463: --- Summary: Enable sampling CrawlDB Key: NUTCH-2463 URL: https://issues.apache.org/jira/browse/NUTCH-2463 Project: Nutch Issue Type: Improvement Compone

[jira] [Created] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2017-12-28 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2489: --- Summary: Dependency collision with lucene-analyzers-common in scoring-similarity plugin Key: NUTCH-2489 URL: https://issues.apache.org/jira/browse/NUTCH-2489 Project: N

[jira] [Updated] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2017-12-28 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2489: Attachment: ivy.xml.patch This can be fixed by removing in ivy.xml: {code:java} conf="*->default" {

[jira] [Comment Edited] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2017-12-28 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305594#comment-16305594 ] Yossi Tamari edited comment on NUTCH-2489 at 12/28/17 5:28 PM: -

[jira] [Commented] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2017-12-28 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305636#comment-16305636 ] Yossi Tamari commented on NUTCH-2489: - Upgrading the dependency to lucene-analyzers-co

[jira] [Commented] (NUTCH-2489) Dependency collision with lucene-analyzers-common in scoring-similarity plugin

2018-02-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355507#comment-16355507 ] Yossi Tamari commented on NUTCH-2489: - Since nobody has suggested a better fix in over

[jira] [Created] (NUTCH-2509) Inconsistent behavior in SitemapProcessor

2018-02-13 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2509: --- Summary: Inconsistent behavior in SitemapProcessor Key: NUTCH-2509 URL: https://issues.apache.org/jira/browse/NUTCH-2509 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2509) Inconsistent behavior in SitemapProcessor

2018-02-13 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2509: Attachment: SitemapProcessor.patch > Inconsistent behavior in SitemapProcessor > ---

[jira] [Created] (NUTCH-2511) SitemapProcessor limited by http.content.limit

2018-02-19 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2511: --- Summary: SitemapProcessor limited by http.content.limit Key: NUTCH-2511 URL: https://issues.apache.org/jira/browse/NUTCH-2511 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-2511) SitemapProcessor limited by http.content.limit

2018-02-19 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369299#comment-16369299 ] Yossi Tamari commented on NUTCH-2511: - The best solution I can see for this to create

[jira] [Created] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally

2018-03-05 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2523: --- Summary: UpdateHostDB blocks plugins unintenionally Key: NUTCH-2523 URL: https://issues.apache.org/jira/browse/NUTCH-2523 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally

2018-03-05 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2523: Attachment: NUTCH-2523.tamari.180305.patch.txt > UpdateHostDB blocks plugins unintenionally > --

[jira] [Commented] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409906#comment-16409906 ] Yossi Tamari commented on NUTCH-1741: - Just wanted to add to Sebastian's comment above

[jira] [Comment Edited] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2018-03-22 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409906#comment-16409906 ] Yossi Tamari edited comment on NUTCH-1741 at 3/22/18 5:08 PM: --

[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-17 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479006#comment-16479006 ] Yossi Tamari commented on NUTCH-2578: - That would require MimeUtil to stay thread-safe

[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

2018-05-17 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479176#comment-16479176 ] Yossi Tamari commented on NUTCH-2578: - Hi [~wastl-nagel], maybe I'm missing something,

[jira] [Commented] (NUTCH-2594) Documentation for indexer plugins

2018-06-19 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517182#comment-16517182 ] Yossi Tamari commented on NUTCH-2594: - Hi [~roannel], I think the Mapping section nee

[jira] [Created] (NUTCH-2611) Add line-breaks when parsing HTML block-level elements

2018-06-25 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2611: --- Summary: Add line-breaks when parsing HTML block-level elements Key: NUTCH-2611 URL: https://issues.apache.org/jira/browse/NUTCH-2611 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-2611) Add line-breaks when parsing HTML block-level elements

2018-06-25 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522186#comment-16522186 ] Yossi Tamari commented on NUTCH-2611: - I will submit a PR that optionally adds a new-

[jira] [Commented] (NUTCH-2611) Add line-breaks when parsing HTML block-level elements

2018-06-25 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522236#comment-16522236 ] Yossi Tamari commented on NUTCH-2611: - [~wastl-nagel] I forgot it too :) at least I'm

[jira] [Commented] (NUTCH-2624) protocol-okhttp resource leak

2018-07-23 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553005#comment-16553005 ] Yossi Tamari commented on NUTCH-2624: - Hi [~wastl-nagel], I suspect it's the response

[jira] [Comment Edited] (NUTCH-2624) protocol-okhttp resource leak

2018-07-23 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553005#comment-16553005 ] Yossi Tamari edited comment on NUTCH-2624 at 7/23/18 3:37 PM: -

[jira] [Commented] (NUTCH-2624) protocol-okhttp resource leak

2018-07-24 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554026#comment-16554026 ] Yossi Tamari commented on NUTCH-2624: - (y) > protocol-okhttp resource leak > ---

[jira] [Commented] (NUTCH-1861) Implement POP3 Protocol

2018-08-27 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593916#comment-16593916 ] Yossi Tamari commented on NUTCH-1861: - Hi Lewis, I have a some questions: # Isn't S

[jira] [Commented] (NUTCH-2644) CrawlDbReader -dump ignores filter options

2018-09-12 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612132#comment-16612132 ] Yossi Tamari commented on NUTCH-2644: - [~wastl-nagel], Isn't this a much wider issue?

[jira] [Issue Comment Deleted] (NUTCH-2644) CrawlDbReader -dump ignores filter options

2018-09-12 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2644: Comment: was deleted (was: [~wastl-nagel], Isn't this a much wider issue? For example, I think it

[jira] [Commented] (NUTCH-2644) CrawlDbReader -dump ignores filter options

2018-09-13 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613205#comment-16613205 ] Yossi Tamari commented on NUTCH-2644: - Hi [~wastl-nagel], the reason I deleted the co

[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2018-10-14 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649315#comment-16649315 ] Yossi Tamari commented on NUTCH-1842: - While I agree it is a minor issue for 2.X, I t

[jira] [Comment Edited] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2018-10-14 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649315#comment-16649315 ] Yossi Tamari edited comment on NUTCH-1842 at 10/14/18 9:42 AM:

[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2018-10-14 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649550#comment-16649550 ] Yossi Tamari commented on NUTCH-1842: - [~wastl-nagel], I agree. The reason I chose to

[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653368#comment-16653368 ] Yossi Tamari commented on NUTCH-2658: - I disagree regarding putting the documentation

[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655082#comment-16655082 ] Yossi Tamari commented on NUTCH-2662: - [~jorgelbg], this goes against your own review

[jira] [Commented] (NUTCH-2670) org.apache.nutch.indexer.IndexerMapReduce does not read the value of "indexer.delete" from nutch-site.xml

2018-10-29 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667107#comment-16667107 ] Yossi Tamari commented on NUTCH-2670: - This is reproducible. nutch-site and nutch-def

[jira] [Commented] (NUTCH-2679) "ant eclipse" failed as eclipse binary is moved

2018-12-12 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719008#comment-16719008 ] Yossi Tamari commented on NUTCH-2679: - I'm not sure what the problem here is, but Ant

[jira] [Created] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2691: --- Summary: Improve logging from scoring-depth plugin Key: NUTCH-2691 URL: https://issues.apache.org/jira/browse/NUTCH-2691 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2691) Improve logging from scoring-depth plugin

2019-01-22 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2691: Description: Currently the scoring-depth plugin emits a "Missing depth, removing all outlinks from

[jira] [Created] (NUTCH-2715) WARCExporter fails on large records

2019-05-06 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2715: --- Summary: WARCExporter fails on large records Key: NUTCH-2715 URL: https://issues.apache.org/jira/browse/NUTCH-2715 Project: Nutch Issue Type: Bug Affects V

[jira] [Created] (NUTCH-2716) Response headers are not stored for a compressed response

2019-05-06 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2716: --- Summary: Response headers are not stored for a compressed response Key: NUTCH-2716 URL: https://issues.apache.org/jira/browse/NUTCH-2716 Project: Nutch Issue T

[jira] [Commented] (NUTCH-2716) protocol-http: Response headers are not stored for a compressed response

2019-05-06 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833888#comment-16833888 ] Yossi Tamari commented on NUTCH-2716: - Actually, I meant to remove those two headers

[jira] [Commented] (NUTCH-2716) protocol-http: Response headers are not stored for a compressed response

2019-05-06 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833917#comment-16833917 ] Yossi Tamari commented on NUTCH-2716: - OK, I'll try to submit a patch tomorrow along

[jira] [Updated] (NUTCH-2716) protocol-http: Response headers are not stored for a compressed response

2019-05-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2716: Description: Even when store.http.headers=true, the HTTP headers are not saved for a gzipped or de

[jira] [Commented] (NUTCH-2715) WARCExporter fails on large records

2019-05-07 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834751#comment-16834751 ] Yossi Tamari commented on NUTCH-2715: - It seems to me like the commoncrawldump plugin

[jira] [Commented] (NUTCH-2716) protocol-http: Response headers are not stored for a compressed response

2019-05-08 Thread Yossi Tamari (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835441#comment-16835441 ] Yossi Tamari commented on NUTCH-2716: - I've submitted a pull request for this, it see

[jira] [Commented] (NUTCH-2742) Unable to parse specific pdf file

2019-10-06 Thread Yossi Tamari (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945355#comment-16945355 ] Yossi Tamari commented on NUTCH-2742: - [~Mark A] The important line in crawl.log is:

[jira] [Commented] (NUTCH-2511) SitemapProcessor limited by http.content.limit

2019-10-10 Thread Yossi Tamari (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948710#comment-16948710 ] Yossi Tamari commented on NUTCH-2511: - Opened a PR for this: [https://github.com/apac