Yossi Tamari created NUTCH-2383:
---
Summary: Wrong FS exception in Fetcher
Key: NUTCH-2383
URL: https://issues.apache.org/jira/browse/NUTCH-2383
Project: Nutch
Issue Type: Bug
Component
[
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992887#comment-15992887
]
Yossi Tamari edited comment on NUTCH-2383 at 5/2/17 1:28 PM:
-
[
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2383:
Attachment: crawl output.txt
The output of the crawl job, with {code}set -x{code}.
> Wrong FS excep
[
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2383:
Environment: Hadoop 2.8 and Hadoop 2.7.2 (was: Hadoop 2.8 (Not tested yet
in 2.7.2))
Descriptio
[
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994709#comment-15994709
]
Yossi Tamari commented on NUTCH-2383:
-
Setting the MapReduce framework to YARN solved
[
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996954#comment-15996954
]
Yossi Tamari commented on NUTCH-2383:
-
Yes.
> Wrong FS exception in Fetcher
> ---
Yossi Tamari created NUTCH-2399:
---
Summary: indexer-elastic does not index multi-value fields (only
the first value is indexed)
Key: NUTCH-2399
URL: https://issues.apache.org/jira/browse/NUTCH-2399
Proje
Yossi Tamari created NUTCH-2414:
---
Summary: Allow LanguageIndexingFilter to actually filter documents
by language.
Key: NUTCH-2414
URL: https://issues.apache.org/jira/browse/NUTCH-2414
Project: Nutch
[
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144330#comment-16144330
]
Yossi Tamari commented on NUTCH-2414:
-
Markus, if I understand correctly, there are tw
Yossi Tamari created NUTCH-2415:
---
Summary: Create a JEXL based IndexingFilter
Key: NUTCH-2415
URL: https://issues.apache.org/jira/browse/NUTCH-2415
Project: Nutch
Issue Type: New Feature
Yossi Tamari created NUTCH-2448:
---
Summary: Allow Sending an empty http.agent.version
Key: NUTCH-2448
URL: https://issues.apache.org/jira/browse/NUTCH-2448
Project: Nutch
Issue Type: Bug
Yossi Tamari created NUTCH-2449:
---
Summary: Usage of Tika LanguageIdentifier in language-identifier
plugin
Key: NUTCH-2449
URL: https://issues.apache.org/jira/browse/NUTCH-2449
Project: Nutch
I
[
https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217004#comment-16217004
]
Yossi Tamari commented on NUTCH-2449:
-
Since in Tika LanguageIdentifier was superseded
Yossi Tamari created NUTCH-2456:
---
Summary: Redirected documents are not indexed
Key: NUTCH-2456
URL: https://issues.apache.org/jira/browse/NUTCH-2456
Project: Nutch
Issue Type: Bug
Co
[
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2456:
Description:
If http.redirect.max is set to a positive value, the Fetcher will follow
redirects, cr
[
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242295#comment-16242295
]
Yossi Tamari commented on NUTCH-2456:
-
db.update.additions.allowed is set to false, wh
[
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242304#comment-16242304
]
Yossi Tamari commented on NUTCH-2456:
-
BTW, I submitted a PR that tries to be a minima
[
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2456:
Summary: Allow to index pages/URLs not contained in CrawlDb (was:
Redirected documents are not inde
[
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242491#comment-16242491
]
Yossi Tamari commented on NUTCH-2456:
-
Updated the title as you suggested.
It seems to
Yossi Tamari created NUTCH-2463:
---
Summary: Enable sampling CrawlDB
Key: NUTCH-2463
URL: https://issues.apache.org/jira/browse/NUTCH-2463
Project: Nutch
Issue Type: Improvement
Compone
Yossi Tamari created NUTCH-2489:
---
Summary: Dependency collision with lucene-analyzers-common in
scoring-similarity plugin
Key: NUTCH-2489
URL: https://issues.apache.org/jira/browse/NUTCH-2489
Project: N
[
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2489:
Attachment: ivy.xml.patch
This can be fixed by removing in ivy.xml:
{code:java}
conf="*->default"
{
[
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305594#comment-16305594
]
Yossi Tamari edited comment on NUTCH-2489 at 12/28/17 5:28 PM:
-
[
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305636#comment-16305636
]
Yossi Tamari commented on NUTCH-2489:
-
Upgrading the dependency to lucene-analyzers-co
[
https://issues.apache.org/jira/browse/NUTCH-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355507#comment-16355507
]
Yossi Tamari commented on NUTCH-2489:
-
Since nobody has suggested a better fix in over
Yossi Tamari created NUTCH-2509:
---
Summary: Inconsistent behavior in SitemapProcessor
Key: NUTCH-2509
URL: https://issues.apache.org/jira/browse/NUTCH-2509
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2509:
Attachment: SitemapProcessor.patch
> Inconsistent behavior in SitemapProcessor
> ---
Yossi Tamari created NUTCH-2511:
---
Summary: SitemapProcessor limited by http.content.limit
Key: NUTCH-2511
URL: https://issues.apache.org/jira/browse/NUTCH-2511
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369299#comment-16369299
]
Yossi Tamari commented on NUTCH-2511:
-
The best solution I can see for this to create
Yossi Tamari created NUTCH-2523:
---
Summary: UpdateHostDB blocks plugins unintenionally
Key: NUTCH-2523
URL: https://issues.apache.org/jira/browse/NUTCH-2523
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2523:
Attachment: NUTCH-2523.tamari.180305.patch.txt
> UpdateHostDB blocks plugins unintenionally
> --
[
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409906#comment-16409906
]
Yossi Tamari commented on NUTCH-1741:
-
Just wanted to add to Sebastian's comment above
[
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409906#comment-16409906
]
Yossi Tamari edited comment on NUTCH-1741 at 3/22/18 5:08 PM:
--
[
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479006#comment-16479006
]
Yossi Tamari commented on NUTCH-2578:
-
That would require MimeUtil to stay thread-safe
[
https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479176#comment-16479176
]
Yossi Tamari commented on NUTCH-2578:
-
Hi [~wastl-nagel], maybe I'm missing something,
[
https://issues.apache.org/jira/browse/NUTCH-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517182#comment-16517182
]
Yossi Tamari commented on NUTCH-2594:
-
Hi [~roannel], I think the Mapping section nee
Yossi Tamari created NUTCH-2611:
---
Summary: Add line-breaks when parsing HTML block-level elements
Key: NUTCH-2611
URL: https://issues.apache.org/jira/browse/NUTCH-2611
Project: Nutch
Issue Type
[
https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522186#comment-16522186
]
Yossi Tamari commented on NUTCH-2611:
-
I will submit a PR that optionally adds a new-
[
https://issues.apache.org/jira/browse/NUTCH-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522236#comment-16522236
]
Yossi Tamari commented on NUTCH-2611:
-
[~wastl-nagel] I forgot it too :) at least I'm
[
https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553005#comment-16553005
]
Yossi Tamari commented on NUTCH-2624:
-
Hi [~wastl-nagel], I suspect it's the response
[
https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553005#comment-16553005
]
Yossi Tamari edited comment on NUTCH-2624 at 7/23/18 3:37 PM:
-
[
https://issues.apache.org/jira/browse/NUTCH-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554026#comment-16554026
]
Yossi Tamari commented on NUTCH-2624:
-
(y)
> protocol-okhttp resource leak
> ---
[
https://issues.apache.org/jira/browse/NUTCH-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593916#comment-16593916
]
Yossi Tamari commented on NUTCH-1861:
-
Hi Lewis,
I have a some questions:
# Isn't S
[
https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612132#comment-16612132
]
Yossi Tamari commented on NUTCH-2644:
-
[~wastl-nagel], Isn't this a much wider issue?
[
https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2644:
Comment: was deleted
(was: [~wastl-nagel], Isn't this a much wider issue?
For example, I think it
[
https://issues.apache.org/jira/browse/NUTCH-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613205#comment-16613205
]
Yossi Tamari commented on NUTCH-2644:
-
Hi [~wastl-nagel], the reason I deleted the co
[
https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649315#comment-16649315
]
Yossi Tamari commented on NUTCH-1842:
-
While I agree it is a minor issue for 2.X, I t
[
https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649315#comment-16649315
]
Yossi Tamari edited comment on NUTCH-1842 at 10/14/18 9:42 AM:
[
https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649550#comment-16649550
]
Yossi Tamari commented on NUTCH-1842:
-
[~wastl-nagel], I agree. The reason I chose to
[
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653368#comment-16653368
]
Yossi Tamari commented on NUTCH-2658:
-
I disagree regarding putting the documentation
[
https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655082#comment-16655082
]
Yossi Tamari commented on NUTCH-2662:
-
[~jorgelbg], this goes against your own review
[
https://issues.apache.org/jira/browse/NUTCH-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667107#comment-16667107
]
Yossi Tamari commented on NUTCH-2670:
-
This is reproducible. nutch-site and nutch-def
[
https://issues.apache.org/jira/browse/NUTCH-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719008#comment-16719008
]
Yossi Tamari commented on NUTCH-2679:
-
I'm not sure what the problem here is, but Ant
Yossi Tamari created NUTCH-2691:
---
Summary: Improve logging from scoring-depth plugin
Key: NUTCH-2691
URL: https://issues.apache.org/jira/browse/NUTCH-2691
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2691:
Description:
Currently the scoring-depth plugin emits a "Missing depth, removing all
outlinks from
Yossi Tamari created NUTCH-2715:
---
Summary: WARCExporter fails on large records
Key: NUTCH-2715
URL: https://issues.apache.org/jira/browse/NUTCH-2715
Project: Nutch
Issue Type: Bug
Affects V
Yossi Tamari created NUTCH-2716:
---
Summary: Response headers are not stored for a compressed response
Key: NUTCH-2716
URL: https://issues.apache.org/jira/browse/NUTCH-2716
Project: Nutch
Issue T
[
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833888#comment-16833888
]
Yossi Tamari commented on NUTCH-2716:
-
Actually, I meant to remove those two headers
[
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833917#comment-16833917
]
Yossi Tamari commented on NUTCH-2716:
-
OK, I'll try to submit a patch tomorrow along
[
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yossi Tamari updated NUTCH-2716:
Description:
Even when store.http.headers=true, the HTTP headers are not saved for a gzipped
or de
[
https://issues.apache.org/jira/browse/NUTCH-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834751#comment-16834751
]
Yossi Tamari commented on NUTCH-2715:
-
It seems to me like the commoncrawldump plugin
[
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835441#comment-16835441
]
Yossi Tamari commented on NUTCH-2716:
-
I've submitted a pull request for this, it see
[
https://issues.apache.org/jira/browse/NUTCH-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945355#comment-16945355
]
Yossi Tamari commented on NUTCH-2742:
-
[~Mark A] The important line in crawl.log is:
[
https://issues.apache.org/jira/browse/NUTCH-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948710#comment-16948710
]
Yossi Tamari commented on NUTCH-2511:
-
Opened a PR for this: [https://github.com/apac
64 matches
Mail list logo