Update jakarta poi jars to the most relevant version
Key: NUTCH-691
URL: https://issues.apache.org/jira/browse/NUTCH-691
Project: Nutch
Issue Type: Improvement
Components:
[
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-691:
--
Attachment: NUTCH-691-v1-test.patch
cd nutch;
Update jakarta poi jars to the most relevant
[
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-691:
--
Attachment: NUTCH-691-v1-test.patch
Update jakarta poi jars to the most relevant version
[
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674468#action_12674468
]
Dmitry Lihachev commented on NUTCH-691:
---
Steps to reproduce NUTCH-591 (you must have
[
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674468#action_12674468
]
dmitry.lihachev edited comment on NUTCH-691 at 2/17/09 9:39 PM:
[
https://issues.apache.org/jira/browse/NUTCH-591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674478#action_12674478
]
Dmitry Lihachev commented on NUTCH-591:
---
can be resolved via NUTCH-691
[
https://issues.apache.org/jira/browse/NUTCH-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-691:
--
Remaining Estimate: 0.25h
Original Estimate: 0.25h
Update jakarta poi jars to the most
incorrect mime type detection by MoreIndexingFilter plugin
--
Key: NUTCH-695
URL: https://issues.apache.org/jira/browse/NUTCH-695
Project: Nutch
Issue Type: Bug
Components:
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-695:
--
Attachment: NUTCH-695_MoreIndexingFilter.patch
Test case for this bug
incorrect mime type
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-695:
--
Attachment: NUTCH-695_MoreIndexingFilter.patch
This patch fixes this bug
incorrect mime type
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-695:
--
Attachment: (was: NUTCH-695_MoreIndexingFilter.patch)
incorrect mime type detection by
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-695:
--
Attachment: NUTCH-695_TestMoreIndexingFilter.patch
incorrect mime type detection by
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674952#action_12674952
]
dmitry.lihachev edited comment on NUTCH-695 at 2/19/09 2:15 AM:
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674953#action_12674953
]
dmitry.lihachev edited comment on NUTCH-695 at 2/19/09 2:16 AM:
[
https://issues.apache.org/jira/browse/NUTCH-695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12674956#action_12674956
]
Dmitry Lihachev commented on NUTCH-695:
---
thank you, Sami
incorrect mime type
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675232#action_12675232
]
Dmitry Lihachev commented on NUTCH-684:
---
This patch works for me too.
Dedup support
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-684:
--
Attachment: NUTCH-684_bin_nutch.patch
patch for bin/nutch
so we can write
{{bin/nutch
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-684:
--
Attachment: NUTCH-684_solrdedup_v2.patch
Produce a little more log output
Dedup support for
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675260#action_12675260
]
dmitry.lihachev edited comment on NUTCH-684 at 2/19/09 10:40 PM:
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-684:
--
Attachment: (was: NUTCH-684_solrdedup_v2.patch)
Dedup support for Solr
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-684:
--
Attachment: NUTCH-684_solrdedup_v2.patch
Dedup support for Solr
--
Generate log output for solr indexer and dedup
--
Key: NUTCH-697
URL: https://issues.apache.org/jira/browse/NUTCH-697
Project: Nutch
Issue Type: Improvement
Components: indexer
[
https://issues.apache.org/jira/browse/NUTCH-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-697:
--
Attachment: NUTCH-697_solr_logs.patch
Generate log output for solr indexer and dedup
[
https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675311#action_12675311
]
Dmitry Lihachev commented on NUTCH-684:
---
bq. there is a silent assumption that Solr
[
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675324#action_12675324
]
Dmitry Lihachev commented on NUTCH-699:
---
I think we must extends field set for each
[
https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676227#action_12676227
]
Dmitry Lihachev commented on NUTCH-644:
---
I found sources of RTFParser.jj (ASF) and
[
https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-644:
--
Attachment: NUTCH-644_v3.patch
RTF parser doesn't compile anymore
parse-rtf plugin
Key: NUTCH-705
URL: https://issues.apache.org/jira/browse/NUTCH-705
Project: Nutch
Issue Type: New Feature
Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
[
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677242#action_12677242
]
Dmitry Lihachev commented on NUTCH-705:
---
This parser correctly handles non ascii input
[
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-705:
--
Attachment: NUTCH-705.patch
parse-rtf plugin
Key:
[
https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677244#action_12677244
]
Dmitry Lihachev commented on NUTCH-644:
---
this parser incorrectly handles non-ascii
[
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677878#action_12677878
]
Dmitry Lihachev commented on NUTCH-705:
---
Yes, it looks a bit like a problem... How can
Subcollection plugin doesn't work with default subcollections.xml file
--
Key: NUTCH-715
URL: https://issues.apache.org/jira/browse/NUTCH-715
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-715:
--
Attachment: NUTCH-715-testcase.patch
Subcollection plugin doesn't work with default
[
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-715:
--
Attachment: NUTCH-715-fix.patch
Subcollection plugin doesn't work with default
[
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-715:
--
Attachment: (was: NUTCH-715-fix.patch)
Subcollection plugin doesn't work with default
[
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-715:
--
Attachment: NUTCH-715_subcollections_fix.patch
Subcollection plugin doesn't work with default
Make subcollection index filed multivalued
--
Key: NUTCH-716
URL: https://issues.apache.org/jira/browse/NUTCH-716
Project: Nutch
Issue Type: Improvement
Components: indexer
Affects
[
https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-716:
--
Attachment: NUTCH-716_multivalued_subcollection.patch
Make subcollection index filed
urlfilter-subnets plugin
Key: NUTCH-718
URL: https://issues.apache.org/jira/browse/NUTCH-718
Project: Nutch
Issue Type: New Feature
Reporter: Dmitry Lihachev
Priority: Minor
This plugin
[
https://issues.apache.org/jira/browse/NUTCH-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-718:
--
Attachment: NUTCH-718_urlfilter_subnets.patch
{code}
cd nutch-trunk
patch -p0
[
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682223#action_12682223
]
Dmitry Lihachev commented on NUTCH-699:
---
In some cases (eg. when using
[
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689385#action_12689385
]
Dmitry Lihachev commented on NUTCH-706:
---
I think this must be changed to
{code:xml}
[
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-578:
--
Attachment: NUTCH-578_v3.patch
changes in CrawlDbReducer already applied in trunk, so patch
[
https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-716:
--
Fix Version/s: 1.1
Make subcollection index filed multivalued
urlnormalizer-unalias plugin
Key: NUTCH-737
URL: https://issues.apache.org/jira/browse/NUTCH-737
Project: Nutch
Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
I tried
[
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-737:
--
Priority: Minor (was: Major)
urlnormalizer-unalias plugin
[
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-737:
--
Attachment: NUTCH-737_urlfilter_unalias.patch
urlnormalizer-unalias plugin
[
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-737:
--
Attachment: (was: NUTCH-737_urlfilter_unalias.patch)
urlnormalizer-unalias plugin
[
https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713383#action_12713383
]
Dmitry Lihachev commented on NUTCH-702:
---
I catched NPE when using this patch
{code}
SolrDeleteDuplications too slow when using hadoop
-
Key: NUTCH-739
URL: https://issues.apache.org/jira/browse/NUTCH-739
Project: Nutch
Issue Type: Bug
Components: indexer
Affects
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-739:
--
Description:
in my environment i always have many warnings like this on the dedup step
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lihachev updated NUTCH-739:
--
Attachment: NUTCH-739_remove_optimize_on_solr_dedup.patch
This simple patch decrease dedup time
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714264#action_12714264
]
Dmitry Lihachev commented on NUTCH-739:
---
in my recrawl script I have following lines
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714287#action_12714287
]
Dmitry Lihachev commented on NUTCH-739:
---
with this approach we still have few optimize
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714288#action_12714288
]
Dmitry Lihachev commented on NUTCH-739:
---
am I wrong?
SolrDeleteDuplications too slow
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714290#action_12714290
]
Dmitry Lihachev commented on NUTCH-739:
---
I think that optimizing solr - is not hadoop
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714346#action_12714346
]
Dmitry Lihachev commented on NUTCH-739:
---
Doğacan, I agree with you about curl usage.
[
https://issues.apache.org/jira/browse/NUTCH-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714349#action_12714349
]
Dmitry Lihachev commented on NUTCH-739:
---
Ooops, sorry... Tool is Map/Reduce
[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851710#action_12851710
]
Dmitry Lihachev commented on NUTCH-570:
---
Yeah, Otis. It's just an update so it applies
60 matches
Mail list logo