[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Fix Version/s: 1.12 > Rely on Tika for outlink extract

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Component/s: parser > Rely on Tika for outlink extract

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148590#comment-15148590 ] Markus Jelsma commented on NUTCH-1233: -- Awesome! Everything works as expected s

[jira] [Resolved] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2210. -- Resolution: Fixed Committed to trunk in revision 1730686. > Upgrade to Tika 1

[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148572#comment-15148572 ] Markus Jelsma commented on NUTCH-2210: -- Test passes, will commit shortly. >

[jira] [Updated] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2210: - Attachment: NUTCH-2210.patch Patch for trunk. > Upgrade to Tika 1

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-16 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148489#comment-15148489 ] Markus Jelsma commented on NUTCH-2197: -- Hello Arun - no, this is not applie

[jira] [Commented] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147735#comment-15147735 ] Markus Jelsma commented on NUTCH-2210: -- Apache Tika 1.12 is available. Will upg

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Attachment: NUTCH-2216-NUTCH-2220-NUTCH-2221.patch Patch for trunk. This includes all three

[jira] [Updated] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2216: - Attachment: NUTCH-2216.patch Patch for trunk, introducing db.ignore.treat.redirects.as.links

[jira] [Updated] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2216: - Summary: db.ignore.*.links to optionally follow internal redirects (was: ignore.internal.links

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Attachment: NUTCH-2221.patch Patch for trunk. This includes the modified config of NUTCH-2220

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Patch Info: Patch Available > Rename db.* options used only by the linkdb to lin

[jira] [Updated] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2220: - Attachment: NUTCH-2220.patch Patch for trunk > Rename db.* options used only by the linkdb

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Description: FetcherThread has support for db.ignore.external.links. In config you can find

[jira] [Updated] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2221: - Summary: Introduce db.ignore.internal.links to FetcherThread (was: Introduce

[jira] [Created] (NUTCH-2221) Introduce db.ignore.external.links to FetcherThread

2016-02-15 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2221: Summary: Introduce db.ignore.external.links to FetcherThread Key: NUTCH-2221 URL: https://issues.apache.org/jira/browse/NUTCH-2221 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-15 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2220: Summary: Rename db.* options used only by the linkdb to linkdb.* Key: NUTCH-2220 URL: https://issues.apache.org/jira/browse/NUTCH-2220 Project: Nutch Issue

[jira] [Resolved] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2189. -- Resolution: Fixed > Domain filter must deactivate if no rules are pres

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Fix Version/s: 1.12 > Domain filter must deactivate if no rules are pres

[jira] [Updated] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2189: - Affects Version/s: 1.11 > Domain filter must deactivate if no rules are pres

[jira] [Reopened] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-2189: -- Fix version missing > Domain filter must deactivate if no rules are pres

[jira] [Closed] (NUTCH-2189) Domain filter must deactivate if no rules are present

2016-02-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2189. > Domain filter must deactivate if no rules are pres

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144518#comment-15144518 ] Markus Jelsma commented on NUTCH-2216: -- An option is to change the default

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144497#comment-15144497 ] Markus Jelsma commented on NUTCH-2216: -- Additionally, it probably should no

[jira] [Commented] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144463#comment-15144463 ] Markus Jelsma commented on NUTCH-2216: -- Apparently db.ignore.internal.links is

[jira] [Created] (NUTCH-2216) ignore.internal.links to optionally follow internal redirects

2016-02-12 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2216: Summary: ignore.internal.links to optionally follow internal redirects Key: NUTCH-2216 URL: https://issues.apache.org/jira/browse/NUTCH-2216 Project: Nutch

[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Attachment: NUTCH-2215.patch Tiny error in nutch-default description. > Generator to restr

[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Attachment: NUTCH-2215.patch Patch for trunk. Unit test passes! > Generator to restrict crawl

[jira] [Created] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-11 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2215: Summary: Generator to restrict crawl to mime type Key: NUTCH-2215 URL: https://issues.apache.org/jira/browse/NUTCH-2215 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2214) Index clean to be flexible on what it deletes

2016-02-10 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2214: Summary: Index clean to be flexible on what it deletes Key: NUTCH-2214 URL: https://issues.apache.org/jira/browse/NUTCH-2214 Project: Nutch Issue Type

[jira] [Created] (NUTCH-2212) Decrease memory consumption by tuning stack size

2016-02-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2212: Summary: Decrease memory consumption by tuning stack size Key: NUTCH-2212 URL: https://issues.apache.org/jira/browse/NUTCH-2212 Project: Nutch Issue Type

[jira] [Closed] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2211. > Filter and normalizer checkers missing in bin/nu

[jira] [Resolved] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2211. -- Resolution: Fixed Committed to trunk in revision 1728339. > Filter and normalizer check

[jira] [Updated] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2211: - Attachment: NUTCH-2211.patch Patch for trunk. > Filter and normalizer checkers missing in

[jira] [Created] (NUTCH-2211) Filter and normalizer checkers missing in bin/nutch

2016-02-03 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2211: Summary: Filter and normalizer checkers missing in bin/nutch Key: NUTCH-2211 URL: https://issues.apache.org/jira/browse/NUTCH-2211 Project: Nutch Issue Type

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Fix Version/s: 1.12 > Add solr5 solrcloud indexer supp

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Affects Version/s: (was: 1.12) 1.11 > Add solr5 solrcloud inde

[jira] [Resolved] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2197. -- Resolution: Fixed Committed to trunk in revision 1728313. Thanks Jurian Broertjes! >

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Attachment: NUTCH-2197.patch Previous patch was missing a proper version in plugin.xml. Will

[jira] [Created] (NUTCH-2210) Upgrade to Tika 1.12

2016-02-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2210: Summary: Upgrade to Tika 1.12 Key: NUTCH-2210 URL: https://issues.apache.org/jira/browse/NUTCH-2210 Project: Nutch Issue Type: Task Reporter

[jira] [Updated] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2197: - Attachment: NUTCH-2197.patch Here's the updated patch with Solr 5.4.1 > Add solr5 s

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-02-02 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128366#comment-15128366 ] Markus Jelsma commented on NUTCH-2197: -- I am going to commit this soon un

[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Fix Version/s: 1.13 > Support sitemaps in Nu

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117024#comment-15117024 ] Markus Jelsma commented on NUTCH-961: - Yes! :) > Expose Tika's boiler

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116975#comment-15116975 ] Markus Jelsma commented on NUTCH-961: - With boilerpipe, you get only a very

[jira] [Commented] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs

2016-01-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114991#comment-15114991 ] Markus Jelsma commented on NUTCH-2205: -- This looks like your cluster was down, n

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114989#comment-15114989 ] Markus Jelsma commented on NUTCH-961: - That is probably due to the patch parsing t

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111292#comment-15111292 ] Markus Jelsma commented on NUTCH-961: - Some news, the upstream Tika issue has

[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110947#comment-15110947 ] Markus Jelsma commented on NUTCH-2202: -- Yes, a patch would be a good place to s

[jira] [Commented] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110797#comment-15110797 ] Markus Jelsma commented on NUTCH-2197: -- This Solr 5 plugin is capable of indexin

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Patch Info: Patch Available > Remove loops program from webgraph pack

[jira] [Resolved] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2201. -- Resolution: Fixed Committed to trunk revision 1725981. Thanks Dennis! > Remove loops prog

[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110729#comment-15110729 ] Markus Jelsma commented on NUTCH-1325: -- Yes, they are very useful for fin

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Attachment: NUTCH-2201.patch Patch for trunk which removed the loops program and all references

[jira] [Resolved] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1325. -- Resolution: Fixed Committed to trunk in revision 1725952. Many thanks to all contributors

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Component/s: hostdb > HostDB for Nutch > > > Key

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Fix Version/s: 1.12 > HostDB for Nutch > > > Key

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch Updated patch for trunk contains more thorough config descriptions

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Patch Info: Patch Available Description: h1. HostDB for Apache Nutch 1.x * automatically

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch TDigest is awesome! Here's with support for user configurable

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch Updated patch to use TDigest for streaming percentiles. But because

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110375#comment-15110375 ] Markus Jelsma commented on NUTCH-1233: -- Yes, we'll get this support with

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-21 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110373#comment-15110373 ] Markus Jelsma commented on NUTCH-961: - Hello - that doesn't seem related to t

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Fix Version/s: 1.12 > Remove loops program from webgraph pack

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Affects Version/s: 1.11 > Remove loops program from webgraph pack

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325.patch Updated patch for trunk, i think it's fairly complete now, incl

[jira] [Assigned] (NUTCH-1325) HostDB for Nutch

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-1325: Assignee: Markus Jelsma > HostDB for Nutch > > >

[jira] [Assigned] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2203: Assignee: Markus Jelsma > Suffix URL filter can't handle trailing/leading whi

[jira] [Resolved] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2203. -- Resolution: Fixed Committed to trunk in revision 1725538. Thanks Jurian Broertjes. > Suf

[jira] [Updated] (NUTCH-2203) Suffix URL filter can't handle trailing/leading whitespaces

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2203: - Fix Version/s: 1.12 > Suffix URL filter can't handle trailing/leading whi

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106783#comment-15106783 ] Markus Jelsma commented on NUTCH-961: - Update, i've updated NUTCH-1233 fo

[jira] [Comment Edited] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106633#comment-15106633 ] Markus Jelsma edited comment on NUTCH-1233 at 1/19/16 11:5

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106633#comment-15106633 ] Markus Jelsma commented on NUTCH-1233: -- It seems Tika's link extraction

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233-2.txt post-1233-2.txt Here's another set to compare &

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch. Patch now contains the old link extraction commented

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: pre-1233.txt post-1233.txt Two lists of extracted URL's, befor

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233.patch Updated patch for trunk > Rely on Tika for outlink extract

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-19 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106570#comment-15106570 ] Markus Jelsma commented on NUTCH-961: - Yes but it requires NUTCH-1233. >

[jira] [Closed] (NUTCH-1107) Log slow parse entries

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1107. Resolution: Won't Fix > Log slow parse entries > -- > >

[jira] [Closed] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1149. Resolution: Won't Fix Will upload proper patch for NUTCH-1325 soon which already contains nu

[jira] [Resolved] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1838. -- Resolution: Fixed > Host and domain based regex and automaton filter

[jira] [Assigned] (NUTCH-2201) Remove loops program from webgraph package

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2201: Assignee: Markus Jelsma > Remove loops program from webgraph pack

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
communication problem I am using solr 5.4 and nutch 1.11 On Tue, Jan 19, 2016 at 1:46 AM, Markus Jelsma mailto:markus.jel...@openindex.io>> wrote: Hi - it was an answer to your question whether i have ever used it. Yes, i patched and committed it. And therefore i asked if youre using Solr 5 or no

[jira] [Assigned] (NUTCH-2197) Add solr5 solrcloud indexer support

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2197: Assignee: Markus Jelsma > Add solr5 solrcloud indexer supp

[jira] [Updated] (NUTCH-2201) Remove loops program from webgraph package

2016-01-18 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2201: - Summary: Remove loops program from webgraph package (was: Remove loops program from webgrapg

[jira] [Created] (NUTCH-2201) Remove loops program from webgrapg package

2016-01-18 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2201: Summary: Remove loops program from webgrapg package Key: NUTCH-2201 URL: https://issues.apache.org/jira/browse/NUTCH-2201 Project: Nutch Issue Type: Task

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
: dev@nutch.apache.org Subject: Re: Nutch/Solr communication problem Mind to share that patch ? On Mon, Jan 18, 2016 at 8:28 PM, Markus Jelsma mailto:markus.jel...@openindex.io>> wrote: Yes i have used it, i made the damn patch myself years ago, and i used the same configuration. Command line or conf

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
. thanks On Mon, Jan 18, 2016 at 4:50 PM, Markus Jelsma mailto:markus.jel...@openindex.io>> wrote: Hi - This doesnt look like a HTTP basic authentication problem. Are you running Solr 5.x? Markus -Original message- From: Zara Parstmailto:edotserv...@gmail.com>> Sent:

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
xingJob.run(IndexingJob.java:228) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) On Mon, Jan 18, 2016 at 4:15 PM, Markus Jelsma mailto:markus.jel...@openindex.io>> wrote: Hi - can you post the lo

RE: Nutch/Solr communication problem

2016-01-18 Thread Markus Jelsma
Hi - can you post the log output? Markus -Original message- From: Zara Parst Sent: Monday 18th January 2016 2:06 To: dev@nutch.apache.org Subject: Nutch/Solr communication problem Hi everyone, I have situation here, I am using nutch 1.11 and solr 5.4 Solr is protected by user name and

[jira] [Resolved] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2194. -- Resolution: Fixed Committed to trunk in revision 1724771. > Run IndexingFilterChecker

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-14 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Attachment: NUTCH-2194.patch Updated patch. Signature is now also added to CrawlDatum, in case an

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Patch Info: Patch Available > Run IndexingFilterChecker as simple Telnet ser

[jira] [Commented] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096263#comment-15096263 ] Markus Jelsma commented on NUTCH-2194: -- Please check it out :) &

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Description: We have used a customized IndexingFilterChecker running as server to be able to

[jira] [Updated] (NUTCH-2194) Run IndexingFilterChecker as simple Telnet server

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2194: - Attachment: NUTCH-2194.patch Patch for trunk. With default settings this server needs just about

[jira] [Commented] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096156#comment-15096156 ] Markus Jelsma commented on NUTCH-2196: -- Committed to trunk in revision 172

[jira] [Resolved] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2196. -- Resolution: Fixed > IndexingFilterChecker to optionally normal

[jira] [Updated] (NUTCH-2196) IndexingFilterChecker to optionally normalize

2016-01-13 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2196: - Attachment: NUTCH-2196.patch Patch for trunk introducing the -normalize flag. If enabled, input

<    4   5   6   7   8   9   10   11   12   13   >