[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-817: --- Assignee: Julien Nioche parse-(html)does follow links of full html page, parse-(tika) does

[jira] Commented: (NUTCH-710) Support for rel=canonical attribute

2010-04-21 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286 ] Julien Nioche commented on NUTCH-710: - As suggested previously we could either treat

[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349 ] Julien Nioche commented on NUTCH-808: - Hi Enis, {quote} On the other hand, current

[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-808: Fix Version/s: 2.0 Evaluate ORM Frameworks which support non-relational column-oriented

[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche

[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-789: Component/s: (was: fetcher) parser Fix Version/s: (was: 1.1) Have

[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-810. --- Resolution: Fixed Committed in rev 931098. http://issues.apache.org/jira/browse/TIKA-317 changed the

[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251 ] Julien Nioche commented on NUTCH-789: - Will upgrade as soon as 0.7 is available from

[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee:

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Parse-metatags plugin - Key:

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: (was: NUTCH-809.patch) Parse-metatags plugin -

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika

[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter

[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-706: Fix Version/s: (was: 1.1) Both variants of the substitution rule above break existing tests.

[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-779. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 929038. Thanks Andrzej for your

[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-785. --- Resolution: Fixed Committed revision 929039 Thanks Andrzej for reviewing it Fetcher : copy

[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851316#action_12851316 ] Julien Nioche commented on NUTCH-789: - Shall we postpone the work on this issue to after

[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851545#action_12851545 ] Julien Nioche commented on NUTCH-570: - {quote}Julien, want to take this?{quote} Not

[jira] Closed: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-784. --- Resolution: Fixed Committed revision 928746 CrawlDBScanner --- Key:

[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Fix Version/s: 1.1 CrawlDBScanner --- Key: NUTCH-784

[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2010-03-29 Thread Julien Nioche (JIRA)
Merge CrawlDBScanner with CrawlDBReader --- Key: NUTCH-806 URL: https://issues.apache.org/jira/browse/NUTCH-806 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche

[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Fix Version/s: (was: 1.1) Removed tag 1.1 Will rename to IndexingPluginsChecker later

[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850912#action_12850912 ] Julien Nioche commented on NUTCH-785: - Could anyone please review this issue? I would

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850915#action_12850915 ] Julien Nioche commented on NUTCH-779: - Could anyone please review this issue? I would

[jira] Updated: (NUTCH-776) Configurable queue depth

2010-03-23 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-776: Fix Version/s: (was: 1.1) Moving this issue post 1.1 Needs a patch file, some description of

[jira] Closed: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-740. --- Resolution: Fixed Assignee: Julien Nioche Committed in rev 926003 Thanks Marcin for

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Fix Version/s: 1.1 Alternative Generator which can generate several segments in one parse of the

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v3.patch new patch which reintroduces the 'generator.update.crawldb'

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848095#action_12848095 ] Julien Nioche commented on NUTCH-762: - {quote} I just noticed that the new Generator

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848140#action_12848140 ] Julien Nioche commented on NUTCH-762: - The change of prefix also reflected that we now

[jira] Closed: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-762. --- Resolution: Fixed Committed revision 926155 Have reverted the prefix for params to 'generate.' +

[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-740: Attachment: NUTCH-740.patch Slightly modified version of the patch with modifs for protocol-http.

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846910#action_12846910 ] Julien Nioche commented on NUTCH-762: - OK, there was indeed an assumption that the

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846930#action_12846930 ] Julien Nioche commented on NUTCH-762: - Yes, I came across that situation too on a large

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-469: Fix Version/s: (was: 1.1) There has not been any changes to this issue since February 09 and it

[jira] Commented: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845886#action_12845886 ] Julien Nioche commented on NUTCH-740: - A nice contribution but should not this be

[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846141#action_12846141 ] Julien Nioche commented on NUTCH-762: - If I am not mistaken the point of having

[jira] Updated: (NUTCH-710) Support for rel=canonical attribute

2010-03-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-710: Fix Version/s: (was: 1.1) Great idea. Won't be included in 1.1 though so moving to *fix :

[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2010-03-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-692. - Resolution: Cannot Reproduce Fix Version/s: 1.1 I cannot reproduce the issue since we

[jira] Resolved: (NUTCH-798) Upgrade to SOLR1.4

2010-03-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-798. - Resolution: Fixed Updated SOLRJ's dependencies at the same time : Deleting

[jira] Resolved: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-801. - Resolution: Fixed Committed revision 921840. Remove RTF and MP3 parse plugins

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: (was: NUTCH-762-MultiGenerator.patch) Alternative Generator which can generate

[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-762: Attachment: NUTCH-762-v2.patch Improved version of the patch : - fixed a few minor bugs - renamed

[jira] Created: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
SOLRIndexer to commit once all reducers have finished - Key: NUTCH-799 URL: https://issues.apache.org/jira/browse/NUTCH-799 Project: Nutch Issue Type: Improvement Components:

[jira] Updated: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-799: Attachment: NUTCH-799.patch SOLRIndexer to commit once all reducers have finished

[jira] Closed: (NUTCH-782) Ability to order htmlparsefilters

2010-03-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-782. --- Resolution: Fixed Committed revision 917557 Ability to order htmlparsefilters

[jira] Created: (NUTCH-798) Upgrade to SOLR1.4

2010-02-26 Thread Julien Nioche (JIRA)
Upgrade to SOLR1.4 -- Key: NUTCH-798 URL: https://issues.apache.org/jira/browse/NUTCH-798 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Julien Nioche Fix For: 1.1

[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-719. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 911905. Thanks to S. Dennis for

[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-719. --- fetchQueues.totalSize incorrect in Fetcher2 ---

[jira] Resolved: (NUTCH-705) parse-rtf plugin

2010-02-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-705. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766). Please open an issue

[jira] Resolved: (NUTCH-644) RTF parser doesn't compile anymore

2010-02-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-644. - Resolution: Fixed RTF parsing is now handled by the TikaPlugin (NUTCH-766) which solves the issue

[jira] Created: (NUTCH-794) Tika parser does not keep attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
Tika parser does not keep attributes on html tag Key: NUTCH-794 URL: https://issues.apache.org/jira/browse/NUTCH-794 Project: Nutch Issue Type: Bug Reporter: Julien Nioche

[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Description: The following HTML document : html lang=fiheaddocument 1 title/headbodyjotain

[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Attachment: NUTCH-794.patch Tika parser does identify lang attributes on html tag

[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834147#action_12834147 ] Julien Nioche commented on NUTCH-794: - Committed patch in revision 910454 Waiting for

[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Summary: Language Identification must use check the parse metadata for language values (was: Tika

[jira] Work started: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-794 started by Julien Nioche. Language Identification must use check the parse metadata for language values

[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-794: Component/s: parser Language Identification must use check the parse metadata for language values

[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Component/s: parser Ability to order htmlparsefilters -

[jira] Closed: (NUTCH-766) Tika parser

2010-02-15 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-766. --- Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832454#action_12832454 ] Julien Nioche commented on NUTCH-766: - @Chris : I just did a fresh co from svn, applied

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564 ] Julien Nioche commented on NUTCH-766: - I had a closer look at the HTML parsing issue.

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564 ] Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM: --

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832583#action_12832583 ] Julien Nioche commented on NUTCH-766: - @Chris : did you do ant -f

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-787: Fix Version/s: 1.1 Upgrade Lucene to 3.0.0. Key:

[jira] Created: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
Better list of suffix domains - Key: NUTCH-786 URL: https://issues.apache.org/jira/browse/NUTCH-786 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Julien Nioche

[jira] Updated: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-786: Attachment: NUTCH-786.patch Small improvement to the content of domain-suffixes.xml : added

[jira] Closed: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-786. --- Resolution: Fixed Committed revision 906907 Better list of suffix domains

[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828548#action_12828548 ] Julien Nioche commented on NUTCH-781: - did you forgot to update conf/tika-mimetypes.xml

[jira] Created: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
Update Tika to v0.6 for the MimeType detection --- Key: NUTCH-781 URL: https://issues.apache.org/jira/browse/NUTCH-781 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche

[jira] Resolved: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-781. - Resolution: Fixed Committed revision 905228 Update Tika to v0.6 for the MimeType detection

[jira] Closed: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-781. --- Update Tika to v0.6 for the MimeType detection ---

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766-v3.patch Updated version of the plugin : uses Tika 0.6 Tika parser

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: Nutch-766.ParserFactory.patch) Tika parser ---

[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: (was: NUTCH-766.tika.patch) Tika parser --- Key:

[jira] Created: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
Ability to order htmlparsefilters - Key: NUTCH-782 URL: https://issues.apache.org/jira/browse/NUTCH-782 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien

[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-782: Attachment: NUTCH-782.patch Ability to order htmlparsefilters -

[jira] Created: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
IndexerChecker Utilty - Key: NUTCH-783 URL: https://issues.apache.org/jira/browse/NUTCH-783 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For:

[jira] Assigned: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-783: --- Assignee: Julien Nioche IndexerChecker Utilty - Key:

[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-783: Attachment: NUTCH-783.patch IndexerChecker Utilty - Key:

[jira] Assigned: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-779: --- Assignee: Julien Nioche Mechanism for passing metadata from parse to crawldb

[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779-v2.patch Improved version of the patch. Followed AB's recommendations and

[jira] Created: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
CrawlDBScanner --- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments:

[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-784: Attachment: NUTCH-784.patch CrawlDBScanner --- Key: NUTCH-784

[jira] Created: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL --- Key: NUTCH-785 URL:

[jira] Updated: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-785: Attachment: NUTCH-785.patch Fetcher : copy metadata from origin URL when redirecting + call

[jira] Commented: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805892#action_12805892 ] Julien Nioche commented on NUTCH-766: - Here is a slightly better version of the patch

[jira] Updated: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-766: Attachment: NUTCH-766.v2 sample.tar.gz new version of the patch + archive

[jira] Resolved: (NUTCH-778) Running Nutch On linux having whoami exception?

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-778. - Resolution: Invalid Fix Version/s: (was: 1.0.0) This is likely to be a problem with

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803670#action_12803670 ] Julien Nioche commented on NUTCH-766: - I think the end result of this plugin should be

[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-19 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802172#action_12802172 ] Julien Nioche commented on NUTCH-779: - The property needs some documentation in

[jira] Created: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)
Mechanism for passing metadata from parse to crawldb Key: NUTCH-779 URL: https://issues.apache.org/jira/browse/NUTCH-779 Project: Nutch Issue Type: New Feature Reporter:

[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-01-18 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-779: Attachment: NUTCH-779 Mechanism for passing metadata from parse to crawldb

[jira] Closed: (NUTCH-767) Update Tika to v0.5 for the MimeType detection

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-767. --- Resolution: Fixed Committed revision 897825 Update Tika to v0.5 for the MimeType detection

[jira] Resolved: (NUTCH-751) Upgrade version of HttpClient

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-751. - Resolution: Later The changes in the underlying API are quite substantial and this would need a

[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798727#action_12798727 ] Julien Nioche commented on NUTCH-766: - Hi Chris, No worries, I'd rather wait for you

[jira] Assigned: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-269: --- Assignee: Julien Nioche CrawlDbReducer: OOME because no upper-bound on inlinks count

[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797990#action_12797990 ] Julien Nioche commented on NUTCH-269: - I will shortly commit a variant of this approach

[jira] Resolved: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2010-01-08 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-269. - Resolution: Fixed Fix Version/s: 1.1 Committed revision 897180 CrawlDbReducer: OOME

[jira] Commented: (NUTCH-776) Configurable queue depth

2010-01-07 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797653#action_12797653 ] Julien Nioche commented on NUTCH-776: - Did you notice any improvement in the fetch rate

  1   2   >