[jira] Commented: (NUTCH-525) DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment

2007-07-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514914 ] Andrzej Bialecki commented on NUTCH-525: - +1 for adding undeleteAll(). When DDRecordReader was created

[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513853 ] Andrzej Bialecki commented on NUTCH-518: - IMHO this change is not helpful. It takes away too much control

Re: Looking to fix relative path issue in linkdb

2007-07-19 Thread Andrzej Bialecki
trigger a costly copy operation. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] Reopened: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reopened NUTCH-518: - This one was too quick, I think ... I wanted to discuss the issue whether the chaining

[jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513704 ] Andrzej Bialecki commented on NUTCH-518: - Right, I was too quick too ... ;) Leave it in for now. Let's agree

[jira] Commented: (NUTCH-515) Next fetch time is set incorrectly

2007-07-16 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513019 ] Andrzej Bialecki commented on NUTCH-515: - +1 - sorry for the mess up ... Next fetch time is set

[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512139 ] Andrzej Bialecki commented on NUTCH-505: - Please test Java 1.5 and Java 1.6 - IIRC there are some

[jira] Closed: (NUTCH-511) Recrawling

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-511. --- Resolution: Invalid Assignee: Andrzej Bialecki Please use mailing lists

[jira] Closed: (NUTCH-512) Search on date range

2007-07-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-512. --- Resolution: Invalid Please use mailing lists for such questions. Search on date range

Re: OPIC scoring differences

2007-07-11 Thread Andrzej Bialecki
process to consider the complete webgraph, i.e. all link information collected so far - but the main attractiveness of OPIC is that it's incremental, so that you don't have to consider the whole webgraph with small incremental updates. -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-439) Top Level Domains Indexing / Scoring

2007-07-10 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511362 ] Andrzej Bialecki commented on NUTCH-439: - Very nice patch! A couple comments: * the fix

[jira] Commented: (NUTCH-505) Outlink urls should be validated

2007-07-10 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511447 ] Andrzej Bialecki commented on NUTCH-505: - * In ParseOutputFormat, the calculation of outlinksToStore should

Re: OPIC scoring differences

2007-07-09 Thread Andrzej Bialecki
/nutch/FixingOpicScoring bottom of the page). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Not renewing CrawlDatum on Inject

2007-07-09 Thread Andrzej Bialecki
in Injector.InjectReducer. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Plans on releasing another bug fix release?

2007-07-04 Thread Andrzej Bialecki
Doug Cutting wrote: Will the next release really be 1.0 or will it be 0.10? Really 1.0. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: Plans on releasing another bug fix release?

2007-07-04 Thread Andrzej Bialecki
not be as full of features as one might wished, but that's what the 1.0 release implies - it's usable, with some limitations. Will it be soon? :) I'm pretty sure it will be some time after the vacation period is over, not earlier ;) -- Best regards, Andrzej Bialecki

Re: Plans on releasing another bug fix release?

2007-07-03 Thread Andrzej Bialecki
(that is, supported by developer resources ;) ) to be able to do this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508816 ] Andrzej Bialecki commented on NUTCH-392: - Re: Content versioning - we can use negative int values as version

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508900 ] Andrzej Bialecki commented on NUTCH-392: - Excellent work, Doğacan - thank you. The numbers for RECORD

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508506 ] Andrzej Bialecki commented on NUTCH-498: - +1. Use Combiner in LinkDb to increase speed of linkdb

[jira] Commented: (NUTCH-501) Implement a different caching mechanism for objects cached in configuration

2007-06-23 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507609 ] Andrzej Bialecki commented on NUTCH-501: - +1 - looks good. An idea: perhaps we could add a LOG.debug

[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507168 ] Andrzej Bialecki commented on NUTCH-504: - +1 - we should skip documents that failed to parse properly

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-21 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775 ] Andrzej Bialecki commented on NUTCH-497: - The patch looks good to me as it is now - however, I've seen

Re: upgrade to hadoop-0.13?

2007-06-18 Thread Andrzej Bialecki
for more details. This change will affect a lot of places in our code, so it would be best to do it long before the next Nutch release. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

[jira] Commented: (NUTCH-501) implementing a different caching mechanism for objects

2007-06-18 Thread Andrzej Bialecki (JIRA)
.= jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505807 ]=20 Andrzej Bialecki commented on NUTCH-501: - ObjectCache should support caching objects that fall under the same key, bu= t are differently configured. This situation occurs when running in local

[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-06-17 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505598 ] Andrzej Bialecki commented on NUTCH-485: - Whitespace changes should be committed as a separate patch

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505302 ] Andrzej Bialecki commented on NUTCH-498: - Currently there is no difference, indeed. The version

Welcome Doğacan as Nutch committer

2007-06-11 Thread Andrzej Bialecki
Hi all, I'm glad to announce that the Lucene PMC has voted to add Doğacan Güney as Nutch committer. Welcome, Doğacan! There are 192 open issues in Nutch JIRA waiting to be solved ... just dive in! ;) -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500951 ] Andrzej Bialecki commented on NUTCH-392: - I don't think it's a good idea, it's creating too many cryptic

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ] Andrzej Bialecki commented on NUTCH-392: - Good point. We can change it to use the following pattern

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728 ] Andrzej Bialecki commented on NUTCH-392: - I think it is okay to allow BLOCK compression for linkdb, crawldb

Re: Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Plugins and Thread Safety

2007-06-01 Thread Andrzej Bialecki
Briggs wrote: Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. Yes, please try this change. -- Best regards, Andrzej Bialecki

Re: [PATCH] Moving HitDetails construction to a HitDetails constructor (v2).

2007-06-01 Thread Andrzej Bialecki
, in my opinion it should not be applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: [jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-05-31 Thread Andrzej Bialecki
rubdabadub wrote: MANY MANY Super Thanks! I can't thank you enough for this Patch :-) This is so cool!!! You're welcome :) I would appreciate it if you could give it some testing and provide feedback ... -- Best regards, Andrzej Bialecki

[jira] Updated: (NUTCH-466) Flexible segment format

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-466: Attachment: segmentparts.patch This patch contains the following modifications

[jira] Resolved: (NUTCH-486) Break searcher dependency on commons-cli

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-486. - Resolution: Won't Fix Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki

[jira] Updated: (NUTCH-466) Flexible segment format

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-466: Attachment: ParseFilters.java Add missing file. Flexible segment format

[jira] Resolved: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-05-31 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-392. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki

Re: Plugins initialized all the time!

2007-05-30 Thread Andrzej Bialecki
). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

[jira] Resolved: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2007-05-30 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-61. Resolution: Fixed Fix Version/s: 1.0.0 Applied with some modifications in rev. 542903

[jira] Work started: (NUTCH-466) Flexible segment format

2007-05-28 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-466 started by Andrzej Bialecki . Flexible segment format --- Key: NUTCH-466 URL: https

Re: Get meta name=description and other meta tags from Content

2007-05-23 Thread Andrzej Bialecki
need to use HtmlParseFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Issues pending before 0.9 release

2007-05-18 Thread Andrzej Bialecki
rubdabadub wrote: On 3/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: rubdabadub wrote: Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. This was discussed before

[jira] Commented: (NUTCH-486) Break searcher dependency on commons-cli

2007-05-15 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495867 ] Andrzej Bialecki commented on NUTCH-486: - -1 This is a side-effect of using LinkDbReader to read the LinkDb

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-443: Attachment: patch.txt I'm not too happy with the direction you took in the latest patch

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495797 ] Andrzej Bialecki commented on NUTCH-443: - Indeed... I forgot that we need crawl_parse to collect new sub

[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-12 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495319 ] Andrzej Bialecki commented on NUTCH-485: - I think a more natural change would be this: ParseResult filter

[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-443. - Resolution: Fixed Committed in rev. 536606. Big thanks to all who contributed

[jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-467. - Resolution: Fixed Assignee: Andrzej Bialecki Patch applied in rev. 532105

[jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-418. --- Resolution: Fixed Fix Version/s: 0.9.0 Already applied. Fixes parsing of XHTML (e.g

[jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-417. --- Resolution: Fixed Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki Fixed

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494552 ] Andrzej Bialecki commented on NUTCH-393: - I agree with that - either all filters should run or the document

Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/

2007-05-09 Thread Andrzej Bialecki
? Indeed. Thanks for spotting this - it's fixed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

[jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-393. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Both

[jira] Commented: (NUTCH-479) Support for OR queries

2007-05-09 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494582 ] Andrzej Bialecki commented on NUTCH-479: - Correct - the only syntax element added in this patch

[jira] Created: (NUTCH-479) Support for OR queries

2007-05-07 Thread Andrzej Bialecki (JIRA)
: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 1.0.0 There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA

[jira] Updated: (NUTCH-479) Support for OR queries

2007-05-07 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-479: Attachment: or.patch Patch based on the discussion on the mailing list, and a description

[jira] Created: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-03 Thread Andrzej Bialecki (JIRA)
: 1.0.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 1.0.0 I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed

[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-03 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-477: Attachment: urlfilters.patch This patch implements suggested changes. Extend URLFilters

[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-27 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492386 ] Andrzej Bialecki commented on NUTCH-468: - +1. I'm writing a scoring plugin now where it's impossible

Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
? See the ASCII-art graphs and comments in NUTCH-385 - this is likely not what is expected. Although this JIRA issue is still open, the Fetcher2 code tries to implement this middle ground solution. -- Best regards, Andrzej Bialecki

Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
the protocol/http difference) to false to indicate lib-http shouldn't handle blocking internally. Because of this, when you use Fetcher2, lib-http still tries to block them which makes Fetcher2 much less useful. This is definitely a bug. -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-04-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491290 ] Andrzej Bialecki commented on NUTCH-471: - +1. Nice trick with the unsynchronized check. :) Fix

Re: Fetcher2's delay between successive requests

2007-04-24 Thread Andrzej Bialecki
, you're right - it's a bug. However, the reasoning that I presented still holds, it's just the implementation that doesn't get it ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

[jira] Closed: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-474. --- Resolution: Fixed Assignee: Andrzej Bialecki Fixed in rev. 532088. Thanks! Fetcher2

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki
be a good option to have, especially for smaller setups - but it would require extensive modifications to many tools in Nutch. Unless you are willing to provide patches that implement it without breaking the large-scale case, I think we should let the matter rest ... -- Best regards, Andrzej

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki
in the DB is used not standalone, but as one of many inputs to a map-reduce job. To summarize - I think it would be very difficult to do this with the current codebase. -- Best regards, Andrzej Bialecki

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Andrzej Bialecki
this issue. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
should move forward. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Index

Re: [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
Chris Mattmann wrote: [..] [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... +1. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Andrzej Bialecki
and discovering new issues, and patching them, we will never make a release ... I think for issues that are not critical or blocker we should press forward, otherwise we will have to wait another 72 hours, and another, and another ... -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485986 ] Andrzej Bialecki commented on NUTCH-466: - Minor nit: MapFile requires that the key is a WritableComparable

[jira] Commented: (NUTCH-466) Flexible segment format

2007-04-02 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486003 ] Andrzej Bialecki commented on NUTCH-466: - I thought that the map will be from class names to directory

[jira] Created: (NUTCH-466) Flexible segment format

2007-04-01 Thread Andrzej Bialecki (JIRA)
: Andrzej Bialecki Assigned To: Andrzej Bialecki In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata

Re: [VOTE] Release Apache Nutch 0.9

2007-03-30 Thread Andrzej Bialecki
(in a separate thread) before rewriting the how to release page in wiki. I agree - the current release process didn't fare too well in this particular situation ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [VOTE] Release Apache Nutch 0.9

2007-03-29 Thread Andrzej Bialecki
withhold other development while waiting for rc1, rc2, rcN, ... - other patches, including disruptive ones and those that introduce new features, can be applied in the meantime to trunk/ . As for bugfixes, they can be merged up or down between the branch and trunk as needed. -- Best regards, Andrzej

Re: [VOTE] Release Apache Nutch 0.9

2007-03-29 Thread Andrzej Bialecki
Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag and so on until we are satisfied. Then when we're actually satisfied create tag for 0.9 (copy from rc that got promoted). What

Re: [VOTE] Release Apache Nutch 0.9

2007-03-29 Thread Andrzej Bialecki
Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: 2007/3/29, Andrzej Bialecki [EMAIL PROTECTED]: Sami Siren wrote: IMO we should have had a 0.9-rc1 tag, apply patch to trunk, have 0.9-rc2 tag and so on until we are satisfied. Then when we're

Re: Sequence File Question

2007-03-29 Thread Andrzej Bialecki
. CrawlDbReader knows about Nutch naming convention and always appends current to the db name. But if you were to use MapFileOutputFormat.getReaders() directly this Hadoop class of course doesn't know about this, so you need to provide a full path that includes current. -- Best regards, Andrzej Bialecki

Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Andrzej Bialecki
are finally happy with the codebase then take a snapshot into tags/release-0.9, and keep it read-only. Another solution is to bend the rules and apply the patch to trunk/ and then merge from the trunk to tags/release-0.9 . What do you think? -- Best regards, Andrzej Bialecki

Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Andrzej Bialecki
between 0-9. What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: Sequence File Question

2007-03-28 Thread Andrzej Bialecki
step. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Image Search Engine Input

2007-03-27 Thread Andrzej Bialecki
in the process. Let me know what you all think. I think we should work together on a proposed API changes to this extensible part interface, plus probably some changes to the Parse API. I can create a JIRA issue and provide some initial patches. -- Best regards, Andrzej Bialecki

Re: Issues pending before 0.9 release

2007-03-24 Thread Andrzej Bialecki
-size.html Yes, I saw this - great stuff :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Issues pending before 0.9 release

2007-03-23 Thread Andrzej Bialecki
release ever! :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[jira] Closed: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2007-03-22 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-246. --- Resolution: Fixed Assignee: Andrzej Bialecki Thanks for reminding us about

Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
Sami Siren wrote: for me it works: ... BUILD SUCCESSFUL Total time: 4 minutes 3 seconds I did a fresh checkout to an empty dir, rebuilt and it's still failing - perhaps you have some uncommitted changes in your working copy ... ? -- Best regards, Andrzej Bialecki

Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
rubdabadub wrote: Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. This was discussed before - it will be applied right after the release. -- Best regards, Andrzej Bialecki

Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki
is the reason - it seems that the results of text extraction are completely different under 1.6 ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

[jira] Commented: (NUTCH-462) Noarchive urls are available via the cache link

2007-03-20 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482332 ] Andrzej Bialecki commented on NUTCH-462: - Is this happening with the latest trunk? See NUTCH-167, which

Re: Issues pending before 0.9 release

2007-03-20 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken

Re: Launching custom classes

2007-03-19 Thread Andrzej Bialecki
is in the classpath. I think that What needs to be on your classpath is the *.job jar. The bin/nutch script takes care of that if you built your Nutch using the command-line version of ant. -- Best regards, Andrzej Bialecki

[jira] Commented: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266 ] Andrzej Bialecki commented on NUTCH-381: - Your last comment confirms my suspicions. After analysis

[jira] Closed: (NUTCH-381) Ignore external link not work as expected

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-381. --- Resolution: Won't Fix Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki

[jira] Closed: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-459. --- Resolution: Fixed Upgraded to 0.12.1 release. Upgrade Nutch to Hadoop 0.12.1

[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-353: Priority: Major (was: Blocker) This i partially fixed so that page status is consistent

[jira] Updated: (NUTCH-451) Tool to recover partial fetcher output

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-451: Priority: Minor (was: Major) Tool to recover partial fetcher output

[jira] Closed: (NUTCH-450) How to set up nutch

2007-03-19 Thread Andrzej Bialecki (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-450. --- Resolution: Invalid Assignee: Andrzej Bialecki This belongs in nutch-user mailing list

Re: Issues pending before 0.9 release

2007-03-19 Thread Andrzej Bialecki
-427 Moved to Major, fix after release. NUTCH-381 Won't fix - this is a configuration issue. NUTCH-277 Cannot reproduce NUTCH-167 Fixed. Any other stuff we need to fix before the release? -- Best regards, Andrzej Bialecki

<    1   2   3   4   5   6   7   8   9   10   >