Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
LuceneQueryOptimizer.LimitedCollector constructor, instead of super(maxHits) it should be super(numHits) - this was actually the bug, which was causing that mysterious slowdown for higher values of MAX_HITS. -- Best regards, Andrzej Bialecki

Re: NullPointerException (new as of Dec 31st)

2006-01-03 Thread Andrzej Bialecki
Rod Taylor wrote: During a fetch I have recently started getting these (pretty consistently). Fixed. Thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Sem

Re: mapred crawling exception - Job failed!

2006-01-03 Thread Andrzej Bialecki
should be fixed :) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix,

Re: mapred crawling exception - Job failed!

2006-01-04 Thread Andrzej Bialecki
new version invokes Float.parseFloat() on line 88. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
se-cases? I would love to do this job, can I get a go from the other developers? +1 from me. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
Jérôme Charron wrote: Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. -- Best regards, Andrzej Bialecki

Re: IndexSorter optimizer

2006-01-04 Thread Andrzej Bialecki
e already sort of use with CachingFilters, only they propose to store them on-disk instead of limiting the cache to relatively small number of filters kept in RAM... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
and used locally by tasktrackers to instantiate local tasks using copies of the original NutchConf instance. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \|

Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Andrzej Bialecki
didn't see any problems, I think you can go ahead. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigra

Re: mapred crawling exception - Job failed!

2006-01-04 Thread Andrzej Bialecki
? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: mapred crawling exception - Job failed!

2006-01-05 Thread Andrzej Bialecki
g the content). Is it easy to reproduce this if I knew the seed urls? If that's the case, please send me the seed urls (contact me off the list, if it's sensitive). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|

Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and retrieve these policies by ID; and then instantiate it and call appropriate methods whenever we use today the URLFilters and do the score calculations. Any comments? -- Best regards, Andrzej Bialecki

Re: no static NutchConf

2006-01-05 Thread Andrzej Bialecki
the performance somehow, since we do not need to scan the plugin folder and time. Yes, I agree on both accounts. :-) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
ies too... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
operation... OTOH, perhaps it's a premature micro-optimization. We can move it to metadata for now, but I see it as a strong candidate to be moved back... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Informa

Re: problems http-client

2006-01-05 Thread Andrzej Bialecki
r https, cookies and authentication. A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html Yes. -- Best regar

Re: problems http-client

2006-01-06 Thread Andrzej Bialecki
? Please do go on! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
: java.lang.ClassCastException: java.util.ArrayList -Matt Zytaruk Could you please add a call to printStackTrace() in that catch{} statement, so that we know where the exception is thrown? -- Best regards, Andrzej Bialecki

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
bump the ParseData.VERSION, and leave this code to handle older versions... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.si

Re: Per-page crawling policy

2006-01-06 Thread Andrzej Bialecki
processed differently if needs be. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
Hi, I attached the patch. Please test. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
old segments. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch Deployment

2006-01-07 Thread Andrzej Bialecki
ave a self-contained deployment package that you can simply copy around. However, this does NOT by any means solve the problem of static NutchConf, that problem is on the level of API usage and not the fi

Re: NPE in Indexer.java line 184

2006-01-08 Thread Andrzej Bialecki
. Stacktrace? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: NPE in Indexer.java line 184

2006-01-09 Thread Andrzej Bialecki
Gal Nitzan wrote: Hi Andrzej, The value cannot be null is my message :) :) I'm guessing that you are using Fetcher in non-parsing mode, and then you run ParseSegment as a separate step, right? Please try the attached patch. -- Best regards, Andrzej Bia

Re: NPE in Indexer.java line 184

2006-01-09 Thread Andrzej Bialecki
ns no segment name nor score in parseData.metadata. Please test and report if it helps. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, Syst

Re: OpenOffice and Excel parsers

2006-01-10 Thread Andrzej Bialecki
arser will be added today or tomorrow. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Problem with latest SVN during reduce phase

2006-01-11 Thread Andrzej Bialecki
to parseData.metadata. I was waiting for someone to test it... but this could as well be you ;-) Anyway to recover the crawl/finish the reduce job from where it failed? I don't think so... although it would be a nice feature. -- Best regards, Andrzej Bia

Re: MapReduce and segment merging

2006-01-12 Thread Andrzej Bialecki
Mike Alulin wrote: Is it possible to merge segments in the map reduce version of Nutch? Not yet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || |

Re: MapReduce and segment merging

2006-01-12 Thread Andrzej Bialecki
not a good option as i have millions of documents and I DO know which of them were updated without requesting them. This is a development version, nobody said it's feature complete. Patience, my friend... or spend some effort to improve it. ;-) -- Best regards, Andrze

Generating multiple fetchlists between updates

2006-01-13 Thread Andrzej Bialecki
is a cost to modify the CrawlDB, but there is also a cost to not be able to generate multiple different fetchlists and fetch them in parallel... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: Per-page crawling policy

2006-01-16 Thread Andrzej Bialecki
to process it... but overall these operations scale much better in 0.8 than before. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: tool to mount nutch filesystem

2006-01-21 Thread Andrzej Bialecki
John X wrote: Hi, Otis, On Fri, Jan 20, 2006 at 09:31:16PM -0800, [EMAIL PROTECTED] wrote: Hi John, NDFS + MapReduce will soon become a separate Lucene sub-project. In one sub-project or two separately? In one. They are closely related anyway. -- Best regards, Andrzej Bialecki

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Andrzej Bialecki
ncountered some identification problems with some specific sites (with blogger for instance), and I plan to investigate on this point. * Another pending task : the analysis (and coding) of multilingual querying support. -- Best regards, Andrze

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future. Does that sound ok? +1 from me. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| In

Re: xml-parser plugin contribution

2006-01-24 Thread Andrzej Bialecki
it is considered as a spam). How can I send the source code ? Best regards. Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a new issue and attach the file. -- Best regards, Andrzej Bia

Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Andrzej Bialecki
d always try to guess the language if we have enough text, unless we can be sure that we deal with properly marked documents (not such uncommon case in Intranets). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/

Re: Two possible extensions

2006-01-24 Thread Andrzej Bialecki
ty and general usefulness that this should be coordinated with the existing efforts, and discussed on the mailing lists. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___||

Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Andrzej Bialecki
meta data is found, then checks that it is the correct value regarding the score of this language (statistical analyis). If the score is too low or no meta data is found, then we perform a full statistical analysis. No? Yes :-) -- Best regards, Andrze

Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Andrzej Bialecki
. Either way is fine with me. Perhaps splitting this into two commits would make it easier to fix potential breakage... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web __

Re: [Nutch-cvs] svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Andrzej Bialecki
27;s better to avoid bash-isms, if we easily can. Not all the world looks like Linux. ;-) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix,

Re: [Nutch-cvs] svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Namely? I didn't notice any ... I think it's better to avoid bash-isms, if we easily can. Not all the world looks like Linux. ;-) IFS, at least. I tried running this on Solaris, where /bin/sh is not bash, and it didn't work. It c

Re: svn commit: r372810 - /lucene/nutch/trunk/bin/nutch

2006-01-27 Thread Andrzej Bialecki
't matter where it's installed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-29 Thread Andrzej Bialecki
Sami Siren wrote: should there be a conf.setObject(clazz,impl); inside that try ? Yes, of course, thanks for catching it! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Sem

Re: where we need meta data?

2006-01-29 Thread Andrzej Bialecki
just that. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: indexSorter - applied to SVN or patch in Jira?

2006-01-31 Thread Andrzej Bialecki
have good differentiation of values across page scores. Performance gains are significant, in certain situations dramatic (e.g. 10x faster). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: [jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-31 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12364544 ] Andrzej Bialecki commented on NUTCH-169: - This patch looks good! If there are no further objections, I'll tes

Lucene's VInt for lengths/counts/sizes

2006-01-31 Thread Andrzej Bialecki
4-byte ints for the size of list, e.g. ParseData.outlinks Overall I think the size savings could be considerable, at the cost of some CPU. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Inform

Cmd line for running plugins

2006-02-01 Thread Andrzej Bialecki
useful, I can add this to PluginRepository. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Cmd line for running plugins

2006-02-01 Thread Andrzej Bialecki
just the ones declared as extensions in plugin.xml. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigra

Re: Cmd line for running plugins

2006-02-02 Thread Andrzej Bialecki
Andrzej Bialecki wrote: It works rather nicely. If other people find it useful, I can add this to PluginRepository. Committed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Sem

Re: Carrot2 v. 1.0.1. [clustering plugin]

2006-02-03 Thread Andrzej Bialecki
maintenance line or both? Most efforts go to the mapred version (in trunk/ now). If it's not much work, or if there are compelling reasons, we try to update the maintenance branches, but they are diverging more and more from the

[jira] Closed: (NUTCH-198) SWF parser

2006-02-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-198?page=all ] Andrzej Bialecki closed NUTCH-198: --- Resolution: Fixed Added. > SWF parser > -- > > Key: NUTCH-198 > URL: http://issues.apache.org/jira/b

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-07 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365413 ] Andrzej Bialecki commented on NUTCH-192: - I have a different opinion on this (I think MapWritable is a sufficiently general-purpose data structure that would be

[jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

2006-02-07 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365434 ] Andrzej Bialecki commented on NUTCH-205: - This is a design choice, not a bug. The errors you see are due to improper configuration - some threads cannot access the

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365536 ] Andrzej Bialecki commented on NUTCH-192: - Yes, that's an issue - due to the way WritableName is initialized it's difficult to add more mappings l

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365623 ] Andrzej Bialecki commented on NUTCH-139: - I like this patch, the split of Metadata names into interfaces looks right. +1. > Standard metadata property names in

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365648 ] Andrzej Bialecki commented on NUTCH-192: - Looks good to me, too. If there are no further objections, I can commit this latest patch, modulo some minor whitespace

[jira] Commented: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365782 ] Andrzej Bialecki commented on NUTCH-209: - All Nutch classes + plugins weigh about 16MB. It feels a bit heavy to distribute this to every node on every task request

[jira] Commented: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365800 ] Andrzej Bialecki commented on NUTCH-209: - No problem. Re: plugin loading: well, when we are done building the binary distribution we already know for sure what

[jira] Closed: (NUTCH-192) meta data support for CrawlDatum

2006-02-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ] Andrzej Bialecki closed NUTCH-192: --- Resolution: Fixed Applied. Thank you! > meta data support for CrawlDatum > > > Ke

[jira] Commented: (NUTCH-198) SWF parser

2006-02-11 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-198?page=comments#action_12366068 ] Andrzej Bialecki commented on NUTCH-198: - This parser is already added in 0.8. You should be able to add it to 0.7.x with little changes. > SWF par

[jira] Updated: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-02-27 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-61?page=all ] Andrzej Bialecki updated NUTCH-61: --- Attachment: 20060227.txt This patch is updated to the current trunk/ . The default configuration works as before, and uses DefaultFetchSchedule. If there

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-02-27 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368051 ] Andrzej Bialecki commented on NUTCH-61: I contemplated this for a while, and then decided against it. The main reason was that currently most of the "plug

[jira] Commented: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-227?page=comments#action_12369660 ] Andrzej Bialecki commented on NUTCH-227: - Isn't it so that QueryFilter (which is an interface) already extends Configurable? What seems to be missi

[jira] Closed: (NUTCH-229) improved handling of plugin folder configuration

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-229?page=all ] Andrzej Bialecki closed NUTCH-229: --- Resolution: Fixed Applied. Thanks! > improved handling of plugin folder configurat

[jira] Closed: (NUTCH-206) search server throws InstantiationException

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-206?page=all ] Andrzej Bialecki closed NUTCH-206: --- Fix Version: 0.8-dev Resolution: Fixed Fixed in r 384011. > search server throws InstantiationExcept

[jira] Closed: (NUTCH-203) ParseSegment throws InstantiationException

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-203?page=all ] Andrzej Bialecki closed NUTCH-203: --- Fix Version: 0.8-dev Resolution: Fixed Fixed in r 376315. Thank you! > ParseSegment throws InstantiationExcept

[jira] Closed: (NUTCH-218) need DOAP file for Nutch

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-218?page=all ] Andrzej Bialecki closed NUTCH-218: --- Resolution: Fixed Applied by Doug. > need DOAP file for Nutch > > > Key: NUTCH-218 >

[jira] Closed: (NUTCH-3) multi values of header discarded

2006-03-13 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Andrzej Bialecki closed NUTCH-3: - Resolution: Fixed Fixed in r 376089. > multi values of header discarded > > > Key: NUTCH-3 >

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ] Andrzej Bialecki commented on NUTCH-230: - Hmmm, this is a deeply philosophical question... Should you spread out the OPIC score to all links that a page sports, or

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370426 ] Andrzej Bialecki commented on NUTCH-230: - Yes, these are good examples - I'll prepare a patch to make this a boolean setting; if false (default) the calculation

[jira] Updated: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-17 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=all ] Andrzej Bialecki updated NUTCH-230: Attachment: patch.txt Please review this patch, if it's ok I'll commit it. > OPIC score for outlinks should be based on # of valid lin

[jira] Created: (NUTCH-235) Duplicate Inlink values

2006-03-18 Thread Andrzej Bialecki (JIRA)
Duplicate Inlink values --- Key: NUTCH-235 URL: http://issues.apache.org/jira/browse/NUTCH-235 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Andrzej Bialecki Assigned to: Andrzej Bialecki Reading the code for

[jira] Updated: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=all ] Andrzej Bialecki updated NUTCH-235: Attachment: patch.txt Proposed fix for this issue. If there are no objections I'll commit this shortly. > Duplicate Inlin

[jira] Commented: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371141 ] Andrzej Bialecki commented on NUTCH-235: - No problem, I can change this. However, going through every link will then require creation of an Iterator. We do this when

[jira] Updated: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=all ] Andrzej Bialecki updated NUTCH-235: Attachment: set-patch.txt Same functionality, but using a HashSet. > Duplicate Inlink values > --- > > Ke

[jira] Closed: (NUTCH-235) Duplicate Inlink values

2006-03-20 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-235?page=all ] Andrzej Bialecki closed NUTCH-235: --- Fix Version: 0.8-dev Resolution: Fixed HashSet-based version of the patch applied. > Duplicate Inlink val

[jira] Closed: (NUTCH-234) Clustering extension code cleanups and a real JUnit test case for the current implementation.

2006-03-21 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-234?page=all ] Andrzej Bialecki closed NUTCH-234: --- Fix Version: 0.8-dev Resolution: Fixed Applied. Thanks! > Clustering extension code cleanups and a real JUnit test case for the curr

[jira] Commented: (NUTCH-237) Carrot2 clustering plugin upgrade.

2006-03-23 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-237?page=comments#action_12371606 ] Andrzej Bialecki commented on NUTCH-237: - Hmm, I'm not sure I like this patch. It removes support for other languages than English. While I can agree wit

[jira] Created: (NUTCH-238) NDFSck - fsck utility for NDFS (pre-Hadoop)

2006-03-23 Thread Andrzej Bialecki (JIRA)
Bialecki Assigned to: Andrzej Bialecki Attachments: NDFSck.java This is a utility to check health status of NDFS. NOTE: this is compatible ONLY with pre-Hadoop Nutch versions! (Another version has been submitted for Hadoop volumes). -- This message is automatically generated by JIRA. - If you

[jira] Updated: (NUTCH-238) NDFSck - fsck utility for NDFS (pre-Hadoop)

2006-03-23 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-238?page=all ] Andrzej Bialecki updated NUTCH-238: Attachment: NDFSck.java > NDFSck - fsck utility for NDFS (pre-Hadoop) > --- > > Ke

[jira] Created: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-28 Thread Andrzej Bialecki (JIRA)
Reporter: Andrzej Bialecki This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequenc

[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-28 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki updated NUTCH-240: Attachment: patch.txt > Scoring API: extension point, scoring filters and an OPIC plu

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] Andrzej Bialecki commented on NUTCH-240: - Yes, one of the reasons I wanted to discuss these patches is that they uncovered some of the underlying ugliness... ;) The

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372580 ] Andrzej Bialecki commented on NUTCH-240: - > First, I hope my critical remarks were not taken personally. I am thankful > for this and all of your contributions.

[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki updated NUTCH-240: Attachment: Generator.patch.txt This patch is an intermediate step towards the simplification of the scoring API. It changes Generator to use an

[jira] Closed: (NUTCH-238) NDFSck - fsck utility for NDFS (pre-Hadoop)

2006-04-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-238?page=all ] Andrzej Bialecki closed NUTCH-238: --- Resolution: Fixed I'm closing this issue - DFSck has been committed to Hadoop, and anyone wishing to use this version can get it here. >

[jira] Closed: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-04-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-230?page=all ] Andrzej Bialecki closed NUTCH-230: --- Resolution: Fixed Patch applied. > OPIC score for outlinks should be based on # of valid links, not total # of >

[jira] Assigned: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki reassigned NUTCH-240: --- Assign To: Andrzej Bialecki > Scoring API: extension point, scoring filters and an OPIC plu

[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki updated NUTCH-240: Attachment: patch1.txt Updated patch, includes the Generator.patch.txt. Changes: * reduce creationf of new Objects in CrawlDbReducer * simplify API by

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-05 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] Andrzej Bialecki commented on NUTCH-240: - Oops, sorry, that was a last moment change ... I fixed it now, thanks for spotting this. > Scoring API: extension po

[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-05 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373396 ] Andrzej Bialecki commented on NUTCH-244: - We don't pass the Configuration object to the constructor, so we have no way to read the value of this. Configuration i

[jira] Updated: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-07 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=all ] Andrzej Bialecki updated NUTCH-240: Attachment: patch2.txt Minor refactoring: passScore* methods now allow access to more data. I found this useful when implementing a different scoring

[jira] Closed: (NUTCH-254) Fetcher throws NullPointer if redirect URL is filtered

2006-04-24 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-254?page=all ] Andrzej Bialecki closed NUTCH-254: --- Resolution: Fixed Fixed - actually there were two places which needed fixing, I also somewhat simplified the logic Thank you! > Fetcher thr

[jira] Closed: (NUTCH-125) OpenOffice Parser plugin

2006-04-25 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-125?page=all ] Andrzej Bialecki closed NUTCH-125: --- Fix Version: 0.8-dev Resolution: Fixed Applied, with some changes (due to Nutch API changes, and also it uses lib-xml plugin now

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12377200 ] Andrzej Bialecki commented on NUTCH-240: - If there are no further suggestions or objections, I'd like to move forward on this patch. I know the passScore* method

[jira] Created: (NUTCH-263) MapWritable.equals() doesn't work properly

2006-05-03 Thread Andrzej Bialecki (JIRA)
MapWritable.equals() doesn't work properly -- Key: NUTCH-263 URL: http://issues.apache.org/jira/browse/NUTCH-263 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Andrzej Bialecki MapWritable.equals

[jira] Updated: (NUTCH-263) MapWritable.equals() doesn't work properly

2006-05-03 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-263?page=all ] Andrzej Bialecki updated NUTCH-263: Attachment: patch1.txt This patch fixes the issue, but at the cost of creating new objects... improvements are welcome. > MapWritable.equals() does

<    1   2   3   4   5   6   7   8   9   10   >