Re: Iterating spidered pages

2005-07-05 Thread Andrzej Bialecki
) and the other at parse level (ParseData.metadata). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

SVN repo, Where Art Thou? (Re: [jira] Closed: (NUTCH-66) Cookies are not being read properly)

2005-07-20 Thread Andrzej Bialecki
Andrzej Bialecki (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-66?page=all ] Andrzej Bialecki closed NUTCH-66: -- Resolution: Fixed Just tried to commit the fixes, and svn said it could not find the repository. I went

Vacation...

2005-07-26 Thread Andrzej Bialecki
Hi folks, I'm away on a 2 weeks vacation, so I won't be able to follow the discussions. If you think I should be involved in something, please mail me directly, so I can pick it up more easily when I'm back. Thanks! -- Best regards, Andrzej Bialecki

Detecting unmodified content patches (Re: near-term plan)

2005-08-04 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release? Sounds good to me. Thanks for helping with this! Unfortunately, the patches related to detecting the unmodified content will have to wait

Re: Tutorial

2005-08-08 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: I can commit such changes for 0.7 release (it means today) if I got positive feedback from other committers. +1 -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Andrzej Bialecki
set them to 0.07 now, so that we have the right values in the release ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: [Nutch-cvs] svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-09 Thread Andrzej Bialecki
. Considering that this is the last release before merging the map-reduce, doing a branch seems very appropriate. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Site Content not indexed ? Nutch 0.7

2005-08-12 Thread Andrzej Bialecki
is NOT stored in Lucene index, it's just indexed there - the text itself is stored in the segment parse_text. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

VOTE: clustering plugin update for Rel 0.7

2005-08-15 Thread Andrzej Bialecki
, which works with recent code, your help in testing would be appreciated. Please vote +1 for and -1 against. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: 0.7 branch

2005-08-23 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch crawler is breadth-first ?

2005-09-07 Thread Andrzej Bialecki
such sites, a depth-first crawler would be better. It's not too difficult to build one, using the tools already present in Nutch. Contributions are welcome... ;-) -- Best regards, Andrzej Bialecki

Re: bug in bin/nutch?

2005-09-09 Thread Andrzej Bialecki
a general tutorial would be difficult... unless it would be simply you need to run ./nutch crawl ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: fetch performance

2005-09-09 Thread Andrzej Bialecki
% of your CPU. :-) Solution: upgrade PDFBox to the yet unreleased 0.7.2 . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: fetch performance

2005-09-10 Thread Andrzej Bialecki
to just replace the old JAR with the new one, but keeping the same name as the old JAR. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: fetch performance

2005-09-10 Thread Andrzej Bialecki
, through the protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Nutch 6.1 running issu

2005-09-10 Thread Andrzej Bialecki
, where Nutch cannot be forced to re-fetch the page because every time you try it remains unmodified - but you need refetching the actual data because e.g. you lost that segment data... -- Best regards, Andrzej Bialecki

Re: fetch performance

2005-09-10 Thread Andrzej Bialecki
... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-cvs] svn commit: r280368 - /lucene/nutch/branches/mapred/src/java/org/apache/nutch/fs/TestClient.java

2005-09-12 Thread Andrzej Bialecki
to something like NDFSShell or something like this? Tests are placed somewhere else, so the name of this class doesn't fit here (and IMHO it should stay here, or perhaps in tools/). -- Best regards, Andrzej Bialecki

Re: svn commit: r280396 - /lucene/nutch/tags/Release-0.7/

2005-09-12 Thread Andrzej Bialecki
to wait for these changes as it was the main reason to prepare 0.7.1 release). I have some time tomorrow or the day after, I'll do it then. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-16 Thread Andrzej Bialecki
that. no? Yes, it comes from another package so I need to wrap it around in the plugin interfaces, give me a day or two... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: svn commit: r290163 - in /lucene/nutch/branches/Release-0.7/src/plugin/clustering-carrot2: ./ lib/

2005-09-19 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Hi Andrzej, Is anything related to clustering commits left? Or should we proceed with 0.7.1 release? I will commit the PDFBox update today, and then I don't have anything more... -- Best regards, Andrzej Bialecki

Re: hyperbolic browser api (I missed)

2005-09-22 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread Andrzej Bialecki
This is interesting. Could you please check what is the difference in this benchmark, if you set HttpVersion.HTTP_1_1 in protocol-httpclient/HttpResponse.java:92 ? Unfortunately, Nutch cannot use that library because it's LGPL. -- Best regards, Andrzej Bialecki

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
httpclient is L-GPL, and hence not acceptable for apache.org. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki
, and there's no other way to solve the problem without losing text, then your patch has my +1. We should not drop the offending characters, but escape them. Either the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler). So, this is -1 for the patch. -- Best regards, Andrzej

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki
characters like ' . Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) ) -- Best regards, Andrzej Bialecki

Re: Enter Chinese in search box, returns messy results

2005-10-13 Thread Andrzej Bialecki
Connector (server.xml). This could result in such output... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: OPIC

2005-10-20 Thread Andrzej Bialecki
could also use an n-gram profile (either word-level or character level) with coarse quantization. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: OPIC

2005-10-21 Thread Andrzej Bialecki
numerous problems with quantization noise). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: NekoHTML 0.9.5

2005-11-01 Thread Andrzej Bialecki
in the config file). I found that in many cases TagSoup gives much better results, especially for pages with multiple html or body elements, where neko would give up... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Halloween Joke at Google

2005-11-02 Thread Andrzej Bialecki
true subject ... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Index update and Google Dance

2005-11-08 Thread Andrzej Bialecki
it would be possible to deploy replicas of segments across the set of DS$Server-s without getting duplicate results. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Index update and Google Dance

2005-11-09 Thread Andrzej Bialecki
segments to update their internal lists. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Lucene or Nutch

2005-11-09 Thread Andrzej Bialecki
make sense to move it to a separate project on its own (or maybe as a part of Jakarta Commons), but moving it into a catch-all purely optional category like Lucene contrib would increase risks that it slides into oblivion... -- Best regards, Andrzej Bialecki

Re: Lucene or Nutch

2005-11-10 Thread Andrzej Bialecki
in design, and also prepare these parts to be separated into their own projects. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: lucene write.lock error

2005-11-12 Thread Andrzej Bialecki
- please see Lucene docs for details how to do this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Nutch WebDb storage alternatives: Revisited

2005-11-18 Thread Andrzej Bialecki
it sound suitable for the new web database (I'm not familliar with the mapred branch of nutch)? You will find the mapred version much much more responsive. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Problem with CRC files on NDFS

2005-11-19 Thread Andrzej Bialecki
in -put operation only if I first deleted all .*.crc files. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Performance issues with ConjunctionScorer

2005-11-22 Thread Andrzej Bialecki
is to index it and then build the summaries. Please see the profiles here: http://www.getopt.org/nutch/profile/index.html -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: On 11/22/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I've been profiling a Nutch installation, and to my surprise the largest amount of throwaway allocations and the most time spent was not in Nutch specific code, or IPC, but in Lucene

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Andrzej Bialecki
Sami Siren wrote: + if (k.contains(score)) { Since: 1.5 Ah, indeed. Fixed - thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: [jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Andrzej Bialecki
this on purpose, so if it's not too complicated we should warn the user. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Nutch WebDb storage alternatives: Revisited

2005-11-29 Thread Andrzej Bialecki
for or against... Please use standard quoting rules... please. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
(Moving the discussion to nutch-dev, please drop the cc: when responding) Doug Cutting wrote: Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only

Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Andrzej Bialecki
need to evaluate this and determine, for query log, how different the results are. Then a HitCollector can simply stop searching once a given number of hits are found. Doug -- Best regards, Andrzej Bialecki

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
even with the group of servers that answered this particular query... My guess is that there could be different estimated indexes prepared for different values of the main boolean parameters, like filter=0... -- Best regards, Andrzej Bialecki

IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
results, and their order was completely at odds with the original hit list. This is probably due to the scoring of sloppy phrases - I need to modify the test scripts to compare the explanations from matching results... -- Best regards, Andrzej Bialecki

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-12 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: By all means please start, this is still near the limits of my knowledge of Lucene... ;-) Attached is a class which sorts a Nutch index by boost. I have only tested it on a ~100 page index, where it appears to work correctly. Please tell me

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-13 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list... Yes. I was just posting the work-in-progress. Ok, I just tested IndexSorter for now. It appears

Re: Hard-coded Content-type checks

2005-12-13 Thread Andrzej Bialecki
Jérôme Charron wrote: If there is no objection, I will commit these changes in the next hours. +1. Great stuff! Finally we will be able to predict which parser works on which content... -- Best regards, Andrzej Bialecki

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Andrzej Bialecki
- in order to avoid name-clashes with other properties (e.g. blindly copied from the protocol headers). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: best file system for NDFS?

2005-12-13 Thread Andrzej Bialecki
think these requirements point rather to a fairly primitive FS (not FAT - a real FS ;-) ), perhaps reiserfs is too complex. When in doubt, test. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Andrzej Bialecki
/ will soon be invaded by the code from mapred, I guess some time around the middle of January (Doug?) ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Andrzej Bialecki
Zaheed Haque wrote: what about the following: http://issues.apache.org/jira/browse/NUTCH-125 On its way ... ;-) I'll add it during this week. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I tested it on a 5 mln index. Thanks, this is great data! Can you please tell a bit more about the experiments? In particular: . How were scores assigned to pages? Link analysis? log(number of incoming links) or OPIC? log() . How were

Re: vote results.

2005-12-15 Thread Andrzej Bialecki
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: . How were the queries generated? From a log or randomly? Queries have been picked up manually, to test the worst performing cases from a real query log. So, for example, the 50% error rate might not be typical, but could be worst-case

Re: [Fwd: Crawler submits forms?]

2005-12-15 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: I agree. I just thought that we would prepare the relase based on the code in trunk/ , and in that case we would like to wait with the merge before we do the release. My definition of trunk is that it should be where the majority of development

Re: version branches / two products

2005-12-15 Thread Andrzej Bialecki
for a prolonged period. * 0.7.x data formats are incompatible with the mapred branch. If we maintain both versions, those who want to migrate will have to convert their data. * the mapred version can be run in a local mode, which requires just a single machine. -- Best regards, Andrzej Bialecki

[VOTE] Commiter access for Stefan Groschupf

2005-12-16 Thread Andrzej Bialecki
, and it's best to catch him now, before he realizes that there are other ways of spending time than hacking Nutch code... ;-) So, I'd like to call for a vote on adding Stefan as a commiter. -- Best regards, Andrzej Bialecki

Re: problems http-client

2005-12-19 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: problems http-client

2005-12-19 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: GNU Getopt

2005-12-20 Thread Andrzej Bialecki
in Apache projects. I believe there is a similar library in Jakarta Commons, I don't know if it provides similar functionality...? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Static initializers

2005-12-20 Thread Andrzej Bialecki
NutchConf per JVM it doesn't change anything. In case you want to run several different configs in a single JVM this approach provides the solution. We could follow this strategy for other plugin registry facades. Comments? -- Best regards, Andrzej Bialecki

Re: Static initializers

2005-12-20 Thread Andrzej Bialecki
Andrzej Bialecki wrote: URLFilters: private URLFilters(NutchConf) { // initialize plugins based on this instance of NutchConf } public static URLFilters get(NutchConf conf) { URLFilters res = (URLFilters)conf.get(urlfilters.key); if (res == null) { res

Re: Static initializers

2005-12-20 Thread Andrzej Bialecki
, but then you have to remember to call setConf() before you do anything else... I'll work on this to see where it leads. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-dev] distributed search

2005-12-20 Thread Andrzej Bialecki
the segment data. So, after you slice the segments you need to re-index them. Sorry. I believe it's possible to do the index slicing, it's just not implemented yet... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki
is not just a plain precision, recall, tf/idf and other tangible measures, it's also a sort of political statement of the engine's operator. ;-) To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... -- Best regards, Andrzej Bialecki

Re: IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki
American Jeff Bowden wrote: Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems

Re: Commons HttpClient 3.0 released

2005-12-22 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi, Since we know that our httpclient plugin has some problems may it is sensefully to update to the new library, I guess this is some work, but may someone is interested to take the job.:) I'll take it, thanks for the heads-up. -- Best regards, Andrzej Bialecki

Removing old classes from trunk/

2005-12-22 Thread Andrzej Bialecki
interested in Nutch history can always retrieve them from SVN or from the past releases. If I don't hear any objections, I'll do it some time during Christmas. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Mega-cleanup in trunk/

2005-12-28 Thread Andrzej Bialecki
Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... -- Best regards, Andrzej Bialecki

Re: Trunk is broken

2005-12-29 Thread Andrzej Bialecki
, threads, parsing); // fetch it Also the Javadoc build has million errors. Fixed. Thanks for spotting this! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Bug in DeleteDuplicates.java ?

2005-12-30 Thread Andrzej Bialecki
maxDoc is zero? Ka-boom! ;-) You're right, this should be wrapped in an IOException and rethrown. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Trunk is broken

2005-12-30 Thread Andrzej Bialecki
Thomas Jaeger wrote: Hi Andrzej, Gal Nitzan wrote: It seems that Trunk is now broken... DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. Fixed. Thanks! -- Best regards, Andrzej Bialecki

Adaptive fetch interval unmodified content detection, episode II

2005-12-30 Thread Andrzej Bialecki
around with them - they work properly even now. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Mega-cleanup in trunk/

2006-01-02 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Andrzej Bialecki wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... Hi, I am not sure what is wrong but a lot of JUnit

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-02 Thread Andrzej Bialecki
produce an MD5 digest, just differently. I'll fix it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low boost value (the OPIC score), to outrank pages with high boost but lower tf/idf of a term. This phenomenon leads quite often to results

Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki
of the LuceneQueryOptimizer.LimitedCollector constructor, instead of super(maxHits) it should be super(numHits) - this was actually the bug, which was causing that mysterious slowdown for higher values of MAX_HITS. -- Best regards, Andrzej Bialecki

Re: NullPointerException (new as of Dec 31st)

2006-01-03 Thread Andrzej Bialecki
Rod Taylor wrote: During a fetch I have recently started getting these (pretty consistently). Fixed. Thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: mapred crawling exception - Job failed!

2006-01-03 Thread Andrzej Bialecki
:) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
this job, can I get a go from the other developers? +1 from me. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
Jérôme Charron wrote: Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. -- Best regards, Andrzej Bialecki

Re: IndexSorter optimizer

2006-01-04 Thread Andrzej Bialecki
sort of use with CachingFilters, only they propose to store them on-disk instead of limiting the cache to relatively small number of filters kept in RAM... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki
by tasktrackers to instantiate local tasks using copies of the original NutchConf instance. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Andrzej Bialecki
didn't see any problems, I think you can go ahead. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: mapred crawling exception - Job failed!

2006-01-04 Thread Andrzej Bialecki
? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: mapred crawling exception - Job failed!

2006-01-05 Thread Andrzej Bialecki
the content). Is it easy to reproduce this if I knew the seed urls? If that's the case, please send me the seed urls (contact me off the list, if it's sensitive). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
the URLFilters and do the score calculations. Any comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: no static NutchConf

2006-01-05 Thread Andrzej Bialecki
the performance somehow, since we do not need to scan the plugin folder and time. Yes, I agree on both accounts. :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
... OTOH, perhaps it's a premature micro-optimization. We can move it to metadata for now, but I see it as a strong candidate to be moved back... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: problems http-client

2006-01-06 Thread Andrzej Bialecki
on? Please do go on! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
with: java.lang.ClassCastException: java.util.ArrayList -Matt Zytaruk Could you please add a call to printStackTrace() in that catch{} statement, so that we know where the exception is thrown? -- Best regards, Andrzej Bialecki

Re: Class Cast exception

2006-01-06 Thread Andrzej Bialecki
, and leave this code to handle older versions... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

  1   2   3   4   5   6   7   8   9   10   >