NutchQuery adding non required Terms

2006-01-12 Thread Stefan Groschupf
Hi, I would love to build a nutch Query object via API and not using the Queryparser. In my case I need the complete set of boolean operators in the query, so required (AND) and non required (OR) terms and prohibited (NOT). I notice that in general this would be possible to add a clause in

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-11 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362447 ] Stefan Groschupf commented on NUTCH-169: I wonder what is the performance impact of this patch - in many places, where previously we used the static methods on classes

[jira] Created: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
remove static NutchConf --- Key: NUTCH-169 URL: http://issues.apache.org/jira/browse/NUTCH-169 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Removing the static NutchConf.get

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Stefan Groschupf updated NUTCH-169: --- Attachment: nutchConf.patch The patch was created by Marko Bauhardt with some help from me, so full credits to Marko! It remove any access of nutchConf

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362334 ] Stefan Groschupf commented on NUTCH-169: I missed to mentioned that is the first version just for discussing and provide Jerome the changed API it is not the final

ParserFactory test fail

2006-01-10 Thread Stefan Groschupf
Hi Jerome, I'm not sure but could it happen that with your new html protocol plugin the ParserFactory fails, since a component require log4j? May we should than add log4j into the core classpath, since I had added log4j to the NUTCH_HOME/lib and than the test was running successfully.

Re: ParserFactory test fail

2006-01-10 Thread Stefan Groschupf
Sure, my mistake. Am 10.01.2006 um 18:24 schrieb Jérôme Charron: Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

[jira] Commented: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-169?page=comments#action_12362393 ] Stefan Groschupf commented on NUTCH-169: Great! Thanks a lot Jerome!!! We will continue to fix some smaller bugs we introduced and JobConf related issue and hopefully

why index not in segment anymore

2006-01-09 Thread Stefan Groschupf
Hi Doug, in nutch 0.8 the index is not in the segment folder any more. What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle, Thanks for a

test suite fails?

2006-01-08 Thread Stefan Groschupf
Hi, is anyone able to run the test suite without any problems? Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net

Re: Reporter interface

2006-01-07 Thread Stefan Groschupf
Am 07.01.2006 um 00:43 schrieb Andrew McNabb: I'm looking at the Reporter interface, and I would like to verify my understanding of what it is. It appears to me that Reporter.setStatus() is called periodically during an operation to give a human-readable description of how far the progress

[jira] Created: (NUTCH-166) secure jobtracker info pages with a password

2006-01-07 Thread Stefan Groschupf (JIRA)
secure jobtracker info pages with a password Key: NUTCH-166 URL: http://issues.apache.org/jira/browse/NUTCH-166 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Stefan Groschupf Fix For: 0.8

[jira] Updated: (NUTCH-166) secure jobtracker info pages with a password

2006-01-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-166?page=all ] Stefan Groschupf updated NUTCH-166: --- Attachment: passwordPatch.txt secure jobtracker info pages with a password Key: NUTCH-166

Re: [bug?] PRC called emthod require parameter

2006-01-06 Thread Stefan Groschupf
What bug was that? What is your one-line fix? http://www.nabble.com/RCP-known-limitation-or-bug--t688207.html something like: Object[] values; method.getReturnType()!=null ? values = (Object[])Array.newInstance (method.getReturnType(),wrappedValues.length) : values = new Object[0];

Re: no static NutchConf

2006-01-05 Thread Stefan Groschupf
I have two more ideas: 1) create NutchConf as interface (not class) 2) make it work as plugin I like the idea to make the conf as a singleton and understand the need to be able to integrate nutch. However I would love to do one first step and later on we can make this second step. I made

Re: no static NutchConf

2006-01-05 Thread Stefan Groschupf
(2) What I'd REALLY like to see is if NutchConf were an interface, As mentioned, give us some time to get the first step done and than I'm sure such kind of community contributions are every-time welcome. May people can work together on this. Stefan

Re: Per-page crawling policy

2006-01-05 Thread Stefan Groschupf
I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and

no static NutchConf

2006-01-04 Thread Stefan Groschupf
Hi, to move forward in the direction of having a nutch gui, I would love to start removing the static access of NutchConf. Based on experience first I would love to get a kind of general agreement and a 'go' before wasting to much time for an unaccented solution. I suggest: + removing

Re: no static NutchConf

2006-01-04 Thread Stefan Groschupf
I don't fully agree with this. In most such cases, you already have a NutchConf instance in the method or class context, so it makes sense to use it in the constructor. You could add these construtors with all parameters iterated, but I'd expect that the constructors using NutchConf

Re: LogFormatter

2006-01-03 Thread Stefan Groschupf
Hi, I also agree and would love to see things changed. In general I would love to be able to be able to write log files also in custom storages types. For example it would be great in case it would be possibe to write log files into the ndfs or into a database. Especially for smaller scaled

Re: [bug?] PRC called emthod require parameter

2006-01-03 Thread Stefan Groschupf
Different parameters are sent to each address. So params.length should equal addresses.length, and if params.length==0 then addresses.length==0 and there's no call to be made. Make sense? It might be clearer if the test were changed to addresses.length==0. Yes, this would be better,

[jira] Closed: (NUTCH-154) Unable to add/update new files to fetchlist/fetcher and thus index, when u rerun crawl tool on same db.

2005-12-28 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-154?page=all ] Stefan Groschupf closed NUTCH-154: -- Resolution: Won't Fix Please ask question in the mailing lists, this is a bug tracking tool. Unable to add/update new files to fetchlist/fetcher

[jira] Closed: (NUTCH-55) Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available

2005-12-28 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-55?page=all ] Stefan Groschupf closed NUTCH-55: - Resolution: Duplicate Duplicate of NUTCH-59 Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available

Fwd: bug in Nutch wiki - FAQ

2005-12-26 Thread Stefan Groschupf
I'm sending this to you because you are active on the nutch-users list and I am too lazy to subscribe at this particular moment. Please pass on / act as you see fit. Wiki itself seems immutable at least to the likes of me. -Jeff = currently By default the [WWW] file plugin is

Re: severe error in fetch

2005-12-25 Thread Stefan Groschupf
Hi, Can you provide a detailed stacktrace from the log file? Stefan Am 25.12.2005 um 23:38 schrieb AJ Chen: I have seen repeatedly the following severe errors during fetching 400,000 pages with 200 threads. What may cause Host connection pool not found? This type of error must be avoided,

Re: Removing old classes from trunk/

2005-12-23 Thread Stefan Groschupf
It's time to do some cleanup of the trunk/ after the mapred merge. +1

Commons HttpClient 3.0 released

2005-12-22 Thread Stefan Groschupf
Hi, Since we know that our httpclient plugin has some problems may it is sensefully to update to the new library, I guess this is some work, but may someone is interested to take the job.:) http://www.theserverside.com/news/thread.tss?thread_id=38189 ttpClient 3.0 provides the following

Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Stefan Groschupf
Lukas, the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g.

Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Stefan Groschupf
untch-0.8-dev which I get from nutch-trunk. Regards, Lukas On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Lukas, the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess

Re: IndexSorter optimizer

2005-12-21 Thread Stefan Groschupf
Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to

Re: Static initializers

2005-12-21 Thread Stefan Groschupf
Andrzej, well I'm not ready with digging into the problem but want to ask some more questions. BTW I counted 195 places that use NutchConf.get(), so this will be a bigger patch. :) As I mentioned I would love to go the inversion of control way, so not using nutchConf in the constructor

Re: Static initializers

2005-12-20 Thread Stefan Groschupf
Hi, right this is a know problem and discussed several times, we should start solving this. :-) I suggest that we make the Plugin Class implementing the Configurable interface. In case a plugin needs any configuration value it will request them from the plugin instance. The next step would

[jira] Created: (NUTCH-146) mapred.job.tracker.info.port is defined 2 times in the nutch-default.xml

2005-12-20 Thread Stefan Groschupf (JIRA)
mapred.job.tracker.info.port is defined 2 times in the nutch-default.xml Key: NUTCH-146 URL: http://issues.apache.org/jira/browse/NUTCH-146 Project: Nutch Type: Bug Reporter: Stefan

Re: [bug] overwriting job properties until runtime is not possible

2005-12-20 Thread Stefan Groschupf
to the ContentProperties mechanism. I think using an array list is may easier than using properties that are hosted in properties. Stefan Am 21.12.2005 um 01:36 schrieb Paul Baclace: Stefan Groschupf wrote: My suggestion is that we change NutchConf is following way: resourceNames.add

Re: Latest version of Mapred

2005-12-19 Thread Stefan Groschupf
mapred is now trunk... Am 19.12.2005 um 18:46 schrieb Rafi Iz: Hi all, I am currently working with Nutch 0.7.1, I want to start using the mapred, any ideas where I can find the latest version. B.T.W I looked at the path: http://svn.apache.org/repos/asf/lucene/ nutch/branches/ but the only

Re: problems http-client

2005-12-19 Thread Stefan Groschupf
um 19:47 schrieb Andrzej Bialecki: Stefan Groschupf wrote: Anyway today we note that when fetching with http-client the sum of errors and fetched pages is much less than the size defined when generating the segment. Changing to protocol-http solves the problem. Has anyone also note

Re: [Nutch-dev] distributed search

2005-12-19 Thread Stefan Groschupf
By the way, is there an easy way to split the index I have already have. I would hate to recrawl all of the 1.9MM URLs again and waste bandwidth. Well I do not know any tool that comes with nutch or a other tool that does it, may there is one. But to write a java class that creates two

[bug] overwriting job properties until runtime is not possible

2005-12-18 Thread Stefan Groschupf
Hi, until writing theses Test that mades the generation bug reproducable I discovered another strange behavior. Following test fail: public void testConf() throws Exception { NutchConf conf = NutchConf.get(); conf.setInt(mapred.reduce.tasks, 2);

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360658 ] Stefan Groschupf commented on NUTCH-3: -- Thanks. :) multi values of header discarded Key: NUTCH-3 URL: http

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360666 ] Stefan Groschupf commented on NUTCH-3: -- No problem, I can easily change this, but this will effect a lot of code. Just give me some hours. I will do aginst the svn since

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360667 ] Stefan Groschupf commented on NUTCH-3: -- ... the ideas was to leasve api as it is, just add a new getProperties method. Should we now in general replace setProperty

[jira] Reopened: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf reopened NUTCH-3: -- improvement Doug suggested multi values of header discarded Key: NUTCH-3 URL: http

[jira] Updated: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf updated NUTCH-3: - Attachment: contentPropertiesAddpatch.txt Better? multi values of header discarded Key: NUTCH-3 URL

[jira] Commented: (NUTCH-143) Improper error numbers returned on exit

2005-12-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-143?page=comments#action_12360571 ] Stefan Groschupf commented on NUTCH-143: Would be great in case you can provide a patch. Improper error numbers returned on exit

Re: [Nutch-dev] distributed seach

2005-12-16 Thread Stefan Groschupf
Hi Ledio, the actually nutch is 0.7 or you can also use the 0.8 branch code. Also you are using old mailing lists and I suggest you use the apache nutch user mailing list. http://lucene.apache.org/nutch/mailing_lists.html To answer your question, nutch does forward the query to all search

Something is Wrong with Google’s Mathematica l Model

2005-12-16 Thread Stefan Groschupf
Hi, found this link on a news site, may some can found this interesting. An Israeli mathematician, Hillel Tal-Ezer, of the Academic College of Tel Aviv in Yaffo has written a paper on the faults of google's mathematical algorithms for page ranking

[jira] Updated: (NUTCH-3) multi values of header discarded

2005-12-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf updated NUTCH-3: - Attachment: multiValuesPropertyPatch.txt Attached a patch that adds a getProperties method to the ContentProperties class to receive a string array of values

[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping

2005-12-14 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ] Stefan Groschupf commented on NUTCH-140: From my point of view this makes things more complicated, why not just use the extension id, where would be the advantage

vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf
Full list of open issues complete description can be found here : http://issues.apache.org/jira/secure/IssueNavigator.jspa? view=fulltempMax=30 Please add a +1 in case you vote for the issue under this issue. Please keep in mind that this will be more a maintenance release. NUTCH-141

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf
My personal fav. list In a day or so I will count all votes and post them. NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari +1 NUTCH-140 Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping NUTCH-139 Standard metadata property names in

Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Stefan Groschupf
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. Do I only need to change the last line to using HashPartitioner.class, or do I need to modify the other 2 references as well?

Re: Hard-coded Content-type checks

2005-12-13 Thread Stefan Groschupf
If there is no objection, I will commit these changes in the next hours. + 1!!! :-)

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Stefan Groschupf
+1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? Stefan Am 13.12.2005 um 18:07 schrieb Chris Mattmann: Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Stefan Groschupf
This has been fixed in the mapred branch, but that patch is not in 0.7.1. This alone might be a reason to make a 0.7.2 release. May we can get fixed some more parser selection related issue until next days also and get this into a 0.7.2 release. I would be happy to see some more parser

best file system for NDFS?

2005-12-13 Thread Stefan Groschupf
Hi geeks, I have not that much much deep knowledge about the unix file systems, so my questions what would be the best file system for nutch distributed file systems data nodes? Does it make any different using the one or the other file system? Would reiserFS a good choice? Thanks for any

[jira] Created: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2005-12-12 Thread Stefan Groschupf (JIRA)
Reporter: Stefan Groschupf Priority: Critical We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN

[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Stefan Groschupf updated NUTCH-135: --- Attachment: contentProperties_patch_WithContentProperties.txt missed to add the contentproperties itself to the version control... thanks Jack! http

Re: [jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Stefan Groschupf
Jack, sorry there are now 3kb more in the patch :), please give it another try. Stefan Am 10.12.2005 um 15:30 schrieb Jack Tang: Stefan It seemed your patch missing org.apache.nutch.protocol.ContentProperties class, right? /Jack On 12/10/05, Stefan Groschupf (JIRA) [EMAIL PROTECTED

Re: nutch questions

2005-12-09 Thread Stefan Groschupf
Ken, may the user mailing list would be a better place for such questions. The size of your index depends on you configuration(what kind of index filter plugins you use) You can say a document in the index needs 10KB plus the meta data like date, content type or category of the page.

Re: parse.getData().getMetadata().get(propName) is NULL?

2005-12-09 Thread Stefan Groschupf
Jack, discussed here in detail: http://issues.apache.org/jira/browse/NUTCH-133 I will provide a patch just fixing this issue very soon. Stefan Am 09.12.2005 um 20:04 schrieb Jack Tang: Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes

[jira] Created: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
: Nutch Type: Bug Components: fetcher Versions: 0.7.1, 0.7 Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev, 0.7.2-dev As described in issue nutch-133, some webservers return http header meta data not standard conform case insensitive. This provides many

[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Stefan Groschupf updated NUTCH-135: --- Attachment: contentProperties_patch.txt As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that solve the problem of case insensitive

[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] Stefan Groschupf commented on NUTCH-135: Andrzej, that is easy to add to the ContentProperties object and sure I can do that. However first I would love to get a OK

[jira] Assigned: (NUTCH-3) multi values of header discarded

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf reassigned NUTCH-3: Assign To: Stefan Groschupf multi values of header discarded Key: NUTCH-3 URL: http

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359725 ] Stefan Groschupf commented on NUTCH-133: Doug, ok, I will split things in different patches and open a set of new bugs. Jerome: If you take a carefully look to my

[jira] Closed: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ] Stefan Groschupf closed NUTCH-133: -- Resolution: Won't Fix We will split the problems described here into a set of bugs to fix things step by step. ParserFactory does not work

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359610 ] Stefan Groschupf commented on NUTCH-133: Jerome: Since 3 months or so url extentions and also magic content type detection is never used. I suggest to assign

Re: RCP known limitation or bug?

2005-12-07 Thread Stefan Groschupf
)? Stefan Am 07.12.2005 um 20:29 schrieb Doug Cutting: This should work. TestRPC.java has a case which returns void (ping). Can you send a simple test case that fails? Doug Stefan Groschupf wrote: Hi, I never used the RCP that intensive so I was surprised to found this limitation

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359627 ] Stefan Groschupf commented on NUTCH-133: Doug, I already attached a unit test that call ParseUtil.parse(Content) and simulate the different scenarios. I can extend

Re: submitting a patch?

2005-12-06 Thread Stefan Groschupf
Hi, put the patch to jira. Actually for the most important packages except of map reduce 0.7 and 0.8 are identically and as far I know Doug is syncronizing things frequently. Stefan Am 06.12.2005 um 17:44 schrieb James Nelson: Hello, hope this is the right place to ask this. I'm

RCP known limitation or bug?

2005-12-06 Thread Stefan Groschupf
Hi, I never used the RCP that intensive so I was surprised to found this limitation. Is it known that the RCP.call method can only call methods that have a return type? RCP.java line 152 Object[] values = (Object[])Array.newInstance(method.getReturnType (),wrappedValues.length);

[jira] Created: (NUTCH-133) ParserFactory does not work as expected

2005-12-06 Thread Stefan Groschupf (JIRA)
ParserFactory does not work as expected --- Key: NUTCH-133 URL: http://issues.apache.org/jira/browse/NUTCH-133 Project: Nutch Type: Bug Versions: 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Stefan Groschupf Priority

[jira] Updated: (NUTCH-133) ParserFactory does not work as expected

2005-12-06 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ] Stefan Groschupf updated NUTCH-133: --- Attachment: Parserutil_test_patch.txt A test that reproduce most problems, see a real world sample url in the conclusion above. ParserFactory does

[jira] Updated: (NUTCH-133) ParserFactory does not work as expected

2005-12-06 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ] Stefan Groschupf updated NUTCH-133: --- Attachment: ParserFactoryPatch_nutch.0.7_patch.txt A patch that solves the described problems for nutch 0.7. MimeTypes detection is now REALLY used

Re: incremental crawling

2005-12-02 Thread Stefan Groschupf
Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki: Yes, this is required to detect unmodified content. A small note: plain MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages with a counter, or with ads. It would be good to provide a framework for other implementations

Re: NDFS/MapReduce?

2005-12-01 Thread Stefan Groschupf
Check out the latest source from svn, use the branch called mapred. This url give you a kick start to install a map reduce system on several boxes: http://wiki.media-style.com/display/nutchDocu/setup+a+map+reduce+multi +box+system The 0.8 brunch works very well for me, but for sure there some

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Stefan Groschupf
Am 25.11.2005 um 11:30 schrieb Erik Hatcher: On 24 Nov 2005, at 23:49, Chris Mattmann wrote: Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. Me too. Do we talk about parsing rdf or do we discuss to store parsed html text in rdf

Re: problem with ndfs

2005-11-24 Thread Stefan Groschupf
Sounds like a problem with the hostnames of your datanodes. Check that your are able to ping all the datanodes with the hostnames they had send to the namenode. check: bin/nutch ndfs -report to see the hostnames. Stefan Am 24.11.2005 um 16:04 schrieb Anton Potehin: When we start namenode

Re: [jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Stefan Groschupf
definition overwrites the first. So sure multi values for one key in multi files, but we should warn in case a key is defined two times in the same file. Could I clarify my suggestion? Stefan Am 24.11.2005 um 18:30 schrieb Andrzej Bialecki: Stefan Groschupf (JIRA) wrote: second

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Jérôme, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) No, nutch uses java logging, only some plugins use jar that depends on log4j. Stefan

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread Stefan Groschupf
Andrzej, very interesting!!! Nutch Summarizer also needlessly re-tokenizes the text over and over again - perhaps it would be better to save already tokenized text in parse_text, instead of the raw plain text? After all, the only use for that text is to index it and then build the

[jira] Created: (NUTCH-127) uncorrect values using -du, or ls does not return items

2005-11-18 Thread Stefan Groschupf (JIRA)
Reporter: Stefan Groschupf Priority: Blocker The ndfs client return uncorrect values by using du or ls does not return items. It looks like there is a problem with the virtual file strcuture, since -du only reads the meta data, isn't it? We had moved some data from folder to folder and after

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-14 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357616 ] Stefan Groschupf commented on NUTCH-99: --- SURE! That is absolutly ok for me! Thanks a lot Piotr ports are hardcoded or random

mapper Exceptions

2005-11-13 Thread Stefan Groschupf
Hi Doug, a very small improvement suggestion. Actually the method map in the mapper Interface can throw a IOException. I would found it better in case it just throw a general Exception since a map task can fail for other reasons as well, e.g. a in the map search server scenario you

Re: threading versus nio

2005-11-12 Thread Stefan Groschupf
Hi Johannes, right, but in case you have 200 boxes and each box need to open 4 different connections to the master. Than the master has 200 * 4 connections = 800 threads = the limit of the 2.4 kernel. In case you open only one conenction per box you are also limited to run 800 boxes per

Re: [Nutch Wiki] Update of OverviewDeploymentConfigs by PaulBaclace

2005-11-11 Thread Stefan Groschupf
Am 11.11.2005 um 11:48 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by PaulBaclace: http://wiki.apache.org/nutch/OverviewDeploymentConfigs New page: == Overview of

Re: [Nutch Wiki] Update of OverviewDeploymentConfigs by PaulBaclace

2005-11-11 Thread Stefan Groschupf
ups, sorry... Paul, you may should mentioned that this scripts require ssh in a version higher than 3.8. A great page! Stefan Am 11.11.2005 um 13:45 schrieb Stefan Groschupf: Am 11.11.2005 um 11:48 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category

mapSearcher was Re: Index update and Google Dance

2005-11-11 Thread Stefan Groschupf
Hi Doug, In the future I would like to implement a more automated distributed search system than Nutch currently has. One way to do this might be to use MapReduce. Each map task's input could be an index and some segment data. The map method would serve queries, i.e., run a Nutch

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-11 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357409 ] Stefan Groschupf commented on NUTCH-99: --- I'm not sure what you are meaning with catching Exception is overkill. In case the try to open a server on this port fails

Re: What is suitable environment?

2005-11-10 Thread Stefan Groschupf
Hi, see http://wiki.apache.org/nutch/GettingNutchRunningWithWindows HTH Stefan Am 10.11.2005 um 06:44 schrieb KAAS INFOTECH: Hi All, I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft window install on my PC with Java home Set. I came to know that cgywin is require

Re: Problem about method Query#query()

2005-11-10 Thread Stefan Groschupf
Hi, Do you have any query filter installed? Stefan Am 10.11.2005 um 09:37 schrieb Game Now: Hi all, I pass some strings to org.apache.nutch.searcher.Query#parse() method, but I got difference result like below: parameter string: area:XX, returnedQuery.toString() is: area:XX. parameter

Re: Max Per Host and topN

2005-11-10 Thread Stefan Groschupf
+1 Am 10.11.2005 um 19:03 schrieb Rod Taylor: Generator.java.patch --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net

Re: [Nutch Wiki] Update of PluginCentral by JakeVanderdray

2005-11-10 Thread Stefan Groschupf
Hi Jake, take a look here http://wiki.media-style.com/display/nutchDocu/Why+nutch+has+a+plugin +system This short text already mentioned why a nutch as a plugin system :) Stefan Am 10.11.2005 um 20:04 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on

Re: Index update and Google Dance

2005-11-09 Thread Stefan Groschupf
and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this case? Actually you have to do that manually, but there will be a automatically solution later. Or could you tell me where should I start learning? The

Re: rank system

2005-11-08 Thread Stefan Groschupf
Pre score calculation is done in the indexer. Yes it works with complete webcrawls as well, and it works very well for that. :-) Stefan Am 08.11.2005 um 11:22 schrieb Anton Potehin: What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with

Re: Index update and Google Dance

2005-11-08 Thread Stefan Groschupf
nutch use the concepts of segments and yes you are able to update part of the index by just delete older older segments and generate / fetch new segments. Stefan Am 08.11.2005 um 18:38 schrieb Jack Tang: Hi I read GFS document and NFS document on the wiki. One interesting question here:

Re: standard version of log4j

2005-11-07 Thread Stefan Groschupf
That is the sense of the plugin system that each plugin can have own libraries and do not share or share them with other plugins. Stefan Am 07.11.2005 um 16:08 schrieb Byron Miller: Is there any way to make sure all plugins/modules reference a standard version of log4j? seems to me there are

Re: mapred bug -- bad part calculation?

2005-11-05 Thread Stefan Groschupf
I tried running one datanode per machine connecting back to the same SAN but it seemed pretty clunky. SAN in general is a bad idea. A SAN is too slow for a serious setup. ... and it is the single point of failure... Better use many local hdd. Stefan

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12356853 ] Stefan Groschupf commented on NUTCH-99: --- Is there anything I can improve so one of the developers commit this patch into the svn? Thanks in case one of the people

<    1   2   3   4   >