Re: new configuration proposal in nutch-site.xml (maximum url length)

2006-08-25 Thread Stefan Groschupf
I think that is the property for the anchor text length but not the length of a url. Am 25.08.2006 um 04:28 schrieb Lourival Júnior: Try this one: property namedb.max.anchor.length/name value800/value descriptionThe maximum number of characters permitted in an anchor. /description

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread Stefan Groschupf
I don't know if Chris Schneider's patch for HADOOP-406 will prove to be the long-term solution, but it certainly works for me. If you like please vote for this issue! I also use it in several projects and wonder why it is not yet part of hadoop. Thanks. Stefan

Re: 0.8 much slower than 0.7

2006-07-31 Thread Stefan Groschupf
Hi, I have some code using queue based mechanism and java nio. In my tests it is 4 times faster than the existing fetcher. But: + I need to fix some more bugs + we need to re factor the robots.txt part since it is not usable outside the http protocols yet. + the fetcher does not support plug

Re: 0.8 much slower than 0.7

2006-07-31 Thread Stefan Groschupf
Check: http://issues.apache.org/jira/browse/NUTCH-233 and let us know if it helps. Stefan Am 31.07.2006 um 07:46 schrieb Matthew Holt: Fetcher for one, and the mapreduce takes forever... IE the mapreduce is kind of annoying... is it possible to disable it if I'm not running on a DFS? Matt

Volunteers requested for Web Spam Classification

2006-07-16 Thread Stefan Groschupf
Dear Nutch Users, web spam is a serious issue also for nutch, but in the moment we known only a little bit about the problem and how we can work around. Please invest some time to help the research community by building a collection for future research work. Details see below. Thank you.

Re: Extending scoring plugin

2006-07-13 Thread Stefan Groschupf
I'm only a moderately experienced java programmer, so I was hoping I could get a few pointers about where to begin on a particular problem. I want to increase the score of a search result if the title contains the search query and the site is from a particular site. Take a look to the

Re: Eclipse IDE

2006-07-11 Thread Stefan Groschupf
http://find23.net/Web-Site/blog/66A7676A-8C9C-4A93-8B59- A6A100EF8C1B.html You may be need to update that to the latest sources. Am 11.07.2006 um 15:29 schrieb Matthew Holt: Can someone that has Nutch developement configured for Eclipse please paste their .project and .classpath files?

Re: Index algorithm

2006-07-10 Thread Stefan Groschupf
Hi, nutch uses lucene. So you will find that interesting: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ Similarity.html Beside that nutch uses a kind of opic: http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/ nutch/scoring/opic/OPICScoringFilter.html Also

Re: why i can't crawl all the linked pages in the specified page to crawl.

2006-07-07 Thread Stefan Groschupf
Hi, may be you can try to have a much higher depth something like 20? However in general check: + the regex url filter file. + the rebotos.txt + nofollow tag in the pages + number of out links to extrac in nutch-default.cml Stefan On 06.07.2006, at 19:12, kevin pang wrote: i set up the nutch

Re: Link db (traversal + modification)

2006-07-06 Thread Stefan Groschupf
Hi Otis, the link graph live in the linkdb. I suggest to write a small map reduce tool that reads the existing linkDb filter the pages you want to remove and write the result back to disk. This will be just a couble lines of code. The hadoop package comes with some nice map reduce examples.

Re: Alternatives

2006-07-05 Thread Stefan Groschupf
Hi, It would be nice to use the features of Nutch instead of my own hacky stuff. How bound is Nutch to the J2EE-container? Would it be a big job to make it run on an alternative GUI? Or is is the container used for more than GUI? I.e. do all services (crawler, et.c.) run within the container? Do

Re: Input and Output Value Class Types

2006-06-29 Thread Stefan Groschupf
Hi, may be have a look to the nutch indexer it use a kind of wrapper, may be this can help you. Also please browse the haddop developer list archive since there was some related discussion. HTH Stefan Am 29.06.2006 um 14:41 schrieb Dennis Kubes: All, Is there a way to get around having to

Re: Input and Output Value Class Types

2006-06-29 Thread Stefan Groschupf
Kubes wrote: The indexer uses an ObjectWritable and I am using that trick. Problem is I need to input and ObjectWritable but output a different object. I will take a look at the hadoop list. Dennis Stefan Groschupf wrote: Hi, may be have a look to the nutch indexer it use a kind of wrapper

Re: Large Scale Searching

2006-06-12 Thread Stefan Groschupf
Am 12.06.2006 um 19:46 schrieb Dennis Kubes: Is anyone doing large scale searching and if so what kind of architecture is good. I have a 25G index now (merged) and the searches are failing due to memory constraints. Is is better to have multiple smaller indexes across machines. yes.

Re: parsing and using xml-data

2006-06-08 Thread Stefan Groschupf
Hi Karsten, nutch has the limitation one url one document (in crawlDB or index). The content and metadata for this document is normally available 'behind' url. The only exception is the anchor text. Anchor text are data from the mother url that is passed and indexed within the child

Re: Removing or reindexing a URL?

2006-06-08 Thread Stefan Groschupf
Just recrawl and reindex every day. That was the simple answer. The more complex answer is you need to do write custom code that deletes documents from your index and crawld. If you not want to complete learn the internals of nutch, just recrawl and reindex. :) Stefan Am 06.06.2006 um 19:42

Re: Jprofiler compile options

2006-06-08 Thread Stefan Groschupf
Do you use java 1.4 or 1.5 ? In general have a look to the hadoop code base: TaskRunner.java line: 145. Stefan Am 05.06.2006 um 10:51 schrieb Murat Ali Bayir: Hi eveybody, I have problem in running Jprofiler in remote side. I am using DFS and submitting crawl job. I configure LD library

Re: [Moved from Nutch-Dev] Re: how to turn on logging, excersize analyzer, tips on debugging plugins?

2006-06-08 Thread Stefan Groschupf
I just found this http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug +nutch It's from Dec. 2005, so I am not sure if it will still work. It still work, you only need to add more plugins. :) Stefan

Re: Removing or reindexing a URL?

2006-06-08 Thread Stefan Groschupf
desperately want to be able to give Nutch a list of documents. Ben On 6/8/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Just recrawl and reindex every day. That was the simple answer. The more complex answer is you need to do write custom code that deletes documents from your index and crawld

Re: new Plugins in 0.7

2006-06-06 Thread Stefan Groschupf
If the extension point was already available in 0.7 - yes. Am 06.06.2006 um 11:07 schrieb Peter Swoboda: hi, Is ist possible to integrate 0.8 Plugins (ms powerpoint..) in nutch 0.7? Thanx Peter -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem

Re: Image Search

2006-06-03 Thread Stefan Groschupf
the hadoop map reduce job. we could even contribute this back and base a small tutorial on this work. What do you think? Rgrds, Thomas On 6/2/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, using search http is a bad idea, since you get many but not all pages. Just write a hadoop map

Re: Re[2]: Image Search

2006-06-03 Thread Stefan Groschupf
? Rgrds, Thomas On 6/2/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, using search http is a bad idea, since you get many but not all pages. Just write a hadoop map reduce job that process the fetched content in your segments, that should be easy. Storing images in a file system will be very slow

Re: getting exact number of matches

2006-05-30 Thread Stefan Groschupf
Hi, why not dedub your complete index before and not until runtime? There is a dedub tool for that. Stefan Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: Hi Eugen, what I've found (and if I'm right) is that the page-calculation is done in Lucene. As it is quite expensive (time-consuming)

Re: Re-parsing document

2006-05-30 Thread Stefan Groschupf
You can just delete the parse output folders and start the parsing tool. Parsing a given page again makes only sense for debug reasons since hadoop io system can not update entries. If you need to debug I suggest to write you a junit test. HTH Stefan Am 29.05.2006 um 01:01 schrieb Stefan

Re: BBS Crawl Possible?

2006-05-30 Thread Stefan Groschupf
Hi, Did you check the reg ex url filter? By default dynamically ulrs are not allowed by excluding all urls containing a question mark. If you configure your url filter proper you should be able to fetch your dynamically pages. Stefan Am 27.05.2006 um 05:30 schrieb Jackey Yang: Hey Guys,

Re: Nutch meeting 2006 -San Francisco

2006-05-30 Thread Stefan Groschupf
Great Pictures. :) Thanks for that you had a camera with you, it was a really nice event, from my point of view. We should repeat that. :-) Cheers, Stefan Am 23.05.2006 um 02:14 schrieb Michael Plax: Hello, It was great to see everybody. You can find photos from meeting on flickr

Re: nutch compressing huge content data

2006-05-30 Thread Stefan Groschupf
Hi Fabian, wow nutch 0.6 is really old school.. :-) However the simplest thing you can do is just write a class that reads the data from a segment (parsed text and data) and writes those into a own index. Should be simple if you know how to write into a lucene index. HTH Stefan Am

Re: Multiple indexes on a single server instance.

2006-05-30 Thread Stefan Groschupf
I'm not sure what you are planing to do, but you can just switch a symbolic link on your hdd driven by a cronjob to switch between index on a given time. May be you need to touch the web.xml to restart the searcher. If you try to search in different kind of indexes at the same time, I

Re: getting exact number of matches

2006-05-30 Thread Stefan Groschupf
general (there are variables like dedupField etc.). Regards, Stefan Stefan Groschupf wrote: Hi, why not dedub your complete index before and not until runtime? There is a dedub tool for that. Stefan Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: Hi Eugen, what I've found (and if I'm right

Re: [Nutch-general] RE: new location! nutch user meeting San Francisco

2006-05-16 Thread Stefan Groschupf
Hi, no Agenda see: http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1 Stefan

new location! nutch user meeting San Francisco

2006-05-11 Thread Stefan Groschupf
Hi there, since there is such a big interest in the nutch user meeting, we decide to move to a other location. We will now meet: Rite-Spot Cafe (415) 552-6066 2099 Folsom St San Francisco, CA 94110 Its in a good location too for parking and its even reachable by public transport -- 2 blocks

Extending Nutch talk, May 11th, Palo Alto, CA

2006-05-09 Thread Stefan Groschupf
Hi Nutch Users, Doug already mentioned it in the developers list (thanks!), but for those of you that does not subscribe the developer list... The next CommerceNet Thursday Tech Talk will be about Extending Nutch. I'll present a few slides about the plugin system and meta data 'flow' in

Re: GUI

2006-05-04 Thread Stefan Groschupf
I tried to upload some screenshooots to the jira but wasn't able to do so. :( But installing it, mean downloading it, decompress and start bin/ nutch gui /aFolder.. Stefan Am 04.05.2006 um 10:07 schrieb Jérôme Charron: is there any url to see the gui without installing the Bundle? This

Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread Stefan Groschupf
Hi there, since building the gui is some how complicated I was thinking about providing a ready to use binary. This may be would help to get some more beta testers we currently looking for. Any thoughts? However I afraid that this would hit my server to hard and I have to pay for

yes, a European nutch meeting is also planed :)

2006-04-21 Thread Stefan Groschupf
Hi Sami, Hi Dawid, Hi All, yes if there are enough people interested I would love to get a European user meeting organized as well. A nice time would be the Wizards of open source conference this year in September. http://wizards-of-os.org/index.php?id=36L=3 If people are interested to

nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Stefan Groschupf
(with apologies for multiple postings) Dear Nutch users, Dear Nutch developers, Dear Hadoop developers, we would love to invite you to the Nutch user meeting in San Francisco. Date: Thursday, May 18th, 2006 Time: 7 PM. Location: Cafe Du Soleil, 200 Fillmore St, San Francisco, CA 94117.

Re: Saving Metadata to Mysql

2006-04-12 Thread Stefan Groschupf
Depends what you are planing to do, nutch 0.8 support meta data that is very flexible (key value tuples) and fast. Also you can store information in parseData.getMetaData, these will be available until indexing as well. Am 12.04.2006 um 04:31 schrieb sudhendra seshachala: Sorry to just

Re: details: stackoverflow error

2006-04-12 Thread Stefan Groschupf
Doug Cutting wrote: Perhaps we could enhance the logic of the loop at Fetcher.java: 320. Currently this exits the fetcher when all threads exceed a timeout. Instead it could kill any thread that exceeds the timeout, and restart a new thread to replace it. So instead of just keeping a

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf
... a beta will be available soon. Am 11.04.2006 um 22:22 schrieb Rida Benjelloun: Hi Robert, You can see this page http://wiki.apache.org/nutch/NutchAdministrationUserInterface. But I don't have any idea about the advancement of this project. Best regards. On 4/10/06, Robert Douglass

Re: Nutch administration web interface?

2006-04-11 Thread Stefan Groschupf
just 0.8. Am 11.04.2006 um 23:08 schrieb carmmello: Will this interface also cope with Nutch 0.7 or just the new 0.8? - Original Message - From: Stefan Groschupf [EMAIL PROTECTED] style.com To: nutch-user@lucene.apache.org Sent: Tuesday, April 11, 2006 5:53 PM Subject: Re: Nutch

Re: details: stackoverflow error

2006-04-07 Thread Stefan Groschupf
I already suggested to add a kind of timeout mechanism here and had done this for my installation, however the patch suggestion was rejected since it was a 'non reproducible' problem. :-/ Am 07.04.2006 um 21:55 schrieb Rajesh Munavalli: Hi Piotr, Thanks for the help. I think I

Re: details: stackoverflow error

2006-04-07 Thread Stefan Groschupf
Am 07.04.2006 um 22:13 schrieb Jérôme Charron: I already suggested to add a kind of timeout mechanism here and had done this for my installation, however the patch suggestion was rejected since it was a 'non reproducible' problem. Stefan, do you refer to NUTCH-233? No:

Re: extension point: org.apache.nutch.parse.Parser does not exist.

2006-03-10 Thread Stefan Groschupf
Hi, the extension point plugin need to be included in the includes also. Please note that nutc-site do not extend parameters but overwrite it and it is not a good idea to have just the parser plugins installed, at least you need one protocol plugin, a query and a index filter also. Stefan

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread Stefan Groschupf
Better you use nutch .8 to run a crawl using several machines. There is some documentation in the wiki now. Am 08.03.2006 um 17:49 schrieb Olive g: Hi I am new here. Could someone please let me know the step-by-step instructions to set up distributed crawl in 0.7.1? Thank you.

Re: why TOTAL urls: 1

2006-03-08 Thread Stefan Groschupf
I guess yahoo.com has a robot.txt to block crawling the complete page. Also check the level depth you use. Am 08.03.2006 um 17:53 schrieb Olive g: Hello everyone, I am also running distributed crawl on .8.0 (some dev version) and somehow the stats always returned TOTAL urls as 1 while I

Re: how to search data on DSF (0.8)

2006-03-08 Thread Stefan Groschupf
please! From: Stefan Groschupf [EMAIL PROTECTED] Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: how to search data on DSF (0.8) Date: Wed, 8 Mar 2006 18:46:25 +0100 MIME-Version: 1.0 (Apple Message framework v746.2) Received: from mail.apache.org

Re: how to search data on DSF (0.8)

2006-03-08 Thread Stefan Groschupf
I just have no login and ip to the box any more. In case you send me a login, ip and the path where the source are I can have someone taking a look tomorrow. Stefan Am 08.03.2006 um 19:03 schrieb Stefan Groschupf: Storing the index on a dfs works just change conf to use dfs in nutch.war

Re: Offline search (Vicaya 0.1)

2006-03-06 Thread Stefan Groschupf
Hi, storing the index on the hdd would be a good idea. Take a look to the nutchBean init method to get an idea what you need to change. Should be simple by just allowing to provide an location for the index that is different than the segments folder. Stefan Am 06.03.2006 um 12:53 schrieb

Re: project vitality?

2006-03-06 Thread Stefan Groschupf
Hi Thomas, for this crawl setup we have a test environment of nutch 0.8, 10xAMD's, custom linux build, 100Mbit eth1, 1Gb eth0, each box has a 'caching' dns server. Stefan Am 06.03.2006 um 15:59 schrieb TDLN: Stefan. I know people having 500 mio pages index and I personal run crawls

Re: Normal search speeds

2006-03-05 Thread Stefan Groschupf
This is very slow! You can expect results in less than a second from my experience. + check memory settings of tomcat. + you do not use ndfs, right? Am 06.03.2006 um 00:23 schrieb Insurance Squared Inc.: Asking again for the patience of the list, we're still working on speed. I guess what I

Re: NullPointerException

2006-03-05 Thread Stefan Groschupf
Hi, http or www are very good test queries. double check that the nutch-default.xml which inside the nutch.war points to the correct folder namesearcher.dir/name. Stefan Am 06.03.2006 um 02:31 schrieb Hasan Diwan: I've followed the nutch tutorial for crawling and started tomcat from the

Re: NullPointerException

2006-03-05 Thread Stefan Groschupf
If none are being fetched, something is definaltely wrong with your filter or url file. Yes, since it is blog it may has dynamic pages like foo.com?entry=23 this definitely filtered by default. - blog: http://www.find23.org company:

nutch 0.7.0 search performance measurement

2006-03-05 Thread Stefan Groschupf
is not with nutch, but instead with something at the OS or tomcat level, or with another system process that nutch is using). Stefan Groschupf wrote: This is very slow! You can expect results in less than a second from my experience. + check memory settings of tomcat. + you do not use ndfs

Re: project vitality?

2006-03-04 Thread Stefan Groschupf
Hi Richard, I told you I was more than willing to help, and I think many users feel the same way, but I for one feel that there is a lack of documentation and support. This isn't meant to offend anyone, if you are offended you need to toughen up your skin a little bit. Here you can find

Re: how can i go deep?

2006-03-04 Thread Stefan Groschupf
The crawl command creates a crawlDB for each call. So as Rchard mentioned try a higher depth. In case you like nutch to go deeper with each iteration, try the whole web tutorial but change the url filter in a manner that it only crawls your webpage. This will go as deep as much iteration

Re: project vitality?

2006-03-04 Thread Stefan Groschupf
Maybe we should organize us ourself a little bit better in this point. What do you think? Just a general note, jira has a voting functionality. This allows everybody to vote an issue and can show in a very compressed style what the community is looking for. However it is not used that often

Re: Hadoop MapReduce: using NFS as the filesystem

2006-02-27 Thread Stefan Groschupf
Jon, there is also a hadoop user mailing list. It is not clear to me, what you are planing to do, but in general hadoop's tasktracks and jobtrackers require to run with a switched-on dfs. What you can do is write a map task that is reading from the local disk you mentioned,but you will no

Re: question to stefan

2006-02-24 Thread Stefan Groschupf
I had noticed, that you work for a german company. Is it possible to get some nutch support from you or your company? Sure. Please note here you find a list of all people providing support. http://wiki.apache.org/nutch/Support If have some problems to get nutch running that way I want. If

Re: Nutch and HTTrack Crawler

2006-02-23 Thread Stefan Groschupf
Am 23.02.2006 um 01:55 schrieb sudhendra seshachala: Is there a way I could use HTTrack for crawling and nutch for just searching? Has any body done this before andcomparision between crawlers. I suggest take a look to lucene, since I guess it is more work changing nutch to your

Re: retrieve data from index file

2006-02-23 Thread Stefan Groschupf
23.02.2006 um 02:37 schrieb Wong Ting Kiong: hi, Is there any example of java code that i can read the data from index file in segments? I had tried segmentReader, ArrayfileReader, and SequenceReader feel confuse. Thanks. Wong On 1/24/06, Stefan Groschupf [EMAIL PROTECTED] wrote: you can

Re: Admin GUI

2006-02-23 Thread Stefan Groschupf
Hi Daniel, thanks we still working on it. Actually we have to finish something behind the sense and than we will publish a kind of plugin extension point that will allow other people to contribute. Thanks for the offer, may be the only thing you can do is to vote for this issue since this

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
P.S. Now finally i could test nutch...:) Puhh, that was a pain! :-) Welcome!

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
Puhh, that was a pain! :-) Welcome! Ups I hit the send button to fast. :-/ Before people may miss understand that, 'welcome' mean 'welcome to nutch'. 'Welcome' in german means in any case 'someone welcome to something', sorry.

Re: Manage severals NutchConf in one webapp

2006-02-23 Thread Stefan Groschupf
This should be possible with the latest version nutch 0.8 you may need build from sources. There nutchConf is not static anymore and you can pass it down the stack. Beside that you my need to store the Nutchbean not as context attribute but in a hashmap that is stored as content attribute.

Re: Stop Indexing

2006-02-23 Thread Stefan Groschupf
No. Am 22.02.2006 um 22:29 schrieb Saravanaraj Duraisamy: Hi, in nutch 0.7.1 is there a way to stop indexing with out corrupting the index files in the middle of indexing??? thanks d.saravanaraj - blog: http://www.find23.org company:

Re: Intranet search - some questions

2006-02-23 Thread Stefan Groschupf
Hi, - Is there any way to perform form based authentication? I know that this is a common request but I haven’t found a “good-enough” answer to it. The only references I’ve found are about basic auth, which I’d prefer to avoid. I ask this because I’ve noticed that SearchBlox,

Re: Nutch on Windows

2006-02-23 Thread Stefan Groschupf
Don't worry, I understood what do you meant :) Sorry my english is too often just terrible I'm trying to improve it, I feel people to often misunderstand me. Anyway I guess and hope my java is much better. :-) But, what is the reason of this kind of problem? Why nutch is not capable of select

Re: Nutch 0.8 version required..

2006-02-23 Thread Stefan Groschupf
http://cvs.apache.org/dist/lucene/nutch/nightly/ Am 24.02.2006 um 01:44 schrieb sudhendra seshachala: The latest version I could see in the SVN is 0.7.1, Where can I get 0.8., source code is even better. Could I just grab from nightly builds ? Please let me know.. Thanks Sudhi

Re: switch off caching

2006-02-22 Thread Stefan Groschupf
Hi, may this is what you are looking for: property namefetcher.store.content/name valuetrue/value descriptionIf true, fetcher will store content./description /property Am 22.02.2006 um 15:21 schrieb Martin Gutbrod: Hi, I'd like to use nutch to index a large number of pdf files in a

Re: out of memory error

2006-02-22 Thread Stefan Groschupf
Do you have done some customizing, e.g. store the nutchBean more than once? I personal had never such problems. How many segments / indexes you have? Am 22.02.2006 um 15:21 schrieb Insurance Squared Inc.: We're getting an out of memory error when running a search using nutch 0.71 on a

Re: Why Perl5 regular expressions?

2006-02-22 Thread Stefan Groschupf
I guess this it is a historically reason. I remember a discussion to replace it but didn't remember the details may you find something in the mail archive (developer list). Am 22.02.2006 um 16:09 schrieb Elwin: Why the url filter of nutch use Perl5 regular expressions? Any benefits? --

Re: does nutch suppor wild card

2006-02-22 Thread Stefan Groschupf
Does the NutchAnalyser support wild card queries ? and * (character) I don't think so. What are the modificatins needed to support this? A set of things like the Nutch- QueryParser, Nutch Query Object, basic Query filter etc.

Re: does nutch suppor wild card

2006-02-22 Thread Stefan Groschupf
? or missing something here. Rgds Prabhu On 2/22/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Does the NutchAnalyser support wild card queries ? and * (character) I don't think so. What are the modificatins needed to support this? A set of things like the Nutch- QueryParser, Nutch Query

Re: Out of Memory while fetching

2006-02-18 Thread Stefan Groschupf
didn't create segment file by myself. It was created via nutch generate. Please let me know what you mean yuo have one key two times. Best regards, Keren Stefan Groschupf [EMAIL PROTECTED] wrote: I'm not sure if not the key problem is the real source of the problem. In general I suggest

Re: Link problems with Nutch Web-GUI

2006-02-18 Thread Stefan Groschupf
My this is your problem? Entities.encode(url) Am 17.02.2006 um 15:13 schrieb Fankhauser, Alain: Hello I use Nutch 0.8-dev and I'm trying to index a local file system. After Indexing I start tomcat and search. If I do this, I find the expected results but the links aren't correct. It's

Re: search inside lucene-fields

2006-02-18 Thread Stefan Groschupf
This depends on the query filter plugins you are using. As far I know only the scores of a documents increase if the word occurs in a title but there is not title query filter. However write a own is very easy, check the query-site plugin. Stefan Am 17.02.2006 um 16:36 schrieb Nutch

Re: Out of Memory while fetching

2006-02-16 Thread Stefan Groschupf
I'm not sure if not the key problem is the real source of the problem. In general I suggest using nutch 0.8 that fix a set of issues. E.g. writes syncs to the files and creates checksums since people not problems with hdd's. At least a nutch map file require to have ordered keys and in your

Re: impossible situation error

2006-02-16 Thread Stefan Groschupf
Can you provide a full stack, may by setting the loglevel higher. I guess this is a problem of the new OPIC score calculation but you are the first that report such a problem. In general - sorry to repeat that often - it is a good idea to run the latest nightly builds of a open source

Re: clustering carrot plugin

2006-02-16 Thread Stefan Groschupf
No it is a plugin running until search time. Find some more documentation here: http://www.carrot2.org/website/xml/index.xml Am 13.02.2006 um 20:02 schrieb Raghavendra Prabhu: Hi What is the exact use of clustering carrot plugin Say i have java the coffe as well as java the language ,will

Re: segments prune

2006-02-16 Thread Stefan Groschupf
You can remove a document from the index, what at least is the storage that make sense to manipulate. You can also block in general a url from coming into the segment by using a url filter. Stefan Am 13.02.2006 um 18:18 schrieb Raghavendra Prabhu: Hi Is there a way if we give a url , can

Re: file parser

2006-02-16 Thread Stefan Groschupf
You can easily add new file formats by writing new content type parser plugins. Just browse the code of one of the existing parsers like pdf or the new swt parser to get an idea what you need to do. In the end you only need to write a parser for the content and return some values. ... and

Re: Using search refining with carrot2?

2006-02-16 Thread Stefan Groschupf
Hey, there is a interesting discussion in the lucene mailing list about a similar theme. http://www.gossamer-threads.com/lists/lucene/java-user/32629 I'm not sure if Dawid has subscribed the nutch user list also, so you may you can catch and him in the carrot mailing list. Stefan Am

Re: Injecting into existing DB

2006-02-16 Thread Stefan Groschupf
I use normally a simple trick in such situations. I create a new empthy db inject the urls, create my segment and fetch the segment. Than I inject the urls a second time to my orginal db and update the the db with the segment. Stefan Am 12.02.2006 um 18:11 schrieb Chris Schneider: Nutch

Re: nutch configuration

2006-02-10 Thread Stefan Groschupf
in the apache hadoop configuration, this inside the hadoop jar. Am 10.02.2006 um 17:33 schrieb carmmello: After some time, I downloaded the latest nightly version of Nutch (2006-02-10). Going through nutch-default.xml I could not find, anymore, nor fs.defaul nor marpred.reduce properties.

Re: How to add only new urls to DB

2006-02-08 Thread Stefan Groschupf
Hi Scott, yes this makes a sense. I would also create a temp web db create the segment, crawl the segment. If you don't want to add the pages down of the new urls than just index the segment and add this segment to the other searchable segments, do not update the db. In general if you

Re: Installing nutch

2006-02-06 Thread Stefan Groschupf
schrieb Bernd Fehling: Hi list, I came across nutch while looking for search engines. Nutch with its NDFS is very interesting to me. A basic question: Is it possible to install nutch with NDFS on a single machine or do I need at least two maschines? I followed the instructions from Stefan Groschupf

Re: Installing nutch

2006-02-06 Thread Stefan Groschupf
Stefan Groschupf schrieb: Hi, running ndfs on a single box installation makes not much sense, exceptio you plan to use it for research. However it is possible to run a name node and a data node on the same box, also you can run several datanodes on the same box. Rgearding your second question

Re: Asp pages again

2006-02-06 Thread Stefan Groschupf
I guess this is more a question of the configuration than of the version. In any case I suggest using the latest nightly build, since - well - that is an active open source project. :-) Carefully check your url reg ex, also check what your webserver retrun as content type, there is a known

Re: How should I call to the class Injector from hadoop/trunk

2006-02-06 Thread Stefan Groschupf
The hadoop.jar is in nutch/lib. So you may better run under nutch as it was before, also some minutes ago Doug moved some scripts back to nutch/bin so as far I know it should work as before. Am 05.02.2006 um 20:40 schrieb Rafit Izhak_Ratzin: Hi, I updated my environment to the newest

Re: How deep to go

2006-02-06 Thread Stefan Groschupf
Instead of using the crawl command I personal prefer the manually commands. Than I use a small script that runs http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling in a never ending loop where a wait for a day for each iteration. This will make sure that you have all links that

Re: Exact Match Query?

2006-02-06 Thread Stefan Groschupf
It is more a question of query filters than query parsing. You can somehow remove the standard behavior and add your own query filter plugin. Just take a look to query-basic and query-more query filter plugins. Am 04.02.2006 um 16:04 schrieb Albert Chern: Hello, I want to search for a

Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf
Is the host in your web-browser available? Does this host block your ip, since he understand nutch as a DOS attack? Is you bandwidth limited? Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu: Hi I am running a crawl using protocol-httpclient I get a java.io.IOException:

Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf
On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Is the host in your web-browser available? Does this host block your ip, since he understand nutch as a DOS attack? Is you bandwidth limited? Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu: Hi I am running a crawl using protocol

Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf
not be due to protocol-http but is there a chance that this may be also due to same reason ? Thanks for the answer . Rgds Prabhu On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote: I personal prefer protocol-http. Am 05.02.2006 um 18:26 schrieb Raghavendra Prabhu: Hi Stefan My

Re: Hosting segments in NDFS

2006-02-04 Thread Stefan Groschupf
Yes, I already had done this once, but is is not API conform any more, when porting ndfs to hadoop is done I may can bring things again to the api and provide a patch. However there is a list of other issues on my todo list already so it will not happen until next days. Stefan Am

Re: crawler

2006-02-03 Thread Stefan Groschupf
Check the reg ex url filter! Your page contains symbols that are filtered. Am 03.02.2006 um 14:46 schrieb [EMAIL PROTECTED]: Hello, I have problems indexing a special internet site: http://www.gildemeister.com Nutch only fetches 14 pages but not the complete site. I'm using the default

Re: crawler

2006-02-03 Thread Stefan Groschupf
There is already a java script parser, you only need to switch it on. Am 03.02.2006 um 15:55 schrieb mos: The problem at www.gildemeister.com is the use of JavaScript for link generation. That's the reason why nutch can't find the other pages (the links are invisible). Two ideas: - You need

Re: takes too long to remove a page from WEBDB

2006-02-03 Thread Stefan Groschupf
And also it makes no sense, since it will come back as soon the link is found on a page. Use a url filter instead and remove it from the index. Removing from webdb makes no sense. Am 03.02.2006 um 21:27 schrieb Keren Yu: Hi everyone, It took about 10 minutes to remove a page from WEBDB

Re: Problem with plugins

2006-01-31 Thread Stefan Groschupf
What happens if you change your query string to: String query = quiewId:a3d32ce0cae0da47677f30cc6182d421 HTTP ; (adding space http)?? Does this return any hits? Stefan Am 31.01.2006 um 12:05 schrieb Enrico Triolo: Hi all, I developed a couple of plugins to add and search a custom field. The

Re: adding meta to domain

2006-01-31 Thread Stefan Groschupf
Meta data support is actually under developerment and come soon. See jira for latest discussion. In any case you can write already a index filter plugin, see the cool fresh wiki documentation for that. Am 31.01.2006 um 23:25 schrieb Sunnyvale Fl: I need to add some meta data to the index

  1   2   3   >