Re: Why does nutch only handle åäö sometimes?

2009-06-09 Thread Matthias Jaekle
Different encodings? Larsson85 schrieb: When I did a dump from one of my fetched segments I found out that nutch doesnt allways handle all characters in the right way. The example I encountered was http://0862.bizweb.se/Default.aspx when I look in the nutch dump it looks like this: As you can

Re: Pornfilter

2006-07-12 Thread Matthias Jaekle
Hi, We once downloaded a very flexibel regex from squidguard to ignore most of the porn urls. Matthias NG-Marketing, M.Schneider wrote: Hello List, does anyone of you have a pornfilter not to fetch those URLs and therefore save bandwidth and storage space? I could do that with

Re: GUI

2006-05-04 Thread Matthias Jaekle
Hi, is there any url to see the gui without installing the Bundle? Matthias

Re: Multi dimensional searches

2006-03-07 Thread Matthias Jaekle
Hi, I noticed there is a GeoPosition plugin.. Has any one used this plugin in US.. Never heared about somebody. It is uses in Germany and maybe Brazil. Further more, has any built a two dimensional search? The first versions we build from the GeoPosition plugin used a 2D system. We

Re: ö ü ä! German language

2006-03-07 Thread Matthias Jaekle
Hi, I believe you do not have to change anythink: http://www.ankertexte.de:8080/umkreisfinder/search.jsp?query=M%C3%BCnchen Matthias

Re: Moving tutorial link to wiki

2006-03-04 Thread Matthias Jaekle
Maybe we should move the tutorial to the wiki so it can be commented on. +1

Re: project vitality?

2006-03-04 Thread Matthias Jaekle
I am sorry if you don't like my opinion or the way it is expressed. Hi Richard, most of your opinion I think is the same as mine. I use nutch now since spring 2004 for our page http://www.umkreisfinder.de It was a big effort to learn how nutch is working and also a big effort to learn how

Re: New Version of GeoPosition Plugin for local searches released

2005-10-18 Thread Matthias Jaekle
Hi, I would like to implement your plugin to my search system. My question is how do I implement it for Sweden? First you need a file with domains and their locations, e.g.: http://www.stockholm.com 62.123 15.123 ... I mean where do I get the postioning code from? There might be databases

New Version of GeoPosition Plugin for local searches released

2005-10-14 Thread Matthias Jaekle
Hi all, we have expanded the GeoPosition Plugin. The plugin is now able to accept zip codes as the center of a local search. For example searching for hotel de:70174 should bring you all the pages sorted to the area of Stuttgart, Germany and containing the word hotel. 70174 is one of the zip

Re: [nutch] - http.max.delays: retry later issue?

2005-09-21 Thread Matthias Jaekle
In this case, a host is an IP address. I've thought about this more, and wonder if perhaps this should be switched so that host name are blocked from simultaneous fetching rather than IP addresses. I recently spoke with Carlos Castillo, author of the WIRE crawler

Re: [nutch] - http.max.delays: retry later issue?

2005-09-21 Thread Matthias Jaekle
So most other crawlers use the hostname, not the ip. That's good to know. google and yahoo, Yes. The others I am not sure. Perhaps a dynamic property would help. If the elapsed time of the previous request is some fraction of the delay then we might lessen the delay. Similarly, if it is

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

2005-09-20 Thread Matthias Jaekle
You missed my point - I proposed that we change the API. On the surface, command-line tools would behave like now, with the benefit that segment corruption would be fixed automatically by those tools that require clean segments - unless _prevented_ by a cmd-line switch. So, this is just to

Re: RangQuery problem.

2005-09-03 Thread Matthias Jaekle
Hi Benny, I could not tell you anything about your failure, but maybe there is an other one. Did you consider, that lucene uses text comparisons. So, maybe you should always compare 001000 with 20. Strings with the same length. Matthias Benny schrieb: Hi, I hit a problem when using

Re: Information extraction

2005-07-26 Thread Matthias Jaekle
In the list of public nutch servers you find the following, which might be interesting: http://www.betherebesquare.com/ Matthias

Re: Information extraction

2005-07-26 Thread Matthias Jaekle
Hi, the author of this system announced he would like to contribute some of his modifications. Here is his post to list from 2005-06-10: Hello, I'd like to announce the launch of a new search engine that uses the Nutch engine. http://betherebesquare.com is an Event Search Engine for the San

Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
Hi Andrzej, thanks for your response. I am not really familar with the lucene internals. I am just running nutch with the default parameters on a debian sarge system with ext3 file system, maximum 1024 files opened, and 1 GB RAM. So is ext3 a bad file system for millions of files? I could

Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle
You probably don't want to touch indexer.termIndexInterval and indexer.maxMergeDocs (determines the max size of an individual segment). Why is maxMergeDocs 50 by default? Should not this value be much higher? I found how to calculate the number of opened files But how could I calculate the

Re: Crawling question

2005-07-11 Thread Matthias Jaekle
I have another newbie crawling question. If I am running fetch, and I interrupt it (kill the process), is the segment corrupted, or can I restart the fetch where it left off? You cannot restart the process. Next time send a STOP to the process, so you can restart it. You can use the stuff you

Re: Question about injecting, generating fetch segment and refetching

2005-07-11 Thread Matthias Jaekle
I managed to extract urld from segments which fetcher failed to fetch for some reason. I'm now thinking what's the best way to refetch those urls again? I was first thinking to creating another db/segments pair, inject these urls into the new webdb, fetching them and then merking the results back