Different encodings?
Larsson85 schrieb:
When I did a dump from one of my fetched segments I found out that nutch
doesnt allways handle all characters in the right way.
The example I encountered was
http://0862.bizweb.se/Default.aspx
when I look in the nutch dump it looks like this: As you can
Hi,
We once downloaded a very flexibel regex from squidguard to ignore most
of the porn urls.
Matthias
NG-Marketing, M.Schneider wrote:
Hello List,
does anyone of you have a pornfilter not to fetch those URLs and therefore
save bandwidth and storage space?
I could do that with
Hi,
is there any url to see the gui without installing the Bundle?
Matthias
Hi,
I noticed there is a GeoPosition plugin.. Has any one used this plugin in US..
Never heared about somebody. It is uses in Germany and maybe Brazil.
Further more, has any built a two dimensional search?
The first versions we build from the GeoPosition plugin used a 2D
system. We
Hi,
I believe you do not have to change anythink:
http://www.ankertexte.de:8080/umkreisfinder/search.jsp?query=M%C3%BCnchen
Matthias
Maybe we should move the tutorial to the wiki so it can be commented on.
+1
I am sorry if you don't like my opinion or the way it is expressed.
Hi Richard,
most of your opinion I think is the same as mine. I use nutch now since
spring 2004 for our page http://www.umkreisfinder.de
It was a big effort to learn how nutch is working and also a big effort
to learn how
Hi,
I would like to implement your plugin to my search system. My question
is how do I implement it for Sweden?
First you need a file with domains and their locations, e.g.:
http://www.stockholm.com 62.123 15.123
...
I mean where do I get the
postioning code from?
There might be databases
Hi all,
we have expanded the GeoPosition Plugin.
The plugin is now able to accept zip codes as the center of a local search.
For example searching for hotel de:70174 should bring you all the
pages sorted to the area of Stuttgart, Germany and containing the word
hotel. 70174 is one of the zip
In this case, a host is an IP address.
I've thought about this more, and wonder if perhaps this should be
switched so that host name are blocked from simultaneous fetching rather
than IP addresses. I recently spoke with Carlos Castillo, author of the
WIRE crawler
So most other crawlers use the hostname, not the ip. That's good to
know.
google and yahoo, Yes. The others I am not sure.
Perhaps a dynamic property would help. If the elapsed time of the
previous request is some fraction of the delay then we might lessen the
delay. Similarly, if it is
You missed my point - I proposed that we change the API. On the surface,
command-line tools would behave like now, with the benefit that segment
corruption would be fixed automatically by those tools that require
clean segments - unless _prevented_ by a cmd-line switch. So, this is
just to
Hi Benny,
I could not tell you anything about your failure, but maybe there is an
other one. Did you consider, that lucene uses text comparisons.
So, maybe you should always compare 001000 with 20. Strings with the
same length.
Matthias
Benny schrieb:
Hi,
I hit a problem when using
In the list of public nutch servers you find the following, which might
be interesting:
http://www.betherebesquare.com/
Matthias
Hi,
the author of this system announced he would like to contribute some of
his modifications. Here is his post to list from 2005-06-10:
Hello,
I'd like to announce the launch of a new search engine that uses the
Nutch engine.
http://betherebesquare.com is an Event Search Engine for the San
Hi Andrzej,
thanks for your response. I am not really familar with the lucene internals.
I am just running nutch with the default parameters on a debian sarge
system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.
So is ext3 a bad file system for millions of files?
I could
You probably don't want to touch indexer.termIndexInterval and
indexer.maxMergeDocs (determines the max size of an individual
segment).
Why is maxMergeDocs 50 by default? Should not this value be much higher?
I found how to calculate the number of opened files
But how could I calculate the
I have another newbie crawling question. If I am
running fetch, and I interrupt it (kill the process),
is the segment corrupted, or can I restart the fetch
where it left off?
You cannot restart the process.
Next time send a STOP to the process, so you can restart it.
You can use the stuff you
I managed to extract urld from segments which fetcher failed
to fetch for some reason. I'm now thinking what's the best
way to refetch those urls again? I was first thinking to
creating another db/segments pair, inject these urls into the new webdb,
fetching them and then merking the results back
19 matches
Mail list logo