problem with skiped urls

2006-06-21 Thread david . wojciechowski
hi,
i'm trying to run nutch in our clinicum center and i have a little problem.
we have a few intranet servers and i want that nutch skip a few
direcotries.
for example:

http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/

i wrote this urls in the crawl-urlfilter.txt. for example:

-^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus

but nothing happens. nutch don't skip this urls. and i don't know why...

:( kann anyone help me?

i'm cwaling with this command:

bin/nutch crawl urls -dir crawl060621 -depth 15  crawl060621.log 

i'm using the release 0.7.1
greets david

==

David Wojciechowski
Universitätsklinikum Freiburg
Klinikrechenzentrum
Agnesenstrasse 6-8
D-79106 Freiburg

Telefon :  0761 / 270 - 1842
Fax: 0761 / 270 - 2276
E-Mail   :  [EMAIL PROTECTED]

==



Re: stemming

2006-06-21 Thread bb300


Thanks!

to Jerome

 Checks that these words are not in the stopword list of your analyzer.

That words aren't in the stopword list. It couldn't find them at all.
When I disable the stemming (the index is the same) it could find that
words (of course it find only that form of the words which presents in
the queries).

 No : it only highlight the main form in the summaries. It is a known problem

For me there is not main thing the highlighting. In the summaries
presents only documents with the main form of the words. For example
without stemming I find about 450 documents with the different forms
of the word (I count only simple forms - changes only one or two
letters at the end of the root of the word). With enabled stemming it
finds about 120 documents which contains only main forms of the word.


About analysis-xx.
Should i make any changes in trunk version or not (I mean in the code
as it described in the wiki on MultiLingual support page)?


---

Regards

Alexey



Re: problem with skiped urls

2006-06-21 Thread Jayant Kumar Gandhi

You can also stop nutch from crawling those pages by modifying the
robots.txt if you have set Nutch to respect those rules, by default it
will respect such rules. If you haven't modified the setting
http.robots.agents in nutch-default.xml/nutch-site.xml , the following
robots.txt rule should work:-

User-agent: NutchCVS
Disallow: /abteilung/pvs/dokus/

Cheers,
Jayant

On 6/21/06, Stefan Neufeind [EMAIL PROTECTED] wrote:

[EMAIL PROTECTED] wrote:
 hi,
 i'm trying to run nutch in our clinicum center and i have a little problem.
 we have a few intranet servers and i want that nutch skip a few
 direcotries.
 for example:

 http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/

 i wrote this urls in the crawl-urlfilter.txt. for example:

 -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus

 but nothing happens. nutch don't skip this urls. and i don't know why...

 :( kann anyone help me?

 i'm cwaling with this command:

 bin/nutch crawl urls -dir crawl060621 -depth 15  crawl060621.log 

 i'm using the release 0.7.1

Hi David,

do you have regex-urlfilter in your crawler-site-configfile or
nutch-site-configfile? I suspect that the plugin might not yet be
loaded. Also, do you have another allow all URLs-line above the one
you mentioned, maybe?
I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and
not +, so I guess that should be fine). But if your URL does not have
anything in front of sapdoku, maybe try dropping that part.


Good luck,
 Stefan




--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi | +91-9871412929
M.Tech. Computer Tech. Class of 2007,
D-38, Aravali Hostel, IIT Delhi,
Hauz Khas, Delhi-110016


Re: Do nutch allow an advanced search?

2006-06-21 Thread Scott McCammon
The index-more plugin indexes each document's last modified date and is 
searchable via a range like: date:20060521-20060621  Note that a date 
search does not work by itself. At least one keyword or phrase is required.


Scott

John john wrote:

Hello
 
 I'm new in the nutch world and i'm wondering whether it's possible to search with date range? or specify a date and then nutch retrieves pages updated after this date?
 
 thanks
  


Re: Do nutch allow an advanced search?

2006-06-21 Thread Stefan Neufeind
Scott McCammon wrote:
 The index-more plugin indexes each document's last modified date and is
 searchable via a range like: date:20060521-20060621  Note that a date
 search does not work by itself. At least one keyword or phrase is required.

Hi Scott,

requiring a keyword/phrase has been mentioned at several places before
already. Is there a technical background for it, or could that
limitation maybe be removed (and should we file a JIRA for that)?


Regards,
 Stefan

 John john wrote:
 Hello
  
  I'm new in the nutch world and i'm wondering whether it's possible to
 search with date range? or specify a date and then nutch retrieves
 pages updated after this date?
  
  thanks


NEWBIE help: java.lang.IllegalAccessError

2006-06-21 Thread Mike Blackstock

Hi folks.

I installed Nutch the past days, and I kept getting blank results pages at
http://localhost:8080/search.jsp... Tomcat was properly installed, I think,
so I looked further and I found that everytime I run a crawl, I get
this toward the end:


060621 125637 indexing segment: crawl.test/segments/20060620163405
Exception in thread main java.lang.IllegalAccessError: tried to access 
field org.apache.lucene.index.IndexWriter.mergeFactor from class 
org.apache.nutch.indexer.IndexSegment
   at 
org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:102)

   at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263)

It's crawling ok, but not indexing the segment. I copied the 
lucene-core-2.0.0.jar
file to the current directory from which I'm running bin/nutch (just in 
case it

wasn't finding the lucene.index.IndexWriter stuff), but no luck.

Anyone seen this before or have any ideas? I hope so.

Cheers and thanks,
Mike









Re: NEWBIE help: java.lang.IllegalAccessError

2006-06-21 Thread Mike Blackstock

Mike Blackstock wrote:





Exception in thread main java.lang.IllegalAccessError: tried to 
access field org.apache.lucene.index.IndexWriter.mergeFactor from 
class org.apache.nutch.indexer.IndexSegment   


I'm using Nutch 7.2; I spent the better part of yesterday searching the 
net for possible

answers before subscribing here, but no luck.

Cheers,
Mike



using a test web site

2006-06-21 Thread ndemir

Hi,

I am trying to develop a web-crawler for my master's thesis. I need a 
dummy test site to test my crawler. If I had the tree-structure of the 
test web site
I would compare the list of pages with the output of my crawler after I 
ran my crawler.


Can anybody help me?

Thanks,
Nildem


This message was sent using IMP, the Internet Messaging Program.



Re: stemming

2006-06-21 Thread bb300


to Jerome

 Checks that these words are not in the stopword list of your analyzer.

Actually I could not find stopwords file. Could You help me with this.
Actually I am sure that such words as mission, sea, ocean, building,
electricity, etc. couldn't be in stopwords file. (at my previous
question I mean carrot stopword file, because I can't find lucene's
stopwords files)
---

Regards

Alexey



Re: stemming

2006-06-21 Thread Jérôme Charron

When I disable the stemming (the index is the same) it could find that
words (of course it find only that form of the words which presents in
the queries).


Just a silly question: Do you build your index with the analyzers turned on?
(does the documents language was correctly guessed and the corresponding
analyzer called?)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Re: stemming

2006-06-21 Thread Jérôme Charron

Actually I could not find stopwords file. Could You help me with this.


If you have simply wrapped a Lucene's analyzer (like fr and de analyzers),
the default stop word list is inside the analyzer code (take a look at the
analyzer source).

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


RE: stemming

2006-06-21 Thread Teruhiko Kurosaka
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 

 Actually I could not find stopwords file. Could You help me with this.
 Actually I am sure that such words as mission, sea, ocean, building,
 electricity, etc. couldn't be in stopwords file. (at my previous
 question I mean carrot stopword file, because I can't find lucene's
 stopwords files)

The current implementations of the language Analyzers use
the default constructors of the analyzers of the same name in 
the Lucene package.  When instantiated this way,
the analyzers use the hard-coded stop word lists.  For German,
the stop words are:

  private String[] GERMAN_STOP_WORDS = {
einer, eine, eines, einem, einen,
der, die, das, dass, daß,
du, er, sie, es,
was, wer, wie, wir,
und, oder, ohne, mit,
am, im, in, aus, auf,
ist, sein, war, wird,
ihr, ihre, ihres,
als, für, von, mit,
dich, dir, mich, mir,
mein, sein, kein,
durch, wegen, wird
  };
// From src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java
// of Lucene 1.4.3 distribution.  This could be slightly out of date.

You'd have to either modify the source code in:
src/plugin/analysis-de/src/java/org/apache/nutch/analysis/de/GermanAnalyzer.java
to use the constructor that takes the word list or the file name of the word 
list,
I think.


 when I use trunk version should I change some code as it shown at wiki
 in MultiLingual support page? Because, as I understand everything in
 trunk version have been done for stemming plugins integration without
 code changing.

I believe Jérôme has implemented these code changes into the Trunk.

-kuro


Re: Add Wyona to the wiki support page?

2006-06-21 Thread Andrzej Bialecki

Renaud Richardet wrote:

Hello Nutch,

My name is Renaud Richardet and I am the COO of Wyona LLC.  We are 
offering Nutch and Lucene support (http://wyona.com/lucene.html), and 
I was wondering if I could add our company to 
http://wiki.apache.org/nutch/Support. That would be great.


Certainly, you can add a short note about your company on the support 
page. It's a Wiki, so you can just create an account, log in, and edit 
this page (please use the preview button to check the changes before 
saving).


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
The funny thing about that wiki page (and some others in that area) is 
that they apparently use the nofollow tags.  Given the topic of that 
wiki, isn't that a bit odd?  I personally dislike the nofollow tag and 
think it should be used only in extreme circumstances (i.e. here's a 
link to a site you absolutely don't want to visit).  I believe in this 
case however it's simply being used so that sites that are listed don't 
get any pagerank/weight/whatever passed to them from an authority site.  
A really bizarre policy for a search related site IMO.


Swinging back on topic, does nutch obey the nofollow tags?

g.




Andrzej Bialecki wrote:


Renaud Richardet wrote:


Hello Nutch,

My name is Renaud Richardet and I am the COO of Wyona LLC.  We are 
offering Nutch and Lucene support (http://wyona.com/lucene.html), and 
I was wondering if I could add our company to 
http://wiki.apache.org/nutch/Support. That would be great.



Certainly, you can add a short note about your company on the support 
page. It's a Wiki, so you can just create an account, log in, and edit 
this page (please use the preview button to check the changes before 
saving).




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Andrzej Bialecki

Insurance Squared Inc. wrote:
The funny thing about that wiki page (and some others in that area) is 
that they apparently use the nofollow tags.  Given the topic of that 
wiki, isn't that a bit odd?  I personally dislike the nofollow tag and 
think it should be used only in extreme circumstances (i.e. here's a 
link to a site you absolutely don't want to visit).  I believe in this 
case however it's simply being used so that sites that are listed 
don't get any pagerank/weight/whatever passed to them from an 
authority site.  A really bizarre policy for a search related site IMO.




I think it's a default setting for the Wiki, which nobody bothered to 
change...



Swinging back on topic, does nutch obey the nofollow tags?


Yes. Please see HtmlParser and HTMLMetaTags classes for details.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
Well so much for knee-jerk suspicions as to intent.  No need to look for 
conspiracy theories when default settings are more likely to be the 
cause.  That should probably a corollary to occam's razor or something :).



Andrzej Bialecki wrote:


Insurance Squared Inc. wrote:

The funny thing about that wiki page (and some others in that area) 
is that they apparently use the nofollow tags.  Given the topic of 
that wiki, isn't that a bit odd?  I personally dislike the nofollow 
tag and think it should be used only in extreme circumstances (i.e. 
here's a link to a site you absolutely don't want to visit).  I 
believe in this case however it's simply being used so that sites 
that are listed don't get any pagerank/weight/whatever passed to them 
from an authority site.  A really bizarre policy for a search related 
site IMO.




I think it's a default setting for the Wiki, which nobody bothered to 
change...



Swinging back on topic, does nutch obey the nofollow tags?



Yes. Please see HtmlParser and HTMLMetaTags classes for details.