Re: Use and configuration of RegexUrlNormalize

2006-11-03 Thread Stefan Neufeind
Javier P. L. wrote:
> Hi, 
> 
> I am using Nutch for news sites crawling, I have a problem with one of
> them that publishes the urls with & instead of &. I discovered the
> use of the url normalizer and the regex-normalize.xml configuration
> file. Unfortunately I did not find too much examples about how to use
> the regular expressions and substitutions, so I was trying different
> combinations to make the transformation but it did work. 
> 
> Basically what I want is to convert  
> 
> noticia.jsp?CAT=126&TEXTO=10109668
> 
> in 
> 
> noticia.jsp?CAT=126&TEXTO=10109668
> 
> because otherwise Nutch is not capable to crawl those pages.

A more basic question to this: How do your URLs with & end up in
nutch? It's okay/right that they should be written with & in
HTML-source to be "clean", but when storing the URLs shouldn't Nutch
itself already convert them to & for storing/fetching?


Regads,
 Stefan


Re: I can not query myplugin in field category:test

2006-10-13 Thread Stefan Neufeind
Please do share it. I'd appreciate it, and I guess a lot of others as
well. And I bet it could even be enhanced by the community. :-)


Regards,
 Stefan

Ernesto De Santis wrote:
> I did a url-category-indexer.
> 
> It works with a .properties file that map urls writed as regexp and
> categories.
> example:
> 
> http://www.misite.com/videos/.*=videos
> 
> If it seems useful, I can share it.
> 
> Maybe, it could be better config it in a .xml file.
> 
> Regards,
> Ernesto.
> 
> Stefan Neufeind escribió:
>> Alvaro Cabrerizo wrote:
>>  
>>> Have you included a node to describe your new searcher filter into
>>> plugin.xml?
>>>
>>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>>>
>>>> I have a question about myplugin for indexfilter and queryfilter.
>>>> Can u Help me !
>>>> -
>>>> MoreIndexingFilter.java in add
>>>> doc.add(new Field("category", "test", false, true, false));
>>>> -
>>>>
>>>> --
>>>>
>>>>
>>>> package org.apache.nutch.searcher.more;
>>>>
>>>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>>>>
>>>> /** Handles "category:" query clauses, causing them to search the
>>>> field indexed by
>>>>  * BasicIndexingFilter. */
>>>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>>>>  public CategoryQueryFilter() {
>>>>super("category");
>>>>  }
>>>> }
>>>> ---
>>>> ---
>>>>
>>>> 
>>>>  plugin.includes
>>>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)
>>>>
>>>>
>>>>  Regular expression naming plugin directory names to
>>>>  include.  Any plugin not matching this expression is excluded.
>>>>  In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>  and basic indexing and search plugins.
>>>>  
>>>> 
>>>>
>>>> 
>>>>  plugin.includes
>>>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)
>>>>
>>>>
>>>>  Regular expression naming plugin directory names to
>>>>  include.  Any plugin not matching this expression is excluded.
>>>>  In any case you need at least include the nutch-extensionpoints
>>>> plugin. By
>>>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>>>  and basic indexing and search plugins.
>>>>  
>>>> 
>>>> ---
>>>>
>>>> I use luke to query "category:test" is ok!
>>>> but I use tomcat webstie to query "category:test" ,
>>>> no return result.
>>>>   
>>
>> In case you get the search working:
>> How do you plan to categorize URLs/sites? I'm looking for a solution
>> there, since I didn't yet manage to implement something
>> URL-prefix-filter based to map categories to URLs or so.
>>
>>
>> Regards,
>>  Stefan


Re: I can not query myplugin in field category:test

2006-10-13 Thread Stefan Neufeind
Alvaro Cabrerizo wrote:
> Have you included a node to describe your new searcher filter into
> plugin.xml?
> 
> 2006/10/11, xu nutch <[EMAIL PROTECTED]>:
>> I have a question about myplugin for indexfilter and queryfilter.
>> Can u Help me !
>> -
>> MoreIndexingFilter.java in add
>> doc.add(new Field("category", "test", false, true, false));
>> -
>>
>> --
>>
>>
>> package org.apache.nutch.searcher.more;
>>
>> import org.apache.nutch.searcher.RawFieldQueryFilter;
>>
>> /** Handles "category:" query clauses, causing them to search the
>> field indexed by
>>  * BasicIndexingFilter. */
>> public class CategoryQueryFilter extends RawFieldQueryFilter {
>>  public CategoryQueryFilter() {
>>super("category");
>>  }
>> }
>> ---
>> ---
>>
>> 
>>  plugin.includes
>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)
>>
>>  Regular expression naming plugin directory names to
>>  include.  Any plugin not matching this expression is excluded.
>>  In any case you need at least include the nutch-extensionpoints
>> plugin. By
>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>  and basic indexing and search plugins.
>>  
>> 
>>
>> 
>>  plugin.includes
>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)
>>
>>  Regular expression naming plugin directory names to
>>  include.  Any plugin not matching this expression is excluded.
>>  In any case you need at least include the nutch-extensionpoints
>> plugin. By
>>  default Nutch includes crawling just HTML and plain text via HTTP,
>>  and basic indexing and search plugins.
>>  
>> 
>> ---
>>
>> I use luke to query "category:test" is ok!
>> but I use tomcat webstie to query "category:test" ,
>> no return result.

In case you get the search working:
How do you plan to categorize URLs/sites? I'm looking for a solution
there, since I didn't yet manage to implement something
URL-prefix-filter based to map categories to URLs or so.


Regards,
 Stefan


Re: Lucene query support in Nutch

2006-10-10 Thread Stefan Neufeind
Cristina Belderrain wrote:
> On 10/9/06, Tomi NA <[EMAIL PROTECTED]> wrote:
> 
>> This is *exactly* what I was thinking. Like Stefan, I believe the
>> nutch analyzer is a good foundation and should therefore be extended
>> to support the "or" operator, and possibly additional capabilities
>> when the need arises.
>>
>> t.n.a.
> 
> Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
> which does exactly what you want, is already there?

To what I understood so far in this thread the Nutch
analyser/query-whatever seems to be more targeted and provides
additional features regarding distributed search as well as maybe
speed-improvements due to it's nature etc. (Correct me if I'm wrong.)

One idea that has come up was to offer both as alternatives so you could
use Lucene-based queries if you need it's features on the   one hand but
can live with restrictions on the other.

However due to what has been mentioned so far it seems that
Lucene-queries by default can only be on document-content (is that
right?) not e.g. site:www.example.org. Hmm ...


PS: Thank you all for help offered so far in this thread on how to get
Lucene-queries going. Unfortunately I couldn't make much use of "just
simply extend it here and there ..." :-(


Regards,
 Stefan


Re: Lucene query support in Nutch

2006-10-07 Thread Stefan Neufeind
Björn Wilmsmann wrote:
> 
> Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:
> 
>> Let me remind you that all this must be done just to provide something
>> that's already there: Nutch is built on top of Lucene, after all. If
>> it's hard to understand why Lucene's capabilities were simply
>> neutralized in Nutch, it's even harder to figure out why no choice was
>> left to users by means of some configuration file.
> 
> I think this issue is rooted in the underlying philosophy of Nutch:
> Nutch was designed with the idea of a possible Google(and the
> likes)-sized crawler and indexer in mind. Regular expressions and
> wildcard queries do not seem to fit into this philosophy, as such
> queries would be way less efficient on a huge data set than simple
> boolean queries.
> 
> Nevertheless, I agree that there should be an option to choose the
> Lucene query engine instead of the Nutch flavour one because Nutch has
> been proven to be equally suitable for areas which do not require as
> efficient queries (like intranet crawling for instance) as an all-out
> web indexing application.

Hi,

if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an "easier" way to allow this in Nutch as well instead of throwing quite
a bit away and using the Lucene-syntax? As has just been pointed out: It
seems quite a few things need to be "changed" to use Lucene-search
instead of a Nutch-search. I don't think that it's needed in most cases.
But I see several reasons where a boolean query would make sense.

(Currently I do fetch up to 10.000 or so results using opensearch and
filter them in a script myself, since no "AND (site:... or site:...)" is
 yet possible.)


Regards,
 Stefan


Re: Lucene query support in Nutch

2006-10-04 Thread Stefan Neufeind
Hi,

yes, I guess having the full strength of Lucene-based queries would be
nice. That would as well solve the boolean queries-question I had a few
days ago :-)

Ravi, doesn't Lucene also allow querying of other fields? Is there any
possibility to add that feature to your proposal?


In general: What is the advantage of the current nutch-parser instead of
going with the Lucene-based one?


Regards,
 Stefan

Ravi Chintakunta wrote:
> Hi Cristina,
> 
> You can achieve this by modifying the IndexSearcher to take the query
> String as an argument and then use
> 
> org.apache.lucene.queryParser.QueryParser's parse(String ) method to
> parse the query string. The modified method in IndexSearcher would
> look as below:
> 
> public Hits search(String queryString, int numHits,
> String dedupField, String sortField, boolean
> reverse)  throws IOException {
> 
>org.apache.lucene.queryParser.QueryParser parser = new
> org.apache.lucene.queryParser.QueryParser("content", new
> org.apache.lucene.analysis.standard.StandardAnalyzer());
> 
>   org.apache.lucene.search.Query luceneQuery = parser.parse(queryString);
> 
>   return translateHits
>  (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
>  sortField, reverse),
>   dedupField, sortField);
>  }
> 
> For this you have to modify the code in search.jsp and NutchBean too,
> so that you are passing on the raw query string to IndexSearcher.
> 
> Note that with this approach, you are limiting the search to the content
> field.
> 
> 
> - Ravi Chintakunta
> 
> 
> 
> On 10/4/06, Cristina Belderrain <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> we all know that Lucene supports, among others, boolean queries. Even
>> though Nutch is built on Lucene, boolean clauses are removed by Nutch
>> filters so boolean queries end up as "flat" queries where terms are
>> implicitly connected by an OR operator, as far as I can see.
>>
>> Is there any simple way to turn off the filtering so a boolean query
>> remains as such after it is submitted to Nutch?
>>
>> Just in case a simple way doesn't exist, Ravi Chintakunta suggests the
>> following workaround:
>>
>> "We have to modify the analyzer and add more plugins to Nutch
>> to use the Lucene's query syntax. Or we have to directly use
>> Lucene's Query Parser. I tried the second approach by modifying
>> org.apache.nutch.searcher.IndexSearcher and that seems to work."
>>
>> Can anyone please elaborate on what Ravi actually means by "modifying
>> org.apache.nutch.searcher.IndexSearcher"? Which methods are supposed
>> to be modified and how?
>>
>> It would be really nice to know how to do this. I believe many other
>> Nutch users would also benefit from an answer to this question.
>>
>> Thanks so much,
>>
>> Cristina


Searching with "and" and "or?

2006-09-28 Thread Stefan Neufeind
Hi,

I'm trying to build a search like

   searchword AND (site:www.example.com OR site:www.foobar.org)

But no such syntax I tried worked out. Is it possible somehow?



Regards,
 Stefan



Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

2006-07-26 Thread Stefan Neufeind
Andrzej Bialecki wrote:
> Stefan Neufeind wrote:
>> Sami Siren wrote:
>>  
>>> Stefan Neufeind wrote:
>>>
>>>> Sami Siren wrote:
>>>>  
>>>>> redirecting to nutch-user...
>>>>>
>>>>>> What I currently have is that max. 2 matches are shown per website -
>>>>>> but
>>>>>> that also from the summary-website only 2 matches are shown.
>>>>>> Either I'd
>>>>>> need to be able to show only 2 matches per website but _all_ matches
>>>>>> from the summary-website (would be okay in this case) or give
>>>>>> website 1
>>>>>> to 4 individual "IDs per website" and also assign each URL from the
>>>>>> summary-website the corresponding ID of the website it belongs to.
>>>>>>   
>>>>> You can add whatever (meta-)data to index with indexing filter. You
>>>>> could
>>>>> for example assign tag "A" to site A, tag "B" to B etc...
>>>>> then assign unique tags for pages from summary site.
>>>>>
>>>>> In searching phase you then use that new field as dedupfield
>>>>> (instead of
>>>>> site)
>>>>>
>>>>> This should give you max (for example 2) hits per website and
>>>>> unlimited
>>>>> hits
>>>>> from summary web site.
>>>>>
>>>>> Does that fullfill your requirements?
>>>>> 
>>>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
>>>> what "filter"?
>>>>   
>>> Write a plugin that provides implementation of
>>> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html
>>
>> That was (part of) my question - how to do that "cleanly", and if
>> somebody could give a hint. I'm not sure what would be the elegant way
>> of having a "match URL against ... and set tags ABC"-patternfile, how to
>> use a hash-map or something for that and how to do it in Java. (Sorry,
>> I'm not that familiar with Java as with other languages, and neither
>> with nutch-internals).
> 
> If it's a relatively short list of urls (let's say less than 50,000
> entries) then you can use org.apache.nutch.util.PrefixStringMatcher,
> which builds a compact trie structure. I would then strongly advise you
> to keep just the urls (or whatever it is that you need to match) in that
> structure, and all other data in an external DB or a special-purpose
> Lucene index. You can implement this as an indexing plugin - if the
> pattern matches, then you get additional metadata from some external
> source, and you add additional fields to the index that contain this data.

Hmm, I'm still not sure how this would work. (Sorry for that!) I know
that for every URL in my index the prefix matches. I just would need to
find out how much. E.g.

http://www.example.com/test1/as prefix
and
http://www.example.com/test1/page1.htm   as the page-URL

Now I would want to do a lookup and assign that, based on the prefix, ID
"test1". Do I conclude right that in this case I could leave out the
PreefixStringMatcher, since I know that some string will match for all
the URLs?

Do you maybe have a small example for a plugin to match against an
external database?

PS: Your help is very much appreciated. Sorry for asking dumb questions :-)


Regards,
 Stefan


Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

2006-07-26 Thread Stefan Neufeind
Sami Siren wrote:
> Stefan Neufeind wrote:
>> Sami Siren wrote:
>>
>>> redirecting to nutch-user...
>>>
>>>
>>>> What I currently have is that max. 2 matches are shown per website -
>>>> but
>>>> that also from the summary-website only 2 matches are shown. Either I'd
>>>> need to be able to show only 2 matches per website but _all_ matches
>>>> from the summary-website (would be okay in this case) or give website 1
>>>> to 4 individual "IDs per website" and also assign each URL from the
>>>> summary-website the corresponding ID of the website it belongs to.
>>>
>>> You can add whatever (meta-)data to index with indexing filter. You
>>> could
>>> for example assign tag "A" to site A, tag "B" to B etc...
>>> then assign unique tags for pages from summary site.
>>>
>>> In searching phase you then use that new field as dedupfield (instead of
>>> site)
>>>
>>> This should give you max (for example 2) hits per website and unlimited
>>> hits
>>> from summary web site.
>>>
>>> Does that fullfill your requirements?
>>
>>
>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
>> what "filter"?
>>
> 
> Write a plugin that provides implementation of
> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html

That was (part of) my question - how to do that "cleanly", and if
somebody could give a hint. I'm not sure what would be the elegant way
of having a "match URL against ... and set tags ABC"-patternfile, how to
use a hash-map or something for that and how to do it in Java. (Sorry,
I'm not that familiar with Java as with other languages, and neither
with nutch-internals).

  Stefan


Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]

2006-07-26 Thread Stefan Neufeind
Sami Siren wrote:
> redirecting to nutch-user...
> 
>> What I currently have is that max. 2 matches are shown per website - but
>> that also from the summary-website only 2 matches are shown. Either I'd
>> need to be able to show only 2 matches per website but _all_ matches
>> from the summary-website (would be okay in this case) or give website 1
>> to 4 individual "IDs per website" and also assign each URL from the
>> summary-website the corresponding ID of the website it belongs to.
> 
> You can add whatever (meta-)data to index with indexing filter. You could
> for example assign tag "A" to site A, tag "B" to B etc...
> then assign unique tags for pages from summary site.
> 
> In searching phase you then use that new field as dedupfield (instead of
> site)
> 
> This should give you max (for example 2) hits per website and unlimited
> hits
> from summary web site.
> 
> Does that fullfill your requirements?

That would perfectly fit, yes. But how do I "tag" the pages/URLs? With
what "filter"?

  Stefan


Re: Please Help - Patch not working - external links still crawled

2006-07-25 Thread Stefan Neufeind
Ronny wrote:
> Hi all,
> 
> after installing the patch http://issues.apache.org/jira/browse/NUTCH-173 and 
> a whole-web crawl external links will still be crawled.
> 
>  I modified the nutch-site.xml as follows:
> 
> 
> 
> crawl.ignore.external.links
> 
> true
> 
> not crwling external links
> 
> 
> 
> What made I wrong?

You did not rebuild nutch, did you?


Regards,
 Stefan


Re: Please Help - Patch install

2006-07-25 Thread Stefan Neufeind
You'd use the "patch"-utility, which is generally available on every
Linux-installation I know. It's nothing Java-specific or so. Also
various development-IDEs feature patch-/merge-functionality as well.



Regards,
 Stefan

Ronny wrote:
> Hi Stefan,
> 
> which utility I need and after installing how do I install the patch?
> 
> Sorry for this questions but I am a beginner in Java and nutch...
> 
> Thanks for your help
> Ronny
> - Original Message - From: "Stefan Neufeind"
> <[EMAIL PROTECTED]>
> To: 
> Sent: Tuesday, July 25, 2006 12:14 PM
> Subject: Re: Please Help - Patch install
> 
> 
>> You should use the patch-utility to integrate the patch, not be doing it
>> by hand.
>>
>> That line you mention is sort of "meta-data" and interpreted by the
>> patch-utility. It's nothing you need to add to the sourcefiles!
>>
>>
>> Good luck,
>> Stefan
>>
>> Ronny wrote:
>>> Hello,
>>>
>>> thanks for your reply. Now I tried it and it is not working.
>>>
>>> I just put the lines with + into the source code. The lines are as
>>> follows:
>>>
>>> +public static final boolean CRAWL_IGNORE_EXTERNAL_LINKS =
>>> +NutchConf.get().getBoolean("crawl.ignore.external.links",
>>> false);
>>>
>>> and
>>>
>>> +if (!internal && CRAWL_IGNORE_EXTERNAL_LINKS) {
>>> +continue;  // External links are forbidden : skip it !
>>> +   }
>>>
>>> Of course they are on the right place in the script. But I don´t know
>>> what to do with this: @@ -198,6 +200,9 @@ .
>>>
>>> Please help me
>>> Kind regards
>>> Ronny
>>>
>>>
>>>
>>>
>>> - Original Message - From: "Philippe EUGENE"
>>> <[EMAIL PROTECTED]>
>>> To: 
>>> Sent: Monday, July 24, 2006 10:21 AM
>>> Subject: Re: Please Help - Patch install
>>>
>>>
>>>> Ronny a écrit :
>>>>> Hello List,
>>>>>
>>>>> I have a patch for Nutch
>>>>> http://issues.apache.org/jira/browse/NUTCH-173 and I want to install
>>>>> it. But I don´t know how to do that.
>>>>> Which file I have to edit that I can install and run the patch. I am
>>>>> working currently with nutch 0.7.2
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> Kind regards
>>>>> Ronny
>>>>>
>>>> Hi,
>>>> You must use a tools like svn to apply this patch on your source code.
>>>> It seems working.
>>>> If you are not familar with this, you can edit manualy in your IDE
>>>> this file :
>>>> tools/UpdateDatabaseTool.java
>>>> The patch.txt file indicate  witch line you must edit or replace.
>>>> After, you must add this option crawl.ignore.external.links in your
>>>> configuration file.
>>>> -- 
>>>> Philippr


Re: Please Help - Patch install

2006-07-25 Thread Stefan Neufeind
You should use the patch-utility to integrate the patch, not be doing it
by hand.

That line you mention is sort of "meta-data" and interpreted by the
patch-utility. It's nothing you need to add to the sourcefiles!


Good luck,
 Stefan

Ronny wrote:
> Hello,
> 
> thanks for your reply. Now I tried it and it is not working.
> 
> I just put the lines with + into the source code. The lines are as follows:
> 
> +public static final boolean CRAWL_IGNORE_EXTERNAL_LINKS =
> +NutchConf.get().getBoolean("crawl.ignore.external.links", false);
> 
> and
> 
> +if (!internal && CRAWL_IGNORE_EXTERNAL_LINKS) {
> +continue;  // External links are forbidden : skip it !
> +   }
> 
> Of course they are on the right place in the script. But I don´t know
> what to do with this: @@ -198,6 +200,9 @@ .
> 
> Please help me
> Kind regards
> Ronny
> 
> 
> 
> 
> - Original Message - From: "Philippe EUGENE"
> <[EMAIL PROTECTED]>
> To: 
> Sent: Monday, July 24, 2006 10:21 AM
> Subject: Re: Please Help - Patch install
> 
> 
>> Ronny a écrit :
>>> Hello List,
>>>
>>> I have a patch for Nutch
>>> http://issues.apache.org/jira/browse/NUTCH-173 and I want to install
>>> it. But I don´t know how to do that.
>>> Which file I have to edit that I can install and run the patch. I am
>>> working currently with nutch 0.7.2
>>>
>>> Thanks for your help.
>>>
>>> Kind regards
>>> Ronny
>>>
>> Hi,
>> You must use a tools like svn to apply this patch on your source code.
>> It seems working.
>> If you are not familar with this, you can edit manualy in your IDE
>> this file :
>> tools/UpdateDatabaseTool.java
>> The patch.txt file indicate  witch line you must edit or replace.
>> After, you must add this option crawl.ignore.external.links in your
>> configuration file.
>> -- 
>> Philippr


Re: any success with php-java-bridge and Nutch?

2006-07-12 Thread Stefan Neufeind

Chris Stephens wrote:
Has anyone had success getting Nutch to work with the php-java-bridge?  
I've been playing around with this for about a day and a half and have 
not been able to get passed the error:


java stack trace: java.lang.Exception: CreateInstance failed: new 
org.apache.nutch.searcher.NutchBean. Cause: 
java.lang.NullPointerException at 
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) at 
org.apache.nutch.searcher.NutchBean.(NutchBean.java:84) at 
org.apache.nutch.searcher.NutchBean.(NutchBean.java:73) at 
java.lang.reflect.Constructor.newInstance(libgcj.so.7) at 
php.java.bridge.JavaBridge.CreateObject(JavaBridge.java:547) at 
php.java.bridge.Request.handleRequest(Request.java:503) at 
php.java.bridge.Request.handleRequests(Request.java:533) at 
php.java.bridge.JavaBridge.run(JavaBridge.java:192) at 
php.java.bridge.BaseThreadPool$Delegate.run(BaseThreadPool.java:37) 
Caused by: java.lang.NullPointerException at 
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) ...8 more


I do have a proper searcher.dir entry in my nutch-site.xml, and my index 
does have data.  My class path currently looks like:
:/usr/java/jdk1.5.0_06:/usr/java/jdk1.5.0_06/lib:/usr/local/nutch/lib:/usr/local/nutch:/usr/local/nutch/conf/nutch-default.xml:/usr/local/nutch/conf/nutch-site.xml 



I would appreciate any reports on nutch working with php-java-bridge and 
information about stability.


Do you really need a real php-java-bridge for that? We're using the 
OpenSearch XML-output from nutch in a php-application andlocking down 
access to nutch only from localhost. Works fine ...

(Though if someone gets the php-java-bridge to work that would be cool! :-)


Regards,
 Stefan


Re: Do nutch allow an advanced search?

2006-06-21 Thread Stefan Neufeind
Scott McCammon wrote:
> The index-more plugin indexes each document's last modified date and is
> searchable via a range like: "date:20060521-20060621"  Note that a date
> search does not work by itself. At least one keyword or phrase is required.

Hi Scott,

requiring a keyword/phrase has been mentioned at several places before
already. Is there a technical background for it, or could that
limitation maybe be removed (and should we file a JIRA for that)?


Regards,
 Stefan

> John john wrote:
>> Hello
>>  
>>  I'm new in the nutch world and i'm wondering whether it's possible to
>> search with date range? or specify a date and then nutch retrieves
>> pages updated after this date?
>>  
>>  thanks


Re: problem with skiped urls

2006-06-21 Thread Stefan Neufeind
[EMAIL PROTECTED] wrote:
> hi,
> i'm trying to run nutch in our clinicum center and i have a little problem.
> we have a few intranet servers and i want that nutch skip a few
> direcotries.
> for example:
> 
> http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/
> 
> i wrote this urls in the crawl-urlfilter.txt. for example:
> 
> -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus
> 
> but nothing happens. nutch don't skip this urls. and i don't know why...
> 
> :( kann anyone help me?
> 
> i'm cwaling with this command:
> 
> bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &
> 
> i'm using the release 0.7.1

Hi David,

do you have regex-urlfilter in your crawler-site-configfile or
nutch-site-configfile? I suspect that the plugin might not yet be
loaded. Also, do you have another "allow all URLs"-line above the one
you mentioned, maybe?
I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and
not +, so I guess that should be fine). But if your URL does not have
anything in front of sapdoku, maybe try dropping that part.


Good luck,
 Stefan


Re: Restricting query to a domain

2006-06-18 Thread Stefan Neufeind
Bogdan Kecman wrote:
> Use plugin "query-site". It supports the site field. 
> Also if you look at the the 
> 
> NutchBean.search(query, start + hitsToRetrieve, hitsPerSite, "site", sort,
> reverse);
> 
> You will notice that you can get results grouped 
> by site field, actually to get only hitsPerSite 
> number of results per site. 
> 
> Now, this works with 0.7.1, donno about 0.7.2 and 0.8 as 
> I had no time to check them out, but there should not be 
> much difference
> 
> Pay notice that this is a filter, so query like
> 
>  findme andme site:"www.aaa.com"
> 
> Will limit resultset to www.aaa.com only but query
> 
>  site:"www.aaa.com"
> 
> Is empty query and will not return anything.

Why won't that return anything?

And is grouping with "brackets" somehow possible? I know the thing
mentioned below does not work - but would be nice if it could, wouldn't it?

abc && (site:"www.aaa.com" || site:"www.bbb.com")



Regards,
 Stefan

>> -Original Message-
>> From: Bill de hÓra [mailto:[EMAIL PROTECTED] 
>> Sent: Sunday, June 18, 2006 6:33 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Restricting query to a domain
>>
>> Hi,
>>
>> I'll need to provide a search that allow a person to restrict 
>> search to a specific domain (and probably a group of them). 
>> Afaict that's not supported (apologies if I'm wrong). Before 
>> I go rolling my own are they plans to support anything like "site:"?
>>
>> cheers
>> Bill


Re: Removing or reindexing a URL?

2006-06-09 Thread Stefan Neufeind

Andrzej Bialecki wrote:

Stefan Neufeind wrote:
How about making this a commandline-option to inject? Could you create 
an improvement-patch?


FWIW, a patch with similar functionality is in my work-in-progress 
queue,  however it's for 0.8 - there is no point in backporting my patch 
because the architecture is very different...


Here's a snippet:



[...]

I'm fine with 0.8(-dev). Have been using it successful in production 
myself now :-)


  Stefan


Re: Removing or reindexing a URL?

2006-06-09 Thread Stefan Neufeind
Hi,

it just came to my mind, just to make sure (don't have the code at
hand): updatedb uses a different portion of code, right? Otherwise we
might re-crawl URLs we just fetched because links are found to URLs we
just fetched :-)


Regards,
 Stefan

Howie Wang wrote:
> If you don't mind changing the source a little, I would change
> the org.apache.nutch.db.WebDBInjector.java file so that
> when you try to inject a url that is already there, it will update
> it's next fetch date so that it will get fetched during the next
> crawl.
> 
> In WebDBInjector.java in the addPage method, change:
> 
>  dbWriter.addPageIfNotPresent(page);
> 
> to:
> 
>  dbWriter.addPageWithScore(page);
> 
> Every day you can take your list of changed/deleted urls and do:
> 
>bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt
> 
> Then do your crawl as usual. The updated pages will be refetched.
> The deleted pages will attempt to be refetched, but will error out,
> and be removed from the index.
> 
> You could also set your db.default.fetch.interval parameter to
> longer than 30 days if you are sure you know what pages are changing.
> 
> Howie
> 
>> With my tests, I index ~60k documents.  This process takes several
>> hours.  I
>> plan on having about a half million documents index eventually, and I
>> suspect it'll take more than 24 hours to recrawl and reindex with my
>> hardware, so I'm concerned.
>>
>> I *know* which documents I want to reindex or remove.  It's going to be a
>> very small subset compared to the whole group (I imagine around 1000
>> pages).  That's why I desperately want to be able to give Nutch a list of
>> documents.
>>
>> Ben
>>
>> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>>>
>>> Just recrawl and reindex every day. That was the simple answer.
>>> The more complex answer is you need to do write custom code that
>>> deletes documents from your index and crawld.
>>> If you not want to complete learn the internals of nutch, just
>>> recrawl and reindex. :)
>>>
>>> Stefan
>>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:
>>>
>>> > Hello,
>>> >
>>> > I'm trying to get Nutch suitable to use for our (extensive)
>>> > intranet.  One
>>> > problem I'm trying to solve is how best to tell Nutch to either
>>> > reindex or
>>> > remove a URL from the index.  I have a lot of pages that get
>>> > changed, added
>>> > and removed daily, and I'd prefer to have the changes reflected in
>>> > Nutch's
>>> > index immediately.
>>> >
>>> > I am able to generate a list of URLs that have changed or have been
>>> > removed,
>>> > so I definately do not need to reindex everything, I just need a
>>> > way to pass
>>> > this list on to Nutch.
>>> >
>>> > How can I do this?
>>> >
>>> > Ben


Re: Removing or reindexing a URL?

2006-06-08 Thread Stefan Neufeind
How about making this a commandline-option to inject? Could you create 
an improvement-patch?



Regards,
 Stefan

Howie Wang wrote:

If you don't mind changing the source a little, I would change
the org.apache.nutch.db.WebDBInjector.java file so that
when you try to inject a url that is already there, it will update
it's next fetch date so that it will get fetched during the next
crawl.

In WebDBInjector.java in the addPage method, change:

 dbWriter.addPageIfNotPresent(page);

to:

 dbWriter.addPageWithScore(page);

Every day you can take your list of changed/deleted urls and do:

   bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt

Then do your crawl as usual. The updated pages will be refetched.
The deleted pages will attempt to be refetched, but will error out,
and be removed from the index.

You could also set your db.default.fetch.interval parameter to
longer than 30 days if you are sure you know what pages are changing.

Howie

With my tests, I index ~60k documents.  This process takes several 
hours.  I

plan on having about a half million documents index eventually, and I
suspect it'll take more than 24 hours to recrawl and reindex with my
hardware, so I'm concerned.

I *know* which documents I want to reindex or remove.  It's going to be a
very small subset compared to the whole group (I imagine around 1000
pages).  That's why I desperately want to be able to give Nutch a list of
documents.

Ben

On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:


Just recrawl and reindex every day. That was the simple answer.
The more complex answer is you need to do write custom code that
deletes documents from your index and crawld.
If you not want to complete learn the internals of nutch, just
recrawl and reindex. :)

Stefan
Am 06.06.2006 um 19:42 schrieb Benjamin Higgins:

> Hello,
>
> I'm trying to get Nutch suitable to use for our (extensive)
> intranet.  One
> problem I'm trying to solve is how best to tell Nutch to either
> reindex or
> remove a URL from the index.  I have a lot of pages that get
> changed, added
> and removed daily, and I'd prefer to have the changes reflected in
> Nutch's
> index immediately.
>
> I am able to generate a list of URLs that have changed or have been
> removed,
> so I definately do not need to reindex everything, I just need a
> way to pass
> this list on to Nutch.
>
> How can I do this?
>
> Ben


Re: intranet crawl issue

2006-06-08 Thread Stefan Neufeind

Matthew Holt wrote:
Just fyi,.. both of the sites I am trying to crawl are under the same 
domain. The sub-domains just differ. Works for one, the other it o nly 
appears to fetch 6 or so pages then doesn't fetch anymore. Do you need 
any more information to solve the problem? I've tried everything and 
havent' had any luck.. Thanks.


What does your crawl-urlfilter.txt look like?

 Stefan


Re: Filtering webpages based on words / Fetch progress

2006-06-08 Thread Stefan Neufeind

Lukas Vlcek wrote:

Hi again,

On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote:

1. I want to filter out webpages based on a list of words. I have
tried filtering webpages based on url, but how to do it based on
words?


As for this question check the following link:
http://wiki.apache.org/nutch/CommandLineOptions

As far as I know this prune tool should be available for nutch 0.8 as
well (at least I can see the class to be included in source code so
you should be able to call it).


Pruning with 0.8-dev works fine here. You give it a file with your 
"queries" and all matching pages will be pruned from the index. There is 
also a dryrun-option available - use that when building your queries :-)


Note that documents are only pruned from the index, not from segments or 
the crawldb! So upon re-indexing or running another crawler-round be 
sure to apply pruning again.


  Stefan


Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
Oh sorry, I didn't look up the script again from your earlier mail. Hmm,
I guess you can live fine without the invertlinks (if I'm right). Are
you sure that your indexing works fine? I think if an index exists nutch
complains. See if there is any error with indexing. Also maybe try to
delete your current index before indexing again.

Still doesn't work?


Regards,
 Stefan

Matthew Holt wrote:
> Sorry to be asking so many questions.. Below is the current script I'm
> using. It's indexing the segments.. so do I use invertlinks directly
> after the fetch? I'm kind of confused.. thanks.
> matt

[...]

> ---
> 
> Stefan Neufeind wrote:
> 
>> You miss actually indexing the pages :-) This is done inside the
>> "crawl"-command which does everything in one. After you fetched
>> everything use:
>>
>> nutch invertlinks ...
>> nutch index ...
>>
>> Hope that helps. Otherwise let me know and I'll dig  out the complete
>> commandlines for you.
>>
>>
>> Regards,
>> Stefan
>>
>> Matthew Holt wrote:
>>  
>>
>>> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
>>> the newly created page can not be found.
>>>
>>> Matthew Holt wrote:
>>>
>>>   
>>>> The recrawl worked this time, and I recrawled the entire db using the
>>>> -adddays argument (in my case ./recrawl crawl 10 31). However, it
>>>> didn't find a newly created page.
>>>>
>>>> If I delete the database and do the initial crawl over again, the new
>>>> page is found. Any idea what I'm doing wrong or why it isn't finding
>>>> it?
>>>>
>>>> Thanks!
>>>> Matt
>>>>
>>>> Matthew Holt wrote:
>>>>
>>>> 
>>>>> Stefan,
>>>>> Thanks a bunch! I see what you mean..
>>>>> matt
>>>>>
>>>>> Stefan Neufeind wrote:
>>>>>
>>>>>   
>>>>>> Matthew Holt wrote:
>>>>>>
>>>>>>
>>>>>> 
>>>>>>> Hi all,
>>>>>>> I have already successfuly indexed all the files on my domain only
>>>>>>> (as
>>>>>>> specified in the conf/crawl-urlfilter.txt file).
>>>>>>>
>>>>>>> Now when I use the below script (./recrawl crawl 10 31) to
>>>>>>> recrawl the
>>>>>>> domain, it begins indexing pages off of my domain (such as
>>>>>>> wikipedia,
>>>>>>> etc). How do I prevent this? Thanks!
>>>>>>>  
>>>>>>>   
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> have a look at regex-urlfilter. "crawl" is special in some ways.
>>>>>> Actually it's "shortcut" for several steps. And it has a special
>>>>>> urlfilter-file. But if you do it in several steps that
>>>>>> urlfilter-file is
>>>>>> no longer used.


Re: Multiple indexes on a single server instance.

2006-06-06 Thread Stefan Neufeind
Sounds like others might have use for that as well possibly. Can you
provide a clean patchset, maybe? How about an "multi-index"-plugin which
parses a xml-file to find the paths to allow indexes, like


  
index1
/data/something/index1
  
  
index2
/data/somethingelse
  



Regards,
 Stefan

Ravi Chintakunta wrote:
> See my thread
> 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03014.html
> 
> where I have modified NutchBean to dynamically pickup the indexes that
> have to be searcheed. The web page has checkboxes for each index, and
> thus these indexes can be searched in any combination.
> 
> - Ravi Chintakunta
> 
> 
> 
> On 5/31/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>> sudhendra seshachala wrote:
>> > Yes. That is wha I am trying. But for some reason it is not working..
>> >   Does these fields should be lower case only. ?
>> >
>>
>>
>> Preferably. If you use the default NutchDocumentAnalyzer it will
>> lowercase field names, so you won't get any match.


Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
You miss actually indexing the pages :-) This is done inside the
"crawl"-command which does everything in one. After you fetched
everything use:

nutch invertlinks ...
nutch index ...

Hope that helps. Otherwise let me know and I'll dig  out the complete
commandlines for you.


Regards,
 Stefan

Matthew Holt wrote:
> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still
> the newly created page can not be found.
> 
> Matthew Holt wrote:
> 
>> The recrawl worked this time, and I recrawled the entire db using the
>> -adddays argument (in my case ./recrawl crawl 10 31). However, it
>> didn't find a newly created page.
>>
>> If I delete the database and do the initial crawl over again, the new
>> page is found. Any idea what I'm doing wrong or why it isn't finding it?
>>
>> Thanks!
>> Matt
>>
>> Matthew Holt wrote:
>>
>>> Stefan,
>>>  Thanks a bunch! I see what you mean..
>>> matt
>>>
>>> Stefan Neufeind wrote:
>>>
>>>> Matthew Holt wrote:
>>>>  
>>>>
>>>>> Hi all,
>>>>>  I have already successfuly indexed all the files on my domain only
>>>>> (as
>>>>> specified in the conf/crawl-urlfilter.txt file).
>>>>>
>>>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>>>> domain, it begins indexing pages off of my domain (such as wikipedia,
>>>>> etc). How do I prevent this? Thanks!
>>>>>   
>>>>
>>>>
>>>>
>>>> Hi Matt,
>>>>
>>>> have a look at regex-urlfilter. "crawl" is special in some ways.
>>>> Actually it's "shortcut" for several steps. And it has a special
>>>> urlfilter-file. But if you do it in several steps that
>>>> urlfilter-file is
>>>> no longer used.


Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
Matthew Holt wrote:
> Hi all,
>   I have already successfuly indexed all the files on my domain only (as
> specified in the conf/crawl-urlfilter.txt file).
> 
> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
> domain, it begins indexing pages off of my domain (such as wikipedia,
> etc). How do I prevent this? Thanks!

Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that urlfilter-file is
no longer used.


Regards,
 Stefan


Re: Intranet Crawling

2006-06-05 Thread Stefan Neufeind
Just use a depth of 10 or whatever. If there are no more pages to crawl
one depth more or less does no harm. For normal websites anything in the
range from 5 to 10 for depth imho should be reasonable.

topN: This allows you to work on only the highest ranked URLs not yet
fetched. It functions as a max. pages limit per each run (depth).


Regards,
 Stefan

Matthew Holt wrote:
> Ok thanks.. as far as crawling the entire subdomain.. what exact command
> would I use?
> 
> Because depth says how many pages deep to go.. is there anyway to hit
> every single page, without specifying depth? Or should I just say
> depth=10? Also, topN is no longer used, correct?
> 
> Stefan Neufeind wrote:
> 
>> Matthew Holt wrote:
>>  
>>
>>> Question,
>>>   I'm trying to index a subdomain of my intranet. How do I make it
>>> index the entire subdomain, but not index any pages off of the
>>> subdomain? Thanks!
>>>   
>>
>> Have a look at crawl-urlfilter.txt in the conf/ directory.
>>
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>
>> # skip everything else
>> -.
>>
>>
>> Regards,
>> Stefan


Re: Intranet Crawling

2006-06-05 Thread Stefan Neufeind
Matthew Holt wrote:
> Question,
>I'm trying to index a subdomain of my intranet. How do I make it
> index the entire subdomain, but not index any pages off of the
> subdomain? Thanks!

Have a look at crawl-urlfilter.txt in the conf/ directory.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
-.


Regards,
 Stefan


Re: Sorting results by "url"

2006-06-02 Thread Stefan Neufeind
Marco Pereira wrote:
> Hi,
> 
> Please, everybody.
> 
>  I'm indexing a website that makes new scripts with fresh news content
> almost every hour.
> The urls are this way: http://.website.com/1.php
> http://.website.com/2.php http://.website.com/3.php etc...
> 
> Is there a way to modify the results page search.jsp so that it can show
> 1.php first then 2.php then 3.php ... I mean, is there a way to sort
> results
> by url?

For the OpenSearch-interface (RSS-interface) you can supply &sort=url -
or also combine that with &reverse=true if you need it the other way
round. Please note however that those are lexically sorted. In case you
want them to be ordered by fetch-date you can also use &sort=date.


Hope that help,
 Stefan


Re: getting exact number of matches

2006-05-30 Thread Stefan Neufeind
I know it's possible to switch it off. But I need it, and the question
was how to get the exact number of hits after "grouping". The unclean
workaround was the only thing I did find yet:
- one hit per page
- going to page 9
- see where we end up
- cache that number

Works but is ugly :-)

  Stefan

Stefan Groschupf wrote:
> I see you mean grouping by host.
> Yes that works different and is difficult.
> If you like you can switch off grouping by host.
> Stefan
> 
> 
> Am 31.05.2006 um 00:10 schrieb Stefan Neufeind:
> 
>> Hi Stefan,
>>
>> I didn't mean duplicate in the sense of "two times the same result" -
>> but in the sense of "show only XX results per website", e.g. only to
>> shoow max two pages of a website that might match. And you can't dedup
>> that before the search (runtime) because you don't know what was
>> actually searched. I'm refering to the hitsPerSite-parameter of the
>> webinterface - while in the source it's called a bit more general (there
>> are variables like dedupField etc.).
>>
>>
>> Regards,
>>  Stefan
>>
>> Stefan Groschupf wrote:
>>> Hi,
>>> why not dedub your complete index before and not until runtime?
>>> There is a dedub tool for that.
>>>
>>> Stefan
>>>
>>> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
>>>
>>>> Hi Eugen,
>>>>
>>>> what I've found (and if I'm right) is that the page-calculation is done
>>>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
>>>> results when you only need the first page, I guess currently this is
>>>> not
>>>> done at the moment. However, since I also needed the exact number, I
>>>> did
>>>> find out the "dirty hack" at least. That helps for the moment.
>>>> But as it might take quite a while to find out the exact number of
>>>> pages
>>>> I suggest that e.g. you compose a "hash" or the words searched for, and
>>>> maybe to be sure the number of non-dedupped searchresults, so you don't
>>>> have to search the exact number again and again when moving between
>>>> pages.
>>>>
>>>>
>>>> Hope that helps,
>>>>  Stefan
>>>>
>>>> Eugen Kochuev wrote:
>>>>>
>>>>> And did you manage to locate the place where the filtering on per
>>>>> site basis is done? Is it possible to tweak nutch to make it telling
>>>>> the exact number of pages after filtering or is there a problem?
>>>>>
>>>>>> I've got a pending nutch-issue on this
>>>>>> http://issues.apache.org/jira/browse/NUTCH-288
>>>>>
>>>>>> A dirty workaround (though working) is to do a search with one hit
>>>>>> per
>>>>>> page and start-index as 9. That will give you the actual
>>>>>> start-index
>>>>>> of the last item, which +1 is the number of results you are looking
>>>>>> for.
>>>>>> Since requesting the last page takes a bit resources, you might
>>>>>> want to
>>>>>> cache that result actually - so users searching again or navigating
>>>>>> through pages get the number of pages faster.
>>>>>
>>>>>> PS: For the OpenSearch-connector to not throw an exception but to
>>>>>> return the last page, please apply the patch I attached to the bug.


Re: Multiple indexes on a single server instance.

2006-05-30 Thread Stefan Neufeind
sudhendra seshachala wrote:
> I am experiencing a similar problem.
>   What I have done is as follows.
>   I have different parse-plugin for each site ( I have 3 sites to crawl and 
> fetch data). But I capture data into same format I call it datarepository.
>   I have one index-plugin which indexes on data repository and one 
> query-plugin on the data repository,
>   I dont have to run multiple instances. I just run one instance of search 
> engine.
>   However the parse configuration is different for each site so I run 
> different crawler for each site
>   Then I index and merge all of them. So far the results are good if not 
> "WOW".
>   I still have to figure a way of ranking the page. For example I would like 
> to be able to apply ranking on the data repository. Let me know If I was 
> clear...

Hi,

not sure if I got you right with your last point, but it just came to my
mind:
It would be nice to be able to have something like
"If it's from indexA, give it 100 extra-points - if from indexB give it
50 extra-points". Or some "if indexA give it 20% extra-weight" or so.
But I don't believe this is easily doable. Or is it?

I got a similar problem with languages: give priority to documents in
German and English. But somewhere after those results also list
documents in other languages. So I'd need to be able to give
"extra-points" on a "per-language"-basis, based on the indexed
language-field, right?


Regards,
 Stefan

> Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>   I'm not sure what you are planing to do, but you can just switch a 
> symbolic link on your hdd driven by a cronjob to switch between index 
> on a given time.
> May be you need to touch the web.xml to restart the searcher.
> If you try to search in different kind of indexes at the same time, I 
> suggest to merge the indexes and have a kind keyfield for each of the 
> indexes.
> For example add a field to each of your indexes names "indexName" and 
> put A, B and C as value into it.
> Than you can merge your index. During runtime you just need to have a 
> queryfilter that extend a indexName:A or indexName:B to the query 
> string.
> 
> Does this somehow help to solve your problem?
> Stefan
> 
> Am 23.05.2006 um 15:26 schrieb TJ Roberts:
> 
>> I have five different indexes each with their own special 
>> configuration. I would like to be able to switch between the 
>> different indexes dynamically on a single instance of nutch running 
>> on jakarta-tomcat. Is this possible, or do I have to run five 
>> instances of nutch, one for each index?


Re: Multiple indexes on a single server instance.

2006-05-30 Thread Stefan Neufeind
I've been running into a similar question myself a while ago. What I
could imagine are company A, company B and company C. All want to be
able to have "their own" search-engine. At the same time there might be
a "special" search-engine needed that crawls content from both company A
and B but not C. I think that's where your suggestion comes into play,
right? With the indexname.
a) How would you "extend" your indexes by one field before merging them?
is there a small tool to add a field to an index?
b) Do you always have to merge the indexes, or could you use some
feature from the "distributed" nutch to search in multiple indexes? I
just think about that because it would allow you to use multiple maybe
huge indexes that could all be updated separately and without having to
merge them again.

Another point I have understood from the original question:
How would it be possible to have an OpenSearch-interface for multiple
indexes running on one single Tomcat-instance. I think the author asked
whether you could/would install separate copies at the same time with
differeent searcher.dir-settings in their nutch-site.xml.
With your suggestion: I understand that a plugin similar to "query-more"
could be written to allow providing a search for "indexName" (as you
suggested) as well, right? With this, would it also be possible to ask
for "indexName=A or B but not C"?

  Stefan

Stefan Groschupf wrote:
> I'm not sure what you are planing to do, but you can just switch a
> symbolic link on your hdd driven by a cronjob to switch between index on
> a given time.
> May be you need to touch the web.xml to restart the searcher.
> If you try to search in different kind of indexes at the same time, I
> suggest to merge the indexes and have a kind keyfield for each of the
> indexes.
> For example add a field to each of your indexes names "indexName" and
> put A, B and C as value into it.
> Than you can merge your index. During runtime you just need to have a
> queryfilter that extend a indexName:A or indexName:B to the query string.
> 
> Does this somehow help to solve your problem?
> Stefan
> 
> Am 23.05.2006 um 15:26 schrieb TJ Roberts:
> 
>> I have five different indexes each with their own special
>> configuration.  I would like to be able to switch between the
>> different indexes dynamically on a single instance of nutch running on
>> jakarta-tomcat.  Is this possible, or do I have to run five instances
>> of nutch, one for each index?


Re: Re-parsing document

2006-05-30 Thread Stefan Neufeind
Hi Stefan,

that seems to have worked. And I tried out that my patch to the
PDF-parser actually prevented "unclean" IO-exceptions (see
http://issues.apache.org/jira/browse/NUTCH-290   ).

The strange thing, however, is that I still see "garbage" (undecoded
binary data from the PDF-file) in search-summaries. Could it be that
possibly since my plugin returns empty content (and by that preventing
an exception) some other place in the source still thinks "no summary?
I'll grab the raw content instead then"?

My problem is that for unparseable files I get binary data in the
summaries. The special case in my eyes are PDF-files, where the patch
now prevents an exception which leads to a "parse failed". Now the parse
is fine, but I still get binary summaries :-(


Could you maybe have a look at the issue? There is a test-PDF mentioned
as well. And I can offer more :-)


Regards,
 Stefan

Stefan Groschupf wrote:
> You can just delete the parse output folders and start the parsing tool.
> Parsing a given page again makes only sense for debug reasons since
> hadoop io system can not update entries.
> If you need to debug I suggest to write you a junit test.
> 
> HTH
> Stefan
> 
> 
> Am 29.05.2006 um 01:01 schrieb Stefan Neufeind:
> 
>> Hi,
>>
>> was is needed to re-parse documents that were already fetched into a
>> segment? Is another "nutch index ..."-run sufficient, or how could I
>> send the documents through the parse-plugins again?
>>
>>
>> Regards,
>>  Stefan


Re: getting exact number of matches

2006-05-30 Thread Stefan Neufeind
Hi Stefan,

I didn't mean duplicate in the sense of "two times the same result" -
but in the sense of "show only XX results per website", e.g. only to
shoow max two pages of a website that might match. And you can't dedup
that before the search (runtime) because you don't know what was
actually searched. I'm refering to the hitsPerSite-parameter of the
webinterface - while in the source it's called a bit more general (there
are variables like dedupField etc.).


Regards,
 Stefan

Stefan Groschupf wrote:
> Hi,
> why not dedub your complete index before and not until runtime?
> There is a dedub tool for that.
> 
> Stefan
> 
> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
> 
>> Hi Eugen,
>>
>> what I've found (and if I'm right) is that the page-calculation is done
>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
>> results when you only need the first page, I guess currently this is not
>> done at the moment. However, since I also needed the exact number, I did
>> find out the "dirty hack" at least. That helps for the moment.
>> But as it might take quite a while to find out the exact number of pages
>> I suggest that e.g. you compose a "hash" or the words searched for, and
>> maybe to be sure the number of non-dedupped searchresults, so you don't
>> have to search the exact number again and again when moving between
>> pages.
>>
>>
>> Hope that helps,
>>  Stefan
>>
>> Eugen Kochuev wrote:
>>>
>>> And did you manage to locate the place where the filtering on per
>>> site basis is done? Is it possible to tweak nutch to make it telling
>>> the exact number of pages after filtering or is there a problem?
>>>
>>>> I've got a pending nutch-issue on this
>>>> http://issues.apache.org/jira/browse/NUTCH-288
>>>
>>>> A dirty workaround (though working) is to do a search with one hit per
>>>> page and start-index as 9. That will give you the actual
>>>> start-index
>>>> of the last item, which +1 is the number of results you are looking
>>>> for.
>>>> Since requesting the last page takes a bit resources, you might want to
>>>> cache that result actually - so users searching again or navigating
>>>> through pages get the number of pages faster.
>>>
>>>> PS: For the OpenSearch-connector to not throw an exception but to
>>>> return the last page, please apply the patch I attached to the bug.


Re: getting exact number of matches

2006-05-29 Thread Stefan Neufeind
Hi Eugen,

what I've found (and if I'm right) is that the page-calculation is done
in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
results when you only need the first page, I guess currently this is not
done at the moment. However, since I also needed the exact number, I did
find out the "dirty hack" at least. That helps for the moment.
But as it might take quite a while to find out the exact number of pages
I suggest that e.g. you compose a "hash" or the words searched for, and
maybe to be sure the number of non-dedupped searchresults, so you don't
have to search the exact number again and again when moving between pages.


Hope that helps,
 Stefan

Eugen Kochuev wrote:
> 
> And did you manage to locate the place where the filtering on per
> site basis is done? Is it possible to tweak nutch to make it telling
> the exact number of pages after filtering or is there a problem?
> 
>> I've got a pending nutch-issue on this
>> http://issues.apache.org/jira/browse/NUTCH-288
> 
>> A dirty workaround (though working) is to do a search with one hit per
>> page and start-index as 9. That will give you the actual start-index
>> of the last item, which +1 is the number of results you are looking for.
>> Since requesting the last page takes a bit resources, you might want to
>> cache that result actually - so users searching again or navigating
>> through pages get the number of pages faster.
> 
>> PS: For the OpenSearch-connector to not throw an exception but to return
>> the last page, please apply the patch I attached to the bug.



Re: content-type crawling problem

2006-05-29 Thread Stefan Neufeind
Heiko Dietze wrote:
> Hello,
> 
> Eugen Kochuev wrote:
>> Btw, do I need to uncomment this? It's more logical to comment this
>> out. Right?
>>
>>
>>> 
>>>   
>>> 
>>
>>
>>> Just uncomment this wilcard match. You might also check
>>> the other rules for further unwanted content.
> 
> Sorry for the typo, I meant that you should leave it out, yes.
> 
> Unfortunaly for the fetching of the pages this is not the solution, but
> the index will be based only on the proper content. I think the index is
> created with the parsed content.

Maybe have a look at urlfilter-suffix and only fetch those files with
suffixes you want.


Regards,
 Stefan


Re: getting exact number of matches

2006-05-29 Thread Stefan Neufeind
Eugen Kochuev wrote:
> Hello nutch-user,
> 
>   I'm rewriting JSP front-end to add pager (currently there's only
>   "Next page" button) and I have faced a difficulty, that actually I
>   cannot get the number of matches if the hits are filtered to show
>   only 2 results by domain. How could this be resolved? Where this
>   filtering is done and the exact number of pages is lost? Please
>   advise.

Hi,

I've got a pending nutch-issue on this
http://issues.apache.org/jira/browse/NUTCH-288

A dirty workaround (though working) is to do a search with one hit per
page and start-index as 9. That will give you the actual start-index
of the last item, which +1 is the number of results you are looking for.
Since requesting the last page takes a bit resources, you might want to
cache that result actually - so users searching again or navigating
through pages get the number of pages faster.

PS: For the OpenSearch-connector to not throw an exception but to return
the last page, please apply the patch I attached to the bug.


Regards,
 Stefan



Re-parsing document

2006-05-28 Thread Stefan Neufeind
Hi,

was is needed to re-parse documents that were already fetched into a
segment? Is another "nutch index ..."-run sufficient, or how could I
send the documents through the parse-plugins again?


Regards,
 Stefan


How to copy compiled files to correct dirs?

2006-05-28 Thread Stefan Neufeind
Hi,

after doing an "ant compile", how are the files (e.g. all plugins)
supposed to be copied from the build/-directory to the normal
plugins-directory that is shipped when downloading a nightly?
I've been re-compiling a plugin and wondered why "ant compile" leaves
the file in build/ and does not overwrite the actual plugin I was using.
Okay, a simple copy or symlink helped in this case ... but did I miss
any script that is supposed to be called to copy the files "where they
belong"?


Regards,
 Stefan


Re: 0.8 release soon?

2006-05-27 Thread Stefan Neufeind
Andrzej Bialecki wrote:
> Doug Cutting wrote:
>> Andrzej Bialecki wrote:
>>> 0.8 is pretty stable now, I think we should start considering a
>>> release soon, within the next month's time frame.
>>
>> +1
>>
>> Are there substantial features still missing from 0.8 that were
>> supported in 0.7?
> 
> Next week I'll be working on NUTCH-61 to bring it to a state where it
> could be committed. It's a new feature, so the question is: should we
> play safe, and wait with it after the release, or should we go with it
> in the hope that it will get a wider testing audience? ;)

+1 for being "safe" and instead focusing on some of the already
mentioned patches that might need attention more urgently.

  Stefan


Re: 0.8 release soon?

2006-05-26 Thread Stefan Neufeind
Doug Cutting wrote:
> Andrzej Bialecki wrote:
>> 0.8 is pretty stable now, I think we should start considering a
>> release soon, within the next month's time frame.
> 
> +1
> 
> Are there substantial features still missing from 0.8 that were
> supported in 0.7?
> 
> Are there any showstopping bugs, things that worked in 0.7 that are
> broken in 0.8?

+1 as well, though I'm still new to the topic.

During the setup I've come across a few patches that I think might be
useful to maybe go into the 0.8-release. Those are:

fixes:
NUTCH-110-fixIllegalXmlChars08.patch
NUTCH-254-fetcher_filter_url_patch.txt

new features, that I tested and work fine here:
NUTCH-48-did-you-mean-combined08.patch
NUTCH-173-patch08-new.patch
NUTCH-279-regex-normalize.patch
NUTCH-288-OpenSearch-fix.patch


!! open issues, from my side:
NUTCH-277 (seems to affect httpclient, changing to http helped)


Feedback welcome.


Regards,
 Stefan


Re: Sorting in nutch-webinterface - how?

2006-05-26 Thread Stefan Neufeind
Marko Bauhardt wrote:
> 
> Am 26.05.2006 um 01:57 schrieb Stefan Neufeind:
>>> Modified. If not, date=FetchTime.
>>
>> Hi Marko,
>>
> 
> Hi Stefan,
> 
>> that hint really helped. Can you maybe also help me out with sort=title?
>> See also:
>> http://issues.apache.org/jira/browse/NUTCH-287
>>
>> The problem is that it works on some searches - but not always. Could it
>> be that maybe some plugins don't write a title or write title as
>> null/empty and that leads to problems? What could I do:
> 
> If a html page begins with " the html parser (i am not sure). If the TextParser is used to parse this
> page, then no title will be extract. So in this case the title is empty
> and the summary is xml-code.
> 
> Please verify your pages , that have no title and look whether " exists at the begin of this page.

I could understand that those documents are "problematic" in sorting -
e.g. they would all be in front or at the end of the sorted list. But
why does this actually lead to no output/an exception/...?

Maybe in case no title is present at least _something_ could be used -
e.g. the URL instead or so?


Regards,
 Stefan


Re: Incremental crawl again ... (Please explain)

2006-05-26 Thread Stefan Neufeind
I haven't yet tried - but could you maybe:
- move the new segments somewhere independent of the existing ones
- create a separate linkdb for it (to my understanding the linkdb is
only needed when indexing)
- create a separate index on that
- then move segment into segments-dir and new index into indexes-dir as
"part-"
- just merge indexes (should work relatively fast)

In the long term your segments, indexes etc. add up - so in this case
you'd need to maybe think about merging segments etc.

Also, this is "only" my current understanding of the topic. It would be
nice to get feedback and maybe easier solutions from others as well.



Regards,
 Stefan

Jacob Brunson wrote:
> Yes, I see what you mean about re-indexing again over all the
> segments.  However, indexing takes a lot of time and I was hoping that
> merging many smaller indexes would be a much more efficient method.
> Besides, deleting the index and re-indexing just doesn't seem like
> *The Right Thing(tm)*.
> 
> On 5/26/06, zzcgiacomini <[EMAIL PROTECTED]> wrote:
>> I am not at all a Nutch expert, I am just experimenting a little bit,
>> but  as far as I understood it
>> you can remove the indexes directory and re-index again the segments:
>> In may case ofter step 8 of the (see below) I have only one segment :
>> test/segments/20060522144050
>> after step 9 I will have a second segment
>> test/segments/20060522144050
>> Now what we can do is to remove the test/indexes directory and
>> re-index the two segments:
>> this what I did :
>>
>> hadoop dfs -rm test/indexes
>> nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050 test/segments/20060522144050
>>
>> Hope it helps
>>
>> -Corrqdo
>>
>>
>>
>> Jacob Brunson wrote:
>> > I looked at the referenced messaged at
>> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>> > but I am still having problems.
>> >
>> > I am running the latest checkout from subversion.
>> >
>> > These are the commands which I've run:
>> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 1
>> > bin/nutch generate crawl/crawldb crawl/segments -topN 500
>> > lastsegment=`ls -d crawl/segments/2* | tail -1`
>> > bin/nutch fetch $lastsegment
>> > bin/nutch updatedb crawl/crawldb $lastsegment
>> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment
>> >
>> > This last command fails with a java.io.IOException saying: "Output
>> > directory /home/nutch/nutch/crawl/indexes already exists"
>> >
>> > So I'm confused because it seems like I did exactly what was described
>> > in the referenced email, but it didn't work for me.  Can someone help
>> > me figure out what I'm doing wrong or what I need to do instead?
>> > Thanks,
>> > Jacob
>> >
>> >
>> > On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:
>> >> Please do follow the link below..
>> >>  
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
>> >>
>> >>   I have been able to follow the threads as explained and merge
>> >> multiple crawl.. It works like a champ.
>> >>
>> >>   Thanks
>> >>   Sudhi
>> >>
>> >> zzcgiacomini <[EMAIL PROTECTED]> wrote:
>> >>   I am currently using the last nightly nutch-0.8-dev build and
>> >> I am really confused about how to proceed after I have done two
>> >> different "whole web" incremental crawl
>> >>
>> >> The tutorial to me is not clear on how to merge the results after the
>> >> two crawls in order to be able to
>> >> make a search operation.
>> >>
>> >> Could some one please give me an Hints on what is the right
>> procedure ?!
>> >> here is what I am doing:
>> >>
>> >> 1. create an initial urls file /tmp/dmoz/urls.txt
>> >> 2. hadoop dfs -put /tmp/urls/ url
>> >> 3. nutch inject test/crawldb dmoz
>> >> 4. nutch generate test/crawldb test/segments
>> >> 5. nutch fetch test/segments/20060522144050
>> >> 6. nutch updatedb test/crawldb test/segments/20060522144050
>> >> 7. nutch invertlinks linkdb test/segments/20060522144050
>> >> 8. nutch index test/indexes test/crawldb linkdb
>> >> test/segments/20060522144050
>> >>
>> >> ..and now I am able to search...
>> >>
>> >> Now I run
>> >>
>> >> 9. nutch generate test/crawldb test/segments -topN 1000
>> >>
>> >> and I will end up to have a new segment : test/segments/20060522151957
>> >>
>> >> 10. nutch fetch test/segments/20060522151957
>> >> 11. nutch updatedb test/crawldb test/segments/20060522151957
>> >>
>> >>
>> >> From this point on I cannot make any progresses much
>> >>
>> >> A) I have tried to merge the two segments into a new one with the
>> >> idea to rerun an invertlinks and index on it but:
>> >>
>> >> nutch mergesegs test/segments -dir test/segments
>> >>
>> >> whatever I specify as outputdir or outputsegment I get errors
>> >>
>> >> B) I have also tried to make invertlinks on all test/segments with
>> >> the goal to run nutch index command to produce a second
>> >> indexes directory, let say test/indexes1, an finally run the merge
>> >> index on index2
>> >>
>> >> nutch in

Re: Sorting in nutch-webinterface - how?

2006-05-25 Thread Stefan Neufeind
Marko Bauhardt wrote:
> 
>> Hmm, that works. But why - since I think the field is named lastModified.
> 
> LastModified is only used if lastModified is available about the html
> meta tags. If that true, lastModified is stored but not indexed.
> However the date field is always indexed. Is lastModified is available
> as metatag, then date=lastModified. If not, date=FetchTime.

Hi Marko,

that hint really helped. Can you maybe also help me out with sort=title?
See also:
http://issues.apache.org/jira/browse/NUTCH-287

The problem is that it works on some searches - but not always. Could it
be that maybe some plugins don't write a title or write title as
null/empty and that leads to problems? What could I do:
a) as a quickfix to prevent the exceptionand
b) to track this further down which result(s) and why actually cause the
problem.

I've taken a look at the javadoc from the lucene-interface. It looks
like if you sort by something the fields[0] should always be set with
the field you searched for - but afaik actually it is null, or maybe
even fields is empty or so.


Regards,
 Stefan


Re: Sorting in nutch-webinterface - how?

2006-05-25 Thread Stefan Neufeind
Marko Bauhardt wrote:
> 
> Am 25.05.2006 um 13:21 schrieb Stefan Neufeind:
> 
>> Hi,
>>
>> I did use index-basic and index-more. I see lastModified in the
>> RSS-output. Now I want to &sort=lastModified - does not work.
> 
> Try sort=date.

Hmm, that works. But why - since I think the field is named lastModified.


Thank you very much for your help,
 Stefan


Sorting in nutch-webinterface - how?

2006-05-25 Thread Stefan Neufeind
Hi,

I did use index-basic and index-more. I see lastModified in the
RSS-output. Now I want to &sort=lastModified - does not work. Same for
&sort=title. However &sort=url does work.

What am I doing wrong here?


Regards,
 Stefan


Re: using nutch to detect broken pages

2006-05-24 Thread Stefan Neufeind
Jorg Heymans wrote:
> Hi,
> 
> I was wondering if it's possible to get crawl to go through a website and
> only report links that return a specific http response code (eg 404) ? I'm
> looking to somehow automate basic site testing of rather huge websites,
> inevitably one ends up in the world of crawlers (and being a java guy
> myself
> this means nutch).
> 
> I'm still going through the faq and first basic steps, so apologies if what
> i'm asking is the most basic nutch-thing ever :)

I haven't used it yet - but I guess that's what the "store"-setting for
the fetcher in nutch-config might be for. To my understanding this would
allow you not to store the content fetched but only crawl the links.
>From crawldb I guess you should be (somehow) able to see for which URLs
retries were un-successful conducted etc.

Maybe you could instead also just monitor the output of the fetcher
while running?

Would be nice to hear if you manage to set up a working solution imho.


Regards,
 Stefan


Re: using nutch to detect broken pages

2006-05-24 Thread Stefan Neufeind
Jorg Heymans wrote:
> Hi,
> 
> I was wondering if it's possible to get crawl to go through a website and
> only report links that return a specific http response code (eg 404) ? I'm
> looking to somehow automate basic site testing of rather huge websites,
> inevitably one ends up in the world of crawlers (and being a java guy
> myself
> this means nutch).
> 
> I'm still going through the faq and first basic steps, so apologies if what
> i'm asking is the most basic nutch-thing ever :)

I haven't used it yet - but I guess that's what the "store"-setting for
the fetcher in nutch-config might be for. To my understanding this would
allow you not to store the content fetched but only crawl the links.
>From crawldb I guess you should be (somehow) able to see for which URLs
retries were un-successful conducted etc.

Maybe you could instead also just monitor the output of the fetcher
while running?

Would be nice to hear if you manage to set up a working solution imho.


Regards,
 Stefan



Re: how to

2006-05-24 Thread Stefan Neufeind
Daniel wrote:
> Dear friends:
> 
> I'm new here.
> I want to know
> 1. how to put patches to Nutch

Simply using the patch-utility available e.g. under Linux. Go into the
nutch-application-rootdir (from where you can see "conf" and "src" and
the like). There do:

patch -p1 <../pathtopath/mypatch.patch

The -p1 here depends on the path-names used inside the patch - sorry. So
in case there is something like

nutch/trunk/src/...

then since src is directly available from the directory you are in you
would want to strip the first two parts of the pathes used in the patch.
So in this case you would want to use "patch -p2" (p followed by number
of path-parts to remove).

> 2. how to establish a Chinese character lexicon to the Nutch

Which lexicon do you mean? Nutch is using UTF-8, so I guess there should
be no problem with Chinese characters in the index in general. But maybe
I got you wrong. Also, I haven't had to deal with Chinese so far ... sorry.



Regards,
 Stefan


Re: When will we see 0.8?

2006-05-23 Thread Stefan Neufeind
Benjamin Higgins wrote:
> I've heard it's pretty stable.  Should I use 0.8 CVS now or wait for it to
> be officially released?

We're using it, since we wanted some of the new features. During testing
some problems turned up, that we were luckily able to fix easily with
available patches. For those where e.g. there was a new feature but the
 patch not yet rewritten from 0.7 to 0.8 we've submitted an updated
patch. So I guess a basic 0.8 should easily work for you. If you got
problems in one area, see jira or ask here.

But as always: If you want something stable (read: without using at
least one finger to touch source), go for 0.7 imho.

> Is there a friendly changelog for 0.8?

Not sure, sorry.

> Also, will 0.8 require its own Tomcat instance like 0.7 did, or will it
> play nice not being the ROOT?

Works as non-root. There is also a very simple patch in jira to
implement that for 0.7 as well (1 line if I remembere right).


Regards,
 Stefan


Re: Setting query.host.boost etc. in nutch-site.xml does not work?

2006-05-22 Thread Stefan Neufeind
Wow Marko, that was damn quick. I didn't recognise the error, though I
looked into the sources briefly.

Thanks to you for finding the bug - and finding it in such few time. You
made my day!

And also thanks to Andrzej for putting a fix in the trunk already:
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java?r1=383304&r2=408767&pathrev=408767


Thank you,
 Stefan

Marko Bauhardt wrote:
> This is a bug in the query-basic plugin. The boosting values in the
> nutch-default.xml are not used.
> We should open a bug in jira. This simple patch should work.
> 
> Index:
> src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java
> 

[...]

> Am 22.05.2006 um 22:07 schrieb Stefan Neufeind:
> 
>> Hi,
>>
>> I was experiencing a "strange" selection of search-results here. The
>> first idea was to rate the results with searchword in hostname higher.
>> So I set query.host.boost to quite a high value (50, later 200). But
>> nothing in the result changes.
>>
>> Searching for the full hostname (www.example.com) does not give me any
>> search at all. Could it be that hostname is not taken into account
>> during a search?
>>
>> What could be wrong here? Please help. *sigh*


Setting query.host.boost etc. in nutch-site.xml does not work?

2006-05-22 Thread Stefan Neufeind
Hi,

I was experiencing a "strange" selection of search-results here. The
first idea was to rate the results with searchword in hostname higher.
So I set query.host.boost to quite a high value (50, later 200). But
nothing in the result changes.

Searching for the full hostname (www.example.com) does not give me any
search at all. Could it be that hostname is not taken into account
during a search?

What could be wrong here? Please help. *sigh*



Thanks a lot,
 Stefan


Applying new regex-normalizer-rules to indexed pages

2006-05-22 Thread Stefan Neufeind
Hi,

during a long fetch-run I experienced session-IDs in URLs, which was a
bit problematic. So I figured out how to write and test proper
regex-normalizer-rules (see NUTCH-279).

Now I wonder if on the next fetch-round URLs will get properly
normalized of if they are now un-normalized in the crawldb and from
there are fetched during generate without realizing the "duplicate"
(after normalization) URLs.

Also, is there a way to "clean" the page-index before actually indexing?
Our would this automatically be taken care of (does the normalizere run
again?) when performing the actual invertlinks/index/dedup?


Regards,
 Stefan


Re: Debugging rules for RegexUrlNormalizer

2006-05-22 Thread Stefan Neufeind
Thought I just missed something. Okay, I just added a few patterns as
well as a commandline-checker. See

http://issues.apache.org/jira/browse/NUTCH-279

for the patch.


Regards,
 Stefan

TDLN wrote:
> Sorry, I was a bit too fast there, the answer applies to the
> RegexURLFilter not the RegexUrlNormalizer. I don't think there is a
> similar facility for the RegexUrlNormalizer, but let me know if you
> find it :)
> 
> Rgrds, Thomas
> 
> On 5/22/06, TDLN <[EMAIL PROTECTED]> wrote:
>> Hi Stefan
>>
>> try running bin/nutch org.apache.nutch.net.URLFilterChecker
>>
>> Rgrds, Thomas
>>
>> On 5/22/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > is there a way to debug rules for RegexUrlNormalizer, e.g. test the
>> > substitution from commandline?
>> >
>> >
>> > bin/nutch org.apache.nutch.net.RegexUrlNormalizer
>> >
>> > does print out the rules it uses. But afaik there is no such thing
>> > possible as
>> >
>> > echo "http://www.example.com"; | bin/nutch
>> > org.apache.nutch.net.RegexUrlNormalizer
>> >
>> > is there? So how do you debug rules when writing new ones and testing
>> > them against a set of URLs that should match / should not match?


Debugging rules for RegexUrlNormalizer

2006-05-22 Thread Stefan Neufeind
Hi,

is there a way to debug rules for RegexUrlNormalizer, e.g. test the
substitution from commandline?


bin/nutch org.apache.nutch.net.RegexUrlNormalizer

does print out the rules it uses. But afaik there is no such thing
possible as

echo "http://www.example.com"; | bin/nutch
org.apache.nutch.net.RegexUrlNormalizer

is there? So how do you debug rules when writing new ones and testing
them against a set of URLs that should match / should not match?



Regards,
 Stefan


Re: Nutch fetcher "waiting" inbetween fetch

2006-05-22 Thread Stefan Neufeind
Here we're using one machine for fetching (exclusive at the moment) with
about 50 fetchers and a local Bind-resolver in caching-nameserver setup.
Bandwidth of the fetchers is 5 to 10mbit inbound roughly.

What I see is that during fetching java is taking 99.9% cpu (all
userland). At the point where the server "stalls" this changes to 99.9%
on system-usage (writing something  to disc?). It stalls for about 30
seconds or a bit more.

Hmm - I don't know where to look for the cause of these stalls, since
you don't see what it really does at that point (in logs or so).


PS: Your thoughts on this are very much appreciated.


Regards,
 Stefan

Dennis Kubes wrote:
> What we were seeing is the dns server cached the addresses in memory
> (bind 9x..) and because we were caching so many addresses on a single
> dns server it would eat up memory and eventually begin swapping to
> disk.  When this occurred the server load got up to 1.5 and the iowait
> was near 100%.  Basically it stalled the box.  Requests were still
> getting through but it was very slow.  Our solution (at least
> temporarily was to restart the bind service (not the box just the
> daemon) every couple of hours to flush the memory.
> As for load on the boxes we are seeing very minimal loads (like .08
> loads and no iowait times).  We have about 55 fetchers running (5 on
> each box with 11 nodes) and right now we are bandwidth bound on a 2Mbps
> pipe.  So maybe it is just that we don't have enough load on each
> machine to see the kind of waits that you are seeing.  Is your system
> distributed or on a single machine ?
> 
> Dennis
> 
> Stefan Neufeind wrote:
>> Hi Dennis,
>>
>> thank you for the answer. Hmm, could theoretically be. But to prevent
>> this the server already does resolving completely on his local machine.
>> Also I wonder about the CPU-load moving to "system" - I suspected heavy
>> disk-access or so ... but I don't know how/when the fetcher writes data
>> to disk etc.
>>
>>
>>
>> Regards,
>>  Stefan
>>
>> Dennis Kubes wrote:
>>  
>>> Is this possibly a dns issue.  We are running a 5M page crawl and are
>>> seeing very heavy DNS load.  Just a thought.
>>>
>>> Dennis
>>>
>>> Stefan Neufeind wrote:
>>>
>>>> Hi,
>>>>
>>>> I've encountered that here nutch is fetching quite a sum or URLs from a
>>>> long list (about 25.000). But from time to time nutch is "waiting" for
>>>> 10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is
>>>> nutch writing fetched data or index to disk at that stage? Is there any
>>>> way to optimize this step, e.g. by writing more often and performing
>>>> the
>>>> write in "background" or caching even more in mem instead of
>>>> flushing to disk?


dedup after building indexed? (0.8-dev)

2006-05-22 Thread Stefan Neufeind
Hi,

up too 0.7.x dedup was done before indexing, right?

And for 0.8-dev I read from Crawl.java that the order to use is
- invertlinks
- index
- dedup
- merge (merging segment indexes)

Is that right? I wonder why indexing is done before removing duplicates.
Could somebody please explain?


Also, am I right that merge is not needed if run on only one node? I
already got a "complete" index from the "index"-phase. Or what is that
about?



Regards,
 Stefan


Re: Nutch fetcher "waiting" inbetween fetch

2006-05-21 Thread Stefan Neufeind
Hi Dennis,

thank you for the answer. Hmm, could theoretically be. But to prevent
this the server already does resolving completely on his local machine.
Also I wonder about the CPU-load moving to "system" - I suspected heavy
disk-access or so ... but I don't know how/when the fetcher writes data
to disk etc.



Regards,
 Stefan

Dennis Kubes wrote:
> Is this possibly a dns issue.  We are running a 5M page crawl and are
> seeing very heavy DNS load.  Just a thought.
> 
> Dennis
> 
> Stefan Neufeind wrote:
>> Hi,
>>
>> I've encountered that here nutch is fetching quite a sum or URLs from a
>> long list (about 25.000). But from time to time nutch is "waiting" for
>> 10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is
>> nutch writing fetched data or index to disk at that stage? Is there any
>> way to optimize this step, e.g. by writing more often and performing the
>> write in "background" or caching even more in mem instead of flushing to
>> disk?


Nutch fetcher "waiting" inbetween fetch

2006-05-21 Thread Stefan Neufeind
Hi,

I've encountered that here nutch is fetching quite a sum or URLs from a
long list (about 25.000). But from time to time nutch is "waiting" for
10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is
nutch writing fetched data or index to disk at that stage? Is there any
way to optimize this step, e.g. by writing more often and performing the
write in "background" or caching even more in mem instead of flushing to
disk?


Regards,
 Stefan