Re: Use and configuration of RegexUrlNormalize
Javier P. L. wrote: > Hi, > > I am using Nutch for news sites crawling, I have a problem with one of > them that publishes the urls with & instead of &. I discovered the > use of the url normalizer and the regex-normalize.xml configuration > file. Unfortunately I did not find too much examples about how to use > the regular expressions and substitutions, so I was trying different > combinations to make the transformation but it did work. > > Basically what I want is to convert > > noticia.jsp?CAT=126&TEXTO=10109668 > > in > > noticia.jsp?CAT=126&TEXTO=10109668 > > because otherwise Nutch is not capable to crawl those pages. A more basic question to this: How do your URLs with & end up in nutch? It's okay/right that they should be written with & in HTML-source to be "clean", but when storing the URLs shouldn't Nutch itself already convert them to & for storing/fetching? Regads, Stefan
Re: I can not query myplugin in field category:test
Please do share it. I'd appreciate it, and I guess a lot of others as well. And I bet it could even be enhanced by the community. :-) Regards, Stefan Ernesto De Santis wrote: > I did a url-category-indexer. > > It works with a .properties file that map urls writed as regexp and > categories. > example: > > http://www.misite.com/videos/.*=videos > > If it seems useful, I can share it. > > Maybe, it could be better config it in a .xml file. > > Regards, > Ernesto. > > Stefan Neufeind escribió: >> Alvaro Cabrerizo wrote: >> >>> Have you included a node to describe your new searcher filter into >>> plugin.xml? >>> >>> 2006/10/11, xu nutch <[EMAIL PROTECTED]>: >>> >>>> I have a question about myplugin for indexfilter and queryfilter. >>>> Can u Help me ! >>>> - >>>> MoreIndexingFilter.java in add >>>> doc.add(new Field("category", "test", false, true, false)); >>>> - >>>> >>>> -- >>>> >>>> >>>> package org.apache.nutch.searcher.more; >>>> >>>> import org.apache.nutch.searcher.RawFieldQueryFilter; >>>> >>>> /** Handles "category:" query clauses, causing them to search the >>>> field indexed by >>>> * BasicIndexingFilter. */ >>>> public class CategoryQueryFilter extends RawFieldQueryFilter { >>>> public CategoryQueryFilter() { >>>>super("category"); >>>> } >>>> } >>>> --- >>>> --- >>>> >>>> >>>> plugin.includes >>>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more) >>>> >>>> >>>> Regular expression naming plugin directory names to >>>> include. Any plugin not matching this expression is excluded. >>>> In any case you need at least include the nutch-extensionpoints >>>> plugin. By >>>> default Nutch includes crawling just HTML and plain text via HTTP, >>>> and basic indexing and search plugins. >>>> >>>> >>>> >>>> >>>> plugin.includes >>>> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more) >>>> >>>> >>>> Regular expression naming plugin directory names to >>>> include. Any plugin not matching this expression is excluded. >>>> In any case you need at least include the nutch-extensionpoints >>>> plugin. By >>>> default Nutch includes crawling just HTML and plain text via HTTP, >>>> and basic indexing and search plugins. >>>> >>>> >>>> --- >>>> >>>> I use luke to query "category:test" is ok! >>>> but I use tomcat webstie to query "category:test" , >>>> no return result. >>>> >> >> In case you get the search working: >> How do you plan to categorize URLs/sites? I'm looking for a solution >> there, since I didn't yet manage to implement something >> URL-prefix-filter based to map categories to URLs or so. >> >> >> Regards, >> Stefan
Re: I can not query myplugin in field category:test
Alvaro Cabrerizo wrote: > Have you included a node to describe your new searcher filter into > plugin.xml? > > 2006/10/11, xu nutch <[EMAIL PROTECTED]>: >> I have a question about myplugin for indexfilter and queryfilter. >> Can u Help me ! >> - >> MoreIndexingFilter.java in add >> doc.add(new Field("category", "test", false, true, false)); >> - >> >> -- >> >> >> package org.apache.nutch.searcher.more; >> >> import org.apache.nutch.searcher.RawFieldQueryFilter; >> >> /** Handles "category:" query clauses, causing them to search the >> field indexed by >> * BasicIndexingFilter. */ >> public class CategoryQueryFilter extends RawFieldQueryFilter { >> public CategoryQueryFilter() { >>super("category"); >> } >> } >> --- >> --- >> >> >> plugin.includes >> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more) >> >> Regular expression naming plugin directory names to >> include. Any plugin not matching this expression is excluded. >> In any case you need at least include the nutch-extensionpoints >> plugin. By >> default Nutch includes crawling just HTML and plain text via HTTP, >> and basic indexing and search plugins. >> >> >> >> >> plugin.includes >> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more) >> >> Regular expression naming plugin directory names to >> include. Any plugin not matching this expression is excluded. >> In any case you need at least include the nutch-extensionpoints >> plugin. By >> default Nutch includes crawling just HTML and plain text via HTTP, >> and basic indexing and search plugins. >> >> >> --- >> >> I use luke to query "category:test" is ok! >> but I use tomcat webstie to query "category:test" , >> no return result. In case you get the search working: How do you plan to categorize URLs/sites? I'm looking for a solution there, since I didn't yet manage to implement something URL-prefix-filter based to map categories to URLs or so. Regards, Stefan
Re: Lucene query support in Nutch
Cristina Belderrain wrote: > On 10/9/06, Tomi NA <[EMAIL PROTECTED]> wrote: > >> This is *exactly* what I was thinking. Like Stefan, I believe the >> nutch analyzer is a good foundation and should therefore be extended >> to support the "or" operator, and possibly additional capabilities >> when the need arises. >> >> t.n.a. > > Tomi, why would you extend Nutch's analyzer when Lucene's analyzer, > which does exactly what you want, is already there? To what I understood so far in this thread the Nutch analyser/query-whatever seems to be more targeted and provides additional features regarding distributed search as well as maybe speed-improvements due to it's nature etc. (Correct me if I'm wrong.) One idea that has come up was to offer both as alternatives so you could use Lucene-based queries if you need it's features on the one hand but can live with restrictions on the other. However due to what has been mentioned so far it seems that Lucene-queries by default can only be on document-content (is that right?) not e.g. site:www.example.org. Hmm ... PS: Thank you all for help offered so far in this thread on how to get Lucene-queries going. Unfortunately I couldn't make much use of "just simply extend it here and there ..." :-( Regards, Stefan
Re: Lucene query support in Nutch
Björn Wilmsmann wrote: > > Am 07.10.2006 um 17:40 schrieb Cristina Belderrain: > >> Let me remind you that all this must be done just to provide something >> that's already there: Nutch is built on top of Lucene, after all. If >> it's hard to understand why Lucene's capabilities were simply >> neutralized in Nutch, it's even harder to figure out why no choice was >> left to users by means of some configuration file. > > I think this issue is rooted in the underlying philosophy of Nutch: > Nutch was designed with the idea of a possible Google(and the > likes)-sized crawler and indexer in mind. Regular expressions and > wildcard queries do not seem to fit into this philosophy, as such > queries would be way less efficient on a huge data set than simple > boolean queries. > > Nevertheless, I agree that there should be an option to choose the > Lucene query engine instead of the Nutch flavour one because Nutch has > been proven to be equally suitable for areas which do not require as > efficient queries (like intranet crawling for instance) as an all-out > web indexing application. Hi, if it's not the full feature-set, maybe most people could live with it. But basic boolean queries I think were the root for this topic. Is there an "easier" way to allow this in Nutch as well instead of throwing quite a bit away and using the Lucene-syntax? As has just been pointed out: It seems quite a few things need to be "changed" to use Lucene-search instead of a Nutch-search. I don't think that it's needed in most cases. But I see several reasons where a boolean query would make sense. (Currently I do fetch up to 10.000 or so results using opensearch and filter them in a script myself, since no "AND (site:... or site:...)" is yet possible.) Regards, Stefan
Re: Lucene query support in Nutch
Hi, yes, I guess having the full strength of Lucene-based queries would be nice. That would as well solve the boolean queries-question I had a few days ago :-) Ravi, doesn't Lucene also allow querying of other fields? Is there any possibility to add that feature to your proposal? In general: What is the advantage of the current nutch-parser instead of going with the Lucene-based one? Regards, Stefan Ravi Chintakunta wrote: > Hi Cristina, > > You can achieve this by modifying the IndexSearcher to take the query > String as an argument and then use > > org.apache.lucene.queryParser.QueryParser's parse(String ) method to > parse the query string. The modified method in IndexSearcher would > look as below: > > public Hits search(String queryString, int numHits, > String dedupField, String sortField, boolean > reverse) throws IOException { > >org.apache.lucene.queryParser.QueryParser parser = new > org.apache.lucene.queryParser.QueryParser("content", new > org.apache.lucene.analysis.standard.StandardAnalyzer()); > > org.apache.lucene.search.Query luceneQuery = parser.parse(queryString); > > return translateHits > (optimizer.optimize(luceneQuery, luceneSearcher, numHits, > sortField, reverse), > dedupField, sortField); > } > > For this you have to modify the code in search.jsp and NutchBean too, > so that you are passing on the raw query string to IndexSearcher. > > Note that with this approach, you are limiting the search to the content > field. > > > - Ravi Chintakunta > > > > On 10/4/06, Cristina Belderrain <[EMAIL PROTECTED]> wrote: >> Hello, >> >> we all know that Lucene supports, among others, boolean queries. Even >> though Nutch is built on Lucene, boolean clauses are removed by Nutch >> filters so boolean queries end up as "flat" queries where terms are >> implicitly connected by an OR operator, as far as I can see. >> >> Is there any simple way to turn off the filtering so a boolean query >> remains as such after it is submitted to Nutch? >> >> Just in case a simple way doesn't exist, Ravi Chintakunta suggests the >> following workaround: >> >> "We have to modify the analyzer and add more plugins to Nutch >> to use the Lucene's query syntax. Or we have to directly use >> Lucene's Query Parser. I tried the second approach by modifying >> org.apache.nutch.searcher.IndexSearcher and that seems to work." >> >> Can anyone please elaborate on what Ravi actually means by "modifying >> org.apache.nutch.searcher.IndexSearcher"? Which methods are supposed >> to be modified and how? >> >> It would be really nice to know how to do this. I believe many other >> Nutch users would also benefit from an answer to this question. >> >> Thanks so much, >> >> Cristina
Searching with "and" and "or?
Hi, I'm trying to build a search like searchword AND (site:www.example.com OR site:www.foobar.org) But no such syntax I tried worked out. Is it possible somehow? Regards, Stefan
Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]
Andrzej Bialecki wrote: > Stefan Neufeind wrote: >> Sami Siren wrote: >> >>> Stefan Neufeind wrote: >>> >>>> Sami Siren wrote: >>>> >>>>> redirecting to nutch-user... >>>>> >>>>>> What I currently have is that max. 2 matches are shown per website - >>>>>> but >>>>>> that also from the summary-website only 2 matches are shown. >>>>>> Either I'd >>>>>> need to be able to show only 2 matches per website but _all_ matches >>>>>> from the summary-website (would be okay in this case) or give >>>>>> website 1 >>>>>> to 4 individual "IDs per website" and also assign each URL from the >>>>>> summary-website the corresponding ID of the website it belongs to. >>>>>> >>>>> You can add whatever (meta-)data to index with indexing filter. You >>>>> could >>>>> for example assign tag "A" to site A, tag "B" to B etc... >>>>> then assign unique tags for pages from summary site. >>>>> >>>>> In searching phase you then use that new field as dedupfield >>>>> (instead of >>>>> site) >>>>> >>>>> This should give you max (for example 2) hits per website and >>>>> unlimited >>>>> hits >>>>> from summary web site. >>>>> >>>>> Does that fullfill your requirements? >>>>> >>>> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With >>>> what "filter"? >>>> >>> Write a plugin that provides implementation of >>> http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html >> >> That was (part of) my question - how to do that "cleanly", and if >> somebody could give a hint. I'm not sure what would be the elegant way >> of having a "match URL against ... and set tags ABC"-patternfile, how to >> use a hash-map or something for that and how to do it in Java. (Sorry, >> I'm not that familiar with Java as with other languages, and neither >> with nutch-internals). > > If it's a relatively short list of urls (let's say less than 50,000 > entries) then you can use org.apache.nutch.util.PrefixStringMatcher, > which builds a compact trie structure. I would then strongly advise you > to keep just the urls (or whatever it is that you need to match) in that > structure, and all other data in an external DB or a special-purpose > Lucene index. You can implement this as an indexing plugin - if the > pattern matches, then you get additional metadata from some external > source, and you add additional fields to the index that contain this data. Hmm, I'm still not sure how this would work. (Sorry for that!) I know that for every URL in my index the prefix matches. I just would need to find out how much. E.g. http://www.example.com/test1/as prefix and http://www.example.com/test1/page1.htm as the page-URL Now I would want to do a lookup and assign that, based on the prefix, ID "test1". Do I conclude right that in this case I could leave out the PreefixStringMatcher, since I know that some string will match for all the URLs? Do you maybe have a small example for a plugin to match against an external database? PS: Your help is very much appreciated. Sorry for asking dumb questions :-) Regards, Stefan
Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]
Sami Siren wrote: > Stefan Neufeind wrote: >> Sami Siren wrote: >> >>> redirecting to nutch-user... >>> >>> >>>> What I currently have is that max. 2 matches are shown per website - >>>> but >>>> that also from the summary-website only 2 matches are shown. Either I'd >>>> need to be able to show only 2 matches per website but _all_ matches >>>> from the summary-website (would be okay in this case) or give website 1 >>>> to 4 individual "IDs per website" and also assign each URL from the >>>> summary-website the corresponding ID of the website it belongs to. >>> >>> You can add whatever (meta-)data to index with indexing filter. You >>> could >>> for example assign tag "A" to site A, tag "B" to B etc... >>> then assign unique tags for pages from summary site. >>> >>> In searching phase you then use that new field as dedupfield (instead of >>> site) >>> >>> This should give you max (for example 2) hits per website and unlimited >>> hits >>> from summary web site. >>> >>> Does that fullfill your requirements? >> >> >> That would perfectly fit, yes. But how do I "tag" the pages/URLs? With >> what "filter"? >> > > Write a plugin that provides implementation of > http://lucene.apache.org/nutch/nutch-nightly/docs/api/org/apache/nutch/indexer/IndexingFilter.html That was (part of) my question - how to do that "cleanly", and if somebody could give a hint. I'm not sure what would be the elegant way of having a "match URL against ... and set tags ABC"-patternfile, how to use a hash-map or something for that and how to do it in Java. (Sorry, I'm not that familiar with Java as with other languages, and neither with nutch-internals). Stefan
Re: [Fwd: [Fwd: Re: [jira] Commented: (NUTCH-271) Meta-data per URL/site/section]]
Sami Siren wrote: > redirecting to nutch-user... > >> What I currently have is that max. 2 matches are shown per website - but >> that also from the summary-website only 2 matches are shown. Either I'd >> need to be able to show only 2 matches per website but _all_ matches >> from the summary-website (would be okay in this case) or give website 1 >> to 4 individual "IDs per website" and also assign each URL from the >> summary-website the corresponding ID of the website it belongs to. > > You can add whatever (meta-)data to index with indexing filter. You could > for example assign tag "A" to site A, tag "B" to B etc... > then assign unique tags for pages from summary site. > > In searching phase you then use that new field as dedupfield (instead of > site) > > This should give you max (for example 2) hits per website and unlimited > hits > from summary web site. > > Does that fullfill your requirements? That would perfectly fit, yes. But how do I "tag" the pages/URLs? With what "filter"? Stefan
Re: Please Help - Patch not working - external links still crawled
Ronny wrote: > Hi all, > > after installing the patch http://issues.apache.org/jira/browse/NUTCH-173 and > a whole-web crawl external links will still be crawled. > > I modified the nutch-site.xml as follows: > > > > crawl.ignore.external.links > > true > > not crwling external links > > > > What made I wrong? You did not rebuild nutch, did you? Regards, Stefan
Re: Please Help - Patch install
You'd use the "patch"-utility, which is generally available on every Linux-installation I know. It's nothing Java-specific or so. Also various development-IDEs feature patch-/merge-functionality as well. Regards, Stefan Ronny wrote: > Hi Stefan, > > which utility I need and after installing how do I install the patch? > > Sorry for this questions but I am a beginner in Java and nutch... > > Thanks for your help > Ronny > - Original Message - From: "Stefan Neufeind" > <[EMAIL PROTECTED]> > To: > Sent: Tuesday, July 25, 2006 12:14 PM > Subject: Re: Please Help - Patch install > > >> You should use the patch-utility to integrate the patch, not be doing it >> by hand. >> >> That line you mention is sort of "meta-data" and interpreted by the >> patch-utility. It's nothing you need to add to the sourcefiles! >> >> >> Good luck, >> Stefan >> >> Ronny wrote: >>> Hello, >>> >>> thanks for your reply. Now I tried it and it is not working. >>> >>> I just put the lines with + into the source code. The lines are as >>> follows: >>> >>> +public static final boolean CRAWL_IGNORE_EXTERNAL_LINKS = >>> +NutchConf.get().getBoolean("crawl.ignore.external.links", >>> false); >>> >>> and >>> >>> +if (!internal && CRAWL_IGNORE_EXTERNAL_LINKS) { >>> +continue; // External links are forbidden : skip it ! >>> + } >>> >>> Of course they are on the right place in the script. But I don´t know >>> what to do with this: @@ -198,6 +200,9 @@ . >>> >>> Please help me >>> Kind regards >>> Ronny >>> >>> >>> >>> >>> - Original Message - From: "Philippe EUGENE" >>> <[EMAIL PROTECTED]> >>> To: >>> Sent: Monday, July 24, 2006 10:21 AM >>> Subject: Re: Please Help - Patch install >>> >>> >>>> Ronny a écrit : >>>>> Hello List, >>>>> >>>>> I have a patch for Nutch >>>>> http://issues.apache.org/jira/browse/NUTCH-173 and I want to install >>>>> it. But I don´t know how to do that. >>>>> Which file I have to edit that I can install and run the patch. I am >>>>> working currently with nutch 0.7.2 >>>>> >>>>> Thanks for your help. >>>>> >>>>> Kind regards >>>>> Ronny >>>>> >>>> Hi, >>>> You must use a tools like svn to apply this patch on your source code. >>>> It seems working. >>>> If you are not familar with this, you can edit manualy in your IDE >>>> this file : >>>> tools/UpdateDatabaseTool.java >>>> The patch.txt file indicate witch line you must edit or replace. >>>> After, you must add this option crawl.ignore.external.links in your >>>> configuration file. >>>> -- >>>> Philippr
Re: Please Help - Patch install
You should use the patch-utility to integrate the patch, not be doing it by hand. That line you mention is sort of "meta-data" and interpreted by the patch-utility. It's nothing you need to add to the sourcefiles! Good luck, Stefan Ronny wrote: > Hello, > > thanks for your reply. Now I tried it and it is not working. > > I just put the lines with + into the source code. The lines are as follows: > > +public static final boolean CRAWL_IGNORE_EXTERNAL_LINKS = > +NutchConf.get().getBoolean("crawl.ignore.external.links", false); > > and > > +if (!internal && CRAWL_IGNORE_EXTERNAL_LINKS) { > +continue; // External links are forbidden : skip it ! > + } > > Of course they are on the right place in the script. But I don´t know > what to do with this: @@ -198,6 +200,9 @@ . > > Please help me > Kind regards > Ronny > > > > > - Original Message - From: "Philippe EUGENE" > <[EMAIL PROTECTED]> > To: > Sent: Monday, July 24, 2006 10:21 AM > Subject: Re: Please Help - Patch install > > >> Ronny a écrit : >>> Hello List, >>> >>> I have a patch for Nutch >>> http://issues.apache.org/jira/browse/NUTCH-173 and I want to install >>> it. But I don´t know how to do that. >>> Which file I have to edit that I can install and run the patch. I am >>> working currently with nutch 0.7.2 >>> >>> Thanks for your help. >>> >>> Kind regards >>> Ronny >>> >> Hi, >> You must use a tools like svn to apply this patch on your source code. >> It seems working. >> If you are not familar with this, you can edit manualy in your IDE >> this file : >> tools/UpdateDatabaseTool.java >> The patch.txt file indicate witch line you must edit or replace. >> After, you must add this option crawl.ignore.external.links in your >> configuration file. >> -- >> Philippr
Re: any success with php-java-bridge and Nutch?
Chris Stephens wrote: Has anyone had success getting Nutch to work with the php-java-bridge? I've been playing around with this for about a day and a half and have not been able to get passed the error: java stack trace: java.lang.Exception: CreateInstance failed: new org.apache.nutch.searcher.NutchBean. Cause: java.lang.NullPointerException at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:84) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:73) at java.lang.reflect.Constructor.newInstance(libgcj.so.7) at php.java.bridge.JavaBridge.CreateObject(JavaBridge.java:547) at php.java.bridge.Request.handleRequest(Request.java:503) at php.java.bridge.Request.handleRequests(Request.java:533) at php.java.bridge.JavaBridge.run(JavaBridge.java:192) at php.java.bridge.BaseThreadPool$Delegate.run(BaseThreadPool.java:37) Caused by: java.lang.NullPointerException at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96) ...8 more I do have a proper searcher.dir entry in my nutch-site.xml, and my index does have data. My class path currently looks like: :/usr/java/jdk1.5.0_06:/usr/java/jdk1.5.0_06/lib:/usr/local/nutch/lib:/usr/local/nutch:/usr/local/nutch/conf/nutch-default.xml:/usr/local/nutch/conf/nutch-site.xml I would appreciate any reports on nutch working with php-java-bridge and information about stability. Do you really need a real php-java-bridge for that? We're using the OpenSearch XML-output from nutch in a php-application andlocking down access to nutch only from localhost. Works fine ... (Though if someone gets the php-java-bridge to work that would be cool! :-) Regards, Stefan
Re: Do nutch allow an advanced search?
Scott McCammon wrote: > The index-more plugin indexes each document's last modified date and is > searchable via a range like: "date:20060521-20060621" Note that a date > search does not work by itself. At least one keyword or phrase is required. Hi Scott, requiring a keyword/phrase has been mentioned at several places before already. Is there a technical background for it, or could that limitation maybe be removed (and should we file a JIRA for that)? Regards, Stefan > John john wrote: >> Hello >> >> I'm new in the nutch world and i'm wondering whether it's possible to >> search with date range? or specify a date and then nutch retrieves >> pages updated after this date? >> >> thanks
Re: problem with skiped urls
[EMAIL PROTECTED] wrote: > hi, > i'm trying to run nutch in our clinicum center and i have a little problem. > we have a few intranet servers and i want that nutch skip a few > direcotries. > for example: > > http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/ > > i wrote this urls in the crawl-urlfilter.txt. for example: > > -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus > > but nothing happens. nutch don't skip this urls. and i don't know why... > > :( kann anyone help me? > > i'm cwaling with this command: > > bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log & > > i'm using the release 0.7.1 Hi David, do you have regex-urlfilter in your crawler-site-configfile or nutch-site-configfile? I suspect that the plugin might not yet be loaded. Also, do you have another "allow all URLs"-line above the one you mentioned, maybe? I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and not +, so I guess that should be fine). But if your URL does not have anything in front of sapdoku, maybe try dropping that part. Good luck, Stefan
Re: Restricting query to a domain
Bogdan Kecman wrote: > Use plugin "query-site". It supports the site field. > Also if you look at the the > > NutchBean.search(query, start + hitsToRetrieve, hitsPerSite, "site", sort, > reverse); > > You will notice that you can get results grouped > by site field, actually to get only hitsPerSite > number of results per site. > > Now, this works with 0.7.1, donno about 0.7.2 and 0.8 as > I had no time to check them out, but there should not be > much difference > > Pay notice that this is a filter, so query like > > findme andme site:"www.aaa.com" > > Will limit resultset to www.aaa.com only but query > > site:"www.aaa.com" > > Is empty query and will not return anything. Why won't that return anything? And is grouping with "brackets" somehow possible? I know the thing mentioned below does not work - but would be nice if it could, wouldn't it? abc && (site:"www.aaa.com" || site:"www.bbb.com") Regards, Stefan >> -Original Message- >> From: Bill de hÓra [mailto:[EMAIL PROTECTED] >> Sent: Sunday, June 18, 2006 6:33 PM >> To: nutch-user@lucene.apache.org >> Subject: Restricting query to a domain >> >> Hi, >> >> I'll need to provide a search that allow a person to restrict >> search to a specific domain (and probably a group of them). >> Afaict that's not supported (apologies if I'm wrong). Before >> I go rolling my own are they plans to support anything like "site:"? >> >> cheers >> Bill
Re: Removing or reindexing a URL?
Andrzej Bialecki wrote: Stefan Neufeind wrote: How about making this a commandline-option to inject? Could you create an improvement-patch? FWIW, a patch with similar functionality is in my work-in-progress queue, however it's for 0.8 - there is no point in backporting my patch because the architecture is very different... Here's a snippet: [...] I'm fine with 0.8(-dev). Have been using it successful in production myself now :-) Stefan
Re: Removing or reindexing a URL?
Hi, it just came to my mind, just to make sure (don't have the code at hand): updatedb uses a different portion of code, right? Otherwise we might re-crawl URLs we just fetched because links are found to URLs we just fetched :-) Regards, Stefan Howie Wang wrote: > If you don't mind changing the source a little, I would change > the org.apache.nutch.db.WebDBInjector.java file so that > when you try to inject a url that is already there, it will update > it's next fetch date so that it will get fetched during the next > crawl. > > In WebDBInjector.java in the addPage method, change: > > dbWriter.addPageIfNotPresent(page); > > to: > > dbWriter.addPageWithScore(page); > > Every day you can take your list of changed/deleted urls and do: > >bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt > > Then do your crawl as usual. The updated pages will be refetched. > The deleted pages will attempt to be refetched, but will error out, > and be removed from the index. > > You could also set your db.default.fetch.interval parameter to > longer than 30 days if you are sure you know what pages are changing. > > Howie > >> With my tests, I index ~60k documents. This process takes several >> hours. I >> plan on having about a half million documents index eventually, and I >> suspect it'll take more than 24 hours to recrawl and reindex with my >> hardware, so I'm concerned. >> >> I *know* which documents I want to reindex or remove. It's going to be a >> very small subset compared to the whole group (I imagine around 1000 >> pages). That's why I desperately want to be able to give Nutch a list of >> documents. >> >> Ben >> >> On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: >>> >>> Just recrawl and reindex every day. That was the simple answer. >>> The more complex answer is you need to do write custom code that >>> deletes documents from your index and crawld. >>> If you not want to complete learn the internals of nutch, just >>> recrawl and reindex. :) >>> >>> Stefan >>> Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: >>> >>> > Hello, >>> > >>> > I'm trying to get Nutch suitable to use for our (extensive) >>> > intranet. One >>> > problem I'm trying to solve is how best to tell Nutch to either >>> > reindex or >>> > remove a URL from the index. I have a lot of pages that get >>> > changed, added >>> > and removed daily, and I'd prefer to have the changes reflected in >>> > Nutch's >>> > index immediately. >>> > >>> > I am able to generate a list of URLs that have changed or have been >>> > removed, >>> > so I definately do not need to reindex everything, I just need a >>> > way to pass >>> > this list on to Nutch. >>> > >>> > How can I do this? >>> > >>> > Ben
Re: Removing or reindexing a URL?
How about making this a commandline-option to inject? Could you create an improvement-patch? Regards, Stefan Howie Wang wrote: If you don't mind changing the source a little, I would change the org.apache.nutch.db.WebDBInjector.java file so that when you try to inject a url that is already there, it will update it's next fetch date so that it will get fetched during the next crawl. In WebDBInjector.java in the addPage method, change: dbWriter.addPageIfNotPresent(page); to: dbWriter.addPageWithScore(page); Every day you can take your list of changed/deleted urls and do: bin/nutch inject mynutchdb/db -urlfile my_changed_urls.txt Then do your crawl as usual. The updated pages will be refetched. The deleted pages will attempt to be refetched, but will error out, and be removed from the index. You could also set your db.default.fetch.interval parameter to longer than 30 days if you are sure you know what pages are changing. Howie With my tests, I index ~60k documents. This process takes several hours. I plan on having about a half million documents index eventually, and I suspect it'll take more than 24 hours to recrawl and reindex with my hardware, so I'm concerned. I *know* which documents I want to reindex or remove. It's going to be a very small subset compared to the whole group (I imagine around 1000 pages). That's why I desperately want to be able to give Nutch a list of documents. Ben On 6/8/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: Just recrawl and reindex every day. That was the simple answer. The more complex answer is you need to do write custom code that deletes documents from your index and crawld. If you not want to complete learn the internals of nutch, just recrawl and reindex. :) Stefan Am 06.06.2006 um 19:42 schrieb Benjamin Higgins: > Hello, > > I'm trying to get Nutch suitable to use for our (extensive) > intranet. One > problem I'm trying to solve is how best to tell Nutch to either > reindex or > remove a URL from the index. I have a lot of pages that get > changed, added > and removed daily, and I'd prefer to have the changes reflected in > Nutch's > index immediately. > > I am able to generate a list of URLs that have changed or have been > removed, > so I definately do not need to reindex everything, I just need a > way to pass > this list on to Nutch. > > How can I do this? > > Ben
Re: intranet crawl issue
Matthew Holt wrote: Just fyi,.. both of the sites I am trying to crawl are under the same domain. The sub-domains just differ. Works for one, the other it o nly appears to fetch 6 or so pages then doesn't fetch anymore. Do you need any more information to solve the problem? I've tried everything and havent' had any luck.. Thanks. What does your crawl-urlfilter.txt look like? Stefan
Re: Filtering webpages based on words / Fetch progress
Lukas Vlcek wrote: Hi again, On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote: 1. I want to filter out webpages based on a list of words. I have tried filtering webpages based on url, but how to do it based on words? As for this question check the following link: http://wiki.apache.org/nutch/CommandLineOptions As far as I know this prune tool should be available for nutch 0.8 as well (at least I can see the class to be included in source code so you should be able to call it). Pruning with 0.8-dev works fine here. You give it a file with your "queries" and all matching pages will be pruned from the index. There is also a dryrun-option available - use that when building your queries :-) Note that documents are only pruned from the index, not from segments or the crawldb! So upon re-indexing or running another crawler-round be sure to apply pruning again. Stefan
Re: Recrawling question
Oh sorry, I didn't look up the script again from your earlier mail. Hmm, I guess you can live fine without the invertlinks (if I'm right). Are you sure that your indexing works fine? I think if an index exists nutch complains. See if there is any error with indexing. Also maybe try to delete your current index before indexing again. Still doesn't work? Regards, Stefan Matthew Holt wrote: > Sorry to be asking so many questions.. Below is the current script I'm > using. It's indexing the segments.. so do I use invertlinks directly > after the fetch? I'm kind of confused.. thanks. > matt [...] > --- > > Stefan Neufeind wrote: > >> You miss actually indexing the pages :-) This is done inside the >> "crawl"-command which does everything in one. After you fetched >> everything use: >> >> nutch invertlinks ... >> nutch index ... >> >> Hope that helps. Otherwise let me know and I'll dig out the complete >> commandlines for you. >> >> >> Regards, >> Stefan >> >> Matthew Holt wrote: >> >> >>> Just FYI.. After I do the recrawl, I do stop and start tomcat, and still >>> the newly created page can not be found. >>> >>> Matthew Holt wrote: >>> >>> >>>> The recrawl worked this time, and I recrawled the entire db using the >>>> -adddays argument (in my case ./recrawl crawl 10 31). However, it >>>> didn't find a newly created page. >>>> >>>> If I delete the database and do the initial crawl over again, the new >>>> page is found. Any idea what I'm doing wrong or why it isn't finding >>>> it? >>>> >>>> Thanks! >>>> Matt >>>> >>>> Matthew Holt wrote: >>>> >>>> >>>>> Stefan, >>>>> Thanks a bunch! I see what you mean.. >>>>> matt >>>>> >>>>> Stefan Neufeind wrote: >>>>> >>>>> >>>>>> Matthew Holt wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi all, >>>>>>> I have already successfuly indexed all the files on my domain only >>>>>>> (as >>>>>>> specified in the conf/crawl-urlfilter.txt file). >>>>>>> >>>>>>> Now when I use the below script (./recrawl crawl 10 31) to >>>>>>> recrawl the >>>>>>> domain, it begins indexing pages off of my domain (such as >>>>>>> wikipedia, >>>>>>> etc). How do I prevent this? Thanks! >>>>>>> >>>>>>> >>>>>> >>>>>> Hi Matt, >>>>>> >>>>>> have a look at regex-urlfilter. "crawl" is special in some ways. >>>>>> Actually it's "shortcut" for several steps. And it has a special >>>>>> urlfilter-file. But if you do it in several steps that >>>>>> urlfilter-file is >>>>>> no longer used.
Re: Multiple indexes on a single server instance.
Sounds like others might have use for that as well possibly. Can you provide a clean patchset, maybe? How about an "multi-index"-plugin which parses a xml-file to find the paths to allow indexes, like index1 /data/something/index1 index2 /data/somethingelse Regards, Stefan Ravi Chintakunta wrote: > See my thread > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03014.html > > where I have modified NutchBean to dynamically pickup the indexes that > have to be searcheed. The web page has checkboxes for each index, and > thus these indexes can be searched in any combination. > > - Ravi Chintakunta > > > > On 5/31/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> sudhendra seshachala wrote: >> > Yes. That is wha I am trying. But for some reason it is not working.. >> > Does these fields should be lower case only. ? >> > >> >> >> Preferably. If you use the default NutchDocumentAnalyzer it will >> lowercase field names, so you won't get any match.
Re: Recrawling question
You miss actually indexing the pages :-) This is done inside the "crawl"-command which does everything in one. After you fetched everything use: nutch invertlinks ... nutch index ... Hope that helps. Otherwise let me know and I'll dig out the complete commandlines for you. Regards, Stefan Matthew Holt wrote: > Just FYI.. After I do the recrawl, I do stop and start tomcat, and still > the newly created page can not be found. > > Matthew Holt wrote: > >> The recrawl worked this time, and I recrawled the entire db using the >> -adddays argument (in my case ./recrawl crawl 10 31). However, it >> didn't find a newly created page. >> >> If I delete the database and do the initial crawl over again, the new >> page is found. Any idea what I'm doing wrong or why it isn't finding it? >> >> Thanks! >> Matt >> >> Matthew Holt wrote: >> >>> Stefan, >>> Thanks a bunch! I see what you mean.. >>> matt >>> >>> Stefan Neufeind wrote: >>> >>>> Matthew Holt wrote: >>>> >>>> >>>>> Hi all, >>>>> I have already successfuly indexed all the files on my domain only >>>>> (as >>>>> specified in the conf/crawl-urlfilter.txt file). >>>>> >>>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the >>>>> domain, it begins indexing pages off of my domain (such as wikipedia, >>>>> etc). How do I prevent this? Thanks! >>>>> >>>> >>>> >>>> >>>> Hi Matt, >>>> >>>> have a look at regex-urlfilter. "crawl" is special in some ways. >>>> Actually it's "shortcut" for several steps. And it has a special >>>> urlfilter-file. But if you do it in several steps that >>>> urlfilter-file is >>>> no longer used.
Re: Recrawling question
Matthew Holt wrote: > Hi all, > I have already successfuly indexed all the files on my domain only (as > specified in the conf/crawl-urlfilter.txt file). > > Now when I use the below script (./recrawl crawl 10 31) to recrawl the > domain, it begins indexing pages off of my domain (such as wikipedia, > etc). How do I prevent this? Thanks! Hi Matt, have a look at regex-urlfilter. "crawl" is special in some ways. Actually it's "shortcut" for several steps. And it has a special urlfilter-file. But if you do it in several steps that urlfilter-file is no longer used. Regards, Stefan
Re: Intranet Crawling
Just use a depth of 10 or whatever. If there are no more pages to crawl one depth more or less does no harm. For normal websites anything in the range from 5 to 10 for depth imho should be reasonable. topN: This allows you to work on only the highest ranked URLs not yet fetched. It functions as a max. pages limit per each run (depth). Regards, Stefan Matthew Holt wrote: > Ok thanks.. as far as crawling the entire subdomain.. what exact command > would I use? > > Because depth says how many pages deep to go.. is there anyway to hit > every single page, without specifying depth? Or should I just say > depth=10? Also, topN is no longer used, correct? > > Stefan Neufeind wrote: > >> Matthew Holt wrote: >> >> >>> Question, >>> I'm trying to index a subdomain of my intranet. How do I make it >>> index the entire subdomain, but not index any pages off of the >>> subdomain? Thanks! >>> >> >> Have a look at crawl-urlfilter.txt in the conf/ directory. >> >> # accept hosts in MY.DOMAIN.NAME >> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ >> >> # skip everything else >> -. >> >> >> Regards, >> Stefan
Re: Intranet Crawling
Matthew Holt wrote: > Question, >I'm trying to index a subdomain of my intranet. How do I make it > index the entire subdomain, but not index any pages off of the > subdomain? Thanks! Have a look at crawl-urlfilter.txt in the conf/ directory. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -. Regards, Stefan
Re: Sorting results by "url"
Marco Pereira wrote: > Hi, > > Please, everybody. > > I'm indexing a website that makes new scripts with fresh news content > almost every hour. > The urls are this way: http://.website.com/1.php > http://.website.com/2.php http://.website.com/3.php etc... > > Is there a way to modify the results page search.jsp so that it can show > 1.php first then 2.php then 3.php ... I mean, is there a way to sort > results > by url? For the OpenSearch-interface (RSS-interface) you can supply &sort=url - or also combine that with &reverse=true if you need it the other way round. Please note however that those are lexically sorted. In case you want them to be ordered by fetch-date you can also use &sort=date. Hope that help, Stefan
Re: getting exact number of matches
I know it's possible to switch it off. But I need it, and the question was how to get the exact number of hits after "grouping". The unclean workaround was the only thing I did find yet: - one hit per page - going to page 9 - see where we end up - cache that number Works but is ugly :-) Stefan Stefan Groschupf wrote: > I see you mean grouping by host. > Yes that works different and is difficult. > If you like you can switch off grouping by host. > Stefan > > > Am 31.05.2006 um 00:10 schrieb Stefan Neufeind: > >> Hi Stefan, >> >> I didn't mean duplicate in the sense of "two times the same result" - >> but in the sense of "show only XX results per website", e.g. only to >> shoow max two pages of a website that might match. And you can't dedup >> that before the search (runtime) because you don't know what was >> actually searched. I'm refering to the hitsPerSite-parameter of the >> webinterface - while in the source it's called a bit more general (there >> are variables like dedupField etc.). >> >> >> Regards, >> Stefan >> >> Stefan Groschupf wrote: >>> Hi, >>> why not dedub your complete index before and not until runtime? >>> There is a dedub tool for that. >>> >>> Stefan >>> >>> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: >>> >>>> Hi Eugen, >>>> >>>> what I've found (and if I'm right) is that the page-calculation is done >>>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ >>>> results when you only need the first page, I guess currently this is >>>> not >>>> done at the moment. However, since I also needed the exact number, I >>>> did >>>> find out the "dirty hack" at least. That helps for the moment. >>>> But as it might take quite a while to find out the exact number of >>>> pages >>>> I suggest that e.g. you compose a "hash" or the words searched for, and >>>> maybe to be sure the number of non-dedupped searchresults, so you don't >>>> have to search the exact number again and again when moving between >>>> pages. >>>> >>>> >>>> Hope that helps, >>>> Stefan >>>> >>>> Eugen Kochuev wrote: >>>>> >>>>> And did you manage to locate the place where the filtering on per >>>>> site basis is done? Is it possible to tweak nutch to make it telling >>>>> the exact number of pages after filtering or is there a problem? >>>>> >>>>>> I've got a pending nutch-issue on this >>>>>> http://issues.apache.org/jira/browse/NUTCH-288 >>>>> >>>>>> A dirty workaround (though working) is to do a search with one hit >>>>>> per >>>>>> page and start-index as 9. That will give you the actual >>>>>> start-index >>>>>> of the last item, which +1 is the number of results you are looking >>>>>> for. >>>>>> Since requesting the last page takes a bit resources, you might >>>>>> want to >>>>>> cache that result actually - so users searching again or navigating >>>>>> through pages get the number of pages faster. >>>>> >>>>>> PS: For the OpenSearch-connector to not throw an exception but to >>>>>> return the last page, please apply the patch I attached to the bug.
Re: Multiple indexes on a single server instance.
sudhendra seshachala wrote: > I am experiencing a similar problem. > What I have done is as follows. > I have different parse-plugin for each site ( I have 3 sites to crawl and > fetch data). But I capture data into same format I call it datarepository. > I have one index-plugin which indexes on data repository and one > query-plugin on the data repository, > I dont have to run multiple instances. I just run one instance of search > engine. > However the parse configuration is different for each site so I run > different crawler for each site > Then I index and merge all of them. So far the results are good if not > "WOW". > I still have to figure a way of ranking the page. For example I would like > to be able to apply ranking on the data repository. Let me know If I was > clear... Hi, not sure if I got you right with your last point, but it just came to my mind: It would be nice to be able to have something like "If it's from indexA, give it 100 extra-points - if from indexB give it 50 extra-points". Or some "if indexA give it 20% extra-weight" or so. But I don't believe this is easily doable. Or is it? I got a similar problem with languages: give priority to documents in German and English. But somewhere after those results also list documents in other languages. So I'd need to be able to give "extra-points" on a "per-language"-basis, based on the indexed language-field, right? Regards, Stefan > Stefan Groschupf <[EMAIL PROTECTED]> wrote: > I'm not sure what you are planing to do, but you can just switch a > symbolic link on your hdd driven by a cronjob to switch between index > on a given time. > May be you need to touch the web.xml to restart the searcher. > If you try to search in different kind of indexes at the same time, I > suggest to merge the indexes and have a kind keyfield for each of the > indexes. > For example add a field to each of your indexes names "indexName" and > put A, B and C as value into it. > Than you can merge your index. During runtime you just need to have a > queryfilter that extend a indexName:A or indexName:B to the query > string. > > Does this somehow help to solve your problem? > Stefan > > Am 23.05.2006 um 15:26 schrieb TJ Roberts: > >> I have five different indexes each with their own special >> configuration. I would like to be able to switch between the >> different indexes dynamically on a single instance of nutch running >> on jakarta-tomcat. Is this possible, or do I have to run five >> instances of nutch, one for each index?
Re: Multiple indexes on a single server instance.
I've been running into a similar question myself a while ago. What I could imagine are company A, company B and company C. All want to be able to have "their own" search-engine. At the same time there might be a "special" search-engine needed that crawls content from both company A and B but not C. I think that's where your suggestion comes into play, right? With the indexname. a) How would you "extend" your indexes by one field before merging them? is there a small tool to add a field to an index? b) Do you always have to merge the indexes, or could you use some feature from the "distributed" nutch to search in multiple indexes? I just think about that because it would allow you to use multiple maybe huge indexes that could all be updated separately and without having to merge them again. Another point I have understood from the original question: How would it be possible to have an OpenSearch-interface for multiple indexes running on one single Tomcat-instance. I think the author asked whether you could/would install separate copies at the same time with differeent searcher.dir-settings in their nutch-site.xml. With your suggestion: I understand that a plugin similar to "query-more" could be written to allow providing a search for "indexName" (as you suggested) as well, right? With this, would it also be possible to ask for "indexName=A or B but not C"? Stefan Stefan Groschupf wrote: > I'm not sure what you are planing to do, but you can just switch a > symbolic link on your hdd driven by a cronjob to switch between index on > a given time. > May be you need to touch the web.xml to restart the searcher. > If you try to search in different kind of indexes at the same time, I > suggest to merge the indexes and have a kind keyfield for each of the > indexes. > For example add a field to each of your indexes names "indexName" and > put A, B and C as value into it. > Than you can merge your index. During runtime you just need to have a > queryfilter that extend a indexName:A or indexName:B to the query string. > > Does this somehow help to solve your problem? > Stefan > > Am 23.05.2006 um 15:26 schrieb TJ Roberts: > >> I have five different indexes each with their own special >> configuration. I would like to be able to switch between the >> different indexes dynamically on a single instance of nutch running on >> jakarta-tomcat. Is this possible, or do I have to run five instances >> of nutch, one for each index?
Re: Re-parsing document
Hi Stefan, that seems to have worked. And I tried out that my patch to the PDF-parser actually prevented "unclean" IO-exceptions (see http://issues.apache.org/jira/browse/NUTCH-290 ). The strange thing, however, is that I still see "garbage" (undecoded binary data from the PDF-file) in search-summaries. Could it be that possibly since my plugin returns empty content (and by that preventing an exception) some other place in the source still thinks "no summary? I'll grab the raw content instead then"? My problem is that for unparseable files I get binary data in the summaries. The special case in my eyes are PDF-files, where the patch now prevents an exception which leads to a "parse failed". Now the parse is fine, but I still get binary summaries :-( Could you maybe have a look at the issue? There is a test-PDF mentioned as well. And I can offer more :-) Regards, Stefan Stefan Groschupf wrote: > You can just delete the parse output folders and start the parsing tool. > Parsing a given page again makes only sense for debug reasons since > hadoop io system can not update entries. > If you need to debug I suggest to write you a junit test. > > HTH > Stefan > > > Am 29.05.2006 um 01:01 schrieb Stefan Neufeind: > >> Hi, >> >> was is needed to re-parse documents that were already fetched into a >> segment? Is another "nutch index ..."-run sufficient, or how could I >> send the documents through the parse-plugins again? >> >> >> Regards, >> Stefan
Re: getting exact number of matches
Hi Stefan, I didn't mean duplicate in the sense of "two times the same result" - but in the sense of "show only XX results per website", e.g. only to shoow max two pages of a website that might match. And you can't dedup that before the search (runtime) because you don't know what was actually searched. I'm refering to the hitsPerSite-parameter of the webinterface - while in the source it's called a bit more general (there are variables like dedupField etc.). Regards, Stefan Stefan Groschupf wrote: > Hi, > why not dedub your complete index before and not until runtime? > There is a dedub tool for that. > > Stefan > > Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: > >> Hi Eugen, >> >> what I've found (and if I'm right) is that the page-calculation is done >> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ >> results when you only need the first page, I guess currently this is not >> done at the moment. However, since I also needed the exact number, I did >> find out the "dirty hack" at least. That helps for the moment. >> But as it might take quite a while to find out the exact number of pages >> I suggest that e.g. you compose a "hash" or the words searched for, and >> maybe to be sure the number of non-dedupped searchresults, so you don't >> have to search the exact number again and again when moving between >> pages. >> >> >> Hope that helps, >> Stefan >> >> Eugen Kochuev wrote: >>> >>> And did you manage to locate the place where the filtering on per >>> site basis is done? Is it possible to tweak nutch to make it telling >>> the exact number of pages after filtering or is there a problem? >>> >>>> I've got a pending nutch-issue on this >>>> http://issues.apache.org/jira/browse/NUTCH-288 >>> >>>> A dirty workaround (though working) is to do a search with one hit per >>>> page and start-index as 9. That will give you the actual >>>> start-index >>>> of the last item, which +1 is the number of results you are looking >>>> for. >>>> Since requesting the last page takes a bit resources, you might want to >>>> cache that result actually - so users searching again or navigating >>>> through pages get the number of pages faster. >>> >>>> PS: For the OpenSearch-connector to not throw an exception but to >>>> return the last page, please apply the patch I attached to the bug.
Re: getting exact number of matches
Hi Eugen, what I've found (and if I'm right) is that the page-calculation is done in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ results when you only need the first page, I guess currently this is not done at the moment. However, since I also needed the exact number, I did find out the "dirty hack" at least. That helps for the moment. But as it might take quite a while to find out the exact number of pages I suggest that e.g. you compose a "hash" or the words searched for, and maybe to be sure the number of non-dedupped searchresults, so you don't have to search the exact number again and again when moving between pages. Hope that helps, Stefan Eugen Kochuev wrote: > > And did you manage to locate the place where the filtering on per > site basis is done? Is it possible to tweak nutch to make it telling > the exact number of pages after filtering or is there a problem? > >> I've got a pending nutch-issue on this >> http://issues.apache.org/jira/browse/NUTCH-288 > >> A dirty workaround (though working) is to do a search with one hit per >> page and start-index as 9. That will give you the actual start-index >> of the last item, which +1 is the number of results you are looking for. >> Since requesting the last page takes a bit resources, you might want to >> cache that result actually - so users searching again or navigating >> through pages get the number of pages faster. > >> PS: For the OpenSearch-connector to not throw an exception but to return >> the last page, please apply the patch I attached to the bug.
Re: content-type crawling problem
Heiko Dietze wrote: > Hello, > > Eugen Kochuev wrote: >> Btw, do I need to uncomment this? It's more logical to comment this >> out. Right? >> >> >>> >>> >>> >> >> >>> Just uncomment this wilcard match. You might also check >>> the other rules for further unwanted content. > > Sorry for the typo, I meant that you should leave it out, yes. > > Unfortunaly for the fetching of the pages this is not the solution, but > the index will be based only on the proper content. I think the index is > created with the parsed content. Maybe have a look at urlfilter-suffix and only fetch those files with suffixes you want. Regards, Stefan
Re: getting exact number of matches
Eugen Kochuev wrote: > Hello nutch-user, > > I'm rewriting JSP front-end to add pager (currently there's only > "Next page" button) and I have faced a difficulty, that actually I > cannot get the number of matches if the hits are filtered to show > only 2 results by domain. How could this be resolved? Where this > filtering is done and the exact number of pages is lost? Please > advise. Hi, I've got a pending nutch-issue on this http://issues.apache.org/jira/browse/NUTCH-288 A dirty workaround (though working) is to do a search with one hit per page and start-index as 9. That will give you the actual start-index of the last item, which +1 is the number of results you are looking for. Since requesting the last page takes a bit resources, you might want to cache that result actually - so users searching again or navigating through pages get the number of pages faster. PS: For the OpenSearch-connector to not throw an exception but to return the last page, please apply the patch I attached to the bug. Regards, Stefan
Re-parsing document
Hi, was is needed to re-parse documents that were already fetched into a segment? Is another "nutch index ..."-run sufficient, or how could I send the documents through the parse-plugins again? Regards, Stefan
How to copy compiled files to correct dirs?
Hi, after doing an "ant compile", how are the files (e.g. all plugins) supposed to be copied from the build/-directory to the normal plugins-directory that is shipped when downloading a nightly? I've been re-compiling a plugin and wondered why "ant compile" leaves the file in build/ and does not overwrite the actual plugin I was using. Okay, a simple copy or symlink helped in this case ... but did I miss any script that is supposed to be called to copy the files "where they belong"? Regards, Stefan
Re: 0.8 release soon?
Andrzej Bialecki wrote: > Doug Cutting wrote: >> Andrzej Bialecki wrote: >>> 0.8 is pretty stable now, I think we should start considering a >>> release soon, within the next month's time frame. >> >> +1 >> >> Are there substantial features still missing from 0.8 that were >> supported in 0.7? > > Next week I'll be working on NUTCH-61 to bring it to a state where it > could be committed. It's a new feature, so the question is: should we > play safe, and wait with it after the release, or should we go with it > in the hope that it will get a wider testing audience? ;) +1 for being "safe" and instead focusing on some of the already mentioned patches that might need attention more urgently. Stefan
Re: 0.8 release soon?
Doug Cutting wrote: > Andrzej Bialecki wrote: >> 0.8 is pretty stable now, I think we should start considering a >> release soon, within the next month's time frame. > > +1 > > Are there substantial features still missing from 0.8 that were > supported in 0.7? > > Are there any showstopping bugs, things that worked in 0.7 that are > broken in 0.8? +1 as well, though I'm still new to the topic. During the setup I've come across a few patches that I think might be useful to maybe go into the 0.8-release. Those are: fixes: NUTCH-110-fixIllegalXmlChars08.patch NUTCH-254-fetcher_filter_url_patch.txt new features, that I tested and work fine here: NUTCH-48-did-you-mean-combined08.patch NUTCH-173-patch08-new.patch NUTCH-279-regex-normalize.patch NUTCH-288-OpenSearch-fix.patch !! open issues, from my side: NUTCH-277 (seems to affect httpclient, changing to http helped) Feedback welcome. Regards, Stefan
Re: Sorting in nutch-webinterface - how?
Marko Bauhardt wrote: > > Am 26.05.2006 um 01:57 schrieb Stefan Neufeind: >>> Modified. If not, date=FetchTime. >> >> Hi Marko, >> > > Hi Stefan, > >> that hint really helped. Can you maybe also help me out with sort=title? >> See also: >> http://issues.apache.org/jira/browse/NUTCH-287 >> >> The problem is that it works on some searches - but not always. Could it >> be that maybe some plugins don't write a title or write title as >> null/empty and that leads to problems? What could I do: > > If a html page begins with " the html parser (i am not sure). If the TextParser is used to parse this > page, then no title will be extract. So in this case the title is empty > and the summary is xml-code. > > Please verify your pages , that have no title and look whether " exists at the begin of this page. I could understand that those documents are "problematic" in sorting - e.g. they would all be in front or at the end of the sorted list. But why does this actually lead to no output/an exception/...? Maybe in case no title is present at least _something_ could be used - e.g. the URL instead or so? Regards, Stefan
Re: Incremental crawl again ... (Please explain)
I haven't yet tried - but could you maybe: - move the new segments somewhere independent of the existing ones - create a separate linkdb for it (to my understanding the linkdb is only needed when indexing) - create a separate index on that - then move segment into segments-dir and new index into indexes-dir as "part-" - just merge indexes (should work relatively fast) In the long term your segments, indexes etc. add up - so in this case you'd need to maybe think about merging segments etc. Also, this is "only" my current understanding of the topic. It would be nice to get feedback and maybe easier solutions from others as well. Regards, Stefan Jacob Brunson wrote: > Yes, I see what you mean about re-indexing again over all the > segments. However, indexing takes a lot of time and I was hoping that > merging many smaller indexes would be a much more efficient method. > Besides, deleting the index and re-indexing just doesn't seem like > *The Right Thing(tm)*. > > On 5/26/06, zzcgiacomini <[EMAIL PROTECTED]> wrote: >> I am not at all a Nutch expert, I am just experimenting a little bit, >> but as far as I understood it >> you can remove the indexes directory and re-index again the segments: >> In may case ofter step 8 of the (see below) I have only one segment : >> test/segments/20060522144050 >> after step 9 I will have a second segment >> test/segments/20060522144050 >> Now what we can do is to remove the test/indexes directory and >> re-index the two segments: >> this what I did : >> >> hadoop dfs -rm test/indexes >> nutch index test/indexes test/crawldb linkdb >> test/segments/20060522144050 test/segments/20060522144050 >> >> Hope it helps >> >> -Corrqdo >> >> >> >> Jacob Brunson wrote: >> > I looked at the referenced messaged at >> > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html >> > but I am still having problems. >> > >> > I am running the latest checkout from subversion. >> > >> > These are the commands which I've run: >> > bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 1 >> > bin/nutch generate crawl/crawldb crawl/segments -topN 500 >> > lastsegment=`ls -d crawl/segments/2* | tail -1` >> > bin/nutch fetch $lastsegment >> > bin/nutch updatedb crawl/crawldb $lastsegment >> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment >> > >> > This last command fails with a java.io.IOException saying: "Output >> > directory /home/nutch/nutch/crawl/indexes already exists" >> > >> > So I'm confused because it seems like I did exactly what was described >> > in the referenced email, but it didn't work for me. Can someone help >> > me figure out what I'm doing wrong or what I need to do instead? >> > Thanks, >> > Jacob >> > >> > >> > On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote: >> >> Please do follow the link below.. >> >> >> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html >> >> >> >> I have been able to follow the threads as explained and merge >> >> multiple crawl.. It works like a champ. >> >> >> >> Thanks >> >> Sudhi >> >> >> >> zzcgiacomini <[EMAIL PROTECTED]> wrote: >> >> I am currently using the last nightly nutch-0.8-dev build and >> >> I am really confused about how to proceed after I have done two >> >> different "whole web" incremental crawl >> >> >> >> The tutorial to me is not clear on how to merge the results after the >> >> two crawls in order to be able to >> >> make a search operation. >> >> >> >> Could some one please give me an Hints on what is the right >> procedure ?! >> >> here is what I am doing: >> >> >> >> 1. create an initial urls file /tmp/dmoz/urls.txt >> >> 2. hadoop dfs -put /tmp/urls/ url >> >> 3. nutch inject test/crawldb dmoz >> >> 4. nutch generate test/crawldb test/segments >> >> 5. nutch fetch test/segments/20060522144050 >> >> 6. nutch updatedb test/crawldb test/segments/20060522144050 >> >> 7. nutch invertlinks linkdb test/segments/20060522144050 >> >> 8. nutch index test/indexes test/crawldb linkdb >> >> test/segments/20060522144050 >> >> >> >> ..and now I am able to search... >> >> >> >> Now I run >> >> >> >> 9. nutch generate test/crawldb test/segments -topN 1000 >> >> >> >> and I will end up to have a new segment : test/segments/20060522151957 >> >> >> >> 10. nutch fetch test/segments/20060522151957 >> >> 11. nutch updatedb test/crawldb test/segments/20060522151957 >> >> >> >> >> >> From this point on I cannot make any progresses much >> >> >> >> A) I have tried to merge the two segments into a new one with the >> >> idea to rerun an invertlinks and index on it but: >> >> >> >> nutch mergesegs test/segments -dir test/segments >> >> >> >> whatever I specify as outputdir or outputsegment I get errors >> >> >> >> B) I have also tried to make invertlinks on all test/segments with >> >> the goal to run nutch index command to produce a second >> >> indexes directory, let say test/indexes1, an finally run the merge >> >> index on index2 >> >> >> >> nutch in
Re: Sorting in nutch-webinterface - how?
Marko Bauhardt wrote: > >> Hmm, that works. But why - since I think the field is named lastModified. > > LastModified is only used if lastModified is available about the html > meta tags. If that true, lastModified is stored but not indexed. > However the date field is always indexed. Is lastModified is available > as metatag, then date=lastModified. If not, date=FetchTime. Hi Marko, that hint really helped. Can you maybe also help me out with sort=title? See also: http://issues.apache.org/jira/browse/NUTCH-287 The problem is that it works on some searches - but not always. Could it be that maybe some plugins don't write a title or write title as null/empty and that leads to problems? What could I do: a) as a quickfix to prevent the exceptionand b) to track this further down which result(s) and why actually cause the problem. I've taken a look at the javadoc from the lucene-interface. It looks like if you sort by something the fields[0] should always be set with the field you searched for - but afaik actually it is null, or maybe even fields is empty or so. Regards, Stefan
Re: Sorting in nutch-webinterface - how?
Marko Bauhardt wrote: > > Am 25.05.2006 um 13:21 schrieb Stefan Neufeind: > >> Hi, >> >> I did use index-basic and index-more. I see lastModified in the >> RSS-output. Now I want to &sort=lastModified - does not work. > > Try sort=date. Hmm, that works. But why - since I think the field is named lastModified. Thank you very much for your help, Stefan
Sorting in nutch-webinterface - how?
Hi, I did use index-basic and index-more. I see lastModified in the RSS-output. Now I want to &sort=lastModified - does not work. Same for &sort=title. However &sort=url does work. What am I doing wrong here? Regards, Stefan
Re: using nutch to detect broken pages
Jorg Heymans wrote: > Hi, > > I was wondering if it's possible to get crawl to go through a website and > only report links that return a specific http response code (eg 404) ? I'm > looking to somehow automate basic site testing of rather huge websites, > inevitably one ends up in the world of crawlers (and being a java guy > myself > this means nutch). > > I'm still going through the faq and first basic steps, so apologies if what > i'm asking is the most basic nutch-thing ever :) I haven't used it yet - but I guess that's what the "store"-setting for the fetcher in nutch-config might be for. To my understanding this would allow you not to store the content fetched but only crawl the links. >From crawldb I guess you should be (somehow) able to see for which URLs retries were un-successful conducted etc. Maybe you could instead also just monitor the output of the fetcher while running? Would be nice to hear if you manage to set up a working solution imho. Regards, Stefan
Re: using nutch to detect broken pages
Jorg Heymans wrote: > Hi, > > I was wondering if it's possible to get crawl to go through a website and > only report links that return a specific http response code (eg 404) ? I'm > looking to somehow automate basic site testing of rather huge websites, > inevitably one ends up in the world of crawlers (and being a java guy > myself > this means nutch). > > I'm still going through the faq and first basic steps, so apologies if what > i'm asking is the most basic nutch-thing ever :) I haven't used it yet - but I guess that's what the "store"-setting for the fetcher in nutch-config might be for. To my understanding this would allow you not to store the content fetched but only crawl the links. >From crawldb I guess you should be (somehow) able to see for which URLs retries were un-successful conducted etc. Maybe you could instead also just monitor the output of the fetcher while running? Would be nice to hear if you manage to set up a working solution imho. Regards, Stefan
Re: how to
Daniel wrote: > Dear friends: > > I'm new here. > I want to know > 1. how to put patches to Nutch Simply using the patch-utility available e.g. under Linux. Go into the nutch-application-rootdir (from where you can see "conf" and "src" and the like). There do: patch -p1 <../pathtopath/mypatch.patch The -p1 here depends on the path-names used inside the patch - sorry. So in case there is something like nutch/trunk/src/... then since src is directly available from the directory you are in you would want to strip the first two parts of the pathes used in the patch. So in this case you would want to use "patch -p2" (p followed by number of path-parts to remove). > 2. how to establish a Chinese character lexicon to the Nutch Which lexicon do you mean? Nutch is using UTF-8, so I guess there should be no problem with Chinese characters in the index in general. But maybe I got you wrong. Also, I haven't had to deal with Chinese so far ... sorry. Regards, Stefan
Re: When will we see 0.8?
Benjamin Higgins wrote: > I've heard it's pretty stable. Should I use 0.8 CVS now or wait for it to > be officially released? We're using it, since we wanted some of the new features. During testing some problems turned up, that we were luckily able to fix easily with available patches. For those where e.g. there was a new feature but the patch not yet rewritten from 0.7 to 0.8 we've submitted an updated patch. So I guess a basic 0.8 should easily work for you. If you got problems in one area, see jira or ask here. But as always: If you want something stable (read: without using at least one finger to touch source), go for 0.7 imho. > Is there a friendly changelog for 0.8? Not sure, sorry. > Also, will 0.8 require its own Tomcat instance like 0.7 did, or will it > play nice not being the ROOT? Works as non-root. There is also a very simple patch in jira to implement that for 0.7 as well (1 line if I remembere right). Regards, Stefan
Re: Setting query.host.boost etc. in nutch-site.xml does not work?
Wow Marko, that was damn quick. I didn't recognise the error, though I looked into the sources briefly. Thanks to you for finding the bug - and finding it in such few time. You made my day! And also thanks to Andrzej for putting a fix in the trunk already: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java?r1=383304&r2=408767&pathrev=408767 Thank you, Stefan Marko Bauhardt wrote: > This is a bug in the query-basic plugin. The boosting values in the > nutch-default.xml are not used. > We should open a bug in jira. This simple patch should work. > > Index: > src/plugin/query-basic/src/java/org/apache/nutch/searcher/basic/BasicQueryFilter.java > [...] > Am 22.05.2006 um 22:07 schrieb Stefan Neufeind: > >> Hi, >> >> I was experiencing a "strange" selection of search-results here. The >> first idea was to rate the results with searchword in hostname higher. >> So I set query.host.boost to quite a high value (50, later 200). But >> nothing in the result changes. >> >> Searching for the full hostname (www.example.com) does not give me any >> search at all. Could it be that hostname is not taken into account >> during a search? >> >> What could be wrong here? Please help. *sigh*
Setting query.host.boost etc. in nutch-site.xml does not work?
Hi, I was experiencing a "strange" selection of search-results here. The first idea was to rate the results with searchword in hostname higher. So I set query.host.boost to quite a high value (50, later 200). But nothing in the result changes. Searching for the full hostname (www.example.com) does not give me any search at all. Could it be that hostname is not taken into account during a search? What could be wrong here? Please help. *sigh* Thanks a lot, Stefan
Applying new regex-normalizer-rules to indexed pages
Hi, during a long fetch-run I experienced session-IDs in URLs, which was a bit problematic. So I figured out how to write and test proper regex-normalizer-rules (see NUTCH-279). Now I wonder if on the next fetch-round URLs will get properly normalized of if they are now un-normalized in the crawldb and from there are fetched during generate without realizing the "duplicate" (after normalization) URLs. Also, is there a way to "clean" the page-index before actually indexing? Our would this automatically be taken care of (does the normalizere run again?) when performing the actual invertlinks/index/dedup? Regards, Stefan
Re: Debugging rules for RegexUrlNormalizer
Thought I just missed something. Okay, I just added a few patterns as well as a commandline-checker. See http://issues.apache.org/jira/browse/NUTCH-279 for the patch. Regards, Stefan TDLN wrote: > Sorry, I was a bit too fast there, the answer applies to the > RegexURLFilter not the RegexUrlNormalizer. I don't think there is a > similar facility for the RegexUrlNormalizer, but let me know if you > find it :) > > Rgrds, Thomas > > On 5/22/06, TDLN <[EMAIL PROTECTED]> wrote: >> Hi Stefan >> >> try running bin/nutch org.apache.nutch.net.URLFilterChecker >> >> Rgrds, Thomas >> >> On 5/22/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > is there a way to debug rules for RegexUrlNormalizer, e.g. test the >> > substitution from commandline? >> > >> > >> > bin/nutch org.apache.nutch.net.RegexUrlNormalizer >> > >> > does print out the rules it uses. But afaik there is no such thing >> > possible as >> > >> > echo "http://www.example.com"; | bin/nutch >> > org.apache.nutch.net.RegexUrlNormalizer >> > >> > is there? So how do you debug rules when writing new ones and testing >> > them against a set of URLs that should match / should not match?
Debugging rules for RegexUrlNormalizer
Hi, is there a way to debug rules for RegexUrlNormalizer, e.g. test the substitution from commandline? bin/nutch org.apache.nutch.net.RegexUrlNormalizer does print out the rules it uses. But afaik there is no such thing possible as echo "http://www.example.com"; | bin/nutch org.apache.nutch.net.RegexUrlNormalizer is there? So how do you debug rules when writing new ones and testing them against a set of URLs that should match / should not match? Regards, Stefan
Re: Nutch fetcher "waiting" inbetween fetch
Here we're using one machine for fetching (exclusive at the moment) with about 50 fetchers and a local Bind-resolver in caching-nameserver setup. Bandwidth of the fetchers is 5 to 10mbit inbound roughly. What I see is that during fetching java is taking 99.9% cpu (all userland). At the point where the server "stalls" this changes to 99.9% on system-usage (writing something to disc?). It stalls for about 30 seconds or a bit more. Hmm - I don't know where to look for the cause of these stalls, since you don't see what it really does at that point (in logs or so). PS: Your thoughts on this are very much appreciated. Regards, Stefan Dennis Kubes wrote: > What we were seeing is the dns server cached the addresses in memory > (bind 9x..) and because we were caching so many addresses on a single > dns server it would eat up memory and eventually begin swapping to > disk. When this occurred the server load got up to 1.5 and the iowait > was near 100%. Basically it stalled the box. Requests were still > getting through but it was very slow. Our solution (at least > temporarily was to restart the bind service (not the box just the > daemon) every couple of hours to flush the memory. > As for load on the boxes we are seeing very minimal loads (like .08 > loads and no iowait times). We have about 55 fetchers running (5 on > each box with 11 nodes) and right now we are bandwidth bound on a 2Mbps > pipe. So maybe it is just that we don't have enough load on each > machine to see the kind of waits that you are seeing. Is your system > distributed or on a single machine ? > > Dennis > > Stefan Neufeind wrote: >> Hi Dennis, >> >> thank you for the answer. Hmm, could theoretically be. But to prevent >> this the server already does resolving completely on his local machine. >> Also I wonder about the CPU-load moving to "system" - I suspected heavy >> disk-access or so ... but I don't know how/when the fetcher writes data >> to disk etc. >> >> >> >> Regards, >> Stefan >> >> Dennis Kubes wrote: >> >>> Is this possibly a dns issue. We are running a 5M page crawl and are >>> seeing very heavy DNS load. Just a thought. >>> >>> Dennis >>> >>> Stefan Neufeind wrote: >>> >>>> Hi, >>>> >>>> I've encountered that here nutch is fetching quite a sum or URLs from a >>>> long list (about 25.000). But from time to time nutch is "waiting" for >>>> 10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is >>>> nutch writing fetched data or index to disk at that stage? Is there any >>>> way to optimize this step, e.g. by writing more often and performing >>>> the >>>> write in "background" or caching even more in mem instead of >>>> flushing to disk?
dedup after building indexed? (0.8-dev)
Hi, up too 0.7.x dedup was done before indexing, right? And for 0.8-dev I read from Crawl.java that the order to use is - invertlinks - index - dedup - merge (merging segment indexes) Is that right? I wonder why indexing is done before removing duplicates. Could somebody please explain? Also, am I right that merge is not needed if run on only one node? I already got a "complete" index from the "index"-phase. Or what is that about? Regards, Stefan
Re: Nutch fetcher "waiting" inbetween fetch
Hi Dennis, thank you for the answer. Hmm, could theoretically be. But to prevent this the server already does resolving completely on his local machine. Also I wonder about the CPU-load moving to "system" - I suspected heavy disk-access or so ... but I don't know how/when the fetcher writes data to disk etc. Regards, Stefan Dennis Kubes wrote: > Is this possibly a dns issue. We are running a 5M page crawl and are > seeing very heavy DNS load. Just a thought. > > Dennis > > Stefan Neufeind wrote: >> Hi, >> >> I've encountered that here nutch is fetching quite a sum or URLs from a >> long list (about 25.000). But from time to time nutch is "waiting" for >> 10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is >> nutch writing fetched data or index to disk at that stage? Is there any >> way to optimize this step, e.g. by writing more often and performing the >> write in "background" or caching even more in mem instead of flushing to >> disk?
Nutch fetcher "waiting" inbetween fetch
Hi, I've encountered that here nutch is fetching quite a sum or URLs from a long list (about 25.000). But from time to time nutch is "waiting" for 10 seconds or so. Nothing is locked, but system-load is 99,9% then. Is nutch writing fetched data or index to disk at that stage? Is there any way to optimize this step, e.g. by writing more often and performing the write in "background" or caching even more in mem instead of flushing to disk? Regards, Stefan