Re: Query pdf, etc..
You can use the plugins index-more and query-more to create a field on your index indicating the file type of the document. So, in you search you can use type:pdf or type:msword to filter these files. I used nutch 0.7.2 to make it work... Regards, Lourival Júnior On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote: Hi Guys, I would like to add a new button on my webpage to make an adanced search using the keywords. Once the user will click on it it will search for keywords only in the different PDF/WORD or Excel document indexed. Do you know how i can filter/limit my search on PDF/WORD/EXCEL documents ? Thanks for your help. E -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Using nutch as a web crawler
Nutch has a file called crawl-urlfilter.txt where you can set your site domain or site list, so nutch will only crawl this list. Download nutch and see it working, is better for you :). Take a look: http://lucene.apache.org/nutch/tutorial8.html Regards, On 4/5/07, Meryl Silverburgh [EMAIL PROTECTED] wrote: Thanks. Can you please tell me how can I plugin in my own handling when nutch sees a site instead of building the search database for that site? On 4/3/07, Lourival Júnior [EMAIL PROTECTED] wrote: I have total certainty that nutch is what are you looking for. Take a look to nutch's documentation for more details and you will see :). On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote: Hi, I would like to know if know if it is a good idea to use nutch web carwler? Basically, this is what I need: 1. I have a list of web site 2. I want the web crawler to go thru each site, parser the anchor. if it is the same domain, go thru the same step for 3 level. 3. For each link, write to a new file. Is nutch a good solution? or there is other better open source alternative for my purpose? Thank you. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Using nutch as a web crawler
I have total certainty that nutch is what are you looking for. Take a look to nutch's documentation for more details and you will see :). On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote: Hi, I would like to know if know if it is a good idea to use nutch web carwler? Basically, this is what I need: 1. I have a list of web site 2. I want the web crawler to go thru each site, parser the anchor. if it is the same domain, go thru the same step for 3 level. 3. For each link, write to a new file. Is nutch a good solution? or there is other better open source alternative for my purpose? Thank you. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: java.lang.NoClassDefFoundError
Please I'm testing nutch 0.8.1 and I still get this error when trying to run a simple commad: Exception in thread main java.lang.NoClassDefFoundError What's wrong? I'm running in Windows 2000 with cygwin. On 7/28/06, Rick Carver [EMAIL PROTECTED] wrote: I get the same problem trying to use nutch 0.8, just checked out. Exception in thread main java.lang.NoClassDefFoundError This is on OS X 10.4.7. Older nutch runs fine. --- Lourival Júnior [EMAIL PROTECTED] wrote: Hi all! I'm testing the nutch 0.8. But I get this error in this simple command: $ bin/nutch readdb java.lang.NoClassDefFoundError: and Exception in thread main I've set the NUTCH_JAVA_HOME variable, but I'm sure it is the root cause of this. What is occurring? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Common terms
Has you reindexed your segments? It's important, because it makes nutch recognize your common terms. I've tried it and the only thing I've noted was the index size that is more big than the original (before use the common terms). On 9/25/06, carmmello [EMAIL PROTECTED] wrote: I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf folder (and also under the classes folder, inside the ROOT folder on TomCat), some common terms in portuguese, one per line , like: content:da contente:de contente:eu .. However, when I try some search, I get all the results for those portuguese common terms, and, at the same time, I get zero results for the original english terms. I have even tried to list all the terms in alphabetical order, including the original ones, with the same results. In other words, Nutch does not seem to recognize, as such, the added common terms, only the original ones, included in the distribution. Can any one clarify this? Tanks -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Common terms
Ok. If you're crawling with this settings you don't need to reindex your segments again. And how about the plugins that you are using? Are you using the language-identifier plugin? If not, try it. Regards, Obs: Eu falo português :) On 9/25/06, carmmello [EMAIL PROTECTED] wrote: This issue happens even when I start a new crawl. So, I'm not reindexing the segments. The indexing is done by nutch itself, using the intranet method. Do you mean that after this is done, do I have to reindex the segments, once again? But, if so, why the english common terms are recognized first time? Tanks again - Original Message - From: Lourival Júnior [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Monday, September 25, 2006 3:58 PM Subject: Re: Common terms Has you reindexed your segments? It's important, because it makes nutch recognize your common terms. I've tried it and the only thing I've noted was the index size that is more big than the original (before use the common terms). On 9/25/06, carmmello [EMAIL PROTECTED] wrote: I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf folder (and also under the classes folder, inside the ROOT folder on TomCat), some common terms in portuguese, one per line , like: content:da contente:de contente:eu .. However, when I try some search, I get all the results for those portuguese common terms, and, at the same time, I get zero results for the original english terms. I have even tried to list all the terms in alphabetical order, including the original ones, with the same results. In other words, Nutch does not seem to recognize, as such, the added common terms, only the original ones, included in the distribution. Can any one clarify this? Tanks -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.405 / Virus Database: 268.12.6/453 - Release Date: 20/9/2006 -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
ZIP parser in Nutch 0.7.2
Hi all! Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How can I do this? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: indexing folders with nutch
Yes Cam, if you use a depth 1 you will crawl only the first document. With a depth 2 you will crawl the first document and all the links found on this document. With depth 3, you will crawl the first one, its links and all links found in cycle 2. And so on. Increasing you depth will increasing your WebDB too. Try it ;) Regards On 8/31/06, Sandy Polanski [EMAIL PROTECTED] wrote: Cam, try increasing the depth and see what happens. It seems that logic would say that they're on the same directory depth/level; however, just give it a try because I ran into a similar problem, and if I'm not mistaken, that fixed it. --- Cam Bazz [EMAIL PROTECTED] wrote: Hello, I have a problem. I tried to index some localfiles with nutch. What I have done is put them in a local apache server, (html files) and create a urls file that contains http://localhost/file01.html etc. then I do a nutch crawl urls . -dir crawl -depth 1 but the crawl stales after a while, and nothing happens. I also tried -topN 1 is not there a more convinient way of indexing from file system? Best regards, -C.B. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: index/search filtering by category
Hi Ernesto! Meta tags are custom tags that you add in your web page, to be more exactly, inside the head/head tag, to identify the contents of the web page to search engine indexes. For example your can add meta tag to describe the author of the page, keywords, cache, and so on. What you can do for your problem is add a meta tag to describe your categories: meta name=category content=yourcategory / I hope I helped you. Regards On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote: Thanks to both for response me! What's a meta tag? It's some thing of nutch, it isn't a lucene field? I suppose that implementing IndexFilter.filter: filter(Document doc, Parse parse, UTF8 url, CrawlDatum datum, Inlinks inlinks) I can add my field to a doc instance. Well, seems that the way is to try, to crash, and to try again... :) Thanks, Ernesto. Chris Stephens escribió: You can't do it unless you write a plugin to parse a custom meta tag called category. I'm trying to do something like this now, but the plugin documentation is horrible. Lourival Júnior wrote: Hi Ernesto! I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new in nutch and lucene and I can't help you. Continue trying, the comunity will help you :). On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote: Hi All Please, some body can answer my questions? I'm a nutch beginner, I hope that my questions/doubts are easy... ;) Or if my email is wrong, tell me. Or confirm me if I'm in the right way. Thanks a lot! Ernesto. Ernesto De Santis escribió: Hi I'm new in nutch, start yesterday. But I have experience with Lucene. I have some questions for you, a nutch experts... ;) I want to split my pages results in categories, to filter or to show its separately. This is my approach: *crawl/index* I want to index an extra field. Then, I need to do my own plugin for that, to develop my custom logic. Then, I config my plugin in conf/nutch-site.xml. To develop my plugin, I see that I need to implements: Configurable http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html , IndexingFilter http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html , and Pluggable http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html interfaces. Add to the Document instance the field value, category value. *search* Here I have a doubt, one way is set to nutch query a requiredTerm: query.addRequiredTerm(myCategory, category); I see that nutch use QueryFilters too, but I can't see how I do hook it to my query. *miscellaneous* Lucene has a rich query hierarchy, I don't see it in nutch. I don't see BooleanQuery, TermQuery, etc. The unique point to build the query in nutch is the Query class? Lucene searcher has a way to seperate the query to the filters. The queries conditions affect the rank, and filters don't. How nutch separates it? *documentation* I read the documentation in nutch site, tutorial, wiki, presentations and today.java.net article: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html and part2 too. A lot of details aren't covered there. Some body know more detailed documentation? Thanks a lot. Ernesto. __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: index/search filtering by category
Hi Ernesto! I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new in nutch and lucene and I can't help you. Continue trying, the comunity will help you :). On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote: Hi All Please, some body can answer my questions? I'm a nutch beginner, I hope that my questions/doubts are easy... ;) Or if my email is wrong, tell me. Or confirm me if I'm in the right way. Thanks a lot! Ernesto. Ernesto De Santis escribió: Hi I'm new in nutch, start yesterday. But I have experience with Lucene. I have some questions for you, a nutch experts... ;) I want to split my pages results in categories, to filter or to show its separately. This is my approach: *crawl/index* I want to index an extra field. Then, I need to do my own plugin for that, to develop my custom logic. Then, I config my plugin in conf/nutch-site.xml. To develop my plugin, I see that I need to implements: Configurable http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html , IndexingFilter http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html , and Pluggable http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html interfaces. Add to the Document instance the field value, category value. *search* Here I have a doubt, one way is set to nutch query a requiredTerm: query.addRequiredTerm(myCategory, category); I see that nutch use QueryFilters too, but I can't see how I do hook it to my query. *miscellaneous* Lucene has a rich query hierarchy, I don't see it in nutch. I don't see BooleanQuery, TermQuery, etc. The unique point to build the query in nutch is the Query class? Lucene searcher has a way to seperate the query to the filters. The queries conditions affect the rank, and filters don't. How nutch separates it? *documentation* I read the documentation in nutch site, tutorial, wiki, presentations and today.java.net article: http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html and part2 too. A lot of details aren't covered there. Some body know more detailed documentation? Thanks a lot. Ernesto. __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Zip Plugin
Has anyone get successful in implement Zip parse plugin in nutch 0.7.2? Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Querying Fields
OK Lukas, I know what you mean. The community is very important to the success of the project, specially the open source ones. I'm not sure I can contribute to nutch at now, because I'm a newbie in this area. I will contribute soon. At now, I answer the questions that I have a knowledge. I really appreciate when you answer our questions because we feel motivated, and we'll say to other people that Nutch is very useful when you want to make a web search engine, not only useful, but the best way. Regards! On 8/14/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Lourival, Definitely you are not alone with this feeling. Nutch is quite active open source project so some sort of documentation lack is a natural especially when Nutch hasen't reached its 1.0 release. Believe me, I have the same problem all the time. The best way how to change this situation is to contribute! Wiki is opend to anybody, source code can be downloaded and if you are freak then you can suggest changes and if you are a real hacker (meaning you are not ashmed to use vi for anything - including writing source code) then you can even become a commiter. Once you become a commiter then you will be overloaded with work to the point that you won't be able to answer STFW questions in mail-lists... etc. :-) Regards, Lukas On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote: Yes yes, I tested the index-more and query-more plugin. They allows to search these fields easily. However if I could find a documentation about they I would not spend time thinking in a solution. Thanks a lot! On 8/11/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, You need to look into source to find out what exactly it does. As far as I know it does not add any new filed into index (it should be done via index-more plugin) but it allows you to query using type: date: and site: I think. Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: What does exactilly the query-more plugin? I tested it a few minutes ago and it dont add any field to the result index. It's used in the webapp? Could you give me a clarification about it? Thanks! On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, If my memory serves me correctly then query-more should work fine with 0.7.2 nutch too. And you are right Matthew, you need to use both [type:] or [date:] filters in combination to [url:] as you can experience empty result set if used in solo mode. I do queries like this: [url:http type:pdf] and it gives me the result I need. Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: All right! I've done this already. I thing you dont understand my question. What I want to do is to query my indexes using something like filetype:pdf. The version 0.8 already have this feature. But I'm using the version 0.7.2 and I want to add this feature mannually. But I dont know where I have to edit. Do you know? Regards, Lourival Junior On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, To allow more formats to be indexed you need to modify nutch-site.xml and update/add plugin.includes property (see nutch-default.xmlfor default settings). The following is what I have in nutch-site.xml: property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value /property [parse-*] is used to parse various formats, [query-more] allows you to use [type:] filter in nutch queries. Regards, Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am not sure if I can give you any useful hint but the follwoing is what once worked for me. Example of query: url:http date:20060801 date: and type: options can be used in combination with url: Filer url:http should select all documents (unless you allowed file, ftp protocols). Plain date ot type filter select onthing if they are used alone. And be sure you don't introduce any space between filter name and its value ([date: 20060801] is not the same as [date:20060801]) Lukas On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote: Howie, I inspected my index using Luke and 20060801 shows up several times in the index. I'm unable to query pretty much any field. Several
Re: common-terms.utf8
Hi Timo! Thanks a lot! now I have a clearly knowledge about this file. This article helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061 Thanks again! On 8/11/06, Timo Scheuer [EMAIL PROTECTED] wrote: Hi, Could anyone explain me what does exactly the common-terms.utf8 file? I don't understand the real functionality of this file... During indexing (and also during searching) the common terms are used to form n-grams to make search faster for common words like articles for example. It is an alternative to using stop words. N-grams keep the common words by appending them to the following word. This approach increases the selectivity. Cheers, Timo. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Querying Fields
Yes yes, I tested the index-more and query-more plugin. They allows to search these fields easily. However if I could find a documentation about they I would not spend time thinking in a solution. Thanks a lot! On 8/11/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, You need to look into source to find out what exactly it does. As far as I know it does not add any new filed into index (it should be done via index-more plugin) but it allows you to query using type: date: and site: I think. Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: What does exactilly the query-more plugin? I tested it a few minutes ago and it dont add any field to the result index. It's used in the webapp? Could you give me a clarification about it? Thanks! On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, If my memory serves me correctly then query-more should work fine with 0.7.2 nutch too. And you are right Matthew, you need to use both [type:] or [date:] filters in combination to [url:] as you can experience empty result set if used in solo mode. I do queries like this: [url:http type:pdf] and it gives me the result I need. Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: All right! I've done this already. I thing you dont understand my question. What I want to do is to query my indexes using something like filetype:pdf. The version 0.8 already have this feature. But I'm using the version 0.7.2 and I want to add this feature mannually. But I dont know where I have to edit. Do you know? Regards, Lourival Junior On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, To allow more formats to be indexed you need to modify nutch-site.xml and update/add plugin.includes property (see nutch-default.xml for default settings). The following is what I have in nutch-site.xml: property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value /property [parse-*] is used to parse various formats, [query-more] allows you to use [type:] filter in nutch queries. Regards, Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am not sure if I can give you any useful hint but the follwoing is what once worked for me. Example of query: url:http date:20060801 date: and type: options can be used in combination with url: Filer url:http should select all documents (unless you allowed file, ftp protocols). Plain date ot type filter select onthing if they are used alone. And be sure you don't introduce any space between filter name and its value ([date: 20060801] is not the same as [date:20060801]) Lukas On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote: Howie, I inspected my index using Luke and 20060801 shows up several times in the index. I'm unable to query pretty much any field. Several people seem to be having the same problem. Does anyone know whats going on? This is one of the last things I have to resolve to have Nutch deployed successfully at my organization. Unfortunately, Friday is my last day. Can anyone offer any assistance?? Thanks, Matt Howie Wang wrote: I think that I have problems querying for numbers and words with digits in them. Now that I think of it, is it possible it has something to do with the stemming in either the query filter or indexing? In either case, I would print out the text that is being indexed and the phrases added to the query. You could also using luke to inspect your index and see whether 20060801 shows up anywhere. Howie I tried looked for a page that had the date 20060801 and the text test in the page. I tried the following: date: 20060801 test and date 20060721-20060803 test Neither worked, any ideas?? Matt Matthew Holt wrote: Thanks Jake, However, it seems to me that it makes most sense that a query should return all pages that match the query, instead of acting as a content filter. However, I know its something easy to suggest when you're not having to implement it, so just a suggestion. Matt Vanderdray, Jacob wrote: Try querying
Re: common-terms.utf8
Hi Timo! I analyzed to index before and after using correctly the common-terms.utf8file. Before adding the common terms in my language my index had about 3mb. After add the common terms it has now 5mb! Why it occurs? Regards! On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Timo! Thanks a lot! now I have a clearly knowledge about this file. This article helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061 Thanks again! On 8/11/06, Timo Scheuer [EMAIL PROTECTED] wrote: Hi, Could anyone explain me what does exactly the common-terms.utf8 file? I don't understand the real functionality of this file... During indexing (and also during searching) the common terms are used to form n-grams to make search faster for common words like articles for example. It is an alternative to using stop words. N-grams keep the common words by appending them to the following word. This approach increases the selectivity. Cheers, Timo. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
common-terms.utf8
Hi, Could anyone explain me what does exactly the common-terms.utf8 file? I don't understand the real functionality of this file... Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Querying Fields
Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am not sure if I can give you any useful hint but the follwoing is what once worked for me. Example of query: url:http date:20060801 date: and type: options can be used in combination with url: Filer url:http should select all documents (unless you allowed file, ftp protocols). Plain date ot type filter select onthing if they are used alone. And be sure you don't introduce any space between filter name and its value ([date: 20060801] is not the same as [date:20060801]) Lukas On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote: Howie, I inspected my index using Luke and 20060801 shows up several times in the index. I'm unable to query pretty much any field. Several people seem to be having the same problem. Does anyone know whats going on? This is one of the last things I have to resolve to have Nutch deployed successfully at my organization. Unfortunately, Friday is my last day. Can anyone offer any assistance?? Thanks, Matt Howie Wang wrote: I think that I have problems querying for numbers and words with digits in them. Now that I think of it, is it possible it has something to do with the stemming in either the query filter or indexing? In either case, I would print out the text that is being indexed and the phrases added to the query. You could also using luke to inspect your index and see whether 20060801 shows up anywhere. Howie I tried looked for a page that had the date 20060801 and the text test in the page. I tried the following: date: 20060801 test and date 20060721-20060803 test Neither worked, any ideas?? Matt Matthew Holt wrote: Thanks Jake, However, it seems to me that it makes most sense that a query should return all pages that match the query, instead of acting as a content filter. However, I know its something easy to suggest when you're not having to implement it, so just a suggestion. Matt Vanderdray, Jacob wrote: Try querying with both the date and something you'd expect to find in the content. The field query filter is just a filter. It only restricts your results to things that match the basic query and has the contents you require in the field. So if you query for date:2006080 text you'll be searching for documents that contain text in one of the default query fields and has the value 2006080 in the date field. Leaving out text in that example would essentially be asking for nothing in the default fields and 2006080 in the date field which is why it doesn't return any results. Hope that helps, Jake. -Original Message- From: Matthew Holt [mailto:[EMAIL PROTECTED] Sent: Wed 8/2/2006 4:58 PM To: nutch-user@lucene.apache.org Subject: Querying Fields I am unable to query fields in my index in the method that has been suggested. I used Luke to examine my index and the following field types exist: anchor, boost, content, contentLength, date, digest, host, lastModified, primaryType, segment, site, subType, title, type, url However, when I do a search using one of the fields, followed by a colon, an incorrect result is returned. I used Luke to find the top term in the date field which is '20060801'. I then searched using the following query: date: 20060801 Unfortunately, nothing was returned. The correct plugins are enabled, here is an excerpt from my nutch-site.xml: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property Any ideas? I'm not the only one having the same problem, I saw an earlier mailing list post but couldn't find any resolve... Thanks, Matt -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Querying Fields
All right! I've done this already. I thing you dont understand my question. What I want to do is to query my indexes using something like filetype:pdf. The version 0.8 already have this feature. But I'm using the version 0.7.2 and I want to add this feature mannually. But I dont know where I have to edit. Do you know? Regards, Lourival Junior On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, To allow more formats to be indexed you need to modify nutch-site.xml and update/add plugin.includes property (see nutch-default.xml for default settings). The following is what I have in nutch-site.xml: property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value /property [parse-*] is used to parse various formats, [query-more] allows you to use [type:] filter in nutch queries. Regards, Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am not sure if I can give you any useful hint but the follwoing is what once worked for me. Example of query: url:http date:20060801 date: and type: options can be used in combination with url: Filer url:http should select all documents (unless you allowed file, ftp protocols). Plain date ot type filter select onthing if they are used alone. And be sure you don't introduce any space between filter name and its value ([date: 20060801] is not the same as [date:20060801]) Lukas On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote: Howie, I inspected my index using Luke and 20060801 shows up several times in the index. I'm unable to query pretty much any field. Several people seem to be having the same problem. Does anyone know whats going on? This is one of the last things I have to resolve to have Nutch deployed successfully at my organization. Unfortunately, Friday is my last day. Can anyone offer any assistance?? Thanks, Matt Howie Wang wrote: I think that I have problems querying for numbers and words with digits in them. Now that I think of it, is it possible it has something to do with the stemming in either the query filter or indexing? In either case, I would print out the text that is being indexed and the phrases added to the query. You could also using luke to inspect your index and see whether 20060801 shows up anywhere. Howie I tried looked for a page that had the date 20060801 and the text test in the page. I tried the following: date: 20060801 test and date 20060721-20060803 test Neither worked, any ideas?? Matt Matthew Holt wrote: Thanks Jake, However, it seems to me that it makes most sense that a query should return all pages that match the query, instead of acting as a content filter. However, I know its something easy to suggest when you're not having to implement it, so just a suggestion. Matt Vanderdray, Jacob wrote: Try querying with both the date and something you'd expect to find in the content. The field query filter is just a filter. It only restricts your results to things that match the basic query and has the contents you require in the field. So if you query for date:2006080 text you'll be searching for documents that contain text in one of the default query fields and has the value 2006080 in the date field. Leaving out text in that example would essentially be asking for nothing in the default fields and 2006080 in the date field which is why it doesn't return any results. Hope that helps, Jake. -Original Message- From: Matthew Holt [mailto:[EMAIL PROTECTED] Sent: Wed 8/2/2006 4:58 PM To: nutch-user@lucene.apache.org Subject: Querying Fields I am unable to query fields in my index in the method that has been suggested. I used Luke to examine my index and the following field types exist: anchor, boost, content, contentLength, date, digest, host, lastModified, primaryType, segment, site, subType, title, type, url However, when I do a search using one of the fields, followed by a colon, an incorrect result is returned. I used Luke to find the top term in the date field which is '20060801'. I then searched using the following query: date: 20060801 Unfortunately, nothing was returned. The correct plugins are enabled, here is an excerpt from my nutch-site.xml: property nameplugin.includes/name valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index
Re: Recrawl urls
Hi Nahuel! You could use the command bin/nutch inject $nutch-dir/db -urlfile urlfile.txt. To recrawl your WebDB you can use this script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Take a look to the adddays argument and to the configuration property db.default.fetch.interval.They influence to the result. Regards! On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: Hello, I was searching for the method to add new url to the crawling url list and how to recrawl all urls... Can you help me ? thanks, -- Nahuel ANGELINETTI -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: 0.8 Recrawl script updated
Hi Matthew! Could you update the script to the version 0.7.2 with the same functionalities? I write a scritp that do this, but it don't work very well... Regards! On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote: Just letting everyone know that I updated the recrawl script on the Wiki. It now merges the created segments them deletes the old segs to prevent a lot of unneeded data remaining/growing on the hard drive. Matt http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Recrawl urls
Which version are you using? On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: But the websites just added hasn't been yet crawled... And they're not crawled during recrawl... Does bin/nutch purge will restart all ? Le Thu, 3 Aug 2006 09:21:04 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : In the nutch conf/nutch-default.xml configuration file exist a property call db.default.fetch.interval. When you crawl a site, nutch schedules the next fetch to today + db.default.fetch.interval days. If execute the recrawl command and the pages that you fetch don't reach this date, they won't be re-fetched. When you add new urls to the webdb, they will be ready to be fetch. So at this moment only this pages will be fetched by the recrawl script. I hope I helped you. If I said some wrong thing, please correct me :) Regards On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: I have another question, I done what you give me... But it inject the new urls and recrawl it, but against the first crawl It doesn't download the web pages and really crawl them... perhaps I'm mistaking somewhere... Any idea ? Regards, -- Nahuel ANGELINETTI Le Thu, 3 Aug 2006 08:31:22 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : Hi Nahuel! You could use the command bin/nutch inject $nutch-dir/db -urlfile urlfile.txt. To recrawl your WebDB you can use this script. http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Take a look to the adddays argument and to the configuration property db.default.fetch.interval.They influence to the result. Regards! On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: Hello, I was searching for the method to add new url to the crawling url list and how to recrawl all urls... Can you help me ? thanks, -- Nahuel ANGELINETTI -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Recrawl urls
This command bin/nutch purge doesn't exist. Well I can't say you what is happening. Give me the output when you run the recrawl. On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: 0.7.2 of nutch Le Thu, 3 Aug 2006 09:37:24 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : Which version are you using? On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: But the websites just added hasn't been yet crawled... And they're not crawled during recrawl... Does bin/nutch purge will restart all ? Le Thu, 3 Aug 2006 09:21:04 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : In the nutch conf/nutch-default.xml configuration file exist a property call db.default.fetch.interval. When you crawl a site, nutch schedules the next fetch to today + db.default.fetch.interval days. If execute the recrawl command and the pages that you fetch don't reach this date, they won't be re-fetched. When you add new urls to the webdb, they will be ready to be fetch. So at this moment only this pages will be fetched by the recrawl script. I hope I helped you. If I said some wrong thing, please correct me :) Regards On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: I have another question, I done what you give me... But it inject the new urls and recrawl it, but against the first crawl It doesn't download the web pages and really crawl them... perhaps I'm mistaking somewhere... Any idea ? Regards, -- Nahuel ANGELINETTI Le Thu, 3 Aug 2006 08:31:22 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : Hi Nahuel! You could use the command bin/nutch inject $nutch-dir/db -urlfile urlfile.txt. To recrawl your WebDB you can use this script. http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Take a look to the adddays argument and to the configuration property db.default.fetch.interval.They influence to the result. Regards! On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: Hello, I was searching for the method to add new url to the crawling url list and how to recrawl all urls... Can you help me ? thanks, -- Nahuel ANGELINETTI -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
NullPointException
Why when I delete some segments that reach the db.default.fetcth.intervalthe search application gets the nullPointerException? Periodically I have to recrawl my Site. And delete old segments is a problem. Someone have a suggestion? Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: NullPointException
All right. Take a look to this output of the segread command: 060803 132735 PARSED? STARTED FINISHED COUNT DIR NAME 060803 132735 true 20060717-14:41:58 20060717-14:41:58 1 crawl-legislacao_copia/segments/20060717144154 060803 132735 true 20060717-14:42:03 20060717-14:43:22 77 crawl-legislacao_copia/segments/20060717144201 060803 132735 true 20060717-14:43:29 20060717-15:08:10 1464crawl-legislacao_copia/segments/20060717144327 060803 132735 true 20060717-15:08:17 20060717-15:11:58 223 crawl-legislacao_copia/segments/20060717150815 060803 132736 true 20060718-09:02:56 20060718-09:03:10 5 crawl-legislacao_copia/segments/20060718090250 060803 132736 true 20060803-10:55:18 20060803-12:53:49 1541crawl-legislacao_copia/segments/20060803105509 060803 132736 true 20060803-13:07:15 20060803-13:07:20 4 crawl-legislacao_copia/segments/20060803130707 060803 132736 TOTAL: 3315 entries in 7 segments. My db.default.fetch.interval is 15. Before I run a recrawl script I had 5 segments ( 200607* ) and the Index points to 1537 documents. After run the recrawl 2 segments was created and then the script index all. When I analyzed the index generated I see it had 1541 documents. But how can you see the segments 200607* are old and can be deleted. I done this: rm -rf segments/200607* Then I get de NPE. I right I must to re-index the 2 remain segments. I've done this. So, I analize again the index and it has only 1417! My questions: Why it occurs? How can I know which segments can be deleted? I hope you can help me On 8/3/06, Marko Bauhardt [EMAIL PROTECTED] wrote: Hi, if you delete segments then be sure that you doesnt have an index from this segment. The segment contains the parsed content and the index is the index from this content. If you delete the segment and you doing a search on this index, a NPE occurs because no summary (parsed content) are found. HTH Marko Am 03.08.2006 um 16:33 schrieb Lourival Júnior: Why when I delete some segments that reach the db.default.fetcth.intervalthe search application gets the nullPointerException? Periodically I have to recrawl my Site. And delete old segments is a problem. Someone have a suggestion? Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
ZIP plugin in nutch 0.7.2
Hi all!! Could I use the zip plugin from nutch 0.8 in nutch 0.7.2? Is there any problem? Regards. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
java.lang.NoClassDefFoundError
Hi all! I'm testing the nutch 0.8. But I get this error in this simple command: $ bin/nutch readdb java.lang.NoClassDefFoundError: and Exception in thread main I've set the NUTCH_JAVA_HOME variable, but I'm sure it is the root cause of this. What is occurring? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Total time of a search
Hi, Somebody knows how to calculate the total time of a search? Actually a use this, but I'm not sure about it: Date d = new Date(); int iniTime = (int) d.getTime();//pega o tempo de inicio da execução da busca nos índices //Aqui é executada a busca nos índices. try{ hits = bean.search(query, start + hitsToRetrieve, hitsPerSite, site, sort, reverse); } catch (IOException e){ hits = new Hits(0,new Hit[0]); } int end = (int)Math.min(hits.getLength(), start + hitsPerPage); int length = end-start; int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve); Hit[] show = hits.getHits(start, realEnd-start); HitDetails[] details = bean.getDetails(show); String[] summaries = bean.getSummary(details, query); Date d2 = new Date(); int endTime = (int) d2.getTime(); int totalTime = endTime-iniTime;//tempo de execução em milisegundos double totalTimeInSec = (double)totalTime/(double)1000; Is it correct? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: installation de nutch
Try to delete the directory crawl in /root/nutch-0.7.2/. So, run the command again. On 7/26/06, kawther khazri [EMAIL PROTECTED] wrote: Hi I am trying to run Nutch by following the instructions given in the tutorial. The environment is FEDORA 5, JDK 1.4.2 and Nutch 0.7.2 And of course Tomcat 5 I get the following errors: [EMAIL PROTECTED] ~]# /root/nutch-0.7.2/bin/nutch crawl urls -dir crawl -depth 3 -topN50 run java in /usr/lib/jvm/jre 060726 141458 parsing file:/root/nutch-0.7.2/conf/nutch-default.xml 060726 141458 parsing file:/root/nutch-0.7.2/conf/crawl-tool.xml 060726 141458 parsing file:/root/nutch-0.7.2/conf/nutch-site.xml 060726 141458 No FS indicated, using default:local Exception in thread main java.lang.RuntimeException: crawl already exists. at org.apache.nutch.tools.CrawlTool.main (CrawlTool.java:121) I really appreciate any help that I can get. Thanks a lot - Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. Cliquez ici. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Recrawl script for 0.8.0 completed...
You wanna say that only in windows this error occurs? I haven't tested in linux yet. Has anyone a solution for this problem in windows/tomcat? On 7/25/06, Thomas Delnoij [EMAIL PROTECTED] wrote: Lourival. I have typically seen the same issues on a cygwin/windows setup. The only thing that worked for me was shutting down and restarting tomcat, instead of just reloading the context. On linux now I don't have these issues anymore. Rgrds, Thomas On 7/21/06, Lourival Júnior [EMAIL PROTECTED] wrote: Ok. However a few minutes ago I ran the script exactly you said and I still get this error: Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java :195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java :176) at org.apache.lucene.store.FSDirectory.getDirectory( FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java :225) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java :92) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java :160) I dont know but I thing it occurs because nutch tries to delete some file that tomcat loads to the memory, giving permission access error. Any idea? On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Lourival Júnior wrote: I thing it wont work with me because i'm using the Nutch version 0.7.2. Actually I use this script (some comments are in Portuguese): #!/bin/bash # A simple script to run a Nutch re-crawl # Fonte do script: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html #{ if [ -n $1 ] then crawl_dir=$1 else echo Usage: recrawl crawl_dir [depth] [adddays] exit 1 fi if [ -n $2 ] then depth=$2 else depth=5 fi if [ -n $3 ] then adddays=$3 else adddays=0 fi webdb_dir=$crawl_dir/db segments_dir=$crawl_dir/segments index_dir=$crawl_dir/index #Para o serviço do TomCat #net stop Apache Tomcat # The generate/fetch/update cycle for ((i=1; i = depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment echo echo Fim do ciclo $i. echo done # Update segments echo echo Atualizando os Segmentos... echo mkdir tmp bin/nutch updatesegs $webdb_dir $segments_dir tmp rm -R tmp # Index segments echo Indexando os segmentos... echo for segment in `ls -d $segments_dir/* | tail -$depth` do bin/nutch index $segment done # De-duplicate indexes # bogus argument is ignored but needed due to # a bug in the number of args expected bin/nutch dedup $segments_dir bogus # Merge indexes #echo Unindo os segmentos... #echo ls -d $segments_dir/* | xargs bin/nutch merge $index_dir chmod 777 -R $index_dir #Inicia o serviço do TomCat #net start Apache Tomcat echo Fim. #} recrawl.log 21 How you suggested I used the touch command instead stops the tomcat. However I get that error posted in previous message. I'm running nutch in windows plataform with cygwin. I only get no errors when I stops the tomcat. I use this command to call the script: ./recrawl crawl-legislacao 1 Could you give me more clarifications? Thanks a lot! On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Lourival Júnior wrote: Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225) at org.apache.nutch.indexer.IndexMerger.merge( IndexMerger.java :92) at org.apache.nutch.indexer.IndexMerger.main( IndexMerger.java :160) So, I want another way to re-crawl my pages without this error and without restarting the tomcat. Could you suggest one? Thanks a lot! Try this updated script and tell me what command exactly you run to call the script. Let me know the error message then. Matt #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # Modified by Matthew Holt if [ -n $1
Re: Recrawl script for 0.8.0 completed...
Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I run the script the index could not be update because the tomcat loads it to the memory. Could you suggest a modification to this script or to the NutchBean that accepts modifications to the index without restart tomcat (Actually, I use net stop Apache Tomcat before the index updation...)? Thanks On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Thanks for putting up with all the messages to the list... Here is the recrawl script for 0.8.0 if anyone is interested. Matt --- #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # Modified by Matthew Holt if [ -n $1 ] then crawl_dir=$1 else echo Usage: recrawl crawl_dir [depth] [adddays] exit 1 fi if [ -n $2 ] then depth=$2 else depth=5 fi if [ -n $3 ] then adddays=$3 else adddays=0 fi # EDIT THIS - List the location to your nutch servlet container. nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/ # No need to edit anything past this line # webdb_dir=$crawl_dir/crawldb segments_dir=$crawl_dir/segments linkdb_dir=$crawl_dir/linkdb index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i = depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done # Update segments bin/nutch invertlinks $linkdb_dir -dir $segments_dir # Index segments new_indexes=$crawl_dir/newindexes #ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* # De-duplicate indexes bin/nutch dedup $new_indexes # Merge indexes bin/nutch merge $index_dir $new_indexes # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml # Clean up rm -rf $new_indexes -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Recrawl script for 0.8.0 completed...
Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160) So, I want another way to re-crawl my pages without this error and without restarting the tomcat. Could you suggest one? Thanks a lot! On 7/21/06, Renaud Richardet [EMAIL PROTECTED] wrote: Hi Matt and Lourival, Matt, thank you for the recrawl script. Any plans to commit it to trunk? Lourival, here's in the script what reloads Tomcat, not the cleanest, but it should work # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml HTH, Renaud Lourival Júnior wrote: Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I run the script the index could not be update because the tomcat loads it to the memory. Could you suggest a modification to this script or to the NutchBean that accepts modifications to the index without restart tomcat (Actually, I use net stop Apache Tomcat before the index updation...)? Thanks On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Thanks for putting up with all the messages to the list... Here is the recrawl script for 0.8.0 if anyone is interested. Matt --- #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # Modified by Matthew Holt if [ -n $1 ] then crawl_dir=$1 else echo Usage: recrawl crawl_dir [depth] [adddays] exit 1 fi if [ -n $2 ] then depth=$2 else depth=5 fi if [ -n $3 ] then adddays=$3 else adddays=0 fi # EDIT THIS - List the location to your nutch servlet container. nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/ # No need to edit anything past this line # webdb_dir=$crawl_dir/crawldb segments_dir=$crawl_dir/segments linkdb_dir=$crawl_dir/linkdb index_dir=$crawl_dir/index # The generate/fetch/update cycle for ((i=1; i = depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done # Update segments bin/nutch invertlinks $linkdb_dir -dir $segments_dir # Index segments new_indexes=$crawl_dir/newindexes #ls -d $segments_dir/* | tail -$depth | xargs bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/* # De-duplicate indexes bin/nutch dedup $new_indexes # Merge indexes bin/nutch merge $index_dir $new_indexes # Tell Tomcat to reload index touch $nutch_dir/WEB-INF/web.xml # Clean up rm -rf $new_indexes -- Renaud Richardet COO America Wyona Inc. - Open Source Content Management - Apache Lenya office +1 857 776-3195 mobile +1 617 230 9112 renaud.richardet at wyona.com http://www.wyona.com -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Recrawl script for 0.8.0 completed...
Ok. However a few minutes ago I ran the script exactly you said and I still get this error: Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160) I dont know but I thing it occurs because nutch tries to delete some file that tomcat loads to the memory, giving permission access error. Any idea? On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Lourival Júnior wrote: I thing it wont work with me because i'm using the Nutch version 0.7.2. Actually I use this script (some comments are in Portuguese): #!/bin/bash # A simple script to run a Nutch re-crawl # Fonte do script: http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html #{ if [ -n $1 ] then crawl_dir=$1 else echo Usage: recrawl crawl_dir [depth] [adddays] exit 1 fi if [ -n $2 ] then depth=$2 else depth=5 fi if [ -n $3 ] then adddays=$3 else adddays=0 fi webdb_dir=$crawl_dir/db segments_dir=$crawl_dir/segments index_dir=$crawl_dir/index #Para o serviço do TomCat #net stop Apache Tomcat # The generate/fetch/update cycle for ((i=1; i = depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment echo echo Fim do ciclo $i. echo done # Update segments echo echo Atualizando os Segmentos... echo mkdir tmp bin/nutch updatesegs $webdb_dir $segments_dir tmp rm -R tmp # Index segments echo Indexando os segmentos... echo for segment in `ls -d $segments_dir/* | tail -$depth` do bin/nutch index $segment done # De-duplicate indexes # bogus argument is ignored but needed due to # a bug in the number of args expected bin/nutch dedup $segments_dir bogus # Merge indexes #echo Unindo os segmentos... #echo ls -d $segments_dir/* | xargs bin/nutch merge $index_dir chmod 777 -R $index_dir #Inicia o serviço do TomCat #net start Apache Tomcat echo Fim. #} recrawl.log 21 How you suggested I used the touch command instead stops the tomcat. However I get that error posted in previous message. I'm running nutch in windows plataform with cygwin. I only get no errors when I stops the tomcat. I use this command to call the script: ./recrawl crawl-legislacao 1 Could you give me more clarifications? Thanks a lot! On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Lourival Júnior wrote: Hi Renaud! I'm newbie with shell scripts and I know stops tomcat service is not the better way to do this. The problem is, when a run the re-crawl script with tomcat started I get this error: 060721 132224 merging segment indexes to: crawl-legislacao2\index Exception in thread main java.io.IOException: Cannot delete _0.f0 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195) at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java :141) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java :92) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java :160) So, I want another way to re-crawl my pages without this error and without restarting the tomcat. Could you suggest one? Thanks a lot! Try this updated script and tell me what command exactly you run to call the script. Let me know the error message then. Matt #!/bin/bash # Nutch recrawl script. # Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html # Modified by Matthew Holt if [ -n $1 ] then nutch_dir=$1 else echo Usage: recrawl servlet_path crawl_dir [depth] [adddays] echo servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT) echo crawl_dir - Name of the directory the crawl is located in. echo [depth] - The link depth from the root page that should be crawled. echo [adddays] - Advance the clock # of days for fetchlist generation. exit 1 fi if [ -n $2 ] then crawl_dir=$2 else echo Usage: recrawl servlet_path crawl_dir [depth] [adddays] echo servlet_path - Path of the nutch servlet (i.e. /usr/local/tomcat/webapps/ROOT) echo crawl_dir - Name of the directory the crawl is located in. echo [depth] - The link depth from the root page that should be crawled. echo [adddays] - Advance the clock # of days for fetchlist generation. exit 1
Unused Segments
How can i discover which segments are unused by the index? After many recrawl I have a lot of segments. So, I would like to erase someones... Who can help me? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Recrawl a specific web Page
How can i recrawl a specific web page. For example I have a html page that is constantly update. There a command for that? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: question about plugins
Hi! Dont worry, I know what you mean. You have to modify the nutch-site.xmlconfiguration file in conf directory. Take a look to a example: nutch-conf property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|language-identifier|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded./description /property property namehttp.content.limit/name value-1/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property /nutch-conf Regards, Lourival Junior On 7/11/06, Abdelhakim Diab [EMAIL PROTECTED] wrote: How Can I add a plugin to my search application and how can I activate it. sorry if my question is stupid but I am newbie to nutch. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: OpenOffice Support?
Using to advantage your question, anyone knows if the version 0.7.2 of nutch supports the zip plugin? If so, where can I find it? Lourival Junior On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote: Just wondering, has anyone done any work on a plugin (or aware of a plugin) that supports the indexing of open office documents? Thanks. Matt -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Number of pages different to number of indexed pages
Hi all! I have a little doubt. My WebDB contains, actually, 779 pages with 899 links. When I use the segread command it returns 779 count pages too in one segment. However when I make a search or when I use the luke software the maximum number of documents is 437. I've seen the recrawl logs and when the script is fetching pages, some of them contains the message: ... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. I thing that it happens because some network problem. The fetcher try to fetch some page, but it did not obtain. Because this, when the segment is being indexed, only the fetched pages will appear in results. It is a problem to me. Could someone explain me what should I do to refetch these pages to increase my web search results? Should I change the http.max.delays and fetcher.server.delay properties in nutch-default.xml? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Re: Number of pages different to number of indexed pages
Yes! It really works! I'm execunting the recrawl at now, and it is fetching the pages that it didn't fetched yet... It takes longer, but the final result is more important. Thanks a lot! On 7/7/06, Honda-Search Administrator [EMAIL PROTECTED] wrote: This is typical if you are crawling only a few sites. I crawl 7 sites nightly and often get this error. I changed my http.max.delays property from 3 to 50 and it works without a problem. The crawl takes longer, but I get almost all of the pages. - Original Message - From: Lourival Júnior [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, July 07, 2006 10:20 AM Subject: Number of pages different to number of indexed pages Hi all! I have a little doubt. My WebDB contains, actually, 779 pages with 899 links. When I use the segread command it returns 779 count pages too in one segment. However when I make a search or when I use the luke software the maximum number of documents is 437. I've seen the recrawl logs and when the script is fetching pages, some of them contains the message: ... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater : Exceeded http.max.delays: retry later. I thing that it happens because some network problem. The fetcher try to fetch some page, but it did not obtain. Because this, when the segment is being indexed, only the fetched pages will appear in results. It is a problem to me. Could someone explain me what should I do to refetch these pages to increase my web search results? Should I change the http.max.delays and fetcher.server.delay properties in nutch-default.xml? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED] -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
Index algorithm
Could anyone give some link or document about the nutch's index algorithm? I don't found many ones... Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]