Re: Query pdf, etc..

2007-04-24 Thread Lourival Júnior
You can use the plugins index-more and query-more to create a field on your index indicating the file type of the document. So, in you search you can use type:pdf or type:msword to filter these files. I used nutch 0.7.2 to make it work... Regards, Lourival Júnior On 4/24/07, ekoje ekoje [EMAIL

Re: Using nutch as a web crawler

2007-04-05 Thread Lourival Júnior
PROTECTED] wrote: Thanks. Can you please tell me how can I plugin in my own handling when nutch sees a site instead of building the search database for that site? On 4/3/07, Lourival Júnior [EMAIL PROTECTED] wrote: I have total certainty that nutch is what are you looking for. Take a look

Re: Using nutch as a web crawler

2007-04-03 Thread Lourival Júnior
I have total certainty that nutch is what are you looking for. Take a look to nutch's documentation for more details and you will see :). On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote: Hi, I would like to know if know if it is a good idea to use nutch web carwler? Basically, this is

Re: java.lang.NoClassDefFoundError

2006-12-01 Thread Lourival Júnior
nutch 0.8, just checked out. Exception in thread main java.lang.NoClassDefFoundError This is on OS X 10.4.7. Older nutch runs fine. --- Lourival Júnior [EMAIL PROTECTED] wrote: Hi all! I'm testing the nutch 0.8. But I get this error in this simple command: $ bin/nutch readdb

Re: Common terms

2006-09-25 Thread Lourival Júnior
Has you reindexed your segments? It's important, because it makes nutch recognize your common terms. I've tried it and the only thing I've noted was the index size that is more big than the original (before use the common terms). On 9/25/06, carmmello [EMAIL PROTECTED] wrote: I'm using Nutch

Re: Common terms

2006-09-25 Thread Lourival Júnior
again - Original Message - From: Lourival Júnior [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Monday, September 25, 2006 3:58 PM Subject: Re: Common terms Has you reindexed your segments? It's important, because it makes nutch recognize your common terms. I've tried

ZIP parser in Nutch 0.7.2

2006-09-05 Thread Lourival Júnior
Hi all! Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How can I do this? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: indexing folders with nutch

2006-09-01 Thread Lourival Júnior
Yes Cam, if you use a depth 1 you will crawl only the first document. With a depth 2 you will crawl the first document and all the links found on this document. With depth 3, you will crawl the first one, its links and all links found in cycle 2. And so on. Increasing you depth will increasing

Re: index/search filtering by category

2006-08-23 Thread Lourival Júnior
it unless you write a plugin to parse a custom meta tag called category. I'm trying to do something like this now, but the plugin documentation is horrible. Lourival Júnior wrote: Hi Ernesto! I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new in nutch and lucene

Re: index/search filtering by category

2006-08-22 Thread Lourival Júnior
Hi Ernesto! I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new in nutch and lucene and I can't help you. Continue trying, the comunity will help you :). On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote: Hi All Please, some body can answer my questions? I'm a

Zip Plugin

2006-08-21 Thread Lourival Júnior
Has anyone get successful in implement Zip parse plugin in nutch 0.7.2? Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-14 Thread Lourival Júnior
to the point that you won't be able to answer STFW questions in mail-lists... etc. :-) Regards, Lukas On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote: Yes yes, I tested the index-more and query-more plugin. They allows to search these fields easily. However if I could find a documentation

Re: common-terms.utf8

2006-08-11 Thread Lourival Júnior
Hi Timo! Thanks a lot! now I have a clearly knowledge about this file. This article helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061 Thanks again! On 8/11/06, Timo Scheuer [EMAIL PROTECTED] wrote: Hi, Could anyone explain me what does exactly the common-terms.utf8

Re: Querying Fields

2006-08-11 Thread Lourival Júnior
to find out what exactly it does. As far as I know it does not add any new filed into index (it should be done via index-more plugin) but it allows you to query using type: date: and site: I think. Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: What does exactilly the query-more

Re: common-terms.utf8

2006-08-11 Thread Lourival Júnior
Hi Timo! I analyzed to index before and after using correctly the common-terms.utf8file. Before adding the common terms in my language my index had about 3mb. After add the common terms it has now 5mb! Why it occurs? Regards! On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Timo

common-terms.utf8

2006-08-10 Thread Lourival Júnior
Hi, Could anyone explain me what does exactly the common-terms.utf8 file? I don't understand the real functionality of this file... Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-09 Thread Lourival Júnior
Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I am not sure if I can give you any useful hint but the follwoing is what once worked for me.

Re: Querying Fields

2006-08-09 Thread Lourival Júnior
formats, [query-more] allows you to use [type:] filter in nutch queries. Regards, Lukas On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote: Hi Lukas and everybody! Do you know which file in nutch 0.7.2 should I edit to add some field in my index (i.e. file type - PDF, Word or html)?' On 8/8/06

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
Hi Nahuel! You could use the command bin/nutch inject $nutch-dir/db -urlfile urlfile.txt. To recrawl your WebDB you can use this script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html Take a look to the adddays argument and to the configuration property

Re: 0.8 Recrawl script updated

2006-08-03 Thread Lourival Júnior
Hi Matthew! Could you update the script to the version 0.7.2 with the same functionalities? I write a scritp that do this, but it don't work very well... Regards! On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote: Just letting everyone know that I updated the recrawl script on the Wiki. It now

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
Which version are you using? On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: But the websites just added hasn't been yet crawled... And they're not crawled during recrawl... Does bin/nutch purge will restart all ? Le Thu, 3 Aug 2006 09:21:04 -0300, Lourival Júnior [EMAIL PROTECTED

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior
This command bin/nutch purge doesn't exist. Well I can't say you what is happening. Give me the output when you run the recrawl. On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote: 0.7.2 of nutch Le Thu, 3 Aug 2006 09:37:24 -0300, Lourival Júnior [EMAIL PROTECTED] a écrit : Which version

NullPointException

2006-08-03 Thread Lourival Júnior
Why when I delete some segments that reach the db.default.fetcth.intervalthe search application gets the nullPointerException? Periodically I have to recrawl my Site. And delete old segments is a problem. Someone have a suggestion? Regards -- Lourival Junior Universidade Federal do Pará Curso

Re: NullPointException

2006-08-03 Thread Lourival Júnior
. The segment contains the parsed content and the index is the index from this content. If you delete the segment and you doing a search on this index, a NPE occurs because no summary (parsed content) are found. HTH Marko Am 03.08.2006 um 16:33 schrieb Lourival Júnior: Why when I delete some

ZIP plugin in nutch 0.7.2

2006-08-03 Thread Lourival Júnior
Hi all!! Could I use the zip plugin from nutch 0.8 in nutch 0.7.2? Is there any problem? Regards. -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

java.lang.NoClassDefFoundError

2006-07-28 Thread Lourival Júnior
Hi all! I'm testing the nutch 0.8. But I get this error in this simple command: $ bin/nutch readdb java.lang.NoClassDefFoundError: and Exception in thread main I've set the NUTCH_JAVA_HOME variable, but I'm sure it is the root cause of this. What is occurring? -- Lourival Junior Universidade

Total time of a search

2006-07-27 Thread Lourival Júnior
Hi, Somebody knows how to calculate the total time of a search? Actually a use this, but I'm not sure about it: Date d = new Date(); int iniTime = (int) d.getTime();//pega o tempo de inicio da execução da busca nos índices //Aqui é executada a busca nos índices. try{ hits =

Re: installation de nutch

2006-07-26 Thread Lourival Júnior
Try to delete the directory crawl in /root/nutch-0.7.2/. So, run the command again. On 7/26/06, kawther khazri [EMAIL PROTECTED] wrote: Hi I am trying to run Nutch by following the instructions given in the tutorial. The environment is FEDORA 5, JDK 1.4.2 and Nutch 0.7.2 And of course Tomcat

Re: Recrawl script for 0.8.0 completed...

2006-07-25 Thread Lourival Júnior
that worked for me was shutting down and restarting tomcat, instead of just reloading the context. On linux now I don't have these issues anymore. Rgrds, Thomas On 7/21/06, Lourival Júnior [EMAIL PROTECTED] wrote: Ok. However a few minutes ago I ran the script exactly you said and I still get this error

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I run the script the index could not be update because the tomcat

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
$nutch_dir/WEB-INF/web.xml HTH, Renaud Lourival Júnior wrote: Hi Matt! In the article found at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou said the re-crawl script have a problem with updating the live search index. In my tests with Nutch version 0.7.2 when I

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior
(IndexMerger.java:160) I dont know but I thing it occurs because nutch tries to delete some file that tomcat loads to the memory, giving permission access error. Any idea? On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote: Lourival Júnior wrote: I thing it wont work with me because i'm using the Nutch

Unused Segments

2006-07-14 Thread Lourival Júnior
How can i discover which segments are unused by the index? After many recrawl I have a lot of segments. So, I would like to erase someones... Who can help me? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL

Recrawl a specific web Page

2006-07-13 Thread Lourival Júnior
How can i recrawl a specific web page. For example I have a html page that is constantly update. There a command for that? -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: question about plugins

2006-07-11 Thread Lourival Júnior
Hi! Dont worry, I know what you mean. You have to modify the nutch-site.xmlconfiguration file in conf directory. Take a look to a example: nutch-conf property nameplugin.includes/name

Re: OpenOffice Support?

2006-07-11 Thread Lourival Júnior
Using to advantage your question, anyone knows if the version 0.7.2 of nutch supports the zip plugin? If so, where can I find it? Lourival Junior On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote: Just wondering, has anyone done any work on a plugin (or aware of a plugin) that supports the

Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior
Hi all! I have a little doubt. My WebDB contains, actually, 779 pages with 899 links. When I use the segread command it returns 779 count pages too in one segment. However when I make a search or when I use the luke software the maximum number of documents is 437. I've seen the recrawl logs and

Re: Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior
a few sites. I crawl 7 sites nightly and often get this error. I changed my http.max.delays property from 3 to 50 and it works without a problem. The crawl takes longer, but I get almost all of the pages. - Original Message - From: Lourival Júnior [EMAIL PROTECTED] To: nutch-user

Index algorithm

2006-07-07 Thread Lourival Júnior
Could anyone give some link or document about the nutch's index algorithm? I don't found many ones... Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]