from:"Lourival Júnior"

Re: Query pdf, etc..

2007-04-24 Thread Lourival Júnior


You can use the plugins index-more and query-more to create a field on your
index indicating the file type of the document. So, in you search you can
use type:pdf or type:msword to filter these files. I used nutch 0.7.2 to
make it work...

Regards,

Lourival Júnior

On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote:


Hi Guys,

I would like to add a new button on my webpage to make an adanced search
using the keywords.
Once the user will click on it it will search for keywords only in the
different PDF/WORD or Excel document indexed.

Do you know how i can filter/limit my search on PDF/WORD/EXCEL documents ?

Thanks for your help.
E





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Using nutch as a web crawler

2007-04-05 Thread Lourival Júnior


Nutch has a file called crawl-urlfilter.txt where you can set your site
domain or site list, so nutch will only crawl this list. Download nutch and
see it working, is better for you :). Take a look:
http://lucene.apache.org/nutch/tutorial8.html

Regards,

On 4/5/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:


Thanks. Can you please tell me how can I plugin in my own handling
when nutch sees a site instead of building the search database for
that site?



On 4/3/07, Lourival Júnior [EMAIL PROTECTED] wrote:
 I have total certainty that nutch is what are you looking for. Take a
look
 to nutch's documentation for more details and you will see :).

 On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I would like to know if know if it is a good idea to use nutch web
  carwler?
  Basically, this is what I need:
  1. I have a list of web site
  2. I want the web crawler to go thru each site, parser the anchor. if
  it is the same domain, go thru the same step for 3 level.
  3. For each link, write to a new file.
 
  Is nutch a good solution? or there is other better open source
  alternative for my purpose?
 
  Thank you.
 



 --
 Lourival Junior
 Universidade Federal do Pará
 Curso de Bacharelado em Sistemas de Informação
 http://www.ufpa.br/cbsi
 Msn: [EMAIL PROTECTED]






--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Using nutch as a web crawler

2007-04-03 Thread Lourival Júnior


I have total certainty that nutch is what are you looking for. Take a look
to nutch's documentation for more details and you will see :).

On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:


Hi,

I would like to know if know if it is a good idea to use nutch web
carwler?
Basically, this is what I need:
1. I have a list of web site
2. I want the web crawler to go thru each site, parser the anchor. if
it is the same domain, go thru the same step for 3 level.
3. For each link, write to a new file.

Is nutch a good solution? or there is other better open source
alternative for my purpose?

Thank you.





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: java.lang.NoClassDefFoundError

2006-12-01 Thread Lourival Júnior


Please I'm testing nutch 0.8.1 and I still get this error when trying to run
a simple commad:

Exception in thread main
java.lang.NoClassDefFoundError

What's wrong? I'm running in Windows 2000 with cygwin.

On 7/28/06, Rick Carver [EMAIL PROTECTED] wrote:


I get the same problem trying to use nutch 0.8, just
checked out.

Exception in thread main
java.lang.NoClassDefFoundError

This is on OS X 10.4.7. Older nutch runs fine.

--- Lourival Júnior [EMAIL PROTECTED] wrote:

 Hi all!
 I'm testing the nutch 0.8. But I get this error in
 this simple command:

 $ bin/nutch readdb
 java.lang.NoClassDefFoundError: and
 Exception in thread main

 I've set the NUTCH_JAVA_HOME variable, but I'm sure
 it is the root cause of
 this.

 What is occurring?

 --
 Lourival Junior
 Universidade Federal do Pará
 Curso de Bacharelado em Sistemas de Informação
 http://www.ufpa.br/cbsi
 Msn: [EMAIL PROTECTED]



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Common terms

2006-09-25 Thread Lourival Júnior


Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted was
the index size that is more big than the original (before use the common
terms).

On 9/25/06, carmmello [EMAIL PROTECTED] wrote:


I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf
folder (and also under the classes folder, inside the ROOT folder on
TomCat), some common terms in portuguese, one per line , like:

content:da
contente:de
contente:eu
..
However, when I try some search, I get all the results for those
portuguese common terms, and, at the same time, I get zero results for the
original english terms.  I have even tried to list all the terms in
alphabetical order, including the original ones, with the same results.  In
other words, Nutch does not seem to recognize, as such, the  added common
terms, only the original ones, included in the distribution.
Can any one clarify this?
Tanks





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Common terms

2006-09-25 Thread Lourival Júnior


Ok. If you're crawling with this settings you don't need to reindex your
segments again. And how about the plugins that you are using? Are you using
the language-identifier plugin? If not, try it.

Regards,

Obs: Eu falo português :)

On 9/25/06, carmmello [EMAIL PROTECTED] wrote:


This issue happens even when I start a new crawl.  So, I'm not reindexing
the segments.  The indexing is done by nutch itself, using the intranet
method.
Do you mean that after this is done, do I have to reindex the segments,
once
again?  But, if so, why the english common terms are recognized first
time?
Tanks again
- Original Message -
From: Lourival Júnior [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Monday, September 25, 2006 3:58 PM
Subject: Re: Common terms


Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted
was
the index size that is more big than the original (before use the common
terms).

On 9/25/06, carmmello [EMAIL PROTECTED] wrote:

 I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the
conf
 folder (and also under the classes folder, inside the ROOT folder on
 TomCat), some common terms in portuguese, one per line , like:
 
 content:da
 contente:de
 contente:eu
 ..
 However, when I try some search, I get all the results for those
 portuguese common terms, and, at the same time, I get zero results for
the
 original english terms.  I have even tried to list all the terms in
 alphabetical order, including the original ones, with the same results.
 In
 other words, Nutch does not seem to recognize, as such, the  added
common
 terms, only the original ones, included in the distribution.
 Can any one clarify this?
 Tanks




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]







No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.12.6/453 - Release Date: 20/9/2006





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

ZIP parser in Nutch 0.7.2

2006-09-05 Thread Lourival Júnior


Hi all!

Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How
can I do this?

Regards,

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: indexing folders with nutch

2006-09-01 Thread Lourival Júnior


Yes Cam, if you use a depth 1 you will crawl only the first document. With a
depth 2 you will crawl the first document and all the links found on this
document. With depth 3, you will crawl the first one, its links and all
links found in cycle 2. And so on. Increasing you depth will increasing your
WebDB too. Try it ;)

Regards

On 8/31/06, Sandy Polanski [EMAIL PROTECTED] wrote:


Cam, try increasing the depth and see what happens.
It seems that logic would say that they're on the same
directory depth/level; however, just give it a try
because I ran into a similar problem, and if I'm not
mistaken, that fixed it.

--- Cam Bazz [EMAIL PROTECTED] wrote:

 Hello,

 I have a problem. I tried to index some localfiles
 with nutch.

 What I have done is put them in a local apache
 server, (html files)
 and create a urls file that contains
 http://localhost/file01.html etc.

 then I do a nutch crawl urls . -dir crawl -depth 1

 but the crawl stales after a while, and nothing
 happens.

 I also tried -topN 1

 is not there a more convinient way of indexing from
 file system?

 Best regards,
 -C.B.



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: index/search filtering by category

2006-08-23 Thread Lourival Júnior


Hi Ernesto!

Meta tags are custom tags that you add in your web page, to be more
exactly,  inside the  head/head  tag,  to identify the contents of the
web page to search engine indexes. For example your can add meta tag to
describe the author of the page, keywords, cache, and so on. What you can do
for your problem is add a meta tag to describe your categories:

meta name=category content=yourcategory /

I hope I helped you.

Regards

On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote:


Thanks to both for response me!

What's a meta tag?
It's some thing of nutch, it isn't a lucene field?

I suppose that implementing IndexFilter.filter:

filter(Document doc, Parse parse, UTF8 url, CrawlDatum datum, Inlinks
inlinks)

I can add my field to a doc instance.

Well, seems that the way is to try, to crash, and to try again... :)

Thanks,
Ernesto.

Chris Stephens escribió:
 You can't do it unless you write a plugin to parse a custom meta tag
 called category.

 I'm trying to do something like this now, but the plugin documentation
 is horrible.

 Lourival Júnior wrote:
 Hi Ernesto!

 I know what you mean. Sometimes I get no answers too. Unfortunately,
 I'm new
 in nutch and lucene and I can't help you. Continue trying, the
 comunity will
 help you :).

 On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote:

 Hi All

 Please, some body can answer my questions?
 I'm a nutch beginner, I hope that my questions/doubts are easy... ;)

 Or if my email is wrong, tell me. Or confirm me if I'm in the right
 way.

 Thanks a lot!
 Ernesto.

 Ernesto De Santis escribió:
  Hi
 
  I'm new in nutch, start yesterday.
  But I have experience with Lucene.
 
  I have some questions for you, a nutch experts... ;)
 
  I want to split my pages results in categories, to filter or to show
  its separately.
  This is my approach:
 
  *crawl/index*
 
  I want to index an extra field.
  Then, I need to do my own plugin for that, to develop my custom
 logic.
  Then, I config my plugin in conf/nutch-site.xml.
 
  To develop my plugin, I see that I need to implements: Configurable
  

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html

 ,
  IndexingFilter
  

http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html

 ,
  and Pluggable
  

http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html

 interfaces.
 
  Add to the Document instance the field value, category value.
 
  *search*
 
  Here I have a doubt, one way is set to nutch query a requiredTerm:
 
  query.addRequiredTerm(myCategory, category);
 
  I see that nutch use QueryFilters too, but I can't see how I do hook
  it to my query.
 
  *miscellaneous*
 
  Lucene has a rich query hierarchy, I don't see it in nutch. I don't
  see BooleanQuery, TermQuery, etc. The unique point to build the
query
  in nutch is the Query class?
 
  Lucene searcher has a way to seperate the query to the filters. The
  queries conditions affect the rank, and filters don't. How nutch
  separates it?
 
  *documentation*
 
  I read the documentation in nutch site, tutorial, wiki,
presentations
  and today.java.net article:
 

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

  and part2 too.
 
  A lot of details aren't covered there. Some body know more detailed
  documentation?
 
  Thanks a lot.
  Ernesto.
 




 __
 Preguntá. Respondé. Descubrí.
 Todo lo que querías saber, y lo que ni imaginabas,
 está en Yahoo! Respuestas (Beta).
 ¡Probalo ya!
 http://www.yahoo.com.ar/respuestas











__
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: index/search filtering by category

2006-08-22 Thread Lourival Júnior


Hi Ernesto!

I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new
in nutch and lucene and I can't help you. Continue trying, the comunity will
help you :).

On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote:


Hi All

Please, some body can answer my questions?
I'm a nutch beginner, I hope that my questions/doubts are easy... ;)

Or if my email is wrong, tell me. Or confirm me if I'm in the right way.

Thanks a lot!
Ernesto.

Ernesto De Santis escribió:
 Hi

 I'm new in nutch, start yesterday.
 But I have experience with Lucene.

 I have some questions for you, a nutch experts... ;)

 I want to split my pages results in categories, to filter or to show
 its separately.
 This is my approach:

 *crawl/index*

 I want to index an extra field.
 Then, I need to do my own plugin for that, to develop my custom logic.
 Then, I config my plugin in conf/nutch-site.xml.

 To develop my plugin, I see that I need to implements: Configurable
 
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/conf/Configurable.html
,
 IndexingFilter
 
http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/indexer/IndexingFilter.html
,
 and Pluggable
 
http://lucene.apache.org/nutch/apidocs-0.8/org/apache/nutch/plugin/Pluggable.html
interfaces.

 Add to the Document instance the field value, category value.

 *search*

 Here I have a doubt, one way is set to nutch query a requiredTerm:

 query.addRequiredTerm(myCategory, category);

 I see that nutch use QueryFilters too, but I can't see how I do hook
 it to my query.

 *miscellaneous*

 Lucene has a rich query hierarchy, I don't see it in nutch. I don't
 see BooleanQuery, TermQuery, etc. The unique point to build the query
 in nutch is the Query class?

 Lucene searcher has a way to seperate the query to the filters. The
 queries conditions affect the rank, and filters don't. How nutch
 separates it?

 *documentation*

 I read the documentation in nutch site, tutorial, wiki, presentations
 and today.java.net article:

http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
 and part2 too.

 A lot of details aren't covered there. Some body know more detailed
 documentation?

 Thanks a lot.
 Ernesto.





__
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Zip Plugin

2006-08-21 Thread Lourival Júnior


Has anyone get successful in implement Zip parse plugin in nutch 0.7.2?

Regards

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-14 Thread Lourival Júnior


OK Lukas, I know what you mean. The community is very important to the
success of the project, specially the open source ones. I'm not sure I can
contribute to nutch at now, because I'm a newbie in this area. I will
contribute soon. At now, I answer the questions that I have a knowledge. I
really appreciate when you answer our questions because we feel motivated,
and we'll say to other people that Nutch is very useful when you want to
make a web search engine, not only useful, but the best way.

Regards!

On 8/14/06, Lukas Vlcek [EMAIL PROTECTED] wrote:


Lourival,

Definitely you are not alone with this feeling. Nutch is quite active
open source project so some sort of documentation lack is a natural
especially when Nutch hasen't reached its 1.0 release. Believe me, I
have the same problem all the time.

The best way how to change this situation is to contribute! Wiki is
opend to anybody, source code can be downloaded and if you are freak
then you can suggest changes and if you are a real hacker (meaning you
are not ashmed to use vi for anything - including writing source code)
then you can even become a commiter. Once you become a commiter then
you will be overloaded with work to the point that you won't be able
to answer STFW questions in mail-lists... etc. :-)

Regards,
Lukas

On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 Yes yes, I tested the index-more and query-more plugin. They allows to
 search these fields easily. However if I could find a documentation
about
 they I would not spend time thinking in a solution.

 Thanks a lot!

 On 8/11/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
  Hi,
 
  You need to look into source to find out what exactly it does. As far
  as I know it does not add any new filed into index (it should be done
  via index-more plugin) but it allows you to query using type: date:
  and site: I think.
 
  Lukas
 
  On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
   What does exactilly the query-more plugin? I tested it a few minutes
ago
  and
   it dont add any field to the result index. It's used in the webapp?
  Could
   you give me a clarification about it?
  
   Thanks!
  
   On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
   
Hi,
   
If my memory serves me correctly then query-more should work fine
with
0.7.2 nutch too.
And you are right Matthew, you need to use both [type:] or [date:]
filters in combination to [url:] as you can experience empty
result
set if used in solo mode. I do queries like this: [url:http
type:pdf]
and it gives me the result I need.
   
Lukas
   
On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 All right! I've done this already. I thing you dont understand
my
question.
 What I want to do is to query my indexes using something like
 filetype:pdf. The version 0.8 already have this feature. But
I'm
  using
the
 version 0.7.2 and I want to add this feature mannually. But I
dont
  know
 where I have to edit. Do you know?

 Regards,

 Lourival Junior

 On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
  Hi,
 
  To allow more formats to be indexed you need to modify
  nutch-site.xml
  and update/add plugin.includes property (see nutch-default.xmlfor
  default settings). The following is what I have in
nutch-site.xml:
 
  property
nameplugin.includes/name
 
 
   
 
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value
  /property
 
  [parse-*] is used to parse various formats, [query-more]
allows
  you to
  use [type:] filter in nutch queries.
 
  Regards,
  Lukas
 
  On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
   Hi Lukas and everybody!
  
   Do you know which file in nutch 0.7.2 should I edit to add
some
field in
  my
   index (i.e. file type - PDF, Word or html)?'
  
   On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
   
Hi,
   
I am not sure if I can give you any useful hint but the
  follwoing
is
what once worked for me.
Example of query: url:http date:20060801
   
date: and type: options can be used in combination with
url:
Filer url:http should select all documents (unless you
allowed
file,
ftp protocols). Plain date ot type filter select onthing
if
  they
are
used alone.
   
And be sure you don't introduce any space between filter
name
  and
its
value ([date: 20060801] is not the same as
[date:20060801])
   
Lukas
   
On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote:
 Howie,
I inspected my index using Luke and 20060801 shows up
  several
  times
 in the index. I'm unable to query pretty much any field.
  Several

Re: common-terms.utf8

2006-08-11 Thread Lourival Júnior


Hi Timo!

Thanks a lot! now I have a clearly knowledge about this file. This article
helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061

Thanks again!

On 8/11/06, Timo Scheuer  [EMAIL PROTECTED] wrote:


Hi,

 Could anyone explain me what does exactly the common-terms.utf8 file? I
 don't understand the real functionality of this file...

During indexing (and also during searching) the common terms are used to
form
n-grams to make search faster for common words like articles for example.
It
is an alternative to using stop words. N-grams keep the common words by
appending them to the following word. This approach increases the
selectivity.


Cheers,
Timo.





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-11 Thread Lourival Júnior


Yes yes, I tested the index-more and query-more plugin. They allows to
search these fields easily. However if I could find a documentation about
they I would not spend time thinking in a solution.

Thanks a lot!

On 8/11/06, Lukas Vlcek [EMAIL PROTECTED] wrote:


Hi,

You need to look into source to find out what exactly it does. As far
as I know it does not add any new filed into index (it should be done
via index-more plugin) but it allows you to query using type: date:
and site: I think.

Lukas

On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 What does exactilly the query-more plugin? I tested it a few minutes ago
and
 it dont add any field to the result index. It's used in the webapp?
Could
 you give me a clarification about it?

 Thanks!

 On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
  Hi,
 
  If my memory serves me correctly then query-more should work fine with
  0.7.2 nutch too.
  And you are right Matthew, you need to use both [type:] or [date:]
  filters in combination to [url:] as you can experience empty result
  set if used in solo mode. I do queries like this: [url:http type:pdf]
  and it gives me the result I need.
 
  Lukas
 
  On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
   All right! I've done this already. I thing you dont understand my
  question.
   What I want to do is to query my indexes using something like
   filetype:pdf. The version 0.8 already have this feature. But I'm
using
  the
   version 0.7.2 and I want to add this feature mannually. But I dont
know
   where I have to edit. Do you know?
  
   Regards,
  
   Lourival Junior
  
   On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
   
Hi,
   
To allow more formats to be indexed you need to modify
nutch-site.xml
and update/add plugin.includes property (see nutch-default.xml for
default settings). The following is what I have in nutch-site.xml:
   
property
  nameplugin.includes/name
   
   
 
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value
/property
   
[parse-*] is used to parse various formats, [query-more] allows
you to
use [type:] filter in nutch queries.
   
Regards,
Lukas
   
On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 Hi Lukas and everybody!

 Do you know which file in nutch 0.7.2 should I edit to add some
  field in
my
 index (i.e. file type - PDF, Word or html)?'

 On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I am not sure if I can give you any useful hint but the
follwoing
  is
  what once worked for me.
  Example of query: url:http date:20060801
 
  date: and type: options can be used in combination with url:
  Filer url:http should select all documents (unless you allowed
  file,
  ftp protocols). Plain date ot type filter select onthing if
they
  are
  used alone.
 
  And be sure you don't introduce any space between filter name
and
  its
  value ([date: 20060801] is not the same as [date:20060801])
 
  Lukas
 
  On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote:
   Howie,
  I inspected my index using Luke and 20060801 shows up
several
times
   in the index. I'm unable to query pretty much any field.
Several
people
   seem to be having the same problem. Does anyone know whats
going
  on?
  
   This is one of the last things I have to resolve to have
Nutch
deployed
   successfully at my organization. Unfortunately, Friday is my
  last
day.
   Can anyone offer any assistance??
   Thanks,
 Matt
  
   Howie Wang wrote:
I think that I have problems querying for numbers and
words with digits in them. Now that I think of it, is it
possible it has something to do with the stemming in
either the query filter or indexing? In either case, I
would
print out the text that is being indexed and the phrases
added to the query. You could also using luke to inspect
your index and see whether 20060801 shows up anywhere.
   
Howie
   
I tried looked for a page that had the date 20060801 and
the
  text
test in the page. I tried the following:
   
date: 20060801 test
   
and
   
date 20060721-20060803 test
   
Neither worked, any ideas??
   
Matt
   
Matthew Holt wrote:
Thanks Jake,
  However, it seems to me that it makes most sense that
a
  query
should return all pages that match the query, instead of
  acting
as a
content filter. However, I know its something easy to
  suggest
when
you're not having to implement it, so just a suggestion.
   
Matt
   
Vanderdray, Jacob wrote:
Try querying

Re: common-terms.utf8

2006-08-11 Thread Lourival Júnior


Hi Timo!

I analyzed to index before and after using correctly the
common-terms.utf8file. Before adding the common terms in my language
my index had about 3mb.
After add the common terms it has now 5mb! Why it occurs?

Regards!

On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote:


Hi Timo!

Thanks a lot! now I have a clearly knowledge about this file. This article
helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061

Thanks again!


On 8/11/06, Timo Scheuer  [EMAIL PROTECTED] wrote:

 Hi,

  Could anyone explain me what does exactly the common-terms.utf8 file?
 I
  don't understand the real functionality of this file...

 During indexing (and also during searching) the common terms are used to
 form
 n-grams to make search faster for common words like articles for
 example. It
 is an alternative to using stop words. N-grams keep the common words by
 appending them to the following word. This approach increases the
 selectivity.


 Cheers,
 Timo.




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

common-terms.utf8

2006-08-10 Thread Lourival Júnior


Hi,

Could anyone explain me what does exactly the common-terms.utf8 file? I
don't understand the real functionality of this file...

Regards,

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-09 Thread Lourival Júnior

Hi Lukas and everybody!

Do you know which file in nutch 0.7.2 should I edit to add some field in my
index (i.e. file type - PDF, Word or html)?'

On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote:

Hi,

I am not sure if I can give you any useful hint but the follwoing is
what once worked for me.
Example of query: url:http date:20060801

date: and type: options can be used in combination with url:
Filer url:http should select all documents (unless you allowed file,
ftp protocols). Plain date ot type filter select onthing if they are
used alone.

And be sure you don't introduce any space between filter name and its
value ([date: 20060801] is not the same as [date:20060801])

Lukas

On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote:
Howie,
I inspected my index using Luke and 20060801 shows up several times
in the index. I'm unable to query pretty much any field. Several people
seem to be having the same problem. Does anyone know whats going on?

This is one of the last things I have to resolve to have Nutch deployed
successfully at my organization. Unfortunately, Friday is my last day.
Can anyone offer any assistance??
Thanks,
Matt

Howie Wang wrote:
I think that I have problems querying for numbers and
words with digits in them. Now that I think of it, is it
possible it has something to do with the stemming in
either the query filter or indexing? In either case, I would
print out the text that is being indexed and the phrases
added to the query. You could also using luke to inspect
your index and see whether 20060801 shows up anywhere.

Howie

I tried looked for a page that had the date 20060801 and the text
test in the page. I tried the following:

date: 20060801 test

and

date 20060721-20060803 test

Neither worked, any ideas??

Matt

Matthew Holt wrote:
Thanks Jake,
However, it seems to me that it makes most sense that a query
should return all pages that match the query, instead of acting as a
content filter. However, I know its something easy to suggest when
you're not having to implement it, so just a suggestion.

Matt

Vanderdray, Jacob wrote:
Try querying with both the date and something you'd expect to find
in the content. The field query filter is just a filter. It only
restricts your results to things that match the basic query and has
the contents you require in the field. So if you query for
date:2006080 text you'll be searching for documents that contain
text in one of the default query fields and has the value 2006080
in the date field. Leaving out text in that example would
essentially be asking for nothing in the default fields and 2006080
in the date field which is why it doesn't return any results.

Hope that helps,
Jake.

-Original Message-
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields
I am unable to query fields in my index in the method that has
been suggested. I used Luke to examine my index and the following
field types exist:
anchor, boost, content, contentLength, date, digest, host,
lastModified, primaryType, segment, site, subType, title, type, url

However, when I do a search using one of the fields, followed by a
colon, an incorrect result is returned. I used Luke to find the top
term in the date field which is '20060801'. I then searched using
the following query:
date: 20060801

Unfortunately, nothing was returned. The correct plugins are
enabled, here is an excerpt from my nutch-site.xml:

property
nameplugin.includes/name

descriptionRegular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via
HTTP,
and basic indexing and search plugins.
/description
/property

Any ideas? I'm not the only one having the same problem, I saw an
earlier mailing list post but couldn't find any resolve... Thanks,

Matt

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Querying Fields

2006-08-09 Thread Lourival Júnior


All right! I've done this already. I thing you dont understand my question.
What I want to do is to query my indexes using something like
filetype:pdf. The version 0.8 already have this feature. But I'm using the
version 0.7.2 and I want to add this feature mannually. But I dont know
where I have to edit. Do you know?

Regards,

Lourival Junior

On 8/9/06, Lukas Vlcek [EMAIL PROTECTED] wrote:


Hi,

To allow more formats to be indexed you need to modify nutch-site.xml
and update/add plugin.includes property (see nutch-default.xml for
default settings). The following is what I have in nutch-site.xml:

property
  nameplugin.includes/name

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic/value
/property

[parse-*] is used to parse various formats, [query-more] allows you to
use [type:] filter in nutch queries.

Regards,
Lukas

On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 Hi Lukas and everybody!

 Do you know which file in nutch 0.7.2 should I edit to add some field in
my
 index (i.e. file type - PDF, Word or html)?'

 On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I am not sure if I can give you any useful hint but the follwoing is
  what once worked for me.
  Example of query: url:http date:20060801
 
  date: and type: options can be used in combination with url:
  Filer url:http should select all documents (unless you allowed file,
  ftp protocols). Plain date ot type filter select onthing if they are
  used alone.
 
  And be sure you don't introduce any space between filter name and its
  value ([date: 20060801] is not the same as [date:20060801])
 
  Lukas
 
  On 8/8/06, Matthew Holt [EMAIL PROTECTED] wrote:
   Howie,
  I inspected my index using Luke and 20060801 shows up several
times
   in the index. I'm unable to query pretty much any field. Several
people
   seem to be having the same problem. Does anyone know whats going on?
  
   This is one of the last things I have to resolve to have Nutch
deployed
   successfully at my organization. Unfortunately, Friday is my last
day.
   Can anyone offer any assistance??
   Thanks,
 Matt
  
   Howie Wang wrote:
I think that I have problems querying for numbers and
words with digits in them. Now that I think of it, is it
possible it has something to do with the stemming in
either the query filter or indexing? In either case, I would
print out the text that is being indexed and the phrases
added to the query. You could also using luke to inspect
your index and see whether 20060801 shows up anywhere.
   
Howie
   
I tried looked for a page that had the date 20060801 and the text
test in the page. I tried the following:
   
date: 20060801 test
   
and
   
date 20060721-20060803 test
   
Neither worked, any ideas??
   
Matt
   
Matthew Holt wrote:
Thanks Jake,
  However, it seems to me that it makes most sense that a query
should return all pages that match the query, instead of acting
as a
content filter. However, I know its something easy to suggest
when
you're not having to implement it, so just a suggestion.
   
Matt
   
Vanderdray, Jacob wrote:
Try querying with both the date and something you'd expect to
find
in the content.  The field query filter is just a filter.  It
only
restricts your results to things that match the basic query and
has
the contents you require in the field.  So if you query for
date:2006080 text you'll be searching for documents that
contain
text in one of the default query fields and has the value
2006080
in the date field.  Leaving out text in that example would
essentially be asking for nothing in the default fields and
2006080
in the date field which is why it doesn't return any results.
   
Hope that helps,
Jake.
   
   
-Original Message-
From: Matthew Holt [mailto:[EMAIL PROTECTED]
Sent: Wed 8/2/2006 4:58 PM
To: nutch-user@lucene.apache.org
Subject: Querying Fields
 I am unable to query fields in my index in the method that has
been suggested. I used Luke to examine my index and the
following
field types exist:
anchor, boost, content, contentLength, date, digest, host,
lastModified, primaryType, segment, site, subType, title, type,
url
   
However, when I do a search using one of the fields, followed
by a
colon, an incorrect result is returned. I used Luke to find the
top
term in the date field which is '20060801'. I then searched
using
the following query:
date: 20060801
   
Unfortunately, nothing was returned. The correct plugins are
enabled, here is an excerpt from my nutch-site.xml:
   
property
  nameplugin.includes/name
   
   
 
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior


Hi Nahuel!

You could use the command bin/nutch inject $nutch-dir/db -urlfile
urlfile.txt. To recrawl your WebDB you can use this
script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

Take a look to the adddays argument and to the configuration property
db.default.fetch.interval.They influence to the result.

Regards!

On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:


Hello,

I was searching for the method to add new url to the crawling url list
and how to recrawl all urls...

Can you help me ?

thanks,

--
Nahuel ANGELINETTI





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: 0.8 Recrawl script updated

2006-08-03 Thread Lourival Júnior


Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work very
well...

Regards!

On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:


Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old segs to
prevent a lot of unneeded data remaining/growing on the hard drive.
  Matt


http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior


Which version are you using?

On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:


But the websites just added hasn't been yet crawled... And they're not
crawled during recrawl...
Does bin/nutch purge will restart all ?



Le Thu, 3 Aug 2006 09:21:04 -0300,
Lourival Júnior [EMAIL PROTECTED] a écrit :

 In the nutch conf/nutch-default.xml configuration file exist a
 property call db.default.fetch.interval. When you crawl a site, nutch
 schedules the next fetch to today + db.default.fetch.interval days.
 If execute the recrawl command and the pages that you fetch don't
 reach this date, they won't be re-fetched. When you add new urls to
 the webdb, they will be ready to be fetch. So at this moment only
 this pages will be fetched by the recrawl script.

 I hope I helped you. If I said some wrong thing, please correct me :)

 Regards

 On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
 
  I have another question, I done what you give me... But it inject
  the new urls and recrawl it, but against the first crawl It
  doesn't download the web pages and really crawl them... perhaps I'm
  mistaking somewhere...
  Any idea ?
 
  Regards,
 
  --
  Nahuel ANGELINETTI
 
  Le Thu, 3 Aug 2006 08:31:22 -0300,
  Lourival Júnior [EMAIL PROTECTED] a écrit :
 
   Hi Nahuel!
  
   You could use the command bin/nutch inject $nutch-dir/db -urlfile
   urlfile.txt. To recrawl your WebDB you can use this
   script.
 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
  
   Take a look to the adddays argument and to the configuration
   property db.default.fetch.interval.They influence to the result.
  
   Regards!
  
   On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
   
Hello,
   
I was searching for the method to add new url to the crawling
url list and how to recrawl all urls...
   
Can you help me ?
   
thanks,
   
--
Nahuel ANGELINETTI
   
  
  
  
 








--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl urls

2006-08-03 Thread Lourival Júnior


This command bin/nutch purge doesn't exist. Well I can't say you what is
happening. Give me the output when you run the recrawl.

On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:


0.7.2 of nutch

Le Thu, 3 Aug 2006 09:37:24 -0300,
Lourival Júnior [EMAIL PROTECTED] a écrit :

 Which version are you using?

 On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
 
  But the websites just added hasn't been yet crawled... And they're
  not crawled during recrawl...
  Does bin/nutch purge will restart all ?
 
 
 
  Le Thu, 3 Aug 2006 09:21:04 -0300,
  Lourival Júnior [EMAIL PROTECTED] a écrit :
 
   In the nutch conf/nutch-default.xml configuration file exist a
   property call db.default.fetch.interval. When you crawl a site,
   nutch schedules the next fetch to today +
   db.default.fetch.interval days. If execute the recrawl command
   and the pages that you fetch don't reach this date, they won't be
   re-fetched. When you add new urls to the webdb, they will be
   ready to be fetch. So at this moment only this pages will be
   fetched by the recrawl script.
  
   I hope I helped you. If I said some wrong thing, please correct
   me :)
  
   Regards
  
   On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
   
I have another question, I done what you give me... But it
inject the new urls and recrawl it, but against the first
crawl It doesn't download the web pages and really crawl
them... perhaps I'm mistaking somewhere...
Any idea ?
   
Regards,
   
--
Nahuel ANGELINETTI
   
Le Thu, 3 Aug 2006 08:31:22 -0300,
Lourival Júnior [EMAIL PROTECTED] a écrit :
   
 Hi Nahuel!

 You could use the command bin/nutch inject $nutch-dir/db
 -urlfile urlfile.txt. To recrawl your WebDB you can use this
 script.
   
 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

 Take a look to the adddays argument and to the configuration
 property db.default.fetch.interval.They influence to the
 result.

 Regards!

 On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
 
  Hello,
 
  I was searching for the method to add new url to the
  crawling url list and how to recrawl all urls...
 
  Can you help me ?
 
  thanks,
 
  --
  Nahuel ANGELINETTI
 



   
  
  
  
 








--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

NullPointException

2006-08-03 Thread Lourival Júnior


Why when I delete some segments that reach the
db.default.fetcth.intervalthe search application gets the
nullPointerException? Periodically I have to
recrawl my Site. And delete old segments is a problem. Someone have a
suggestion?

Regards

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: NullPointException

2006-08-03 Thread Lourival Júnior


All right. Take a look to this output of the segread command:

060803 132735 PARSED?   STARTED FINISHED
COUNT   DIR NAME
060803 132735 true  20060717-14:41:58   20060717-14:41:58
1   crawl-legislacao_copia/segments/20060717144154
060803 132735 true  20060717-14:42:03   20060717-14:43:22
77  crawl-legislacao_copia/segments/20060717144201
060803 132735 true  20060717-14:43:29   20060717-15:08:10
1464crawl-legislacao_copia/segments/20060717144327
060803 132735 true  20060717-15:08:17   20060717-15:11:58
223 crawl-legislacao_copia/segments/20060717150815
060803 132736 true  20060718-09:02:56   20060718-09:03:10
5   crawl-legislacao_copia/segments/20060718090250
060803 132736 true  20060803-10:55:18   20060803-12:53:49
1541crawl-legislacao_copia/segments/20060803105509
060803 132736 true  20060803-13:07:15   20060803-13:07:20
4   crawl-legislacao_copia/segments/20060803130707
060803 132736 TOTAL: 3315 entries in 7 segments.

My db.default.fetch.interval is 15. Before I run a recrawl script I had 5
segments ( 200607* ) and the Index points to 1537 documents. After run the
recrawl 2 segments was created and then the script index all. When I
analyzed the index generated I see it had 1541 documents. But how can you
see the segments 200607* are old and can be deleted. I done this:

rm -rf segments/200607*

Then I get de NPE. I right I must to re-index the 2 remain segments. I've
done this. So, I analize again the index and it has only 1417!

My questions:
Why it occurs? How can I know which segments can be deleted?

I hope you can help me

On 8/3/06, Marko Bauhardt [EMAIL PROTECTED] wrote:


Hi,
if you delete segments then be sure that you doesnt have an index
from this segment.
The segment contains the parsed content and the index is the index
from this content. If you delete the segment and you doing a search
on this index, a NPE occurs because no summary (parsed content) are
found.


HTH
Marko



Am 03.08.2006 um 16:33 schrieb Lourival Júnior:

 Why when I delete some segments that reach the
 db.default.fetcth.intervalthe search application gets the
 nullPointerException? Periodically I have to
 recrawl my Site. And delete old segments is a problem. Someone have a
 suggestion?

 Regards

 --
 Lourival Junior
 Universidade Federal do Pará
 Curso de Bacharelado em Sistemas de Informação
 http://www.ufpa.br/cbsi
 Msn: [EMAIL PROTECTED]





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

ZIP plugin in nutch 0.7.2

2006-08-03 Thread Lourival Júnior


Hi all!!

Could I use the zip plugin from nutch 0.8 in nutch 0.7.2? Is there any
problem?

Regards.

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

java.lang.NoClassDefFoundError

2006-07-28 Thread Lourival Júnior


Hi all!
I'm testing the nutch 0.8. But I get this error in this simple command:

$ bin/nutch readdb
java.lang.NoClassDefFoundError: and
Exception in thread main

I've set the NUTCH_JAVA_HOME variable, but I'm sure it is the root cause of
this.

What is occurring?

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Total time of a search

2006-07-27 Thread Lourival Júnior


Hi,

Somebody knows how to calculate the total time of a search? Actually a use
this, but I'm not sure about it:

Date d = new Date();
int iniTime = (int) d.getTime();//pega o tempo de inicio da execução da
busca nos índices

//Aqui é executada a busca nos índices.
try{
   hits = bean.search(query, start + hitsToRetrieve, hitsPerSite, site,
sort, reverse);
} catch (IOException e){
   hits = new Hits(0,new Hit[0]);
}


int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
int length = end-start;
int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve);


Hit[] show = hits.getHits(start, realEnd-start);
HitDetails[] details = bean.getDetails(show);
String[] summaries = bean.getSummary(details, query);

Date d2 = new Date();
int endTime = (int) d2.getTime();
int totalTime = endTime-iniTime;//tempo de execução em milisegundos
double totalTimeInSec = (double)totalTime/(double)1000;

Is it correct?

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: installation de nutch

2006-07-26 Thread Lourival Júnior


Try to delete the directory crawl in /root/nutch-0.7.2/. So, run the command
again.

On 7/26/06, kawther khazri [EMAIL PROTECTED] wrote:



Hi
I am trying to run Nutch by following the instructions
given in the tutorial.
The environment is FEDORA 5, JDK 1.4.2 and Nutch 0.7.2
And of course Tomcat 5

I get the following errors:


[EMAIL PROTECTED] ~]# /root/nutch-0.7.2/bin/nutch crawl urls -dir crawl
-depth 3 -topN50
run java in /usr/lib/jvm/jre
060726 141458 parsing file:/root/nutch-0.7.2/conf/nutch-default.xml
060726 141458 parsing file:/root/nutch-0.7.2/conf/crawl-tool.xml
060726 141458 parsing file:/root/nutch-0.7.2/conf/nutch-site.xml
060726 141458 No FS indicated, using default:local
Exception in thread main java.lang.RuntimeException: crawl already
exists.
   at org.apache.nutch.tools.CrawlTool.main (CrawlTool.java:121)

I really appreciate any help that I can get. Thanks a lot



-
Découvrez un nouveau moyen de poser toutes vos questions quelque soit le
sujet ! Yahoo! Questions/Réponses pour partager vos connaissances, vos
opinions et vos expériences. Cliquez ici.





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

2006-07-25 Thread Lourival Júnior


You wanna say that only in windows this error occurs? I haven't tested in
linux yet. Has anyone a solution for this problem in windows/tomcat?

On 7/25/06, Thomas Delnoij [EMAIL PROTECTED] wrote:


Lourival.

I have typically seen the same issues on a cygwin/windows setup. The
only thing that worked for me was shutting down and restarting tomcat,
instead of just reloading the context. On linux now I don't have these
issues anymore.

Rgrds, Thomas

On 7/21/06, Lourival Júnior [EMAIL PROTECTED] wrote:
 Ok. However a few minutes ago I ran the script exactly you said and I
still
 get this error:

 Exception in thread main java.io.IOException: Cannot delete _0.f0
 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java
:195)
 at org.apache.lucene.store.FSDirectory.init(FSDirectory.java
:176)
 at org.apache.lucene.store.FSDirectory.getDirectory(
FSDirectory.java
 :141)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java
:225)
 at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
:92)
 at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
:160)

 I dont know but I thing it occurs because nutch tries to delete some
file
 that tomcat loads to the memory, giving permission access error. Any
idea?

 On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:
 
  Lourival Júnior wrote:
   I thing it wont work with me because i'm using the Nutch version
0.7.2.
   Actually I use this script (some comments are in Portuguese):
  
   #!/bin/bash
  
   # A simple script to run a Nutch re-crawl
   # Fonte do script:
  
 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
  
   #{
  
   if [ -n $1 ]
   then
crawl_dir=$1
   else
echo Usage: recrawl crawl_dir [depth] [adddays]
exit 1
   fi
  
   if [ -n $2 ]
   then
depth=$2
   else
depth=5
   fi
  
   if [ -n $3 ]
   then
adddays=$3
   else
adddays=0
   fi
  
   webdb_dir=$crawl_dir/db
   segments_dir=$crawl_dir/segments
   index_dir=$crawl_dir/index
  
   #Para o serviço do TomCat
   #net stop Apache Tomcat
  
   # The generate/fetch/update cycle
   for ((i=1; i = depth ; i++))
   do
bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
bin/nutch fetch $segment
bin/nutch updatedb $webdb_dir $segment
echo
echo Fim do ciclo $i.
echo
   done
  
   # Update segments
   echo
   echo Atualizando os Segmentos...
   echo
   mkdir tmp
   bin/nutch updatesegs $webdb_dir $segments_dir tmp
   rm -R tmp
  
   # Index segments
   echo Indexando os segmentos...
   echo
   for segment in `ls -d $segments_dir/* | tail -$depth`
   do
bin/nutch index $segment
   done
  
   # De-duplicate indexes
   # bogus argument is ignored but needed due to
   # a bug in the number of args expected
   bin/nutch dedup $segments_dir bogus
  
   # Merge indexes
   #echo Unindo os segmentos...
   #echo
   ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
  
   chmod 777 -R $index_dir
  
   #Inicia o serviço do TomCat
   #net start Apache Tomcat
  
   echo Fim.
  
   #}  recrawl.log 21
  
   How you suggested I used the touch command instead stops the tomcat.
   However
   I get that error posted in previous message. I'm running nutch in
  windows
   plataform with cygwin. I only get no errors when I stops the tomcat.
I
   use
   this command to call the script:
  
   ./recrawl crawl-legislacao 1
  
   Could you give me more clarifications?
  
   Thanks a lot!
  
   On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:
  
   Lourival Júnior wrote:
Hi Renaud!
   
I'm newbie with shell scripts and I know stops tomcat service is
   not the
better way to do this. The problem is, when a run the re-crawl
script
with
tomcat started I get this error:
   
060721 132224 merging segment indexes to: crawl-legislacao2\index
Exception in thread main java.io.IOException: Cannot delete
_0.f0
   at
org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
   at
   org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
   at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
   at
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225)
   at org.apache.nutch.indexer.IndexMerger.merge(
IndexMerger.java
   :92)
   at org.apache.nutch.indexer.IndexMerger.main(
IndexMerger.java
   :160)
   
So, I want another way to re-crawl my pages without this error
and
without
restarting the tomcat. Could you suggest one?
   
Thanks a lot!
   
   
   Try this updated script and tell me what command exactly you run to
  call
   the script. Let me know the error message then.
  
   Matt
  
  
   #!/bin/bash
  
   # Nutch recrawl script.
   # Based on 0.7.2 script at
  
 
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
  
   # Modified by Matthew Holt
  
   if [ -n $1

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior


Hi Matt!

In the article found at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
said the re-crawl script have a problem with updating the live search
index. In my tests with Nutch version 0.7.2 when I run the script the index
could not be update because the tomcat loads it to the memory. Could you
suggest a modification to this script or to the NutchBean that accepts
modifications to the index without restart tomcat (Actually, I use net stop
Apache Tomcat before the index updation...)?

Thanks

On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:


Thanks for putting up with all the messages to the list... Here is the
recrawl script for 0.8.0 if anyone is interested.
Matt
---

#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
# Modified by Matthew Holt

if [ -n $1 ]
then
  crawl_dir=$1
else
  echo Usage: recrawl crawl_dir [depth] [adddays]
  exit 1
fi

if [ -n $2 ]
then
  depth=$2
else
  depth=5
fi

if [ -n $3 ]
then
  adddays=$3
else
  adddays=0
fi


# EDIT THIS - List the location to your nutch servlet container.
nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/

# No need to edit anything past this line #
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

# The generate/fetch/update cycle
for ((i=1; i = depth ; i++))
do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
done

# Update segments
bin/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#ls -d $segments_dir/* | tail -$depth | xargs
bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

# De-duplicate indexes
bin/nutch dedup $new_indexes

# Merge indexes
bin/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

# Clean up
rm -rf $new_indexes





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior


Hi Renaud!

I'm newbie with shell scripts and I know stops tomcat service is not the
better way to do this. The problem is, when a run the re-crawl script with
tomcat started I get this error:

060721 132224 merging segment indexes to: crawl-legislacao2\index
Exception in thread main java.io.IOException: Cannot delete _0.f0
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
   at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
   at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225)
   at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
   at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

So, I want another way to re-crawl my pages without this error and without
restarting the tomcat. Could you suggest one?

Thanks a lot!

On 7/21/06, Renaud Richardet [EMAIL PROTECTED] wrote:


Hi Matt and Lourival,

Matt, thank you for the recrawl script. Any plans to commit it to trunk?

Lourival, here's in the script what reloads Tomcat, not the cleanest,
but it should work
# Tell Tomcat to reload index
touch $nutch_dir/WEB-INF/web.xml

HTH,
Renaud


Lourival Júnior wrote:
 Hi Matt!

 In the article found at

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou

 said the re-crawl script have a problem with updating the live search
 index. In my tests with Nutch version 0.7.2 when I run the script the
 index
 could not be update because the tomcat loads it to the memory. Could you
 suggest a modification to this script or to the NutchBean that accepts
 modifications to the index without restart tomcat (Actually, I use net
 stop
 Apache Tomcat before the index updation...)?

 Thanks

 On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:

 Thanks for putting up with all the messages to the list... Here is the
 recrawl script for 0.8.0 if anyone is interested.
 Matt
 ---

 #!/bin/bash

 # Nutch recrawl script.
 # Based on 0.7.2 script at

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

 # Modified by Matthew Holt

 if [ -n $1 ]
 then
   crawl_dir=$1
 else
   echo Usage: recrawl crawl_dir [depth] [adddays]
   exit 1
 fi

 if [ -n $2 ]
 then
   depth=$2
 else
   depth=5
 fi

 if [ -n $3 ]
 then
   adddays=$3
 else
   adddays=0
 fi


 # EDIT THIS - List the location to your nutch servlet container.
 nutch_dir=/usr/local/apache-tomcat-5.5.17/webapps/nutch/

 # No need to edit anything past this line #
 webdb_dir=$crawl_dir/crawldb
 segments_dir=$crawl_dir/segments
 linkdb_dir=$crawl_dir/linkdb
 index_dir=$crawl_dir/index

 # The generate/fetch/update cycle
 for ((i=1; i = depth ; i++))
 do
   bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
   segment=`ls -d $segments_dir/* | tail -1`
   bin/nutch fetch $segment
   bin/nutch updatedb $webdb_dir $segment
 done

 # Update segments
 bin/nutch invertlinks $linkdb_dir -dir $segments_dir

 # Index segments
 new_indexes=$crawl_dir/newindexes
 #ls -d $segments_dir/* | tail -$depth | xargs
 bin/nutch index $new_indexes $webdb_dir $linkdb_dir $segments_dir/*

 # De-duplicate indexes
 bin/nutch dedup $new_indexes

 # Merge indexes
 bin/nutch merge $index_dir $new_indexes

 # Tell Tomcat to reload index
 touch $nutch_dir/WEB-INF/web.xml

 # Clean up
 rm -rf $new_indexes





--
Renaud Richardet
COO America
Wyona Inc.  -   Open Source Content Management   -   Apache Lenya
office +1 857 776-3195 mobile +1 617 230 9112
renaud.richardet at wyona.com  http://www.wyona.com





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Recrawl script for 0.8.0 completed...

2006-07-21 Thread Lourival Júnior


Ok. However a few minutes ago I ran the script exactly you said and I still
get this error:

Exception in thread main java.io.IOException: Cannot delete _0.f0
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
   at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
   at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
:141)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225)
   at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:92)
   at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:160)

I dont know but I thing it occurs because nutch tries to delete some file
that tomcat loads to the memory, giving permission access error. Any idea?

On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:


Lourival Júnior wrote:
 I thing it wont work with me because i'm using the Nutch version 0.7.2.
 Actually I use this script (some comments are in Portuguese):

 #!/bin/bash

 # A simple script to run a Nutch re-crawl
 # Fonte do script:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

 #{

 if [ -n $1 ]
 then
  crawl_dir=$1
 else
  echo Usage: recrawl crawl_dir [depth] [adddays]
  exit 1
 fi

 if [ -n $2 ]
 then
  depth=$2
 else
  depth=5
 fi

 if [ -n $3 ]
 then
  adddays=$3
 else
  adddays=0
 fi

 webdb_dir=$crawl_dir/db
 segments_dir=$crawl_dir/segments
 index_dir=$crawl_dir/index

 #Para o serviço do TomCat
 #net stop Apache Tomcat

 # The generate/fetch/update cycle
 for ((i=1; i = depth ; i++))
 do
  bin/nutch generate $webdb_dir $segments_dir -adddays $adddays
  segment=`ls -d $segments_dir/* | tail -1`
  bin/nutch fetch $segment
  bin/nutch updatedb $webdb_dir $segment
  echo
  echo Fim do ciclo $i.
  echo
 done

 # Update segments
 echo
 echo Atualizando os Segmentos...
 echo
 mkdir tmp
 bin/nutch updatesegs $webdb_dir $segments_dir tmp
 rm -R tmp

 # Index segments
 echo Indexando os segmentos...
 echo
 for segment in `ls -d $segments_dir/* | tail -$depth`
 do
  bin/nutch index $segment
 done

 # De-duplicate indexes
 # bogus argument is ignored but needed due to
 # a bug in the number of args expected
 bin/nutch dedup $segments_dir bogus

 # Merge indexes
 #echo Unindo os segmentos...
 #echo
 ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

 chmod 777 -R $index_dir

 #Inicia o serviço do TomCat
 #net start Apache Tomcat

 echo Fim.

 #}  recrawl.log 21

 How you suggested I used the touch command instead stops the tomcat.
 However
 I get that error posted in previous message. I'm running nutch in
windows
 plataform with cygwin. I only get no errors when I stops the tomcat. I
 use
 this command to call the script:

 ./recrawl crawl-legislacao 1

 Could you give me more clarifications?

 Thanks a lot!

 On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:

 Lourival Júnior wrote:
  Hi Renaud!
 
  I'm newbie with shell scripts and I know stops tomcat service is
 not the
  better way to do this. The problem is, when a run the re-crawl script
  with
  tomcat started I get this error:
 
  060721 132224 merging segment indexes to: crawl-legislacao2\index
  Exception in thread main java.io.IOException: Cannot delete _0.f0
 at
  org.apache.lucene.store.FSDirectory.create(FSDirectory.java:195)
 at
 org.apache.lucene.store.FSDirectory.init(FSDirectory.java:176)
 at
  org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java
  :141)
 at
  org.apache.lucene.index.IndexWriter.init(IndexWriter.java:225)
 at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
 :92)
 at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java
 :160)
 
  So, I want another way to re-crawl my pages without this error and
  without
  restarting the tomcat. Could you suggest one?
 
  Thanks a lot!
 
 
 Try this updated script and tell me what command exactly you run to
call
 the script. Let me know the error message then.

 Matt


 #!/bin/bash

 # Nutch recrawl script.
 # Based on 0.7.2 script at

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

 # Modified by Matthew Holt

 if [ -n $1 ]
 then
   nutch_dir=$1
 else
   echo Usage: recrawl servlet_path crawl_dir [depth] [adddays]
   echo servlet_path - Path of the nutch servlet (i.e.
 /usr/local/tomcat/webapps/ROOT)
   echo crawl_dir - Name of the directory the crawl is located in.
   echo [depth] - The link depth from the root page that should be
 crawled.
   echo [adddays] - Advance the clock # of days for fetchlist
 generation.
   exit 1
 fi

 if [ -n $2 ]
 then
   crawl_dir=$2
 else
   echo Usage: recrawl servlet_path crawl_dir [depth] [adddays]
   echo servlet_path - Path of the nutch servlet (i.e.
 /usr/local/tomcat/webapps/ROOT)
   echo crawl_dir - Name of the directory the crawl is located in.
   echo [depth] - The link depth from the root page that should be
 crawled.
   echo [adddays] - Advance the clock # of days for fetchlist
 generation.
   exit 1

Unused Segments

2006-07-14 Thread Lourival Júnior


How can i discover which segments are unused by the index? After many
recrawl I have a lot of segments. So, I would like to erase someones...

Who can help me?

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Recrawl a specific web Page

2006-07-13 Thread Lourival Júnior


How can i recrawl a specific web page. For example I have a html page that
is constantly update. There a command for that?

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: question about plugins

2006-07-11 Thread Lourival Júnior


Hi! Dont worry, I know what you mean. You have to modify the
nutch-site.xmlconfiguration file in conf directory. Take a look to a
example:

nutch-conf
property
 nameplugin.includes/name
valuenutch-extensionpoints|protocol-http|language-identifier|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)/value
 descriptionRegular expression naming plugin directory names to include.
Any plugin not matching this expression is excluded./description
/property
property
 namehttp.content.limit/name
 value-1/value
 descriptionThe length limit for downloaded content, in bytes.
 If this value is nonnegative (=0), content longer than it will be
truncated;
 otherwise, no truncation at all.
 /description
/property
/nutch-conf

Regards,

Lourival Junior

On 7/11/06, Abdelhakim Diab [EMAIL PROTECTED] wrote:


How Can I add a plugin to my search application and how can I activate it.
sorry if my question is stupid but I am newbie to nutch.





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: OpenOffice Support?

2006-07-11 Thread Lourival Júnior


Using to advantage your question, anyone knows if the version 0.7.2 of nutch
supports the zip plugin? If so, where can I find it?

Lourival Junior

On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote:


Just wondering, has anyone done any work on a plugin (or aware of a
plugin) that supports the indexing of open office documents? Thanks.
Matt





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior


Hi all!

I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and when the
script is fetching pages, some of them contains the message:

... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.

I thing that it happens because some network problem. The fetcher try to
fetch some page, but it did not obtain. Because this, when the segment is
being indexed, only the fetched pages will appear in results. It is a
problem to me.

Could someone explain me what should I do to refetch these pages to increase
my web search results? Should I change the http.max.delays and
fetcher.server.delay properties in nutch-default.xml?

Regards,


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Re: Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior


Yes! It really works! I'm execunting the recrawl at now, and it is fetching
the pages that it didn't fetched yet... It takes longer, but the final
result is more important.

Thanks a lot!

On 7/7/06, Honda-Search Administrator [EMAIL PROTECTED] wrote:


This is typical if you are crawling only a few sites.  I crawl 7 sites
nightly and often get this error.  I changed my http.max.delays property
from 3 to 50 and it works without a problem.  The crawl takes longer, but
I
get almost all of the pages.

- Original Message -
From: Lourival Júnior [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Friday, July 07, 2006 10:20 AM
Subject: Number of pages different to number of indexed pages


Hi all!

I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in
one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and when
the
script is fetching pages, some of them contains the message:

... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater
:
Exceeded http.max.delays: retry later.

I thing that it happens because some network problem. The fetcher try to
fetch some page, but it did not obtain. Because this, when the segment is
being indexed, only the fetched pages will appear in results. It is a
problem to me.

Could someone explain me what should I do to refetch these pages to
increase
my web search results? Should I change the http.max.delays and
fetcher.server.delay properties in nutch-default.xml?

Regards,


--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

Index algorithm

2006-07-07 Thread Lourival Júnior


Could anyone give some link or document about the nutch's index algorithm? I
don't found many ones...

Regards

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

39 matches

Mail list logo