You can use the plugins index-more and query-more to create a field on your
index indicating the file type of the document. So, in you search you can
use type:pdf or type:msword to filter these files. I used nutch 0.7.2 to
make it work...
Regards,
Lourival Júnior
On 4/24/07, ekoje ekoje [EMAIL
PROTECTED] wrote:
Thanks. Can you please tell me how can I plugin in my own handling
when nutch sees a site instead of building the search database for
that site?
On 4/3/07, Lourival Júnior [EMAIL PROTECTED] wrote:
I have total certainty that nutch is what are you looking for. Take a
look
I have total certainty that nutch is what are you looking for. Take a look
to nutch's documentation for more details and you will see :).
On 4/3/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
Hi,
I would like to know if know if it is a good idea to use nutch web
carwler?
Basically, this is
nutch 0.8, just
checked out.
Exception in thread main
java.lang.NoClassDefFoundError
This is on OS X 10.4.7. Older nutch runs fine.
--- Lourival Júnior [EMAIL PROTECTED] wrote:
Hi all!
I'm testing the nutch 0.8. But I get this error in
this simple command:
$ bin/nutch readdb
Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried it and the only thing I've noted was
the index size that is more big than the original (before use the common
terms).
On 9/25/06, carmmello [EMAIL PROTECTED] wrote:
I'm using Nutch
again
- Original Message -
From: Lourival Júnior [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Monday, September 25, 2006 3:58 PM
Subject: Re: Common terms
Has you reindexed your segments? It's important, because it makes nutch
recognize your common terms. I've tried
Hi all!
Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How
can I do this?
Regards,
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Yes Cam, if you use a depth 1 you will crawl only the first document. With a
depth 2 you will crawl the first document and all the links found on this
document. With depth 3, you will crawl the first one, its links and all
links found in cycle 2. And so on. Increasing you depth will increasing
it unless you write a plugin to parse a custom meta tag
called category.
I'm trying to do something like this now, but the plugin documentation
is horrible.
Lourival Júnior wrote:
Hi Ernesto!
I know what you mean. Sometimes I get no answers too. Unfortunately,
I'm new
in nutch and lucene
Hi Ernesto!
I know what you mean. Sometimes I get no answers too. Unfortunately, I'm new
in nutch and lucene and I can't help you. Continue trying, the comunity will
help you :).
On 8/22/06, Ernesto De Santis [EMAIL PROTECTED] wrote:
Hi All
Please, some body can answer my questions?
I'm a
Has anyone get successful in implement Zip parse plugin in nutch 0.7.2?
Regards
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
to the point that you won't be able
to answer STFW questions in mail-lists... etc. :-)
Regards,
Lukas
On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote:
Yes yes, I tested the index-more and query-more plugin. They allows to
search these fields easily. However if I could find a documentation
Hi Timo!
Thanks a lot! now I have a clearly knowledge about this file. This article
helps a lot too: http://searchenginewatch.com/showPage.html?page=2156061
Thanks again!
On 8/11/06, Timo Scheuer [EMAIL PROTECTED] wrote:
Hi,
Could anyone explain me what does exactly the common-terms.utf8
to find out what exactly it does. As far
as I know it does not add any new filed into index (it should be done
via index-more plugin) but it allows you to query using type: date:
and site: I think.
Lukas
On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
What does exactilly the query-more
Hi Timo!
I analyzed to index before and after using correctly the
common-terms.utf8file. Before adding the common terms in my language
my index had about 3mb.
After add the common terms it has now 5mb! Why it occurs?
Regards!
On 8/11/06, Lourival Júnior [EMAIL PROTECTED] wrote:
Hi Timo
Hi,
Could anyone explain me what does exactly the common-terms.utf8 file? I
don't understand the real functionality of this file...
Regards,
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Hi Lukas and everybody!
Do you know which file in nutch 0.7.2 should I edit to add some field in my
index (i.e. file type - PDF, Word or html)?'
On 8/8/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
Hi,
I am not sure if I can give you any useful hint but the follwoing is
what once worked for me.
formats, [query-more] allows you to
use [type:] filter in nutch queries.
Regards,
Lukas
On 8/9/06, Lourival Júnior [EMAIL PROTECTED] wrote:
Hi Lukas and everybody!
Do you know which file in nutch 0.7.2 should I edit to add some field in
my
index (i.e. file type - PDF, Word or html)?'
On 8/8/06
Hi Nahuel!
You could use the command bin/nutch inject $nutch-dir/db -urlfile
urlfile.txt. To recrawl your WebDB you can use this
script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
Take a look to the adddays argument and to the configuration property
Hi Matthew!
Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work very
well...
Regards!
On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:
Just letting everyone know that I updated the recrawl script on the
Wiki. It now
Which version are you using?
On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
But the websites just added hasn't been yet crawled... And they're not
crawled during recrawl...
Does bin/nutch purge will restart all ?
Le Thu, 3 Aug 2006 09:21:04 -0300,
Lourival Júnior [EMAIL PROTECTED
This command bin/nutch purge doesn't exist. Well I can't say you what is
happening. Give me the output when you run the recrawl.
On 8/3/06, Nahuel ANGELINETTI [EMAIL PROTECTED] wrote:
0.7.2 of nutch
Le Thu, 3 Aug 2006 09:37:24 -0300,
Lourival Júnior [EMAIL PROTECTED] a écrit :
Which version
Why when I delete some segments that reach the
db.default.fetcth.intervalthe search application gets the
nullPointerException? Periodically I have to
recrawl my Site. And delete old segments is a problem. Someone have a
suggestion?
Regards
--
Lourival Junior
Universidade Federal do Pará
Curso
.
The segment contains the parsed content and the index is the index
from this content. If you delete the segment and you doing a search
on this index, a NPE occurs because no summary (parsed content) are
found.
HTH
Marko
Am 03.08.2006 um 16:33 schrieb Lourival Júnior:
Why when I delete some
Hi all!!
Could I use the zip plugin from nutch 0.8 in nutch 0.7.2? Is there any
problem?
Regards.
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Hi all!
I'm testing the nutch 0.8. But I get this error in this simple command:
$ bin/nutch readdb
java.lang.NoClassDefFoundError: and
Exception in thread main
I've set the NUTCH_JAVA_HOME variable, but I'm sure it is the root cause of
this.
What is occurring?
--
Lourival Junior
Universidade
Hi,
Somebody knows how to calculate the total time of a search? Actually a use
this, but I'm not sure about it:
Date d = new Date();
int iniTime = (int) d.getTime();//pega o tempo de inicio da execução da
busca nos índices
//Aqui é executada a busca nos índices.
try{
hits =
Try to delete the directory crawl in /root/nutch-0.7.2/. So, run the command
again.
On 7/26/06, kawther khazri [EMAIL PROTECTED] wrote:
Hi
I am trying to run Nutch by following the instructions
given in the tutorial.
The environment is FEDORA 5, JDK 1.4.2 and Nutch 0.7.2
And of course Tomcat
that worked for me was shutting down and restarting tomcat,
instead of just reloading the context. On linux now I don't have these
issues anymore.
Rgrds, Thomas
On 7/21/06, Lourival Júnior [EMAIL PROTECTED] wrote:
Ok. However a few minutes ago I ran the script exactly you said and I
still
get this error
Hi Matt!
In the article found at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
said the re-crawl script have a problem with updating the live search
index. In my tests with Nutch version 0.7.2 when I run the script the index
could not be update because the tomcat
$nutch_dir/WEB-INF/web.xml
HTH,
Renaud
Lourival Júnior wrote:
Hi Matt!
In the article found at
http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htmlyou
said the re-crawl script have a problem with updating the live search
index. In my tests with Nutch version 0.7.2 when I
(IndexMerger.java:160)
I dont know but I thing it occurs because nutch tries to delete some file
that tomcat loads to the memory, giving permission access error. Any idea?
On 7/21/06, Matthew Holt [EMAIL PROTECTED] wrote:
Lourival Júnior wrote:
I thing it wont work with me because i'm using the Nutch
How can i discover which segments are unused by the index? After many
recrawl I have a lot of segments. So, I would like to erase someones...
Who can help me?
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL
How can i recrawl a specific web page. For example I have a html page that
is constantly update. There a command for that?
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Hi! Dont worry, I know what you mean. You have to modify the
nutch-site.xmlconfiguration file in conf directory. Take a look to a
example:
nutch-conf
property
nameplugin.includes/name
Using to advantage your question, anyone knows if the version 0.7.2 of nutch
supports the zip plugin? If so, where can I find it?
Lourival Junior
On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote:
Just wondering, has anyone done any work on a plugin (or aware of a
plugin) that supports the
Hi all!
I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and
a few sites. I crawl 7 sites
nightly and often get this error. I changed my http.max.delays property
from 3 to 50 and it works without a problem. The crawl takes longer, but
I
get almost all of the pages.
- Original Message -
From: Lourival Júnior [EMAIL PROTECTED]
To: nutch-user
Could anyone give some link or document about the nutch's index algorithm? I
don't found many ones...
Regards
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
39 matches
Mail list logo