Re: Lucene keeps memory of old crawl ???

Annie Dumont Tue, 27 Jun 2006 22:49:55 -0700

Hi Solprovider, hi all,

Yes, i use lenya 1.2.3.
I tried running the indexer with new, and that works.
Thank you for your  help Solprovider !
Regards, annie


[EMAIL PROTECTED] a écrit :

On 6/27/06, Annie Dumont <[EMAIL PROTECTED]> wrote:

I have a problem with lucene.
If you go to our site and fill the search form, (on the right side)
a search with "biologie", look at number 4 :

Pas d'excerpt disponible :
http://www.univ-reunion.fr/lenya_univ/universite/live/lenya_univ/
univ_reunion/live/formations/catalogue/ufr/sciences/licence/bop.html
/opt/tomcat55/webapps/lenya_univ/lenya/pubs/universite/work/search/
lucene/htdocs_dump/live/lenya_univ/univ_reunion/live/formations/
catalogue/ufr/sciences/licence/bop.html biologie
java.io.FileNotFoundException: /opt/tomcat55/webapps/lenya_univ/lenya/
pubs/universite/work/search/lucene/htdocs_dump/live/lenya_univ/
univ_reunion/live/formations/catalogue/ufr/sciences/licence/bop.html
(No such file or directory)

wich is perfectly normal : work/search/lucene/htdocs_dump/live/
lenya_univ/univ_reunion has been deleted (these pages do not exist
anymore).
http://www.univ-reunion.fr/lenya_univ/universite/live/lenya_univ/
univ_reunion/live/formations/catalogue/ufr/sciences/licence/bop.html
is not in uri.txt

What do i have to do for those type of url not to appear anymore ?
Where does lucene go to read those old urls ? Does anybody know ?



I am assuming you are using Lenya before 1.2.4, since crawling is not
part of the Search process starting with that version.

Under work/search/lucene should be 2 directories:
"htdocs_dump" was complete copies of every page on every website crawled.
"index" is the index used by search.

(I rewrote the search algorithm to directly use the content as soon as
I realized it included the entire page including formatting and
navigation menus, so I am not very familiar with the old version,
but..)

Deleting the "htdocs_dump" directory should have no effect on the
results.  Search should only look at the "index" directory when
creating a results Page.  To refresh the results, either delete the
"index" directory and rerun the Indexer, or run the Indexer as "new"
rather than "incremental".  If using the old code, you'll need to run
the crawler to recreate the "htdocs_dump" directory first.

The new Search code is available at:
http://solprovider.com/lenya/search

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene keeps memory of old crawl ???

Reply via email to