NPE when using explain
Hi, I'm trying to use IndexSearcher.explain(Query query, int doc) and am getting a NPE. If I remove the "explain" the search works fine. I poked a little at the TermQuery.java code, but I can't really tell what's causing the exception. This is with 1.3rc3 Exception in thread "main" java.lang.NullPointerException at org.apache.lucene.search.TermQuery$TermWeight.explain(TermQuery.java:142) at org.apache.lucene.search.BooleanQuery$BooleanWeight.explain(BooleanQuery.java:186) at org.apache.lucene.search.IndexSearcher.explain(IndexSearcher.java:196) at LuceneCli.search(LuceneCli.java:78) at LuceneLine.handleCommand(LuceneLine.java:188) at LuceneLine.(LuceneLine.java:117) at LuceneLine.main(LuceneLine.java:136) The area of the code that caused this. Hits hits = initSearch(queryString); System.out.println(hits.length() + " total matching documents"); final int HITS_PER_PAGE = 10; message ("--"); for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) { int end = Math.min(hits.length(), start + HITS_PER_PAGE); for (int ii = start; ii < end; ii++) { Document doc = hits.doc(ii); message (" " + ii + " score:" + hits.score(ii) + "-"); if (explain) { Explanation exp = searcher.explain(query, ii); message("Explanation:" + exp.toString()); } printHit(doc); } Regards, Dror -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SearchBlox J2EE Search Component Version 1.1 released
On Tuesday 02 December 2003 09:51, Tun Lin wrote: > Anyone knows a search engine that supports xml formats? There's no way to generally "support xml formats", as xml is just a meta-language. However, building specific search engines using Lucene core it should be reasonably straight-forward to implement more accurate xml-structure-aware tokenization for specific xml applications like DocBook or other domain-specific apps. So, if any search engine advertises "indexing xml content", one better read the fine print to learn what they really claim. It might be interesting to create a Lucene plug-in that, given a specification of how sub trees under specific elements, would tokenize and index content into separate fields. Plus implementation shouldn't be very difficult -- just use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to analyzer and then add to index. This could also be used for HTML (pre-filtering with JTidy or similar first to get to xml-compliant HTML). I wouldn't be surprised if someone on list has already done this? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
summary
hi In lucene demo the summary that is displayed is having text that contained inside html tag (like margin, top , left and so) . so how to display actually in the page which is related to the page description. ur help is appreciated thanking you mahesh
AW: Document Similarity
Hi, >> Do they produce same ranking results? No; Lucene's operations on query weight and length normalization is not equivalent to a vanilla cosine in vector space. >> I guess the 2nd approach will be more precise but slow. Query similarity will indeed be faster, but may actually not be worse. A straightforward cosine without IDF weighting of terms (as Lucene does) will almost certainly be less precise if you have documents of different length - word occurence probabilities in texts of different lengths vary greatly, and the cosine of independent longer texts will often be greater than those that actually have the same topic, but are short, just because of randomly found non-content words. If, on the other hand, you choose the right TF/IDF weighting of terms, the cosine in this warped vector space could be (a) equivalent to the one Lucene does - requires some work to do so, or (b) might even get better on average. However, the last time I counted, there where about 250 different TF/IDF formulas around in IR publications, machine learning, computational linguistics and so on. Performance depends on domain and language. But if I was you, I just would start playing and have fun with the stuff... Karsten -Ursprüngliche Nachricht- Von: Jing Su [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 2. Dezember 2003 18:12 An: [EMAIL PROTECTED] Betreff: Document Similarity Hi, I have read some posts in user/developer archives about Lucene-based document similarity comparison. In summary there are two approaches are mentioned: 1 - Construct document to a query; 2 - Calculate each document to be a vector, then rank accoring to their distance (cosine). Do they produce same ranking results? Is there any other way to do so? I guess the 2nd approach will be more precise but slow. Thanks. Jing - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Probabilistic Model in Lucene - possible?
i think i am missing the original question, but by most accepted definitions, the tf/idf model in Lucene is a probabilistic model. it's got strange normalizations though that doesn't allow comparisons of rank values across queries. it isn't terribly hard to make a normalized probabilistic model that allows comparing of document scores across queries and assign a meaning to the score. i've done it. however, that means abandoning idf and keeping actual term frequencies for each document and document size. once you normalize this way, you can intermingle document scores from different queries and different corpora and make statements about the absolute value of the score. it also leads directly into the discussion we had earlier about interterm correlations and how to handle them properly since the full interterm probabilistic model has as a special case the traditional tf/idf model. interjecting Boolean conditions and boost makes the model much more complicated. Herb -Original Message- From: Karsten Konrad [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 4:51 PM To: Lucene Users List Subject: AW: Probabilistic Model in Lucene - possible? >> I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. >> Sorry, I have no idea about how to use a probabilistic approach with Lucene, but if anyone does so, I would like to know, too. I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance rather than a ranking. I.e., it would be nice to have a ranking weight whose value has some kind of semantics such that we could compare results from different queries. Can probabilistic approches do anything like this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Probabilistic Model in Lucene - possible?
Hi, >> I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. >> Sorry, I have no idea about how to use a probabilistic approach with Lucene, but if anyone does so, I would like to know, too. I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance rather than a ranking. I.e., it would be nice to have a ranking weight whose value has some kind of semantics such that we could compare results from different queries. Can probabilistic approches do anything like this? Any help appreciated, Karsten -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 3. Dezember 2003 15:13 An: [EMAIL PROTECTED] Betreff: Probabilistic Model in Lucene - possible? Hello group, from the very inspiring conversations with Karsten I know that Lucene is based on a Vector Space Model. I am just wondering if it would be possible to turn this into a probabilistic Model approach. Of course I do know that I cannot change the underlying indexing and searching principles. However it would be possible to change the index term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I would need to implement another similarity algorithm. I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. If yes, how much effort would need to go into that? I am sure there are many other issues which I have not considered... Kind Regards, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: What about Spindle
There is LARM, there is Nutch, there is Egothor (doesn't use Lucene), etc. Otis --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote: > I think it is common task to index a jsp based web site. A lot of > poeple > ask how to do so on this mailing list. However, Lucene does not have > a > ready to use web crawler. My question is that has anybody used > Spindle to > index a jsp based web site or is there any other tools out there. > > Thanks, > Oliver > > > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 03, 2003 11:25 AM > To: Lucene Users List > Subject: Re: What about Spindle > > > You should ask Spindle author(s). The error doesn't look like > something that is related to Lucene, really. > > Otis > > --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote: > > What about Spindle? Has anybody used it to crawle a jsp based web > > site? Do I > > need to intall listlib.jar to do so? > > > > I got error message "Jsp Translate:Unable to find setter method for > > attribue:class" when I tried to run listlib-example.jsp in wsad. > > > > Thanks, > > Oliver > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > __ > Do you Yahoo!? > Free Pop-Up Blocker - Get it now > http://companion.yahoo.com/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ways to search indexes
On Wed, Dec 03, 2003 at 02:49:12PM +, jt oob wrote: > --- Dror Matalon <[EMAIL PROTECTED]> wrote: > On Tue, Dec 02, 2003 at > 01:54:58PM +, jt oob wrote: > > > Hi, > > > > > > I have just indexed a lot of news (nntp) postings. > > > I now have an index for each topic (a topic can have many > > newsgroups) > > > > > > The index sizes are: > > > > > > 2.6G Current Affairs > > > 2.4G Celebs > > > 119M Recreation > > > 3.0M Tech - Mac > > > 2.4G Tech - Windows > > > 936M Tech - Linux > > > 702M Tech - Other > > > 96M Tech - Consoles > > > > Around 15 gigs. How many days of news? > > Not sure how many days, but it's around 5 million postings. So each posting is roughly 3K. More than I would have thought, but not too surprising. The main reason I asked about how many days, is to get the sense of growth. 15 Gig is a big index, but to understand the performance repercussions the rate of growth is equally important. I suspect that by the time you hit 100 gigs, you'll have one of the biggest indexes around and you'll have to throw quite heavy hardware or distribute the load to get reasonable performance. > > > Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs > http://www.yahoo.co.uk/robbiewilliams > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What about Spindle
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl whatever you like. API: http://www.egothor.org/api/robot/ Console - demo (*.dundee.ac.uk): http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F Leo Zhou, Oliver wrote: I think it is common task to index a jsp based web site. A lot of poeple ask how to do so on this mailing list. However, Lucene does not have a ready to use web crawler. My question is that has anybody used Spindle to index a jsp based web site or is there any other tools out there. Thanks, Oliver -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 11:25 AM To: Lucene Users List Subject: Re: What about Spindle You should ask Spindle author(s). The error doesn't look like something that is related to Lucene, really. Otis --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote: What about Spindle? Has anybody used it to crawle a jsp based web site? Do I need to intall listlib.jar to do so? I got error message "Jsp Translate:Unable to find setter method for attribue:class" when I tried to run listlib-example.jsp in wsad. Thanks, Oliver - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: What about Spindle
I think it is common task to index a jsp based web site. A lot of poeple ask how to do so on this mailing list. However, Lucene does not have a ready to use web crawler. My question is that has anybody used Spindle to index a jsp based web site or is there any other tools out there. Thanks, Oliver -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 11:25 AM To: Lucene Users List Subject: Re: What about Spindle You should ask Spindle author(s). The error doesn't look like something that is related to Lucene, really. Otis --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote: > What about Spindle? Has anybody used it to crawle a jsp based web > site? Do I > need to intall listlib.jar to do so? > > I got error message "Jsp Translate:Unable to find setter method for > attribue:class" when I tried to run listlib-example.jsp in wsad. > > Thanks, > Oliver > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What about Spindle
You should ask Spindle author(s). The error doesn't look like something that is related to Lucene, really. Otis --- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote: > What about Spindle? Has anybody used it to crawle a jsp based web > site? Do I > need to intall listlib.jar to do so? > > I got error message "Jsp Translate:Unable to find setter method for > attribue:class" when I tried to run listlib-example.jsp in wsad. > > Thanks, > Oliver > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What about Spindle
What about Spindle? Has anybody used it to crawle a jsp based web site? Do I need to intall listlib.jar to do so? I got error message "Jsp Translate:Unable to find setter method for attribue:class" when I tried to run listlib-example.jsp in wsad. Thanks, Oliver - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits - how many documents?
That was actually the answer. Originally I thought Hits provide a reference to all documents. However it seem logical that documents with 0.0 should not be contained. Thank you, Ralf > I'm a bit confused by what you're asking. Hits points to all documents > that matched the query. A score > 0.0 is needed. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits - how many documents?
On Wednesday, December 3, 2003, at 10:16 AM, Ralph wrote: Does this mean Hits points to ALL documents and the last one might have a score of 0.0 ? If it does not contain all documents, where is the treshhold then? Or based on which condition it stops pointing to certain documents? I'm a bit confused by what you're asking. Hits points to all documents that matched the query. A score > 0.0 is needed. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits - how many documents?
Does this mean Hits points to ALL documents and the last one might have a score of 0.0 ? If it does not contain all documents, where is the treshhold then? Or based on which condition it stops pointing to certain documents? Ralf > On Wednesday, December 3, 2003, at 09:36 AM, Ralph wrote: > > is there a maximum of documents Hits provide or is it unlimited (means > > limited to heap size of VM)? If there is a maximimum, what is the > > number? > > Hits represents all documents that matched the query (and optionally > filtered). > > But, Hits does not *contain* the documents - it points to them so that > its memory footprint is quite small. (there is some slight caching of > up to 200 documents) > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ways to search indexes
--- Dror Matalon <[EMAIL PROTECTED]> wrote: > On Tue, Dec 02, 2003 at 01:54:58PM +, jt oob wrote: > > Hi, > > > > I have just indexed a lot of news (nntp) postings. > > I now have an index for each topic (a topic can have many > newsgroups) > > > > The index sizes are: > > > > 2.6G Current Affairs > > 2.4G Celebs > > 119M Recreation > > 3.0M Tech - Mac > > 2.4G Tech - Windows > > 936M Tech - Linux > > 702M Tech - Other > > 96M Tech - Consoles > > Around 15 gigs. How many days of news? Not sure how many days, but it's around 5 million postings. Download Yahoo! Messenger now for a chance to win Live At Knebworth DVDs http://www.yahoo.co.uk/robbiewilliams - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits - how many documents?
On Wednesday, December 3, 2003, at 09:36 AM, Ralph wrote: is there a maximum of documents Hits provide or is it unlimited (means limited to heap size of VM)? If there is a maximimum, what is the number? Hits represents all documents that matched the query (and optionally filtered). But, Hits does not *contain* the documents - it points to them so that its memory footprint is quite small. (there is some slight caching of up to 200 documents) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits - how many documents?
Hi, is there a maximum of documents Hits provide or is it unlimited (means limited to heap size of VM)? If there is a maximimum, what is the number? Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Probabilistic Model in Lucene - possible?
Hello group, from the very inspiring conversations with Karsten I know that Lucene is based on a Vector Space Model. I am just wondering if it would be possible to turn this into a probabilistic Model approach. Of course I do know that I cannot change the underlying indexing and searching principles. However it would be possible to change the index term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I would need to implement another similarity algorithm. I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. If yes, how much effort would need to go into that? I am sure there are many other issues which I have not considered... Kind Regards, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query reformulation (Relevance Feedback) in Lucene?
there is no direct support in Lucene for this. there are several strategies for automatic query expansion and most of them rely on either extensive domain-specific analysis of the top N documents on the assumption that the search engine performs well enough to guarantee that the top N documents are all relevant, or that there is a special domain-specific corpus of "good documents" where the initial search is against these picked documents and their terms mined to augment the initial query before resubmitting to the original corpus. all of these things are things you have to do yourself. term reweighting happens by using term boost. how much you boost by is an open question. Herb... -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 6:55 AM To: [EMAIL PROTECTED] Subject: Query reformulation (Relevance Feedback) in Lucene? Hello Group of Lucene users, query reformulation is understood as a effective way to improve retrieval power significantly. The theory teaches us that it consists of two basic steps: a) Query expansion (with new terms) b) Reweighting of the terms in the expanded query User relevance feedback is the most popular reformulation strategy to perform query reformulation because it is user centered. Does Lucene generally support this approach? Especially I am wondering if ... 1) there are classes which directly support query expansion OR 2) I would need to do some programming on top of more generic parts? I do not know about 1). All I know about 2) is what I think could work with no evidence if it actually does :-) I think Query expansion with new terms is easy and would just need to create a new QueryParser object with existing terms plus the top n (most frequent terms) of the (in the user point of view) relevant documents. Then I would have a extended query (a). However I do not know how can I reweight this terms? When I formulate the Query I do not actually know about there weights since it is done internally. Does anybody have any idea? Did anybody try to solve this and has some examples which he/she would like to provide? Cheers, Ralf - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query reformulation (Relevance Feedback) in Lucene?
Hello Group of Lucene users, query reformulation is understood as a effective way to improve retrieval power significantly. The theory teaches us that it consists of two basic steps: a) Query expansion (with new terms) b) Reweighting of the terms in the expanded query User relevance feedback is the most popular reformulation strategy to perform query reformulation because it is user centered. Does Lucene generally support this approach? Especially I am wondering if ... 1) there are classes which directly support query expansion OR 2) I would need to do some programming on top of more generic parts? I do not know about 1). All I know about 2) is what I think could work with no evidence if it actually does :-) I think Query expansion with new terms is easy and would just need to create a new QueryParser object with existing terms plus the top n (most frequent terms) of the (in the user point of view) relevant documents. Then I would have a extended query (a). However I do not know how can I reweight this terms? When I formulate the Query I do not actually know about there weights since it is done internally. Does anybody have any idea? Did anybody try to solve this and has some examples which he/she would like to provide? Cheers, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Translation.
Uh, I get to do this dirty job. :( Lucene-user and lucene-dev are not the appropriate fora for questions such as this one. Please ask the original author of the text for help, or use an online translation service, such as the one at http://babelfish.av.com Also, for questions about Lucene usage, problems, help, etc. please email _only_ lucene-user mailing list. The lucene-dev mailing list is used by developers of Lucene, and not developers who are using Lucene. Thanks, Otis --- Tun Lin <[EMAIL PROTECTED]> wrote: > Hi, > > Can anyone translate this text for me? I cannot understand the > instructions. > Please help! > > Thanks. > > === > > || > | LUCY 1.1 | readme.txtUltimo aggiornamento: 18/03/2003 > || > > > > > > STRUTTURA > > > Lucy 1.1 -> Lucene 1.2 > -> HTMLParser 1.2 > -> PdfBox 0.5.6 > -> wvWare 0.7.2-3 > -> xlhtml 0.4.9 > -> antiword 0.33 > -> Xpdf 2.01 > -> Snowball 0.1 > -> NGramJ 01.12.11 > -> it.corila.lucy -> IndexAll.java > -> SearchIndex.java > -> HTMLDocument.java > -> PDFDocument.java > -> ExternalParser.java > -> ItalianStemFilter.java > -> EnglishStemFilter.java > -> ApostropheFilter.java > -> IndexAnalyzer.java > -> SearchAnalyzer.java > -> LanguageCategorizer > -> NgramjCategorizer.java > > > > > > DESCRIZIONE > > Lucy e' in grado di indicizzare tutti i files con estensione txt, > html, pdf, > doc, ppt, xls contenuti in una cartella base e nelle sue > sottocartelle. Consente > ricerche da linea di comando DOS oppure mediante interfaccia web. > Gestisce testi > in Italiano e Inglese con procedure di elaborazione lessicale > specifiche. > > > > > > SISTEMI OPERATIVI SUPPORTATI > > Windows 98 / Windows 2000 / Windows XP > > > > > > REQUISITI DI SISTEMA > > Nessuno tranne i permessi necessari alla scrittura di files su una > cartella del > sistema > Per utilizzare il modulo di ricerca con interfaccia web e' necessario > disporre > di Apache Tomcat, versione 3 o 4. > > > > > INSTALLAZIONE > > Lanciare la procedura automatica di installazione Lucy1.1.exe, oppure > scompattare > il file Lucy1.1.zip in una cartella (NB: il percorso non deve > contenere spazi). > L'applicazione utilizza di default una propria java virtual machine. > E' > possibile utilizzarne un'altra gia' installata nel sistema > modificando il valore > della variabile MYJAVAPATH nel file jvm.bat > In questo caso la cartella jre puo' essere eliminata per ridurre > l'occupazione > di spazio su disco di circa 40 MBytes. > > > > > CONFIGURAZIONE > > Modificare i valori delle variabili contenute nel file > properties.txt, nella > cartella base dell'applicazione: > > > lucy.path: cartella in cui si e' installata l'applicazione > > log.files.dir: cartella in cui verranno creati i files di log > > del.temp.files: eliminazione dei files temporanei alla fine > dell'indicizzazione > (yes/no) > > doc.parser: parser da utilizzare per i files .doc (antiword/wvware) > > pdf.parser: parser da utilizzare per i files .pdf (xpdf/pdfbox) > > index.dir: cartella in cui verranno memorizzati gli indici > > index.name: nome dell'indice che deve essere creato > > indexing.folder: cartella che deve essere indicizzata > > > IMPORTANTE: tutti i percorsi devono essere indicati utilizzando come > separatori > di directory due barre rovesciate (\\) anziche' una barra singola > > > > > MODALITA' DI UTILIZZO > > I tre files batch nella cartella base dell'applicazione sono > attivabili > direttamente da Windows con doppio click. > > indicizza.batcrea un indice > > aggiorna.bat modifica un indice > > cerca.bateffettua ricerche su un > indice > > Tutti i parametri necessari (nome e localizzazione dell'indice, > percorso della > cartella da indicizzare) vanno specificati a priori nel file > properties.txt > > > E' possibile in alternativa utilizzare le procedure da riga di > comando dos, > sempre con la modifica preventiva del file properties.txt > In questo caso inoltre, mediante la sintassi: > > cerca percorso-indice > > si possono effettuare ricerche su altri indici creati in precedenza, > senza > modificare il file properties.txt > > > > > NOTE SULL'UTILIZZO DEI PARSERS > > I valori di default impostati per i parsers sono quelli consigliati > per la prima > esecuzion