Getting word freqency?
Hello all, I would like to get a word frequency list from a text. How can I archive this in the most direct way using Lucene classes? Example: I have a very long text. I parse these text with an WhitespaceAnalyser. From this Text I generate an Index. From this index I get each word together with its alsolute frequency / relative frequency. Can I do it without generating an index? Cheers, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Getting word freqency?
Hello Erik, I know that. However, I still wonder if there this is already solved somehow in Lucene. I would prefer using Lucene methods instead of workaround. On the other generating an index only get hold of words and their frequencies would make it to complicated. I basically want to tansfer a String (or InputStream) into a word frequency list... Thanks for the help so far! On Jan 13, 2004, at 7:26 AM, [EMAIL PROTECTED] wrote: Example: I have a very long text. I parse these text with an WhitespaceAnalyser. From this Text I generate an Index. From this index I get each word together with its alsolute frequency / relative frequency. Can I do it without generating an index? May be other ways to do it, but a poor mans solution would be to take the output (a TokenStream) of an analyzer directly, and iterate over it and insert it into a Map. If it is already in the Map, add one to the counter, if not insert it with a counter of one. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene based projects...?
Hello group, who knows other software projects (like Nutch) which are based and build around Lucene?? I think it can be quite interesting and helpful for new people to see and learn from examples... Cheers, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTML tag filter...
Hi group, would it be possible to implement a Analyser who filters HTML code out of a HTML page. As a result I would have only the text free of any tagging. Is is maybe better to use other existing open source software for that? Did somebody tried that here? Cheers, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Retrieving the content from hits...
Hi Group, I have a little problem which is able of being solved easily from the expertise within this group. A index has beein generated. The document used looks like this: Document doc = new Document(); doc.add(Field.Text(contents, new FileReader(file))); doc.add(Field.Keyword(filename, file.getCanonicalPath())); When I now search, I get a correct hit. However it seems the contents field does not exist. When I get the field, only filename exists... Here some code how I parse the hits object: Document d = hits.doc(0); Enumeration enum = d.fields(); while (enum.hasMoreElements()){ Field f = (Field)enum.nextElement(); System.out.println(Field value = + f.stringValue()); } Where is the problem? Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieving the content from hits...
Hi, thank you for this advice. I guess the usual way of searching and retrieving the document is to search like I did (with the reduced info in the index (only cleaned text)) and later load the file with the filename information. I just realised that no example for this simple task is actually available. Cheers, Ralf Actually, creating a Field with a Reader means the field data is unstored. It is indexed, but the original text is not retrievable as it is not in the index (yes, it is tokenized, but not kept as a unit, and is very unlikely to be the same as the original text) If you need the text to be stored in the index, read the text into a String and use that Field.Text variant rather than a Reader. Erik On Jan 5, 2004, at 11:35 AM, Grant Ingersoll wrote: I believe since you created the field using a Reader, you have to use the Field.readerValue() method instead of the stringValue() method and then handle the reader appropriately. I don't know if there is anyway to determine which one is used for a given field other than to test for null on the readerValue()? -Grant [EMAIL PROTECTED] 01/05/04 11:27AM Hi Group, I have a little problem which is able of being solved easily from the expertise within this group. A index has beein generated. The document used looks like this: Document doc = new Document(); doc.add(Field.Text(contents, new FileReader(file))); doc.add(Field.Keyword(filename, file.getCanonicalPath())); When I now search, I get a correct hit. However it seems the contents field does not exist. When I get the field, only filename exists... Here some code how I parse the hits object: Document d = hits.doc(0); Enumeration enum = d.fields(); while (enum.hasMoreElements()){ Field f = (Field)enum.nextElement(); System.out.println(Field value = + f.stringValue()); } Where is the problem? Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Summarization; sentence-level and document-level filters.
Hello Gregor and Maurits, I am not quite sure what you want to do. I think you want to search the normal text and present the summarized text on the screen where the user is able to get the full text on request. Is this the case? If this is the case, then you could create a set of summarized text from the full text, crate another index for them and have an extra field in the text which is not summarized. You could use this field to find the summarized version of a full text and retrieve the full text from the summarized text in order to present it to the user. In this case you would put your summarizer before the analyser (in terms of workflow) which would perfectly fit into the existing concept of Lucene. I am not sure if I could catch your idea. Please educate me further if I missunderstood something... Cheers, Ralf Hi Gregor, Sofar as I know there is no summarizer in the plans. And maybe I can help you along the way. Have a look at Classifier4J project on Sourceforge. http://classifier4j.sourceforge.net/ It has a small documetn summarizer besides a bayes classifier.It might speed up your coding. On the level of lucene, I have no idea. My gut feeling says that a summary should be build before the text is tokenized! The tokenizer can ofcourse be used when analysing a document, but hooking into the lucene indexing is a bad idea I think. Someone else has any ideas? regards, Maurits - Original Message - From: Gregor Heinrich [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 15, 2003 7:41 PM Subject: Summarization; sentence-level and document-level filters. Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any precedence case ? My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this punctuation, a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query expansion
Hi, expanding a query is basically done by generating a new one an reusing the existing terms plus the selected one from your ontology/taxonomy. There has been discussion here before and you should search the archive for that. Extracting and using the right bit from your ontology is basically a task for your programm logic and highly depends on your reasoning and choice. Cheers, Ralf Hi Everybody, I wish to use an hierarchy of concept provided by an Ontology to refine or expand my query answer with Lucene. May I Know If someone have tryed it yet ? Thanks, Gayo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query reformulation (Relevance Feedback) in Lucene?
Hello Group of Lucene users, query reformulation is understood as a effective way to improve retrieval power significantly. The theory teaches us that it consists of two basic steps: a) Query expansion (with new terms) b) Reweighting of the terms in the expanded query User relevance feedback is the most popular reformulation strategy to perform query reformulation because it is user centered. Does Lucene generally support this approach? Especially I am wondering if ... 1) there are classes which directly support query expansion OR 2) I would need to do some programming on top of more generic parts? I do not know about 1). All I know about 2) is what I think could work with no evidence if it actually does :-) I think Query expansion with new terms is easy and would just need to create a new QueryParser object with existing terms plus the top n (most frequent terms) of the (in the user point of view) relevant documents. Then I would have a extended query (a). However I do not know how can I reweight this terms? When I formulate the Query I do not actually know about there weights since it is done internally. Does anybody have any idea? Did anybody try to solve this and has some examples which he/she would like to provide? Cheers, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Probabilistic Model in Lucene - possible?
Hello group, from the very inspiring conversations with Karsten I know that Lucene is based on a Vector Space Model. I am just wondering if it would be possible to turn this into a probabilistic Model approach. Of course I do know that I cannot change the underlying indexing and searching principles. However it would be possible to change the index term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I would need to implement another similarity algorithm. I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. If yes, how much effort would need to go into that? I am sure there are many other issues which I have not considered... Kind Regards, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hits - how many documents?
That was actually the answer. Originally I thought Hits provide a reference to all documents. However it seem logical that documents with 0.0 should not be contained. Thank you, Ralf I'm a bit confused by what you're asking. Hits points to all documents that matched the query. A score 0.0 is needed. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: AW: Real Boolean Model in Lucene?
Hello Karsten, that is fine for me. Implementation cannot 100 % be matched to some theory as the ISO OSI model has perfectly shown. :-) Thats ok for me and I want to thank you again for the clarification I gained from this conversation. Cheers Hello Ralf, According to your description, Lucene basically maps the boolean query into the vector space and measures the cosine similarity towards other documents in the vector space. If I understood you correctly you mean if a document is found by Lucene based on a boolean query it is relevant (boolean true). If it is not returned, if was boolean false. The score sits on top of it and can be used for ranking. If I would like to use true boolean model I would therefore just need to ignore the score of the Hits document. Did I understand correctly? Yes, I think that this is indeed pretty close to some theoretical foundation: The Boolean Model explains which documents fit to a query, while some appropriate (Lucene is good!) similarity function in vector space yields the ranking. Now hell would be the place for me where I would have to prove that Lucene's ranking is exactly equivalent to some transformation of vector space and then using the *cosine* for the ranking. Can't be really, as Lucene sometimes returns results 1.0 and only some ruthless normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough corners in Lucene where a good theorist could find some work. Could we leave this topic aside until some suicid.. err, I mean enthusiastic fellow tries to work out a really good theory? Regards, Karsten -Ursprüngliche Nachricht- Von: Ralf B [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Dezember 2003 14:28 An: Lucene Users List Betreff: Re: AW: Real Boolean Model in Lucene? Hi Karsten, I want to thank you for your qualified answer as well as your answer from the 14th of November, where you agreed with me that Lucene is basically a VSM implementation. Sometimes it is difficult to make the link between the clear theory and its implementation. According to your description, Lucene basically maps the boolean query into the vector space and measures the cosine similarity towards other documents in the vector space. If I understood you correctly you mean if a document is found by Lucene based on a boolean query it is relevant (boolean true). If it is not returned, if was boolean false. The score sits on top of it and can be used for ranking. If I would like to use true boolean model I would therefore just need to ignore the score of the Hits document. Did I understand correctly? I aggree that nobody really want to do that. My question intended to find out more about the implemented theory within Lucene. Cheers, Ralph Hi, My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not use the boolean model for the boolean query...) Lucene indeed uses TF/IDF with length normalization for fields and documents. However, Lucene is downward compatible to the Boolean Model where documents are represented as 0/1-vectors in Vector Space. Ranking just adds weights to the elements of the result set, so the underlying interpretation of a query result can be still that of a Propositional/Boolean model. If a document appears in the result, its tokens valuate the query (which actually is a propositional formula formed over words and phrases) to true. The representation of documents is more complex in Lucene than required for the Boolean Model, and as a result, Lucene can efficiently handle phrases and proximity searches, but these seem to be compatible extensions - if you can do it in the Boolean Model, you can do it in Lucene :) One place where Lucene is not 100% compatible with a basic Boolean Model is that full negation is a bit tricky - you can not simply ask for all documents that do not contain a certain term unless you also have some term that appears in all documents. Not a great deal, really. If TF/IDF weighting is a problem to you, the Similarity interface implementation allows you to remove all references to length normalization and document frequencies. Regards, Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Dezember 2003 13:11 An: [EMAIL PROTECTED] Betreff: Real Boolean Model in Lucene? Hi, is it possible to use a real boolean model in lucene for searching. When one is using the Queryparser with a boolean query
Example VSM
Hi, regarding the discussion about Vector Space model (VSM) can somebody provide an example of how to use lucene's (hidden) VSM? Maybe somebody has already created an example or know a good tutorial who refer to this. The tutorials I know do not cover that... Kind Regards Ralph -- HoHoHo! Seid Ihr auch alle schön brav gewesen? GMX Weihnachts-Special: Die 1. Adresse für Weihnachts- männer und -frauen! http://www.gmx.net/de/cgi/specialmail +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Collaborative Filtering API
Hello togehter, I am asking this group because I think people here might know about this since it is a similar approach. Is there a Java based API which assist developers of collaborative filtering in their programs. With this I mean software, which does use user ratings between items and provide ways (algorithsm, methods) to find users with similar interests for prediction generation. Finding a API like Lucene would be dream for me but any pointer to other API's (also in other programming lanuages) to see and learn from would be appreciated. Kind Regards -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Collaborative Filtering API
Hello Mike, I had a quick look over the javadoc and it looks promising, as you said. Did Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work in the early days of collaborative systems... Cheers Ralph You should check out the work of Jon Herlocker at Oregon State (http://eecs.oregonstate.edu/iis/). They have written a CF engine that has been on my to-do list to check out for a few months (sounds good on paper). If you get the chance to play with it, I'd be curious to hear your feedback. Having a CF engine in the open source domain would be a great thing. -Mike At 10:49 AM 11/25/2003, you wrote: Hello togehter, I am asking this group because I think people here might know about this since it is a similar approach. Is there a Java based API which assist developers of collaborative filtering in their programs. With this I mean software, which does use user ratings between items and provide ways (algorithsm, methods) to find users with similar interests for prediction generation. Finding a API like Lucene would be dream for me but any pointer to other API's (also in other programming lanuages) to see and learn from would be appreciated. Kind Regards - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Overview to Lucene
Hello group, can somebody give me an overview to Lucene? What high level components does it include? Particularly I want to asnwer the following questions regarding available functionalty: 1) Does Lucene provide a Vector Space IR Model (with TF/IDF and Cosine Similarity)? 2) Does Lucene provide any collaborative filtering algoritms like correlation / user ranking etc. ? 3) Does Lucene provide a Probabilistic Model? 4) Does Lucene provide anything for indexing XML documents and using XML document structure for that? Or does it just work on flat text files? Does anybody know good articles which demonstrate parts of that or give a good start into Lucene? Thanks, Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Vector Space Model in Lucene?
Hi, does Lucene implement a Vector Space Model? If yes, does anybody have an example of how using it? Cheers, Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]