Getting word freqency?

2004-01-13 Thread ambiesense
Hello all,

I would like to get a word frequency list from a text. How can I archive
this in the most direct way using Lucene classes? 

Example: I have a very long text. I parse these text with an
WhitespaceAnalyser. From this Text I generate an Index. From this index I get each word
together with its alsolute frequency / relative frequency. 

Can I do it without generating an index?

Cheers,
Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting word freqency?

2004-01-13 Thread ambiesense
Hello Erik,

I know that. However, I still wonder if there this is already solved somehow
in Lucene. I would prefer using Lucene methods instead of workaround. On the
other generating an index only get hold of words and their frequencies would
make it to complicated. I basically want to tansfer a String (or
InputStream) into a word frequency list...

Thanks for the help so far!


 On Jan 13, 2004, at 7:26 AM, [EMAIL PROTECTED] wrote:
  Example: I have a very long text. I parse these text with an
  WhitespaceAnalyser. From this Text I generate an Index. From this 
  index I get each word
  together with its alsolute frequency / relative frequency.
 
  Can I do it without generating an index?
 
 May be other ways to do it, but a poor mans solution would be to take 
 the output (a TokenStream) of an analyzer directly, and iterate over it 
 and insert it into a Map.  If it is already in the Map, add one to the 
 counter, if not insert it with a counter of one.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene based projects...?

2004-01-12 Thread ambiesense
Hello group,

who knows other software projects (like Nutch) which are based and build
around Lucene??  I think it can be quite interesting and helpful for new people
to see and learn from examples... 

Cheers,
Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HTML tag filter...

2004-01-10 Thread ambiesense
Hi group,

would it be possible to implement a Analyser who filters HTML code out of a
HTML page. As a result I would have only the text free of any tagging.

Is is maybe better to use other existing open source software for that? Did
somebody tried that here?

Cheers,
Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Retrieving the content from hits...

2004-01-05 Thread ambiesense
Hi Group,

I have a little problem which is able of being solved easily from the
expertise within this group. 

A index has beein generated. The document used looks like this:

Document doc = new Document();
doc.add(Field.Text(contents, new FileReader(file)));
doc.add(Field.Keyword(filename, file.getCanonicalPath())); 


When I now search, I get a correct hit. However it seems the contents
field does not exist. When I get the field, only filename exists...

Here some code how I parse the hits object:

Document d = hits.doc(0);
Enumeration enum = d.fields();
while (enum.hasMoreElements()){
  Field f = (Field)enum.nextElement();
  System.out.println(Field value =  + f.stringValue()); 
}

Where is the problem? 

Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieving the content from hits...

2004-01-05 Thread ambiesense
Hi,

thank you for this advice. I guess the usual way of searching and retrieving
the document is to search like I did (with the reduced info in the index
(only cleaned text)) and later load the file with the filename information. I
just realised that no example for this simple task is actually available. 

Cheers,
Ralf

 Actually, creating a Field with a Reader means the field data is 
 unstored.  It is indexed, but the original text is not retrievable as 
 
 it is not in the index (yes, it is tokenized, but not kept as a unit, 
 and is very unlikely to be the same as the original text)
 
 If you need the text to be stored in the index, read the text into a 
 String and use that Field.Text variant rather than a Reader.
 
   Erik
 
 
 On Jan 5, 2004, at 11:35 AM, Grant Ingersoll wrote:
 
  I believe since you created the field using a Reader, you have to use 
 
  the Field.readerValue() method instead of the stringValue() method and 
 
  then handle the reader appropriately.  I don't know if there is anyway 
 
  to determine which one is used for a given field other than to test 
  for null on the readerValue()?
 
  -Grant
 
  [EMAIL PROTECTED] 01/05/04 11:27AM 
  Hi Group,
 
  I have a little problem which is able of being solved easily from the
  expertise within this group.
 
  A index has beein generated. The document used looks like this:
 
  Document doc = new Document();
  doc.add(Field.Text(contents, new FileReader(file)));
  doc.add(Field.Keyword(filename, file.getCanonicalPath()));
 
 
  When I now search, I get a correct hit. However it seems the 
 contents
  field does not exist. When I get the field, only filename exists...
 
  Here some code how I parse the hits object:
 
  Document d = hits.doc(0);
  Enumeration enum = d.fields();
  while (enum.hasMoreElements()){
Field f = (Field)enum.nextElement();
System.out.println(Field value =  + f.stringValue());
  }
 
  Where is the problem?
 
  Ralf
 
 
  -- 
  +++ GMX - die erste Adresse für Mail, Message, More +++
  Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Summarization; sentence-level and document-level filters.

2003-12-16 Thread ambiesense
Hello Gregor and Maurits,

I am not quite sure what you want to do. I think you want to search the
normal text and present the summarized text on the screen where the user is able
to get the full text on request. Is this the case?

If this is the case, then you could create a set of summarized text from the
full text, crate another index for them and have an extra field in the text
which is not summarized. You could use this field to find the summarized
version of a full text and retrieve the full text from the summarized text in
order to present it to the user. In this case you would put your summarizer
before the analyser (in terms of workflow) which would perfectly fit into the
existing concept of Lucene.

I am not sure if I could catch your idea. Please educate me further if I
missunderstood something... 

Cheers,
Ralf

 Hi Gregor,
 
 Sofar as I know there is no summarizer in the plans. And maybe I can help
 you along the way. Have a look
 at Classifier4J project on Sourceforge.
 
 http://classifier4j.sourceforge.net/
 
 It has a small documetn summarizer besides a bayes classifier.It might
 speed
 up your coding.
 
 On the level of lucene, I have no idea. My gut feeling says that a summary
 should be build before the
 text is tokenized! The tokenizer can ofcourse be used when analysing a
 document, but hooking into
 the lucene indexing is a bad idea I think.
 
 Someone else has any ideas?
 
 regards,
 
 Maurits
 
 
 
 
 - Original Message - 
 From: Gregor Heinrich [EMAIL PROTECTED]
 To: 'Lucene Users List' [EMAIL PROTECTED]
 Sent: Monday, December 15, 2003 7:41 PM
 Subject: Summarization; sentence-level and document-level filters.
 
 
  Hi,
 
  is there any possibility to do sentence-level or document level analysis
  with the current Analysis/TokenStream architecture? Or where else is the
  best place to plug in customised document-level and sentence-level
 analysis
  features? Is there any precedence case ?
 
  My technical problem:
 
  I'd like to include a summarization feature into my system, which should
 (1)
  best make use of the architecture already there in Lucene, and (2)
 should
 be
  able to trigger summarization on a per-document basis while requiring
  sentence-level information, such as full-stops and commas. To preserve
 this
  punctuation, a special Tokenizer can be used that outputs such
 landmarks
  as tokens instead of filtering them out. The actual SummaryFilter then
  filters out the punctuation for its successors in the Analyzer's filter
  chain.
 
  The other, more complex thing is the document-level information: As
 Lucene's
  architecture uses a filter concept that does not know about the document
 the
  tokens are generated from (which is good abstraction), a
 document-specific
  operation like summarization is a bit of an awkward thing with this (and
  originally not intended, I guess). On the other hand, I'd like to have
 the
  existing filter structure in place for preprocessing of the input,
 because
  my raw texts are generated by converters from other formats that output
  unwanted chars (from figures, pagenumbers, etc.), which are filtered out
  anyway by my custom Analyzer.
 
  Any idea how to solve this second problem? Is there any support for such
  document / sentence structure analysis planned?
 
  Thanks and regards,
 
  Gregor
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query expansion

2003-12-10 Thread ambiesense
Hi,

expanding a query is basically done by generating a new one an reusing the
existing terms plus the selected one from your ontology/taxonomy. There has
been discussion here before and you should search the archive for that.
Extracting and using the right bit from your ontology is basically a task for your
programm logic and highly depends on your reasoning and choice.

Cheers,
Ralf 


 Hi Everybody,
 
 I wish to use an hierarchy of concept provided by an Ontology to refine
 or expand my query answer with Lucene.
 May I Know If someone have tryed it yet ?
 
 Thanks,
 Gayo
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query reformulation (Relevance Feedback) in Lucene?

2003-12-03 Thread ambiesense
Hello Group of Lucene users,

query reformulation is understood as a effective way to improve retrieval
power significantly. The theory teaches us that it consists of two basic steps:

a) Query expansion (with new terms)
b) Reweighting of the terms in the expanded query

User relevance feedback is the most popular reformulation strategy to
perform query reformulation because it is user centered. 

Does Lucene generally support this approach? Especially I am wondering if
...

1) there are classes which directly support query expansion OR
2) I would need to do some programming on top of more generic parts? 

I do not know about 1). All I know about 2) is what I think could work with
no evidence if it actually does :-) I think Query expansion with new terms is
easy and would just need to create a new QueryParser object with existing
terms plus the top n (most frequent terms) of the (in the user point of view)
relevant documents. Then I would have a extended query (a). However I do not
know how can I reweight this terms? When I formulate the Query I do not
actually know about there weights since it is done internally. Does anybody have
any idea? Did anybody try to solve this and has some examples which he/she
would like to provide?

Cheers,
Ralf

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Probabilistic Model in Lucene - possible?

2003-12-03 Thread ambiesense
Hello group,

from the very inspiring conversations with Karsten I know that Lucene is
based on a Vector Space Model. I am just wondering if it would be possible to
turn this into a probabilistic Model approach. Of course I do know that I
cannot change the underlying indexing and searching principles. However it would
be possible to change the index term weight to eigther 1.0 (relevant) or 0.0
(non-relevant). For the similarity I would need to implement another
similarity algorithm.

I would highly appreciate it if the experts here (especially Karsten or
Chong) look at my idea and tell me if this would be possible. If yes, how much
effort would need to go into that? I am sure there are many other issues which
I have not considered...

Kind Regards,
Ralf


-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits - how many documents?

2003-12-03 Thread ambiesense
That was actually the answer. Originally I thought Hits provide a reference
to all documents. However it seem logical that documents with 0.0 should not
be contained.

Thank you,
Ralf

 I'm a bit confused by what you're asking.  Hits points to all documents 
 that matched the query.  A score  0.0 is needed.
 
   Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: AW: Real Boolean Model in Lucene?

2003-12-01 Thread ambiesense
Hello Karsten,

that is fine for me. Implementation cannot 100 % be matched to some theory
as the ISO OSI model has perfectly shown. :-) Thats ok for me and I want to
thank you again for the clarification I gained from this conversation.

Cheers

 
 Hello Ralf,
 
 
 According to your description, Lucene basically maps the boolean query 
 into the vector space and measures the cosine similarity towards other 
 documents in the vector space. If I understood you correctly you mean if 
 a document is found by Lucene based on a boolean query it is relevant 
 (boolean true). If it is not returned, if was boolean false. The score 
 sits on top of it and can be used for ranking. If I would like to use 
 true boolean model I would therefore just need to ignore the score of 
 the Hits document. Did I understand correctly?
 
 
 Yes, I think that this is indeed pretty close to some theoretical 
 foundation: The Boolean Model 
 explains which documents fit to a query, while some appropriate (Lucene 
 is good!) similarity 
 function in vector space yields the ranking.
 
 Now hell would be the place for me where I would have to prove that 
 Lucene's ranking is 
 exactly equivalent to some transformation of vector space and then using 
 the *cosine* for the 
 ranking. Can't be really, as Lucene sometimes returns results  1.0 and 
 only some ruthless
 normalisation keeps it within 0.0 to 1.0. In other words, there still 
 are some rough corners
 in Lucene where a good theorist could find some work.
 
 Could  we leave this topic aside until some suicid.. err, I mean 
 enthusiastic fellow
 tries to work out a really good theory?
 
 Regards,
 
 Karsten
 
 
 
 
 
 -Ursprüngliche Nachricht-
 Von: Ralf B [mailto:[EMAIL PROTECTED] 
 Gesendet: Montag, 1. Dezember 2003 14:28
 An: Lucene Users List
 Betreff: Re: AW: Real Boolean Model in Lucene?
 
 
 Hi Karsten,
 
 I want to thank you for your qualified answer as well as your answer 
 from the 14th of November, where you agreed with me that Lucene is 
 basically a VSM implementation. Sometimes it is difficult to make the 
 link between the clear theory and its implementation.
 
 According to your description, Lucene basically maps the boolean query 
 into the vector space and measures the cosine similarity towards other 
 documents in the vector space. If I understood you correctly you mean if 
 a document is found by Lucene based on a boolean query it is relevant 
 (boolean true). If it is not returned, if was boolean false. The score 
 sits on top of it and can be used for ranking. If I would like to use 
 true boolean model I would therefore just need to ignore the score of 
 the Hits document. Did I understand correctly?
 
 I aggree that nobody really want to do that. My question intended to 
 find out more about the implemented theory within Lucene.
 
 Cheers,
 Ralph
 
 
  
  Hi,
  
  
  My Question: Does Lucene use TF/IDF for getting this? (which would 
  mean
  it does not use the boolean model for the boolean query...)
  
  
  Lucene indeed uses TF/IDF with length normalization for fields and
  documents. 
  
  However, Lucene is downward compatible to the Boolean Model where 
  documents are represented as 0/1-vectors in Vector Space. Ranking just 
 
  adds weights to the elements of the result set, so the underlying 
  interpretation of a query result can be still that of a 
  Propositional/Boolean model. If a document appears in the result, its 
  tokens valuate the query (which actually is a propositional formula 
  formed over words and phrases) to true. The representation of 
  documents is more complex in Lucene than required for the Boolean 
  Model, and as a result, Lucene can efficiently handle phrases and 
  proximity searches, but these seem to be compatible extensions - if 
  you can do it in the Boolean Model, you can do it in Lucene :)
  
  One place where Lucene is not 100% compatible with a basic Boolean 
  Model
  is that 
  full negation is a bit tricky - you can not simply ask for all 
 documents 
  that 
  do not contain a certain term unless you also have some term that 
  appears in all 
  documents. Not a great deal, really. 
  
  If TF/IDF weighting is a problem to you, the Similarity interface
  implementation allows you 
  to remove all references to length normalization and document 
  frequencies.
  
  Regards,
  
  Mit freundlichen Grüßen aus Saarbrücken
  
  --
  
  Dr.-Ing. Karsten Konrad
  Head of Artificial Intelligence Lab
  
  XtraMind Technologies GmbH
  Stuhlsatzenhausweg 3
  D-66123 Saarbrücken
  Phone: +49 (681) 3025113
  Fax: +49 (681) 3025109
  [EMAIL PROTECTED]
  www.xtramind.com
  
  
  
  -Ursprüngliche Nachricht-
  Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  Gesendet: Montag, 1. Dezember 2003 13:11
  An: [EMAIL PROTECTED]
  Betreff: Real Boolean Model in Lucene?
  
  
  Hi,
  
  is it possible to use a real boolean model in lucene for searching. 
  When
  one is using the Queryparser with a boolean query 

Example VSM

2003-11-29 Thread ambiesense
Hi,

regarding the discussion about Vector Space model (VSM) can somebody provide
an example of how to use lucene's (hidden) VSM? Maybe somebody has already
created an example or know a good tutorial who refer to this. The tutorials I
know do not cover that...

Kind Regards
Ralph

-- 
HoHoHo! Seid Ihr auch alle schön brav gewesen?

GMX Weihnachts-Special: Die 1. Adresse für Weihnachts-
männer und -frauen! http://www.gmx.net/de/cgi/specialmail

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Collaborative Filtering API

2003-11-25 Thread ambiesense
Hello togehter,

I am asking this group because I think people here might know about this
since it is a similar approach. 

Is there a Java based API which assist developers of collaborative filtering
in their programs. With this I mean software, which does use user ratings
between items and provide ways (algorithsm, methods) to find users with similar
interests for prediction generation. Finding a API like Lucene would be
dream for me but any pointer to other API's (also in other programming lanuages)
to see and learn from would be appreciated.

Kind Regards

-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Collaborative Filtering API

2003-11-25 Thread ambiesense
Hello Mike,

I had a quick look over the javadoc and it looks promising, as you said. Did
Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work
in the early days of collaborative systems...

Cheers
Ralph

 You should check out the work of Jon Herlocker at Oregon State 
 (http://eecs.oregonstate.edu/iis/).  They have written a CF engine that
 has 
 been on my to-do list to check out for a few months (sounds good on 
 paper).  If you get the chance to play with it, I'd be curious to hear 
 your feedback.  Having a CF engine in the open source domain would be a 
 great thing.
 
 -Mike
 
 At 10:49 AM 11/25/2003, you wrote:
 Hello togehter,
 
 I am asking this group because I think people here might know about this
 since it is a similar approach.
 
 Is there a Java based API which assist developers of collaborative
 filtering
 in their programs. With this I mean software, which does use user ratings
 between items and provide ways (algorithsm, methods) to find users with 
 similar
 interests for prediction generation. Finding a API like Lucene would be
 dream for me but any pointer to other API's (also in other programming 
 lanuages)
 to see and learn from would be appreciated.
 
 Kind Regards
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Overview to Lucene

2003-11-12 Thread ambiesense
Hello group,

can somebody give me an overview to Lucene? What high level components does
it include? Particularly I want to asnwer the following questions regarding
available functionalty:

1) Does Lucene provide a Vector Space IR Model (with TF/IDF and Cosine
Similarity)?
2) Does Lucene provide any collaborative filtering algoritms like
correlation / user ranking etc. ?
3) Does Lucene provide a Probabilistic Model?
4) Does Lucene provide anything for indexing XML documents and using XML
document structure for that? Or does it just work on flat text files?

Does anybody know good articles which demonstrate parts of that or give a
good start into Lucene?

Thanks,
Ralf

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Vector Space Model in Lucene?

2003-11-12 Thread ambiesense
Hi,

does Lucene implement a Vector Space Model? If yes, does anybody have an
example of how using it?

Cheers,
Ralf

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]