readVInt, what is it for?
Hi, I am fairly new to Lucene and is now currently going through its source code. I am currently trying to determine how Lucene calculate the frequency of a term in each document located. I encounter a method named readVInt() in IndexInput class. It seems everytime it called this method it will be able to generate the document number and the frequency of the term in each document. I am wondering how it work and fail to find and information on it on the Internet. Could anyone explain it to me? Thanks -- View this message in context: http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18233802.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: readVInt, what is it for?
Thanks, I am clear now on that. But do anyone know where is the frequency of the term for each document calculated? I mean which class it may be in and which method? Thanks Uwe Schindler wrote: > > A VInt is the way, how integers are stored in the index file in a > compressed > and variable length manner. > > Read here: http://lucene.apache.org/java/2_3_2/fileformats.html#VInt > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > >> -----Original Message- >> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, July 02, 2008 11:47 AM >> To: java-dev@lucene.apache.org >> Subject: readVInt, what is it for? >> >> >> Hi, >> >> I am fairly new to Lucene and is now currently going through its source >> code. I am currently trying to determine how Lucene calculate the >> frequency >> of a term in each document located. >> >> I encounter a method named readVInt() in IndexInput class. It seems >> everytime it called this method it will be able to generate the document >> number and the frequency of the term in each document. >> >> I am wondering how it work and fail to find and information on it on the >> Internet. Could anyone explain it to me? Thanks >> -- >> View this message in context: http://www.nabble.com/readVInt%2C-what-is- >> it-for--tp18233802p18233802.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18249790.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: readVInt, what is it for?
Hmmm, I don't think I get it. How is it tracked during index time? I index my file earlier. Later I will open the index and perform a search. Shouldn't the frequency of each term in each document found be calculated at during the searching process? Yonik Seeley wrote: > > The frequency is tracked at index time. It's simply a read at query > time. See TermDocs. > If you really want to understand more about the code internals of > Lucene, I'd suggest stepping through more example queries with a > debugger. > > -Yonik > > On Wed, Jul 2, 2008 at 8:49 PM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: >> >> Thanks, I am clear now on that. But do anyone know where is the frequency >> of >> the term for each document calculated? I mean which class it may be in >> and >> which method? >> Thanks >> >> >> Uwe Schindler wrote: >>> >>> A VInt is the way, how integers are stored in the index file in a >>> compressed >>> and variable length manner. >>> >>> Read here: http://lucene.apache.org/java/2_3_2/fileformats.html#VInt >>> >>> - >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [EMAIL PROTECTED] >>> >>>> -Original Message- >>>> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >>>> Sent: Wednesday, July 02, 2008 11:47 AM >>>> To: java-dev@lucene.apache.org >>>> Subject: readVInt, what is it for? >>>> >>>> >>>> Hi, >>>> >>>> I am fairly new to Lucene and is now currently going through its source >>>> code. I am currently trying to determine how Lucene calculate the >>>> frequency >>>> of a term in each document located. >>>> >>>> I encounter a method named readVInt() in IndexInput class. It seems >>>> everytime it called this method it will be able to generate the >>>> document >>>> number and the frequency of the term in each document. >>>> >>>> I am wondering how it work and fail to find and information on it on >>>> the >>>> Internet. Could anyone explain it to me? Thanks >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/readVInt%2C-what-is- >>>> it-for--tp18233802p18233802.html >>>> Sent from the Lucene - Java Developer mailing list archive at >>>> Nabble.com. >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18249790.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18250434.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Class in Lucene that Perform Search
Hi, I am currently using Lucene to build a search engine and is trying to understand better so I am going through its source code. I track it all the way from the beginning till end, and has managed to located all the class that calculate the score and return the results. What I am missing is that I fail to locate the class that perform the actual comparison to determine if a query match any term in a document. I also fail to locate the class that is responsible for retrieving the document that contains the term specify. Can anyone help me with this? Maybe just tell me the class related. Thanks -- View this message in context: http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18250664.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Class in Lucene that Perform Search
Ah, thanks! I am clear now. Have to change tactics to achieve what I need. Which class during indexing time will create the .frq file? If possible, I want to add an extra value into it so that I can retrieve the information during the searching process. Thank Yonik Seeley wrote: > > On Wed, Jul 2, 2008 at 10:30 PM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: >> What I am missing is that I fail to locate the class that perform the >> actual >> comparison to determine if a query match any term in a document. > > You need to understand the inverted index format. Documents that > match a term is determined at index time, not at query time. The .frq > file lists all documents that match each term. > > TermDocs iterates over all documents that match the term by reading > the .frq file. > > -Yonik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18253813.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: readVInt, what is it for?
Thanks for all the help. I understand how it works already. Now I will have to know how to modify the .frq file. Can anyone help me with this? Mukherjee, Prasenjit wrote: > > The slide16 in the following ppt might be of some help. Let me know if > it helps. > > http://docs.google.com/Presentation?docid=dmsxgtg_98dbh529dn > > -Prasen > > -Original Message- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Thursday, July 03, 2008 8:08 AM > To: java-dev@lucene.apache.org > Subject: Re: readVInt, what is it for? > > I'd suggest starting with a couple of places: > http://lucene.apache.org/java/2_3_2/fileformats.html > > and > > http://lucene.apache.org/java/2_3_2/scoring.html > > and then do as Yonik said and step through the internals, starting with > a simple TermQuery which leads to the TermScorer. > > -Grant > > > On Jul 2, 2008, at 10:04 PM, blazingwolf7 wrote: > >> >> Hmmm, I don't think I get it. How is it tracked during index time? I >> index my file earlier. Later I will open the index and perform a >> search. >> Shouldn't >> the frequency of each term in each document found be calculated at >> during the searching process? >> >> >> Yonik Seeley wrote: >>> >>> The frequency is tracked at index time. It's simply a read at query >>> time. See TermDocs. >>> If you really want to understand more about the code internals of >>> Lucene, I'd suggest stepping through more example queries with a >>> debugger. >>> >>> -Yonik >>> >>> On Wed, Jul 2, 2008 at 8:49 PM, blazingwolf7 <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Thanks, I am clear now on that. But do anyone know where is the >>>> frequency of the term for each document calculated? I mean which >>>> class it may be in and which method? >>>> Thanks >>>> >>>> >>>> Uwe Schindler wrote: >>>>> >>>>> A VInt is the way, how integers are stored in the index file in a >>>>> compressed and variable length manner. >>>>> >>>>> Read here: http://lucene.apache.org/java/2_3_2/ >>>>> fileformats.html#VInt >>>>> >>>>> - >>>>> Uwe Schindler >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de >>>>> eMail: [EMAIL PROTECTED] >>>>> >>>>>> -Original Message- >>>>>> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >>>>>> Sent: Wednesday, July 02, 2008 11:47 AM >>>>>> To: java-dev@lucene.apache.org >>>>>> Subject: readVInt, what is it for? >>>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am fairly new to Lucene and is now currently going through its >>>>>> source code. I am currently trying to determine how Lucene >>>>>> calculate the frequency of a term in each document located. >>>>>> >>>>>> I encounter a method named readVInt() in IndexInput class. It >>>>>> seems everytime it called this method it will be able to generate >>>>>> the document number and the frequency of the term in each >>>>>> document. >>>>>> >>>>>> I am wondering how it work and fail to find and information on it >>>>>> on the Internet. Could anyone explain it to me? Thanks >>>>>> -- >>>>>> View this message in context: >>>>>> http://www.nabble.com/readVInt%2C-what-is- >>>>>> it-for--tp18233802p18233802.html >>>>>> Sent from the Lucene - Java Developer mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> >>>>>> -- >>>>>> --- To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>> >>>>> >>>>> >>>>> --- >>>>> -- To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>> >>>>> >>>>> >>>> >>>> -- >>>> View this message in context: >>>> http://www.nab
Re: Class in Lucene that Perform Search
I am trying to retrieve the contentLength and the URL of each document from the index without continuously using IndexReader, eg: reader.document.get("ur"); I am trying to find a way to retrieve all this value and stored it into an array by using the IndexReader only once or twice. I thought maybe I can store some extra value into the .frq file then I will have no need to continuously use the reader. Anyone can provide other suggestion? Thanks Yonik Seeley wrote: > > On Thu, Jul 3, 2008 at 4:03 AM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: >> Ah, thanks! I am clear now. Have to change tactics to achieve what I >> need. >> Which class during indexing time will create the .frq file? > > DocumentsWriter (called from IndexWriter). > >> If possible, I want to add an extra value into it so that I can retrieve >> the >> information during the searching process. Thank > > Look at payloads first. > What problem are you trying to solve? Someone may have an easier > approach for you if payloads doesn't work. > > -Yonik > > > >> >> Yonik Seeley wrote: >>> >>> On Wed, Jul 2, 2008 at 10:30 PM, blazingwolf7 <[EMAIL PROTECTED]> >>> wrote: >>>> What I am missing is that I fail to locate the class that perform the >>>> actual >>>> comparison to determine if a query match any term in a document. >>> >>> You need to understand the inverted index format. Documents that >>> match a term is determined at index time, not at query time. The .frq >>> file lists all documents that match each term. >>> >>> TermDocs iterates over all documents that match the term by reading >>> the .frq file. >>> >>> -Yonik >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18253813.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18271691.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Untokenized URL
Hi, I am currently working on retrieving url and contentLength of each document found during the search. I want to retrieve it during the calculation of score so that I can influence the score in some other way. I used the methods from TermDocs and TermEnum to get the information. However, the url I retrieve as is know by most, is tokenized. It is broken down into several parts and I will have to rejoin them. Can anyone help me with this? I am stuck here wondering how to get back the whole url without using a Reader. Also, I try to retrieve the contentLength, but the results return are null. Why is that? I opened the index using Luke and the contentLength is there but when I try to get it using this way, the results is null. Can anyone help me with both of these problems? Any help will be appreciated. Thanks -- View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Untokenized URL
No, I didn't store the contentLength. Just adding it into the index. Which until now I am still scratching my head as I can't think of another way to retrieve it without continuously using the reader. As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I will like to keep it this way, having the url being tokenized. I am finding a way to UNtokenized it, I retrieved it using a method that will retrieve the entire field then extract the information in it. But the problem is, the url are broken down. I am seeking a way to reconstruct it to its orgininal format. Can it be done? Shai Erera wrote: > > Hi > > Regarding the contentLength, when you add it to the document, do you use > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)? > > Regarding the URL, how do you add it to the document? For example, if you > do > doc.add(new Field("url", "http://www.cnn.com";, Store.NO, > Index.UN_TOKENIZED), it would create a token like "url:http://www.cnn.com"; > without breaking it to its parts. Is that what you're looking for? > > Shai > > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: > >> >> Hi, >> >> I am currently working on retrieving url and contentLength of each >> document >> found during the search. I want to retrieve it during the calculation of >> score so that I can influence the score in some other way. >> >> I used the methods from TermDocs and TermEnum to get the information. >> However, the url I retrieve as is know by most, is tokenized. It is >> broken >> down into several parts and I will have to rejoin them. Can anyone help >> me >> with this? I am stuck here wondering how to get back the whole url >> without >> using a Reader. >> >> Also, I try to retrieve the contentLength, but the results return are >> null. >> Why is that? I opened the index using Luke and the contentLength is there >> but when I try to get it using this way, the results is null. >> >> Can anyone help me with both of these problems? Any help will be >> appreciated. Thanks >> -- >> View this message in context: >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards, > > Shai Erera > > -- View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Untokenized URL
I am trying to retrieve the url and use it as filter. The main problem is I don't want to use a reader to continuously retrieve the url for each document located. TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms (new Term (field, "")); do{ Term term = termEnum.term(); }while(termEnum.next()); I am using this code to retrieve the field containing the url but it is tokenized. Is there anyway to untokenized it or is there a better way to do this? Shai Erera wrote: > > I think that the simplest solution will be to index the URL field twice, > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the > un_tokenized term. > If you have a document in hand and only want to fetch its URL, then add > the > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES / > COMPRESS and Index.NO. > > Perhaps I don't understand the entire scenario. When do you need to fetch > the contentLength and URL? To what purpose? > > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]> > wrote: > >> >> No, I didn't store the contentLength. Just adding it into the index. >> Which >> until now I am still scratching my head as I can't think of another way >> to >> retrieve it without continuously using the reader. >> >> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). >> I >> will like to keep it this way, having the url being tokenized. I am >> finding >> a way to UNtokenized it, I retrieved it using a method that will retrieve >> the entire field then extract the information in it. But the problem is, >> the >> url are broken down. I am seeking a way to reconstruct it to its >> orgininal >> format. Can it be done? >> >> >> Shai Erera wrote: >> > >> > Hi >> > >> > Regarding the contentLength, when you add it to the document, do you >> use >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)? >> > >> > Regarding the URL, how do you add it to the document? For example, if >> you >> > do >> > doc.add(new Field("url", "http://www.cnn.com";, Store.NO, >> > Index.UN_TOKENIZED), it would create a token like "url: >> http://www.cnn.com"; >> > without breaking it to its parts. Is that what you're looking for? >> > >> > Shai >> > >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]> >> > wrote: >> > >> >> >> >> Hi, >> >> >> >> I am currently working on retrieving url and contentLength of each >> >> document >> >> found during the search. I want to retrieve it during the calculation >> of >> >> score so that I can influence the score in some other way. >> >> >> >> I used the methods from TermDocs and TermEnum to get the information. >> >> However, the url I retrieve as is know by most, is tokenized. It is >> >> broken >> >> down into several parts and I will have to rejoin them. Can anyone >> help >> >> me >> >> with this? I am stuck here wondering how to get back the whole url >> >> without >> >> using a Reader. >> >> >> >> Also, I try to retrieve the contentLength, but the results return are >> >> null. >> >> Why is that? I opened the index using Luke and the contentLength is >> there >> >> but when I try to get it using this way, the results is null. >> >> >> >> Can anyone help me with both of these problems? Any help will be >> >> appreciated. Thanks >> >> -- >> >> View this message in context: >> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html >> >> Sent from the Lucene - Java Developer mailing list archive at >> Nabble.com. >> >> >> >> >> >> - >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> > -- >> > Regards, >> > >> > Shai Erera >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Regards, > > Shai Erera > > -- View this message in context: http://www.nabble.com/Untokenized-URL-tp18275048p18310348.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Untokenized URL
Well, I am open to suggestion, except for using reader. The Documnet.get() & CO, how does it works? Uwe Schindler wrote: > > As Shai told before, you should store the field twice: As tokenized field > for your search and with a different name (e.g. "field-untokenized"). For > your TermEnum Code you may use the untokenized field, for normal search > queries the tokenized. > If you want to retrieve the field contents with Document.get() & Co. > instead > of TermEnum, you may store the field one time with Flags Tokenized & > Stored. > But this does not work with your TermEnum solution. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > >> -Original Message- >> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >> Sent: Monday, July 07, 2008 7:39 AM >> To: java-dev@lucene.apache.org >> Subject: Re: Untokenized URL >> >> >> I am trying to retrieve the url and use it as filter. The main problem is >> I >> don't want to use a reader to continuously retrieve the url for each >> document located. >> >> TermDocs termDocs = reader.termDocs(); >> TermEnum termEnum = reader.terms (new Term (field, "")); >> do{ >>Term term = termEnum.term(); >> }while(termEnum.next()); >> >> I am using this code to retrieve the field containing the url but it is >> tokenized. Is there anyway to untokenized it or is there a better way to >> do >> this? >> >> >> Shai Erera wrote: >> > >> > I think that the simplest solution will be to index the URL field >> twice, >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the >> > un_tokenized term. >> > If you have a document in hand and only want to fetch its URL, then add >> > the >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES / >> > COMPRESS and Index.NO. >> > >> > Perhaps I don't understand the entire scenario. When do you need to >> fetch >> > the contentLength and URL? To what purpose? >> > >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]> >> > wrote: >> > >> >> >> >> No, I didn't store the contentLength. Just adding it into the index. >> >> Which >> >> until now I am still scratching my head as I can't think of another >> way >> >> to >> >> retrieve it without continuously using the reader. >> >> >> >> As for the url, I use doc.add(new Field("url", >> Store.NO,Index.TOKENIZED). >> >> I >> >> will like to keep it this way, having the url being tokenized. I am >> >> finding >> >> a way to UNtokenized it, I retrieved it using a method that will >> retrieve >> >> the entire field then extract the information in it. But the problem >> is, >> >> the >> >> url are broken down. I am seeking a way to reconstruct it to its >> >> orgininal >> >> format. Can it be done? >> >> >> >> >> >> Shai Erera wrote: >> >> > >> >> > Hi >> >> > >> >> > Regarding the contentLength, when you add it to the document, do you >> >> use >> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)? >> >> > >> >> > Regarding the URL, how do you add it to the document? For example, >> if >> >> you >> >> > do >> >> > doc.add(new Field("url", "http://www.cnn.com";, Store.NO, >> >> > Index.UN_TOKENIZED), it would create a token like "url: >> >> http://www.cnn.com"; >> >> > without breaking it to its parts. Is that what you're looking for? >> >> > >> >> > Shai >> >> > >> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 >> <[EMAIL PROTECTED]> >> >> > wrote: >> >> > >> >> >> >> >> >> Hi, >> >> >> >> >> >> I am currently working on retrieving url and contentLength of each >> >> >> document >> >> >> found during the search. I want to retrieve it during the >> calculation >> >> of >> >> >> score so that I can influence the score in some other way. >> >> >> >>
RE: Untokenized URL
Thanks for the help Uwe Schindler wrote: > > Hi, > > Read here: http://wiki.apache.org/lucene-java/LuceneFAQ > > And I think that this type of questions is more for the Lucene Users > mailing > list > (http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List). > This list is for developers of Lucene itself, not for users asking for > help > how to implement something specific with Lucene. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > >> -Original Message- >> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >> Sent: Monday, July 07, 2008 9:15 AM >> To: java-dev@lucene.apache.org >> Subject: RE: Untokenized URL >> >> >> Well, I am open to suggestion, except for using reader. The >> Documnet.get() >> & >> CO, how does it works? >> >> >> Uwe Schindler wrote: >> > >> > As Shai told before, you should store the field twice: As tokenized >> field >> > for your search and with a different name (e.g. "field-untokenized"). >> For >> > your TermEnum Code you may use the untokenized field, for normal search >> > queries the tokenized. >> > If you want to retrieve the field contents with Document.get() & Co. >> > instead >> > of TermEnum, you may store the field one time with Flags Tokenized & >> > Stored. >> > But this does not work with your TermEnum solution. >> > >> > - >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: [EMAIL PROTECTED] >> > >> >> -Original Message- >> >> From: blazingwolf7 [mailto:[EMAIL PROTECTED] >> >> Sent: Monday, July 07, 2008 7:39 AM >> >> To: java-dev@lucene.apache.org >> >> Subject: Re: Untokenized URL >> >> >> >> >> >> I am trying to retrieve the url and use it as filter. The main problem >> is >> >> I >> >> don't want to use a reader to continuously retrieve the url for each >> >> document located. >> >> >> >> TermDocs termDocs = reader.termDocs(); >> >> TermEnum termEnum = reader.terms (new Term (field, "")); >> >> do{ >> >>Term term = termEnum.term(); >> >> }while(termEnum.next()); >> >> >> >> I am using this code to retrieve the field containing the url but it >> is >> >> tokenized. Is there anyway to untokenized it or is there a better way >> to >> >> do >> >> this? >> >> >> >> >> >> Shai Erera wrote: >> >> > >> >> > I think that the simplest solution will be to index the URL field >> >> twice, >> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the >> >> > un_tokenized term. >> >> > If you have a document in hand and only want to fetch its URL, then >> add >> >> > the >> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES / >> >> > COMPRESS and Index.NO. >> >> > >> >> > Perhaps I don't understand the entire scenario. When do you need to >> >> fetch >> >> > the contentLength and URL? To what purpose? >> >> > >> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 >> <[EMAIL PROTECTED]> >> >> > wrote: >> >> > >> >> >> >> >> >> No, I didn't store the contentLength. Just adding it into the >> index. >> >> >> Which >> >> >> until now I am still scratching my head as I can't think of another >> >> way >> >> >> to >> >> >> retrieve it without continuously using the reader. >> >> >> >> >> >> As for the url, I use doc.add(new Field("url", >> >> Store.NO,Index.TOKENIZED). >> >> >> I >> >> >> will like to keep it this way, having the url being tokenized. I am >> >> >> finding >> >> >> a way to UNtokenized it, I retrieved it using a method that will >> >> retrieve >> >> >> the entire field then extract the information in it. But the >> problem >> >> is, >> >> >> the >> >> >> url are broken down. I am seeking a wa
How effcient is IndexReader?
Hi, I want to use a Reader to read a document everytime a matching document is found during search time. So basically, everytime during the calculation of the score for a document, I will use the reader and retrieve some information from the index. Will this lower the searching performance? I mean, the file involve will be millions. Will this way be efficient or should I find some other way to retreive this information? -- View this message in context: http://www.nabble.com/How-effcient-is-IndexReader--tp18312100p18312100.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]