readVInt, what is it for?

2008-07-02 Thread blazingwolf7

Hi, 

I am fairly new to Lucene and is now currently going through its source
code. I am currently trying to determine how Lucene calculate the frequency
of a term in each document located.

I encounter a method named readVInt() in IndexInput class. It seems
everytime it called this method it will be able to generate the document
number and the frequency of the term in each document.

I am wondering how it work and fail to find and information on it on the
Internet. Could anyone explain it to me? Thanks
-- 
View this message in context: 
http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18233802.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: readVInt, what is it for?

2008-07-02 Thread blazingwolf7

Thanks, I am clear now on that. But do anyone know where is the frequency of
the term for each document calculated? I mean which class it may be in and
which method?
Thanks


Uwe Schindler wrote:
> 
> A VInt is the way, how integers are stored in the index file in a
> compressed
> and variable length manner.
> 
> Read here: http://lucene.apache.org/java/2_3_2/fileformats.html#VInt
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [EMAIL PROTECTED]
> 
>> -----Original Message-
>> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 02, 2008 11:47 AM
>> To: java-dev@lucene.apache.org
>> Subject: readVInt, what is it for?
>> 
>> 
>> Hi,
>> 
>> I am fairly new to Lucene and is now currently going through its source
>> code. I am currently trying to determine how Lucene calculate the
>> frequency
>> of a term in each document located.
>> 
>> I encounter a method named readVInt() in IndexInput class. It seems
>> everytime it called this method it will be able to generate the document
>> number and the frequency of the term in each document.
>> 
>> I am wondering how it work and fail to find and information on it on the
>> Internet. Could anyone explain it to me? Thanks
>> --
>> View this message in context: http://www.nabble.com/readVInt%2C-what-is-
>> it-for--tp18233802p18233802.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>> 
>> 
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18249790.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: readVInt, what is it for?

2008-07-02 Thread blazingwolf7

Hmmm, I don't think I get it. How is it tracked during index time? I index my
file earlier. Later I will open the index and perform a search. Shouldn't
the frequency of each term in each document found be calculated at during
the searching process?


Yonik Seeley wrote:
> 
> The frequency is tracked at index time.  It's simply a read at query
> time.  See TermDocs.
> If you really want to understand more about the code internals of
> Lucene, I'd suggest stepping through more example queries with a
> debugger.
> 
> -Yonik
> 
> On Wed, Jul 2, 2008 at 8:49 PM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
>>
>> Thanks, I am clear now on that. But do anyone know where is the frequency
>> of
>> the term for each document calculated? I mean which class it may be in
>> and
>> which method?
>> Thanks
>>
>>
>> Uwe Schindler wrote:
>>>
>>> A VInt is the way, how integers are stored in the index file in a
>>> compressed
>>> and variable length manner.
>>>
>>> Read here: http://lucene.apache.org/java/2_3_2/fileformats.html#VInt
>>>
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [EMAIL PROTECTED]
>>>
>>>> -Original Message-
>>>> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>>>> Sent: Wednesday, July 02, 2008 11:47 AM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: readVInt, what is it for?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I am fairly new to Lucene and is now currently going through its source
>>>> code. I am currently trying to determine how Lucene calculate the
>>>> frequency
>>>> of a term in each document located.
>>>>
>>>> I encounter a method named readVInt() in IndexInput class. It seems
>>>> everytime it called this method it will be able to generate the
>>>> document
>>>> number and the frequency of the term in each document.
>>>>
>>>> I am wondering how it work and fail to find and information on it on
>>>> the
>>>> Internet. Could anyone explain it to me? Thanks
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/readVInt%2C-what-is-
>>>> it-for--tp18233802p18233802.html
>>>> Sent from the Lucene - Java Developer mailing list archive at
>>>> Nabble.com.
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18249790.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/readVInt%2C-what-is-it-for--tp18233802p18250434.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Class in Lucene that Perform Search

2008-07-02 Thread blazingwolf7

Hi, 

I am currently using Lucene to build a search engine and is trying to
understand better so I am going through its source code. I track it all the
way from the beginning till end, and has managed to located all the class
that calculate the score and return the results.

What I am missing is that I fail to locate the class that perform the actual
comparison to determine if a query match any term in a document. I also fail
to locate the class that is responsible for retrieving the document that
contains the term specify. Can anyone help me with this? Maybe just tell me
the class related. Thanks
-- 
View this message in context: 
http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18250664.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Class in Lucene that Perform Search

2008-07-03 Thread blazingwolf7

Ah, thanks! I am clear now. Have to change tactics to achieve what I need.
Which class during indexing time will create the .frq file?

If possible, I want to add an extra value into it so that I can retrieve the
information during the searching process. Thank


Yonik Seeley wrote:
> 
> On Wed, Jul 2, 2008 at 10:30 PM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
>> What I am missing is that I fail to locate the class that perform the
>> actual
>> comparison to determine if a query match any term in a document.
> 
> You need to understand the inverted index format.  Documents that
> match a term is determined at index time, not at query time.  The .frq
> file lists all documents that match each term.
> 
> TermDocs iterates over all documents that match the term by reading
> the .frq file.
> 
> -Yonik
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18253813.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: readVInt, what is it for?

2008-07-03 Thread blazingwolf7

Thanks for all the help. I understand how it works already. Now I will have
to know how to modify the .frq file. Can anyone help  me with this? 


Mukherjee, Prasenjit wrote:
> 
> The slide16 in the following ppt might be of some help. Let me know if
> it helps. 
> 
> http://docs.google.com/Presentation?docid=dmsxgtg_98dbh529dn
> 
> -Prasen 
> 
> -Original Message-
> From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, July 03, 2008 8:08 AM
> To: java-dev@lucene.apache.org
> Subject: Re: readVInt, what is it for?
> 
> I'd suggest starting with a couple of places:
> http://lucene.apache.org/java/2_3_2/fileformats.html
> 
> and
> 
> http://lucene.apache.org/java/2_3_2/scoring.html
> 
> and then do as Yonik said and step through the internals, starting with
> a simple TermQuery which leads to the TermScorer.
> 
> -Grant
> 
> 
> On Jul 2, 2008, at 10:04 PM, blazingwolf7 wrote:
> 
>>
>> Hmmm, I don't think I get it. How is it tracked during index time? I 
>> index my file earlier. Later I will open the index and perform a 
>> search.
>> Shouldn't
>> the frequency of each term in each document found be calculated at 
>> during the searching process?
>>
>>
>> Yonik Seeley wrote:
>>>
>>> The frequency is tracked at index time.  It's simply a read at query 
>>> time.  See TermDocs.
>>> If you really want to understand more about the code internals of 
>>> Lucene, I'd suggest stepping through more example queries with a 
>>> debugger.
>>>
>>> -Yonik
>>>
>>> On Wed, Jul 2, 2008 at 8:49 PM, blazingwolf7 <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> Thanks, I am clear now on that. But do anyone know where is the 
>>>> frequency of the term for each document calculated? I mean which 
>>>> class it may be in and which method?
>>>> Thanks
>>>>
>>>>
>>>> Uwe Schindler wrote:
>>>>>
>>>>> A VInt is the way, how integers are stored in the index file in a 
>>>>> compressed and variable length manner.
>>>>>
>>>>> Read here: http://lucene.apache.org/java/2_3_2/
>>>>> fileformats.html#VInt
>>>>>
>>>>> -
>>>>> Uwe Schindler
>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>>>>> eMail: [EMAIL PROTECTED]
>>>>>
>>>>>> -Original Message-
>>>>>> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>>>>>> Sent: Wednesday, July 02, 2008 11:47 AM
>>>>>> To: java-dev@lucene.apache.org
>>>>>> Subject: readVInt, what is it for?
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am fairly new to Lucene and is now currently going through its 
>>>>>> source code. I am currently trying to determine how Lucene 
>>>>>> calculate the frequency of a term in each document located.
>>>>>>
>>>>>> I encounter a method named readVInt() in IndexInput class. It 
>>>>>> seems everytime it called this method it will be able to generate 
>>>>>> the document number and the frequency of the term in each 
>>>>>> document.
>>>>>>
>>>>>> I am wondering how it work and fail to find and information on it 
>>>>>> on the Internet. Could anyone explain it to me? Thanks
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/readVInt%2C-what-is-
>>>>>> it-for--tp18233802p18233802.html
>>>>>> Sent from the Lucene - Java Developer mailing list archive at 
>>>>>> Nabble.com.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --- To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>>
>>>>>
>>>>> ---
>>>>> -- To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nab

Re: Class in Lucene that Perform Search

2008-07-03 Thread blazingwolf7

I am trying to retrieve the contentLength and the URL of each document from
the index without  continuously using IndexReader, eg:
reader.document.get("ur");

I am trying to find a way to retrieve all this value and stored it into an
array by using the IndexReader only once or twice. I thought maybe I can
store some extra value into the .frq file then I will have no need to
continuously use the reader. Anyone can provide other suggestion? Thanks 


Yonik Seeley wrote:
> 
> On Thu, Jul 3, 2008 at 4:03 AM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
>> Ah, thanks! I am clear now. Have to change tactics to achieve what I
>> need.
>> Which class during indexing time will create the .frq file?
> 
> DocumentsWriter (called from IndexWriter).
> 
>> If possible, I want to add an extra value into it so that I can retrieve
>> the
>> information during the searching process. Thank
> 
> Look at payloads first.
> What problem are you trying to solve?  Someone may have an easier
> approach for you if payloads doesn't work.
> 
> -Yonik
> 
> 
> 
>>
>> Yonik Seeley wrote:
>>>
>>> On Wed, Jul 2, 2008 at 10:30 PM, blazingwolf7 <[EMAIL PROTECTED]>
>>> wrote:
>>>> What I am missing is that I fail to locate the class that perform the
>>>> actual
>>>> comparison to determine if a query match any term in a document.
>>>
>>> You need to understand the inverted index format.  Documents that
>>> match a term is determined at index time, not at query time.  The .frq
>>> file lists all documents that match each term.
>>>
>>> TermDocs iterates over all documents that match the term by reading
>>> the .frq file.
>>>
>>> -Yonik
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18253813.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Class-in-Lucene-that-Perform-Search-tp18250664p18271691.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Untokenized URL

2008-07-04 Thread blazingwolf7

Hi,

I am currently working on retrieving url and contentLength of each document
found during the search. I want to retrieve it during the calculation of
score so that I can influence the score in some other way.

I used the methods from TermDocs and TermEnum to get the information.
However, the url I retrieve as is know by most, is tokenized. It is broken
down into several parts and I will have to rejoin them. Can anyone help me
with this? I am stuck here wondering how to get back the whole url without
using a Reader.

Also, I try to retrieve the contentLength, but the results return are null.
Why is that? I opened the index using Luke and the contentLength is there
but when I try to get it using this way, the results is null. 

Can anyone help me with both of these problems? Any help will be
appreciated. Thanks
-- 
View this message in context: 
http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Untokenized URL

2008-07-05 Thread blazingwolf7

No, I didn't store the contentLength. Just adding it into the index. Which
until now I am still scratching my head as I can't think of another way to
retrieve it without continuously using the reader.

As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED). I
will like to keep it this way, having the url being tokenized. I am finding
a way to UNtokenized it, I retrieved it using a method that will retrieve
the entire field then extract the information in it. But the problem is, the
url are broken down. I am seeking a way to reconstruct it to its orgininal
format. Can it be done?


Shai Erera wrote:
> 
> Hi
> 
> Regarding the contentLength, when you add it to the document, do you use
> *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
> 
> Regarding the URL, how do you add it to the document? For example, if you
> do
> doc.add(new Field("url", "http://www.cnn.com";, Store.NO,
> Index.UN_TOKENIZED), it would create a token like "url:http://www.cnn.com";
> without breaking it to its parts. Is that what you're looking for?
> 
> Shai
> 
> On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
> 
>>
>> Hi,
>>
>> I am currently working on retrieving url and contentLength of each
>> document
>> found during the search. I want to retrieve it during the calculation of
>> score so that I can influence the score in some other way.
>>
>> I used the methods from TermDocs and TermEnum to get the information.
>> However, the url I retrieve as is know by most, is tokenized. It is
>> broken
>> down into several parts and I will have to rejoin them. Can anyone help
>> me
>> with this? I am stuck here wondering how to get back the whole url
>> without
>> using a Reader.
>>
>> Also, I try to retrieve the contentLength, but the results return are
>> null.
>> Why is that? I opened the index using Luke and the contentLength is there
>> but when I try to get it using this way, the results is null.
>>
>> Can anyone help me with both of these problems? Any help will be
>> appreciated. Thanks
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Untokenized URL

2008-07-06 Thread blazingwolf7

I am trying to retrieve the url and use it as filter. The main problem is I
don't want to use a reader to continuously retrieve the url for each
document located. 

TermDocs termDocs = reader.termDocs();
TermEnum termEnum = reader.terms (new Term (field, ""));
do{
   Term term = termEnum.term();
}while(termEnum.next());

I am using this code to retrieve the field containing the url but it is
tokenized. Is there anyway to untokenized it or is there a better way to do
this?


Shai Erera wrote:
> 
> I think that the simplest solution will be to index the URL field twice,
> once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
> un_tokenized term.
> If you have a document in hand and only want to fetch its URL, then add
> the
> URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
> COMPRESS and Index.NO.
> 
> Perhaps I don't understand the entire scenario. When do you need to fetch
> the contentLength and URL? To what purpose?
> 
> On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]>
> wrote:
> 
>>
>> No, I didn't store the contentLength. Just adding it into the index.
>> Which
>> until now I am still scratching my head as I can't think of another way
>> to
>> retrieve it without continuously using the reader.
>>
>> As for the url, I use doc.add(new Field("url", Store.NO,Index.TOKENIZED).
>> I
>> will like to keep it this way, having the url being tokenized. I am
>> finding
>> a way to UNtokenized it, I retrieved it using a method that will retrieve
>> the entire field then extract the information in it. But the problem is,
>> the
>> url are broken down. I am seeking a way to reconstruct it to its
>> orgininal
>> format. Can it be done?
>>
>>
>> Shai Erera wrote:
>> >
>> > Hi
>> >
>> > Regarding the contentLength, when you add it to the document, do you
>> use
>> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >
>> > Regarding the URL, how do you add it to the document? For example, if
>> you
>> > do
>> > doc.add(new Field("url", "http://www.cnn.com";, Store.NO,
>> > Index.UN_TOKENIZED), it would create a token like "url:
>> http://www.cnn.com";
>> > without breaking it to its parts. Is that what you're looking for?
>> >
>> > Shai
>> >
>> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7 <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I am currently working on retrieving url and contentLength of each
>> >> document
>> >> found during the search. I want to retrieve it during the calculation
>> of
>> >> score so that I can influence the score in some other way.
>> >>
>> >> I used the methods from TermDocs and TermEnum to get the information.
>> >> However, the url I retrieve as is know by most, is tokenized. It is
>> >> broken
>> >> down into several parts and I will have to rejoin them. Can anyone
>> help
>> >> me
>> >> with this? I am stuck here wondering how to get back the whole url
>> >> without
>> >> using a Reader.
>> >>
>> >> Also, I try to retrieve the contentLength, but the results return are
>> >> null.
>> >> Why is that? I opened the index using Luke and the contentLength is
>> there
>> >> but when I try to get it using this way, the results is null.
>> >>
>> >> Can anyone help me with both of these problems? Any help will be
>> >> appreciated. Thanks
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/Untokenized-URL-tp18275048p18275048.html
>> >> Sent from the Lucene - Java Developer mailing list archive at
>> Nabble.com.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> >
>> > Shai Erera
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Untokenized-URL-tp18275048p18298055.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> 
> Shai Erera
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Untokenized-URL-tp18275048p18310348.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Untokenized URL

2008-07-07 Thread blazingwolf7

Well, I am open to suggestion, except for using reader. The Documnet.get() &
CO, how does it works?


Uwe Schindler wrote:
> 
> As Shai told before, you should store the field twice: As tokenized field
> for your search and with a different name (e.g. "field-untokenized"). For
> your TermEnum Code you may use the untokenized field, for normal search
> queries the tokenized.
> If you want to retrieve the field contents with Document.get() & Co.
> instead
> of TermEnum, you may store the field one time with Flags Tokenized &
> Stored.
> But this does not work with your TermEnum solution.
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [EMAIL PROTECTED]
> 
>> -Original Message-
>> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>> Sent: Monday, July 07, 2008 7:39 AM
>> To: java-dev@lucene.apache.org
>> Subject: Re: Untokenized URL
>> 
>> 
>> I am trying to retrieve the url and use it as filter. The main problem is
>> I
>> don't want to use a reader to continuously retrieve the url for each
>> document located.
>> 
>> TermDocs termDocs = reader.termDocs();
>> TermEnum termEnum = reader.terms (new Term (field, ""));
>> do{
>>Term term = termEnum.term();
>> }while(termEnum.next());
>> 
>> I am using this code to retrieve the field containing the url but it is
>> tokenized. Is there anyway to untokenized it or is there a better way to
>> do
>> this?
>> 
>> 
>> Shai Erera wrote:
>> >
>> > I think that the simplest solution will be to index the URL field
>> twice,
>> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> > un_tokenized term.
>> > If you have a document in hand and only want to fetch its URL, then add
>> > the
>> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
>> > COMPRESS and Index.NO.
>> >
>> > Perhaps I don't understand the entire scenario. When do you need to
>> fetch
>> > the contentLength and URL? To what purpose?
>> >
>> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7 <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> >>
>> >> No, I didn't store the contentLength. Just adding it into the index.
>> >> Which
>> >> until now I am still scratching my head as I can't think of another
>> way
>> >> to
>> >> retrieve it without continuously using the reader.
>> >>
>> >> As for the url, I use doc.add(new Field("url",
>> Store.NO,Index.TOKENIZED).
>> >> I
>> >> will like to keep it this way, having the url being tokenized. I am
>> >> finding
>> >> a way to UNtokenized it, I retrieved it using a method that will
>> retrieve
>> >> the entire field then extract the information in it. But the problem
>> is,
>> >> the
>> >> url are broken down. I am seeking a way to reconstruct it to its
>> >> orgininal
>> >> format. Can it be done?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > Hi
>> >> >
>> >> > Regarding the contentLength, when you add it to the document, do you
>> >> use
>> >> > *store* it as well (i.e., passing Store.YES or Store.COMPRESS)?
>> >> >
>> >> > Regarding the URL, how do you add it to the document? For example,
>> if
>> >> you
>> >> > do
>> >> > doc.add(new Field("url", "http://www.cnn.com";, Store.NO,
>> >> > Index.UN_TOKENIZED), it would create a token like "url:
>> >> http://www.cnn.com";
>> >> > without breaking it to its parts. Is that what you're looking for?
>> >> >
>> >> > Shai
>> >> >
>> >> > On Fri, Jul 4, 2008 at 11:19 AM, blazingwolf7
>> <[EMAIL PROTECTED]>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I am currently working on retrieving url and contentLength of each
>> >> >> document
>> >> >> found during the search. I want to retrieve it during the
>> calculation
>> >> of
>> >> >> score so that I can influence the score in some other way.
>> >> >>
>> 

RE: Untokenized URL

2008-07-07 Thread blazingwolf7

Thanks for the help


Uwe Schindler wrote:
> 
> Hi,
> 
> Read here: http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> And I think that this type of questions is more for the Lucene Users
> mailing
> list
> (http://lucene.apache.org/java/docs/mailinglists.html#Java%20User%20List).
> This list is for developers of Lucene itself, not for users asking for
> help
> how to implement something specific with Lucene.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [EMAIL PROTECTED]
> 
>> -Original Message-
>> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>> Sent: Monday, July 07, 2008 9:15 AM
>> To: java-dev@lucene.apache.org
>> Subject: RE: Untokenized URL
>> 
>> 
>> Well, I am open to suggestion, except for using reader. The
>> Documnet.get()
>> &
>> CO, how does it works?
>> 
>> 
>> Uwe Schindler wrote:
>> >
>> > As Shai told before, you should store the field twice: As tokenized
>> field
>> > for your search and with a different name (e.g. "field-untokenized").
>> For
>> > your TermEnum Code you may use the untokenized field, for normal search
>> > queries the tokenized.
>> > If you want to retrieve the field contents with Document.get() & Co.
>> > instead
>> > of TermEnum, you may store the field one time with Flags Tokenized &
>> > Stored.
>> > But this does not work with your TermEnum solution.
>> >
>> > -
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: [EMAIL PROTECTED]
>> >
>> >> -Original Message-
>> >> From: blazingwolf7 [mailto:[EMAIL PROTECTED]
>> >> Sent: Monday, July 07, 2008 7:39 AM
>> >> To: java-dev@lucene.apache.org
>> >> Subject: Re: Untokenized URL
>> >>
>> >>
>> >> I am trying to retrieve the url and use it as filter. The main problem
>> is
>> >> I
>> >> don't want to use a reader to continuously retrieve the url for each
>> >> document located.
>> >>
>> >> TermDocs termDocs = reader.termDocs();
>> >> TermEnum termEnum = reader.terms (new Term (field, ""));
>> >> do{
>> >>Term term = termEnum.term();
>> >> }while(termEnum.next());
>> >>
>> >> I am using this code to retrieve the field containing the url but it
>> is
>> >> tokenized. Is there anyway to untokenized it or is there a better way
>> to
>> >> do
>> >> this?
>> >>
>> >>
>> >> Shai Erera wrote:
>> >> >
>> >> > I think that the simplest solution will be to index the URL field
>> >> twice,
>> >> > once as TOKENIZED and once as UN_TOKENIZED. Then you can look up the
>> >> > un_tokenized term.
>> >> > If you have a document in hand and only want to fetch its URL, then
>> add
>> >> > the
>> >> > URL twice, once as Store.NO, Index.TOKENIZED and once as Store.YES /
>> >> > COMPRESS and Index.NO.
>> >> >
>> >> > Perhaps I don't understand the entire scenario. When do you need to
>> >> fetch
>> >> > the contentLength and URL? To what purpose?
>> >> >
>> >> > On Sun, Jul 6, 2008 at 4:26 AM, blazingwolf7
>> <[EMAIL PROTECTED]>
>> >> > wrote:
>> >> >
>> >> >>
>> >> >> No, I didn't store the contentLength. Just adding it into the
>> index.
>> >> >> Which
>> >> >> until now I am still scratching my head as I can't think of another
>> >> way
>> >> >> to
>> >> >> retrieve it without continuously using the reader.
>> >> >>
>> >> >> As for the url, I use doc.add(new Field("url",
>> >> Store.NO,Index.TOKENIZED).
>> >> >> I
>> >> >> will like to keep it this way, having the url being tokenized. I am
>> >> >> finding
>> >> >> a way to UNtokenized it, I retrieved it using a method that will
>> >> retrieve
>> >> >> the entire field then extract the information in it. But the
>> problem
>> >> is,
>> >> >> the
>> >> >> url are broken down. I am seeking a wa

How effcient is IndexReader?

2008-07-07 Thread blazingwolf7

Hi,

I want to use a Reader to read a document everytime a matching document is
found during search time. So basically, everytime during the calculation of
the score for a document, I will use the reader and retrieve some
information from the index. Will this lower the searching performance? 

I mean, the file involve will be millions. Will this way be efficient or
should I find some other way to retreive this information?
-- 
View this message in context: 
http://www.nabble.com/How-effcient-is-IndexReader--tp18312100p18312100.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]