Re: Parametric/faceted Searching

2008-07-24 Thread Konstantyn Smirnov

I soved that using a single field in the document.

It's content is based on a simple convention. 
Say I have 2 docs with values BirthsMarriagesDeath_Deaths_Females and 
BirthsMarriagesDeath_Divorces.

Now when I need to get the total count for BirthsMarriagesDeath category, I
run "BirthsMarriagesDeath*" query. If I need look in a sub-category, I use
"BirthsMarriagesDeath_Deaths*"
-- 
View this message in context: 
http://www.nabble.com/Parametric-faceted-Searching-tp18587632p18628000.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene delete by query

2008-07-24 Thread Michael McCandless


Unfortunately, this is not an easy question to answer ... it's really  
up to you to test it out for your application & production env, and  
see.  We certainly try very hard not to break things, but  alot of  
sizable changes have gone into the trunk since 2.3.


Lucene has good test case coverage, and that coverage keeps improving  
with time -- every time we find something broken we make a test case  
first to catch it & prevent it in the future, then fix it.


If you do test the trunk it'd be great to hear back how it went, good  
or bad, because that helps us improve, faster...


Mike

Cam Bazz wrote:


how reliable is the version in the trunk? is it ok for production?



On Wed, Jul 23, 2008 at 5:25 PM, Yonik Seeley <[EMAIL PROTECTED]>  
wrote:



It's in the lucene trunk (current development version).
IndexWriter.deleteDocuments(Query query)

-Yonik

On Wed, Jul 23, 2008 at 9:53 AM, Cam Bazz <[EMAIL PROTECTED]> wrote:

hello,

was not there a lucene delete by query feature coming up? I remember
something like that, but I could not find an references.

best regards,
-c.b.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parametric/faceted Searching

2008-07-24 Thread Karsten F.

Hi,

my question: How did ebay solve this problem?

Take a look to the faceted browsing in the mark twain project:
http://www.marktwainproject.org/xtf/search?keyword=Berlin&style=mtp
http://tinyurl.com/5cvb3c

This solution is open source and from the xtf project (they use lucene).
http://xtf.wiki.sourceforge.net/programming_Faceted_Browsing

It also use the prefix search but it count the hits for the subtree without
extra search:
The tree of categories is very sufficient in the main memory, so the count
is quite fast.

"sufficient" means, that for each category and document -which belongs to
this category- one "int" is stored in main memory.

The faceted browsing of xtf can used without the rest of xtf (like the
sorting in lucene can used without lucene): you can breaking of the
coupling, but you should consinder to use xtf as a whole .

Keep me informed which solution you selected :-)

Best regards
  Karsten
-- 
View this message in context: 
http://www.nabble.com/Parametric-faceted-Searching-tp18587632p18630935.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ignoring XML tags when Indexing

2008-07-24 Thread Marcelo Schneider
Do you just want to ignore them and store all in one field? If you know 
the used tags previously, I guess you could set up a stop words list 
with them. If not, you could do an "XMLAnalyzer" that simply ignores 
everything inside '<>'...


If you want to split the xml content in separate fields, you have to 
parse it before indexing, take a look at this article: 
http://www.ibm.com/developerworks/library/j-lucene/


I'm a little bit new to Lucene, so I might be missing something here, 
but I wouldn't expect it to have an API for this...



Kalani Ruwanpathirana escreveu:

Hi all,

I am searching for a way to ignore XML tags in the input when indexing. Is
there a built in functionality in Lucene to get this done?
I am sorry if this was discussed before. I searched but couldn't find a
clear solution.

Thanks in advance
Kalani

  


--


*Marcelo Frantz Schneider*
/SIC - TCO - Tecnologia em Engenharia do Conhecimento/

*DÍGITRO TECNOLOGIA*
*E-mail:* [EMAIL PROTECTED] 


***Site:* www.digitro.com 

--
Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
acredita-se estar livre de perigo.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: storing the contents of a document in the lucene index

2008-07-24 Thread starz10de

Dear Erick ,

 Thnaks for your answer, I tryed other way ,  where I read the text files
before i index them. I will try also your solution here.

best regards


Erick Erickson wrote:
> 
> OK, I'm finally catching on. You have to change the demo code to
> get the contents into something besides an input stream, so you
> can use one of the alternate forms of the Field constructor. For
> instance, you could read it all into a string and use the form:
> 
> doc.add(new Field("content", ,
>Field.Store.YES, Field.Index.TOKENIZED))
> 
> 
> Or, you can do something like this, which produces identical results
> to the above
> 
> while (more text to read) {
>  String line = read a line of text from the file
>  doc.add(new Field("content", line, Field.Store.YES,
> Field.Index.TOKENIZED))
> }
> 
> You can add to the same field as often as you want and it just appends the
> content of calls 2 to N to the same field.
> 
> 
> Best
> Erick
> 
> 
> On Wed, Jul 23, 2008 at 3:42 AM, starz10de <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi Erik,
>>
>>  I don't remove the stop words, as I index parallel corpora which is used
>> for learning the translations between pair of languages. so every word is
>> important. I even develop my own analyzer for Arabic which is just remove
>> punctuations and special symbols and it return only Arabic text.
>>
>> I guess in the   FileDocument.java   the whole text is already stored
>>
>> doc.add(Field.Text("contents", IN));
>>
>> where IN is
>>
>> IN = new BufferedReader(new InputStreamReader(new FileInputStream(f))
>>
>> if this is not the case yould you please how to store the whole text
>> inside
>> the index ?
>>
>> I am new to lucene and I don't know how to use this "Field.Store.YES" to
>> store whole text.
>>
>>
>>
>> Best regards
>> Farag
>>
>>
>>
>> starz10de wrote:
>> >
>> >   Could any one tell me please how to print the content of the document
>> > after reading the index.
>> > for example if i like to print the  index terms then i do :
>> >
>> > IndexReader ir = IndexReader.open(index);
>> > TermEnum termEnum = ir.terms();
>> > while (termEnum.next()) {
>> >   TermDocs dok = ir.termDocs();
>> >   dok.seek(termEnum);
>> >   while (dok.next()) {
>> > System.out.println(termEnum.term().text().trim());
>> >   }
>> >
>> > I can print the text files before indexing them, but because of
>> encoding
>> > issues i like to print them from the index.
>> > As i know the content of the document(whole text) is also stored in the
>> > index, my question how to print this content.
>> >
>> > so at the end i will print the path of the current document , index
>> terms
>> > and the content of the document
>> >
>> >
>> > thanks in advance
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18605547.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/storing-the-contents-of-a-document-in-the--lucene-index-tp18595855p18631887.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SnowballFilter

2008-07-24 Thread Eric Hamacher
Hello:

 

Is it my imagination, or did SnowballFilter disappear from 2.3.x?  I am
looking through lucene-core-2.3.0.jar and lucene-core-2.3.2.jar and
cannot find it.  If so, what replaced it?

 

Thanks

 

Regards,

Eric Hamacher

 

**

THIS EMAIL IS INTENDED ONLY FOR THE REVIEW OF THE ADDRESSEE(S), AND MAY
CONTAIN CONFIDENTIAL AND LEGALLY PRIVILEGED INFORMATION. INTERCEPTION,
COPYING, DISSEMINATION, OR OTHER USE BY OTHER THAN THE ADDRESSEE(S) IS
PROHIBITED AND MAY BE PENALIZED UNDER APPLICABLE PRIVACY LAWS. IF YOU
RECEIVED THIS EMAIL IN ERROR, PLEASE DELETE IT AND NOTIFY ME BY RETURN
EMAIL TO [EMAIL PROTECTED] ***

 



RE: SnowballFilter

2008-07-24 Thread Eric Hamacher
Oh, I see it's a separate distribution. Sorry!

-Original Message-
From: Eric Hamacher [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 24, 2008 9:34 AM
To: java-user@lucene.apache.org
Subject: SnowballFilter

Hello:

 

Is it my imagination, or did SnowballFilter disappear from 2.3.x?  I am
looking through lucene-core-2.3.0.jar and lucene-core-2.3.2.jar and
cannot find it.  If so, what replaced it?

 

Thanks

 

Regards,

Eric Hamacher

 

**

THIS EMAIL IS INTENDED ONLY FOR THE REVIEW OF THE ADDRESSEE(S), AND MAY
CONTAIN CONFIDENTIAL AND LEGALLY PRIVILEGED INFORMATION. INTERCEPTION,
COPYING, DISSEMINATION, OR OTHER USE BY OTHER THAN THE ADDRESSEE(S) IS
PROHIBITED AND MAY BE PENALIZED UNDER APPLICABLE PRIVACY LAWS. IF YOU
RECEIVED THIS EMAIL IN ERROR, PLEASE DELETE IT AND NOTIFY ME BY RETURN
EMAIL TO [EMAIL PROTECTED] ***

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Luke shows in top terms but no search results??

2008-07-24 Thread samd

Can someone explain this to me?

After indexing I can see the terms I expect in the top terms using Luke but
then when I search I get no results??

This is really bizarre and is blocker for me.

Thanks.
-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638011.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread Aravind . Yarram
r u using the same analyzer, which u used for indexing, in the luke as 
well?

Regards, 
Aravind R Yarram
Enabling Technologies
Equifax Information Services LLC
1525 Windward Concourse, J42E
Alpharetta, GA 30005
desk: 770 740 6951
email: [EMAIL PROTECTED] 



samd <[EMAIL PROTECTED]> 
07/24/2008 02:45 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Luke shows in top terms but no search results??







Can someone explain this to me?

After indexing I can see the terms I expect in the top terms using Luke 
but
then when I search I get no results??

This is really bizarre and is blocker for me.

Thanks.
-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638011.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message contains information from Equifax Inc. which may be confidential 
and privileged.  If you are not an intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information and note that such 
actions are prohibited.  If you have received this transmission in error, 
please notify by e-mail [EMAIL PROTECTED]


Re: Luke shows in top terms but no search results??

2008-07-24 Thread Erick Erickson
This is almost certainly a coding error, and it's impossible to help without
seeing some code. Please pust:

1> your indexing code (suitably pared down)
2> your search code along with a sample query

Best
Erick

On Thu, Jul 24, 2008 at 2:45 PM, samd <[EMAIL PROTECTED]> wrote:

>
> Can someone explain this to me?
>
> After indexing I can see the terms I expect in the top terms using Luke but
> then when I search I get no results??
>
> This is really bizarre and is blocker for me.
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638011.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Luke shows in top terms but no search results??

2008-07-24 Thread samd

Hello and yes I'm using Standard in both cases.


Aravind.Yarram wrote:
> 
> r u using the same analyzer, which u used for indexing, in the luke as 
> well?
> 
> Regards, 
> Aravind R Yarram
> Enabling Technologies
> Equifax Information Services LLC
> 1525 Windward Concourse, J42E
> Alpharetta, GA 30005
> desk: 770 740 6951
> email: [EMAIL PROTECTED] 
> 
> 
> 
> samd <[EMAIL PROTECTED]> 
> 07/24/2008 02:45 PM
> Please respond to
> java-user@lucene.apache.org
> 
> 
> To
> java-user@lucene.apache.org
> cc
> 
> Subject
> Luke shows in top terms but no search results??
> 
> 
> 
> 
> 
> 
> 
> Can someone explain this to me?
> 
> After indexing I can see the terms I expect in the top terms using Luke 
> but
> then when I search I get no results??
> 
> This is really bizarre and is blocker for me.
> 
> Thanks.
> -- 
> View this message in context: 
> http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638011.html
> 
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> This message contains information from Equifax Inc. which may be
> confidential and privileged.  If you are not an intended recipient, please
> refrain from any disclosure, copying, distribution or use of this
> information and note that such actions are prohibited.  If you have
> received this transmission in error, please notify by e-mail
> [EMAIL PROTECTED]
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638286.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread samd

Oh and the field is not tokenized and stored.
-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638323.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread samd

Indexing is done via Hibernate Search, the search I'm doing I have done in
Luke and it returns nothing. I select the field from the drop down. I type
the text which matches the top term value. No results.

Now, this isn't all fields only some fields.




Erick Erickson wrote:
> 
> This is almost certainly a coding error, and it's impossible to help
> without
> seeing some code. Please pust:
> 
> 1> your indexing code (suitably pared down)
> 2> your search code along with a sample query
> 
> Best
> Erick
> 
> On Thu, Jul 24, 2008 at 2:45 PM, samd <[EMAIL PROTECTED]> wrote:
> 
>>
>> Can someone explain this to me?
>>
>> After indexing I can see the terms I expect in the top terms using Luke
>> but
>> then when I search I get no results??
>>
>> This is really bizarre and is blocker for me.
>>
>> Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638011.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638495.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread Matthew Hall

Erm.. if its not tokenized that's your problem.

You are setting up an Analyzer when indexing.. but then not actually 
USING it.


Whereas when you are searching you are running your query through the 
analyzer, which transforms your text in such a way that it no longer 
matches against your untokenized form.


So, rerun your index, changing untokenized to tokenized, and I think you 
will see the results you are looking for.


Matt

samd wrote:

Oh and the field is not tokenized and stored.
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Luke shows in top terms but no search results??

2008-07-24 Thread samd

Yes that did it and thanks. The examples I have seen have shown cases where
you can specify values which aren't tokenized but yet do a search against
it. Such cases were for something where the name was unique as it is in this
case.

Now as I said before some fields have found matches which were not tokenized
and some did not. I guess I really need to understand more about Lucene but
for the time being I can work with this.

Thank you for your help.


Matthew Hall-7 wrote:
> 
> Erm.. if its not tokenized that's your problem.
> 
> You are setting up an Analyzer when indexing.. but then not actually 
> USING it.
> 
> Whereas when you are searching you are running your query through the 
> analyzer, which transforms your text in such a way that it no longer 
> matches against your untokenized form.
> 
> So, rerun your index, changing untokenized to tokenized, and I think you 
> will see the results you are looking for.
> 
> Matt
> 
> samd wrote:
>> Oh and the field is not tokenized and stored.
>>   
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Luke-shows-in-top-terms-but-no-search-results---tp18638011p18638704.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Fastest way to get just the "bits" of matching documents

2008-07-24 Thread Robert Stewart
Queries are very complex in our case, some have up to 100 or more clauses (over 
several fields), including disjunctions and prohibited clauses.  Some queries 
take over 5 seconds total time on 10 million document index.  I think it is 
because queries are too big and complicated.  Is there any smarter ways to 
optimize Boolean queries than using default Boolean query classes?  Would it 
make sense to pursue some custom "query optimizer"?





-Original Message-
From: eks dev [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 22, 2008 4:26 PM
To: java-user@lucene.apache.org
Subject: Re: Fastest way to get just the "bits" of matching documents

no, at the moment you can not make pure boolean queries. But 1.5 seconds on 
10Mio document sounds a bit too much (we have well under 200mS on 150Mio 
collection) what you can do:

1. use Filter for high frequency terms, e.g. via ConstantScoreQuery as much as 
you can, but you have to cache them (CachingWrapperFilter or something like 
that). SoretedVIntList can help a lot in reducing memory requirements for 
filter caching
2. Use RAMDisk if it fits in RAM, or MMAPDisk
3.Provide more details, what is the structure of the Query takes so long, what 
is the data in index... so someone can help you really. Your question it is 
just too abstract now
4. try to sort your index so that things that you expect in result get close, 
e.g if you search predominantly on some number, sort it on it... if you can... 
this helps reduce IO stress due locality
5. try https://issues.apache.org/jira/browse/LUCENE-1340  as you do not need 
term frequencies for scoring
6. try using your HitCollector insted of QueryFilter.Bits() to get your bits


if you tried all these options and it still does not work fast enough and you 
really have bottelneck in Scoring (I doubt it) then you have 2:
- Wait for Paul to come back from Holidays, he wanted to make "pure Boolean" 
queries, without Scoring, possible :)
- Invest in faster CPU/Memory


have fun
eks



- Original Message 
> From: Robert Stewart <[EMAIL PROTECTED]>
> To: "java-user@lucene.apache.org" 
> Sent: Tuesday, 22 July, 2008 9:37:26 PM
> Subject: Fastest way to get just the "bits" of matching documents
>
> I need to execute a boolean query and get back just the bits of all the 
> matching
> documents.  I do additional filtering (date ranges and entitlements) and then 
> do
> my own sorting later on.  I know that using QueryFilter.Bits() will still
> compute scores for all matching documents.  I do not want to compute any
> scores.  For queries with large results (over 5 million), seems like it is
> somewhat slow , and maybe computing scores is taking some time.  I have
> 10million document index, and for some very broad queries (4-5 million 
> matching
> documents), seems like getting bits is slow (1.5 seconds).  I can do my own
> sorting of results for requested page in under 30 ms, since I have efficient
> cached permutations of sorting by various fields.  Is there a way given a
> BooleanQuery, to get matching bits without computing any scores internally?  I
> looked at ConstantScoreQuery but I believe it actually still computes scores
> since it gets bits from the underlying query anyway.  In fact I tested it and 
> it
> is actually slower to use ConstantScoreQuery than not to.
>
> Is it possible to use a custom similarity class to make scoring faster (by
> returning 0 values, etc)?
>
>
>
>
> Thanks,
> Bob



  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ignoring XML tags when Indexing

2008-07-24 Thread Kalani Ruwanpathirana
Hi Marcelo,

Thanks for the reply. Yes I want to ignore all the tags and store the text
in one field. Previously used tags are not known and seems the "XMLAnalyzer"
is the
solution. Anyway I think Lucene itself does not support a XMLAnalyzer. Do I
have to do it manually?

Kalani

On Thu, Jul 24, 2008 at 6:10 PM, Marcelo Schneider <
[EMAIL PROTECTED]> wrote:

> Do you just want to ignore them and store all in one field? If you know the
> used tags previously, I guess you could set up a stop words list with them.
> If not, you could do an "XMLAnalyzer" that simply ignores everything inside
> '<>'...
>
> If you want to split the xml content in separate fields, you have to parse
> it before indexing, take a look at this article:
> http://www.ibm.com/developerworks/library/j-lucene/
>
> I'm a little bit new to Lucene, so I might be missing something here, but I
> wouldn't expect it to have an API for this...
>
>
> Kalani Ruwanpathirana escreveu:
>
>> Hi all,
>>
>> I am searching for a way to ignore XML tags in the input when indexing. Is
>> there a built in functionality in Lucene to get this done?
>> I am sorry if this was discussed before. I searched but couldn't find a
>> clear solution.
>>
>> Thanks in advance
>> Kalani
>>
>>
>>
>
> --
>
>
> *Marcelo Frantz Schneider*
> /SIC - TCO - Tecnologia em Engenharia do Conhecimento/
>
> *DÍGITRO TECNOLOGIA*
> *E-mail:* [EMAIL PROTECTED]  [EMAIL PROTECTED]>
> ***Site:* www.digitro.com 
>
> --
> Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
> acredita-se estar livre de perigo.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa


Re: Ignoring XML tags when Indexing

2008-07-24 Thread Daniel Noll

Kalani Ruwanpathirana wrote:

Hi Marcelo,

Thanks for the reply. Yes I want to ignore all the tags and store the text
in one field. Previously used tags are not known and seems the "XMLAnalyzer"
is the
solution. Anyway I think Lucene itself does not support a XMLAnalyzer. Do I
have to do it manually?


What makes more sense (at least the way I see it) is to implement a 
Reader which returns the text you need from the XML.  This sort of thing 
is relatively simple to do with the newer StAX API.  You can have your 
reader return even small chunks of text, and it should perform okay as 
long as you have a BufferedReader wrapped around the entire thing.


Daniel

--
Daniel Noll

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]