Indexing sections of TEI XML files

2008-08-13 Thread ao1
Dear users,

Question on approaches to indexing TEI XML or similar section/subsectioned
files.

I'm indexing TEI P4 XML files using Lucene 2.x.

Currently, each TEI XML file corresponds to a Lucene document.
I extract the data from each XML file using XPath expressions e.g. for the
body text: "/TEI.2/text//p". I also extract and store various meta data
e.g. author, title, publishing data etc. per document.

The issue is that TEI documents can be very large and contain several
chapters. Ideally, search terms would return references to chapter(s)
in which the terms were found. The user would then follow a hyperlink to a
particular subsection rather than retrieving the entire file.

I think it is possible to transform TEI files into chapterised sections
using XSLT although I have not managed this yet. The final system
is likely to use Apache Cocoon to present documents in various formats but
that is a separate issue.

I'm tending towards a solution involving indexing each section as a
document (possibly with only the front-matter being associated with the
meta data e.g. title) and then maybe using XPointer to associate the
source document.

Any comments/approaches taken to similar issues appreciated.

Thanks,

Aodh Ó Lionáird.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing sections of TEI XML files

2008-08-13 Thread Erik Hatcher

Have you looked at XTF?   

It does what you're after and much,much more.

Erik


On Aug 13, 2008, at 4:03 AM, [EMAIL PROTECTED] wrote:


Dear users,

Question on approaches to indexing TEI XML or similar section/ 
subsectioned

files.

I'm indexing TEI P4 XML files using Lucene 2.x.

Currently, each TEI XML file corresponds to a Lucene document.
I extract the data from each XML file using XPath expressions e.g.  
for the
body text: "/TEI.2/text//p". I also extract and store various meta  
data

e.g. author, title, publishing data etc. per document.

The issue is that TEI documents can be very large and contain several
chapters. Ideally, search terms would return references to chapter(s)
in which the terms were found. The user would then follow a  
hyperlink to a

particular subsection rather than retrieving the entire file.

I think it is possible to transform TEI files into chapterised  
sections

using XSLT although I have not managed this yet. The final system
is likely to use Apache Cocoon to present documents in various  
formats but

that is a separate issue.

I'm tending towards a solution involving indexing each section as a
document (possibly with only the front-matter being associated with  
the

meta data e.g. title) and then maybe using XPointer to associate the
source document.

Any comments/approaches taken to similar issues appreciated.

Thanks,

Aodh Ó Lionáird.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing sections of TEI XML files

2008-08-13 Thread ao1
Thanks, Erik, but I'm developing this system from scratch as it has
specific use cases including dealing with multiple languages including
multiple forms of a specific minority language (Irish).

I'm going to look at XTF anyway just to see how they managed it!

Thanks,

A.

> Have you looked at XTF?   
>
> It does what you're after and much,much more.
>
>   Erik
>
>
> On Aug 13, 2008, at 4:03 AM, [EMAIL PROTECTED] wrote:
>
>> Dear users,
>>
>> Question on approaches to indexing TEI XML or similar section/
>> subsectioned
>> files.
>>
>> I'm indexing TEI P4 XML files using Lucene 2.x.
>>
>> Currently, each TEI XML file corresponds to a Lucene document.
>> I extract the data from each XML file using XPath expressions e.g.
>> for the
>> body text: "/TEI.2/text//p". I also extract and store various meta
>> data
>> e.g. author, title, publishing data etc. per document.
>>
>> The issue is that TEI documents can be very large and contain several
>> chapters. Ideally, search terms would return references to chapter(s)
>> in which the terms were found. The user would then follow a
>> hyperlink to a
>> particular subsection rather than retrieving the entire file.
>>
>> I think it is possible to transform TEI files into chapterised
>> sections
>> using XSLT although I have not managed this yet. The final system
>> is likely to use Apache Cocoon to present documents in various
>> formats but
>> that is a separate issue.
>>
>> I'm tending towards a solution involving indexing each section as a
>> document (possibly with only the front-matter being associated with
>> the
>> meta data e.g. title) and then maybe using XPointer to associate the
>> source document.
>>
>> Any comments/approaches taken to similar issues appreciated.
>>
>> Thanks,
>>
>> Aodh Ó Lionáird.
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Listing fields in an index

2008-08-13 Thread John Patterson

Hi,

How do I list all the fields in an index? Some documents do not contain all
fields.

Thanks,

John
-- 
View this message in context: 
http://www.nabble.com/Listing-fields-in-an-index-tp18959421p18959421.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Listing fields in an index

2008-08-13 Thread John Patterson

Hi,

How do I list all the fields in an index? Some documents do not contain all
fields.

Thanks,

John
-- 
View this message in context: 
http://www.nabble.com/Listing-fields-in-an-index-tp18959436p18959436.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Number range search

2008-08-13 Thread m.harig

hi all.

  am indexing a price field by 

doc.add(new Field("price", "1450", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("price", "3800", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("price", "2500", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("price", "7020", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("price", "3500", Field.Store.YES,
Field.Index.TOKENIZED));


its done properly. 

when i go for search am using 

IndexSearcher searcher = new IndexSearcher(indexDir);

Analyzer analyzer = new StandardAnalyzer();

QueryParser parser = new QueryParser("contents", analyzer);
Query query = parser.parse(qryStr);

Hits hits = searcher.search(query);

and my query is to search is price:[1000 TO 4000]. when i search this, it
returns nothing or when i change the query to price:[2000 TO 4000] it return
all hits. where am wrong.. am not getting any correct output. could any1
help me out of this. please
-- 
View this message in context: 
http://www.nabble.com/Number-range-search-tp18960121p18960121.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Listing fields in an index

2008-08-13 Thread Erik Hatcher


On Aug 13, 2008, at 5:02 AM, John Patterson wrote:
How do I list all the fields in an index? Some documents do not  
contain all

fields.


Have a look at IndexReader#getFieldNames().  That'll give you back  
field names regardless of which documents have them.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: possible to read index into memory?

2008-08-13 Thread Darren Govoni
Hoss,
   Thank you for the detailed response. What I found weird was it
seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
Suspiciously fast, but ok.

Yet, when I do a simple fuzzy search on a single field 

"word: someword~0.76"

It was taking .35 seconds. That's a very very long time all things
considered. I understand about the OS paging and such but in
doing some variations of this to "throw the OS off", I still saw
no difference between on-disk and RAM times. But despite that, the
times are really slow. 

Any ideas?

thanks again,
Darren

On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> : On one index, I am seeing no speed change when flipping between
> : RAMDirectory IndexSearcher and file system version.
> 
> that is probably because even if you just use an FSDirectory, your OS will 
> cache the disk "pages" in RAM for you -- all using a RAMDirectory does for 
> you is garuntee that the entire index is copied into the heap you allocate 
> for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you 
> allocated 12GB of RAM to the JVM and read your index into a RAMDirectory, 
> your index will always be in RAM, no matter what other processes do on 
> your machine.
> 
> If instead you only allocate 6GB of RAM to the JVM, and nothing else is 
> using up the rest of your RAM, the OS has plenty to load the whole index 
> into RAM as part of the filesystem cache once you use it -- but if another 
> process comes along and really needs that RAM (or if something reads a lot 
> of other pages of disk) your index might get bumped from the filesystem 
> cache, and the next few reads could be slow.
> 
> : Creating the RAMDirectory from the on-disk index only takes 0.09
> : seconds. It appears it is not loading the data into memory, but maybe
> : just the file names of the index?
> 
> passing an FSDIrectory to the constructor of a RAMDIrectory uses the 
> Directory.copy() method whose source is fairly straight forward and easy 
> to read -- unless your index is ginormous it's not suprising that it's 
> "fast" particularly if it's already in the filesystem cache.
> 
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Listing fields in an index

2008-08-13 Thread John Patterson

Thanks!  I was looking in IndexReader for a good couple of minutes and didn't
see that!


Erik Hatcher wrote:
> 
> 
> On Aug 13, 2008, at 5:02 AM, John Patterson wrote:
>> How do I list all the fields in an index? Some documents do not  
>> contain all
>> fields.
> 
> Have a look at IndexReader#getFieldNames().  That'll give you back  
> field names regardless of which documents have them.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Listing-fields-in-an-index-tp18959421p18961700.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Number range search

2008-08-13 Thread Doron Cohen
The code seems correct (although it doesn't show
which analyzer was used at indexing).

Note that when adding numbers like this there's no real point in analyzing
them,
so I would add that field as UN_TOKENIZED. This would be more efficient,
and would also comply with the query parser who does not apply
the analyzer on range query terms. Anyhow this does not explain the problem
you see, because standard analyzer would not modify the numbers anyhow.

Which Lucene version version you are using?

Can you provide a small self-containing program demonstrating the unexpected
behavior?

(often just writing this small program shows leads to solve a problem like
this).

Doron

On Wed, Aug 13, 2008 at 12:55 PM, m.harig <[EMAIL PROTECTED]> wrote:

>
> hi all.
>
>  am indexing a price field by
>
> doc.add(new Field("price", "1450", Field.Store.YES,
>Field.Index.TOKENIZED));
> doc.add(new Field("price", "3800", Field.Store.YES,
>Field.Index.TOKENIZED));
> doc.add(new Field("price", "2500", Field.Store.YES,
>Field.Index.TOKENIZED));
> doc.add(new Field("price", "7020", Field.Store.YES,
>Field.Index.TOKENIZED));
> doc.add(new Field("price", "3500", Field.Store.YES,
>Field.Index.TOKENIZED));
>
>
> its done properly.
>
> when i go for search am using
>
>IndexSearcher searcher = new IndexSearcher(indexDir);
>
>Analyzer analyzer = new StandardAnalyzer();
>
>QueryParser parser = new QueryParser("contents", analyzer);
>Query query = parser.parse(qryStr);
>
>Hits hits = searcher.search(query);
>
> and my query is to search is price:[1000 TO 4000]. when i search this, it
> returns nothing or when i change the query to price:[2000 TO 4000] it
> return
> all hits. where am wrong.. am not getting any correct output. could any1
> help me out of this. please
>


Re: possible to read index into memory?

2008-08-13 Thread Erick Erickson
How are you measuring? There is a bunch of setup work for the first
few queries that go through the system. In either case (RAM or FS),
you should fire a few representative warmup queries at the search
engine before you go ahead and measure the response time.

You also *must* isolate your search time from your response
assembly time. That is, if you have something like
Hits hits = search()
for (each element of hits) {
   do something with the hit
}

you MUST measure the time for the search() call exclusive of the
for loop before you know where to concentrate your efforts.

In this example, if you get more than 100 hits, your query is
actually re-executed every 100 times through the above loop.

There are other gotchas if you process your query results other ways,
so be sure you know exactly what is taking the time before worrying
about the proper way to speed things up.

I strongly suspect that the RAMDir is a complete red herring. a 17M index
will almost certainly be cached by the system after a bit of use.

There's a whole section up on the Lucene website that talks about various
ways to speed up processing

Measure, *then* optimize ..

Best
Erick

On Wed, Aug 13, 2008 at 7:42 AM, Darren Govoni <[EMAIL PROTECTED]> wrote:

> Hoss,
>   Thank you for the detailed response. What I found weird was it
> seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
> Suspiciously fast, but ok.
>
> Yet, when I do a simple fuzzy search on a single field
>
> "word: someword~0.76"
>
> It was taking .35 seconds. That's a very very long time all things
> considered. I understand about the OS paging and such but in
> doing some variations of this to "throw the OS off", I still saw
> no difference between on-disk and RAM times. But despite that, the
> times are really slow.
>
> Any ideas?
>
> thanks again,
> Darren
>
> On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> > : On one index, I am seeing no speed change when flipping between
> > : RAMDirectory IndexSearcher and file system version.
> >
> > that is probably because even if you just use an FSDirectory, your OS
> will
> > cache the disk "pages" in RAM for you -- all using a RAMDirectory does
> for
> > you is garuntee that the entire index is copied into the heap you
> allocate
> > for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you
> > allocated 12GB of RAM to the JVM and read your index into a RAMDirectory,
> > your index will always be in RAM, no matter what other processes do on
> > your machine.
> >
> > If instead you only allocate 6GB of RAM to the JVM, and nothing else is
> > using up the rest of your RAM, the OS has plenty to load the whole index
> > into RAM as part of the filesystem cache once you use it -- but if
> another
> > process comes along and really needs that RAM (or if something reads a
> lot
> > of other pages of disk) your index might get bumped from the filesystem
> > cache, and the next few reads could be slow.
> >
> > : Creating the RAMDirectory from the on-disk index only takes 0.09
> > : seconds. It appears it is not loading the data into memory, but maybe
> > : just the file names of the index?
> >
> > passing an FSDIrectory to the constructor of a RAMDIrectory uses the
> > Directory.copy() method whose source is fairly straight forward and easy
> > to read -- unless your index is ginormous it's not suprising that it's
> > "fast" particularly if it's already in the filesystem cache.
> >
> >
> >
> >
> > -Hoss
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Indexing sections of TEI XML files

2008-08-13 Thread Karsten F.

Hi A.

starting point of xtf was the TEI format. I am very curious, if you find a
missing point for your needs.
(I already used it with cocoon.)

I never saw a better implementation of searching xml-aware: Each hit knows
his exact position inside the indexed(=source) xml-file :-)

I you dive into xtf, feel free to ask your questions:
http://groups.google.de/group/xtf-user

Best regards
  Karsten


ao1 wrote:
> 
> Thanks, Erik, but I'm developing this system from scratch as it has
> specific use cases including dealing with multiple languages including
> multiple forms of a specific minority language (Irish).
> 
> I'm going to look at XTF anyway just to see how they managed it!
> 
> Thanks,
> 
> A.
> 
>> Have you looked at XTF?   
>>
>> It does what you're after and much,much more.
>>
>>  Erik
> 
-- 
View this message in context: 
http://www.nabble.com/Indexing-sections-of-TEI-XML-files-tp18958644p18964569.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Case Sensitivity

2008-08-13 Thread Dino Korah
Hi All,
 
Once I index a bunch of documents with a StandardAnalyzer (and if the effort
I need to put in to reindex the documents is not worth the effort), is there
a way to search on the index without case sensitivity.
I do not use any sophisticated Analyzer that makes use of
LowerCaseTokenizer.
 
Please let me know if there is a solution to circumvent this case
sensitivity problem.
 
Many thanks
Dino
 


RE: Case Sensitivity

2008-08-13 Thread Dino Korah
Also would like to highlight the version of Lucene I am using; It is 2.0.0. 

  _  

From: Dino Korah [mailto:[EMAIL PROTECTED] 
Sent: 13 August 2008 17:10
To: 'java-user@lucene.apache.org'
Subject: Case Sensitivity


Hi All,
 
Once I index a bunch of documents with a StandardAnalyzer (and if the effort
I need to put in to reindex the documents is not worth the effort), is there
a way to search on the index without case sensitivity.
I do not use any sophisticated Analyzer that makes use of
LowerCaseTokenizer.
 
Please let me know if there is a solution to circumvent this case
sensitivity problem.
 
Many thanks
Dino
 


Re: Indexing sections of TEI XML files

2008-08-13 Thread Tricia Williams

Hi,

   Take a look at what I've done with SOLR-380 
(https://issues.apache.org/jira/browse/SOLR-380). The part you might 
find particularly useful is the Tokenizer.


Tricia

[EMAIL PROTECTED] wrote:

Dear users,

Question on approaches to indexing TEI XML or similar section/subsectioned
files.

I'm indexing TEI P4 XML files using Lucene 2.x.

Currently, each TEI XML file corresponds to a Lucene document.
I extract the data from each XML file using XPath expressions e.g. for the
body text: "/TEI.2/text//p". I also extract and store various meta data
e.g. author, title, publishing data etc. per document.

The issue is that TEI documents can be very large and contain several
chapters. Ideally, search terms would return references to chapter(s)
in which the terms were found. The user would then follow a hyperlink to a
particular subsection rather than retrieving the entire file.

I think it is possible to transform TEI files into chapterised sections
using XSLT although I have not managed this yet. The final system
is likely to use Apache Cocoon to present documents in various formats but
that is a separate issue.

I'm tending towards a solution involving indexing each section as a
document (possibly with only the front-matter being associated with the
meta data e.g. title) and then maybe using XPointer to associate the
source document.

Any comments/approaches taken to similar issues appreciated.

Thanks,

Aodh Ó Lionáird.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Case Sensitivity

2008-08-13 Thread Steven A Rowe
Hi Dino,

StandardAnalyzer incorporates StandardTokenizer, StandardFilter, 
LowerCaseFilter, and StopFilter.  Any index you create using it will only 
provide case-insensitive matching.

Steve

On 08/13/2008 at 12:15 PM, Dino Korah wrote:
> Also would like to highlight the version of Lucene I am
> using; It is 2.0.0.
> 
>   _
> 
> From: Dino Korah [mailto:[EMAIL PROTECTED]
> Sent: 13 August 2008 17:10
> To: 'java-user@lucene.apache.org'
> Subject: Case Sensitivity
> 
> 
> Hi All,
> 
> Once I index a bunch of documents with a StandardAnalyzer (and if the
> effort I need to put in to reindex the documents is not worth the
> effort), is there a way to search on the index without case sensitivity.
> I do not use any sophisticated Analyzer that makes use of
> LowerCaseTokenizer.
> 
> Please let me know if there is a solution to circumvent this case
> sensitivity problem.
> 
> Many thanks
> Dino
> 
>

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Case Sensitivity

2008-08-13 Thread Erick Erickson
What analyzer are you using at *query* time? I suspect that's where your
problem lies if you indeed "don't use any sophisticated analyzers", since
you *are* using a sophisticated analyzer at index time. You almost
invariably want to use the same analyzer at query time and analyzer time.

Please start a separate thread with your second question. Google
"Thread Hijacking" for the explanation of why that's a good idea.

Best
Erick

On Wed, Aug 13, 2008 at 12:27 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:

> Hi Dino,
>
> StandardAnalyzer incorporates StandardTokenizer, StandardFilter,
> LowerCaseFilter, and StopFilter.  Any index you create using it will only
> provide case-insensitive matching.
>
> Steve
>
> On 08/13/2008 at 12:15 PM, Dino Korah wrote:
> > Also would like to highlight the version of Lucene I am
> > using; It is 2.0.0.
> >
> >   _
> >
> > From: Dino Korah [mailto:[EMAIL PROTECTED]
> > Sent: 13 August 2008 17:10
> > To: 'java-user@lucene.apache.org'
> > Subject: Case Sensitivity
> >
> >
> > Hi All,
> >
> > Once I index a bunch of documents with a StandardAnalyzer (and if the
> > effort I need to put in to reindex the documents is not worth the
> > effort), is there a way to search on the index without case sensitivity.
> > I do not use any sophisticated Analyzer that makes use of
> > LowerCaseTokenizer.
> >
> > Please let me know if there is a solution to circumvent this case
> > sensitivity problem.
> >
> > Many thanks
> > Dino
> >
> >
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: possible to read index into memory?

2008-08-13 Thread Darren Govoni
Erick,
Thank you for the valuable tips. The time I'm measuring is
just around the lucene search calls with standard analyzer, such as:

word = "helloo"
starttime = ...
query = QueryParser("word", analyzer).parse(word+"~0.76")
hits = searcher.search(query)
endtime = ...
endtime-starttime = .33 seconds

Its a fuzzy match, which I presume should take longer, but for a single
word in the query against a tiny 17MB index, the above code takes

.33 seconds for the first couple dozen or so, then about .15-.20 after
that. Still way way too long for a simple query as this. Do those
figures sound right for Lucene doing this kind of single field match?

Darren

On Wed, 2008-08-13 at 10:24 -0400, Erick Erickson wrote:
> How are you measuring? There is a bunch of setup work for the first
> few queries that go through the system. In either case (RAM or FS),
> you should fire a few representative warmup queries at the search
> engine before you go ahead and measure the response time.
> 
> You also *must* isolate your search time from your response
> assembly time. That is, if you have something like
> Hits hits = search()
> for (each element of hits) {
>do something with the hit
> }
> 
> you MUST measure the time for the search() call exclusive of the
> for loop before you know where to concentrate your efforts.
> 
> In this example, if you get more than 100 hits, your query is
> actually re-executed every 100 times through the above loop.
> 
> There are other gotchas if you process your query results other ways,
> so be sure you know exactly what is taking the time before worrying
> about the proper way to speed things up.
> 
> I strongly suspect that the RAMDir is a complete red herring. a 17M index
> will almost certainly be cached by the system after a bit of use.
> 
> There's a whole section up on the Lucene website that talks about various
> ways to speed up processing
> 
> Measure, *then* optimize ..
> 
> Best
> Erick
> 
> On Wed, Aug 13, 2008 at 7:42 AM, Darren Govoni <[EMAIL PROTECTED]> wrote:
> 
> > Hoss,
> >   Thank you for the detailed response. What I found weird was it
> > seemed to take 0.09 seconds to create a RAMDirectory off a 17MB index.
> > Suspiciously fast, but ok.
> >
> > Yet, when I do a simple fuzzy search on a single field
> >
> > "word: someword~0.76"
> >
> > It was taking .35 seconds. That's a very very long time all things
> > considered. I understand about the OS paging and such but in
> > doing some variations of this to "throw the OS off", I still saw
> > no difference between on-disk and RAM times. But despite that, the
> > times are really slow.
> >
> > Any ideas?
> >
> > thanks again,
> > Darren
> >
> > On Tue, 2008-08-12 at 19:55 -0700, Chris Hostetter wrote:
> > > : On one index, I am seeing no speed change when flipping between
> > > : RAMDirectory IndexSearcher and file system version.
> > >
> > > that is probably because even if you just use an FSDirectory, your OS
> > will
> > > cache the disk "pages" in RAM for you -- all using a RAMDirectory does
> > for
> > > you is garuntee that the entire index is copied into the heap you
> > allocate
> > > for your JVM.  If you've got 16GB or RAM, and a 5GB index, and you
> > > allocated 12GB of RAM to the JVM and read your index into a RAMDirectory,
> > > your index will always be in RAM, no matter what other processes do on
> > > your machine.
> > >
> > > If instead you only allocate 6GB of RAM to the JVM, and nothing else is
> > > using up the rest of your RAM, the OS has plenty to load the whole index
> > > into RAM as part of the filesystem cache once you use it -- but if
> > another
> > > process comes along and really needs that RAM (or if something reads a
> > lot
> > > of other pages of disk) your index might get bumped from the filesystem
> > > cache, and the next few reads could be slow.
> > >
> > > : Creating the RAMDirectory from the on-disk index only takes 0.09
> > > : seconds. It appears it is not loading the data into memory, but maybe
> > > : just the file names of the index?
> > >
> > > passing an FSDIrectory to the constructor of a RAMDIrectory uses the
> > > Directory.copy() method whose source is fairly straight forward and easy
> > > to read -- unless your index is ginormous it's not suprising that it's
> > > "fast" particularly if it's already in the filesystem cache.
> > >
> > >
> > >
> > >
> > > -Hoss
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching Tokenized x Un_tokenized

2008-08-13 Thread Andre Rubin
Thanks Otis,

I created a custom analyzer and it's working fine.

Here's my analyzer, for reference:

public class KeywordLowerAnalyzer extends Analyzer{

  public KeywordLowerAnalyzer() {
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new KeywordTokenizer(reader);
result = new LowerCaseFilter(result);
return result;
  }

}

Cheers


Andre

On Tue, Aug 12, 2008 at 9:22 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Perhaps you can lowercase the text prior to passing it to Lucene?
> Or perhaps you can have a custom Analyzer that treats the whole input as 1 
> Token (see KeywordAnalyzer -- 
> http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/KeywordAnalyzer.html
>  ), but also includes LowerCaseFilter that's applied to that 1 Token.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
>> From: Andre Rubin <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, August 13, 2008 12:15:25 AM
>> Subject: Re: Searching Tokenized x Un_tokenized
>>
>> Thanks Otis, that was exactly what was happening.
>>
>> 1) According to here:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a
>> wildcard queries are not passed through the Analyzer, but they are
>> always set to lower case.
>>
>> 2) And according to here:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c
>> un_tokenized fields are not passed through the Analyze as well.
>>
>> So by creating an untokenized field and setting
>> parser.setLowercaseExpandedTerms(false), I manage to make my use case
>> work in a case-sensitive manner. That is, 'u*' returns 'usa' and 'U*'
>> returns USA
>>
>> The thing is, how to make this case-insensitive? I can make #1 work by
>> settting it to lowercase: parser.setLowercaseExpandedTerms(true). But
>> how make #2 work, that is, using a LowerCaseFilter to an Untokenized
>> field?
>>
>> Thanks,
>>
>>
>> Andre
>>
>> On Tue, Aug 12, 2008 at 7:57 PM, Otis Gospodnetic
>> wrote:
>> > Andre,
>> >
>> > Check the Lucene FAQ, there is an entry about wildcards and analysis (which
>> doesn't take place for wildcard queries).  Could that be it?
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > - Original Message 
>> >> From: Andre Rubin
>> >> To: java-user@lucene.apache.org
>> >> Sent: Tuesday, August 12, 2008 5:30:47 PM
>> >> Subject: Re: Searching Tokenized x Un_tokenized
>> >>
>> >> My searches for my String tokenized field was working properly. I
>> >> switched the field to un_tokenized, rebuilt the index, and now my
>> >> searches only return strings that match the query string in lower
>> >> case.
>> >>
>> >> For example, searching for 'us*':
>> >>
>> >> The tokenized field version would find 'USA' and 'usa'
>> >>
>> >> The untokenized field version only finds 'usa'
>> >>
>> >> I'm using the StandardAnalyzer in both cases.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> Andre
>> >>
>> >> On Thu, Aug 7, 2008 at 8:16 PM, Otis Gospodnetic
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Perhaps you can give some examples.  Yes, untokenized means "full 
>> >> > string" -
>> it
>> >> requires an "exact match".
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >> >
>> >> >
>> >> >
>> >> > - Original Message 
>> >> >> From: Andre Rubin
>> >> >> To: java-user@lucene.apache.org
>> >> >> Sent: Thursday, August 7, 2008 8:04:04 PM
>> >> >> Subject: Searching Tokenized x Un_tokenized
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> When I switched a String field from tokenized to untokenized, some
>> >> >> searches started not returning some obvious values. Am I missing
>> >> >> something on querying untokenized fields? Another question is, do I
>> >> >> need an Analyzer if my search is on an Untokenized field, wouldn't the
>> >> >> search be based on the full String rather than its tokens?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >>
>> >> >> Andre
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>> -

Payloads and tokenizers

2008-08-13 Thread Antony Bowesman
I started playing with payloads and have been trying to work out how to get the 
data into the payload


I have a field where I want to add the following untokenized fields

A1
A2
A3

With these fields, I would like to add the payloads

B1
B2
B3

Firstly, it looks like you cannot add payloads to untokenized fields.  Is this 
correct?  In my usage, A and B are simply external Ids so must not be tokenized 
and there is always a 1-->1 relationship between them.


Secondly, what is the way to provide the payload data to the tokenizer.  It 
looks like I have to add a List/Map of payload data to a custom Tokenizer and 
Analyzer, which is then consumed each "next(Token)".  However, it would be nice 
if, in my use case, I could use some kind of construct like:


Document doc = new Document()
Field f = new Field("myField", "A1", Field.Store.NO, Field.Index.UNTOKENIZED);
f.setPayload("B1");
doc.add(f);

and avoid the whole unnecessary Tokenizer/Analyzer overhead and give support for 
payloads in untokenized fields.


It looks like it would be trivial to implement in DocumentsWriter.invertField(). 
 Or would this corrupt the Fieldable interface in an undesirable way?


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search for special condition.

2008-08-13 Thread Mr Shore
can nutch or lucene support search for special characters like .?
when i search ".net" many result come for "net"
i want to exclude them
ps:i love korean language a lot

2008/8/13 장용석 <[EMAIL PROTECTED]>

> hi. thank you for your response.
>
> I was found the way with your help.
>
> There are class that name is ConstantScoreRangeQuery and NumberTools.
>
> Reference site is here.
>
> http://markmail.org/message/dcirmifoat6uqf7y#query:org.apache.lucene.document.NumberTools+page:1+mid:tld3uekaylmu2cwt+state:results
>
>
> Thanks very much. :)
>
>
>
> 2008/8/13, Otis Gospodnetic <[EMAIL PROTECTED]>:
> >
> > Hi,
> >
> > Lucene doesn't have the greater than operator.  Perhaps you can use range
> > queries to accomplish the same thing.
> >
> >
> >
> http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> > > From: 장용석 <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Tuesday, August 12, 2008 6:01:00 AM
> > > Subject: search for special condition.
> > >
> > > hi.
> > >
> > > I am searching for lucene api or function like query "FIELD > 1000"
> > >
> > > For example, a user wants to search a product which price is bigger
> then
> > > user's input.
> > > If user's input is 1 then result are the products in index just
> like
> > > "PRICE > 1"
> > >
> > > Is there any way to search like that?
> > >
> > > thanks.
> > > Jang.
> > > --
> > > DEV용식
> > > http://devyongsik.tistory.com
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> --
> DEV용식
> http://devyongsik.tistory.com
>


Re: search for special condition.

2008-08-13 Thread 장용석
Hi. I was very happy ,you are love Korean language a lot :)
So do you want search for special characters?

If you want include special characters when indexing, you can override
method in class
Tokenizer. Method's name is isTokenChar(char c).

protected boolean isTokenChar(char c) {
return Character.isLetter(c);
}

As you see, that method is return true when the character c is a character^^

If you fix that method "return Character.isLetter(c)  ||  c=='.'; "
then, you will get the result token that has special characters like .

thanks. :)

Jang.

2008/8/14, Mr Shore <[EMAIL PROTECTED]>:
>
> can nutch or lucene support search for special characters like .?
> when i search ".net" many result come for "net"
> i want to exclude them
> ps:i love korean language a lot
>
> 2008/8/13 장용석 <[EMAIL PROTECTED]>
>
> > hi. thank you for your response.
> >
> > I was found the way with your help.
> >
> > There are class that name is ConstantScoreRangeQuery and NumberTools.
> >
> > Reference site is here.
> >
> >
> http://markmail.org/message/dcirmifoat6uqf7y#query:org.apache.lucene.document.NumberTools+page:1+mid:tld3uekaylmu2cwt+state:results
> >
> >
> > Thanks very much. :)
> >
> >
> >
> > 2008/8/13, Otis Gospodnetic <[EMAIL PROTECTED]>:
> > >
> > > Hi,
> > >
> > > Lucene doesn't have the greater than operator.  Perhaps you can use
> range
> > > queries to accomplish the same thing.
> > >
> > >
> > >
> >
> http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Range%20Searches
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > >
> > > - Original Message 
> > > > From: 장용석 <[EMAIL PROTECTED]>
> > > > To: java-user@lucene.apache.org
> > > > Sent: Tuesday, August 12, 2008 6:01:00 AM
> > > > Subject: search for special condition.
> > > >
> > > > hi.
> > > >
> > > > I am searching for lucene api or function like query "FIELD > 1000"
> > > >
> > > > For example, a user wants to search a product which price is bigger
> > then
> > > > user's input.
> > > > If user's input is 1 then result are the products in index just
> > like
> > > > "PRICE > 1"
> > > >
> > > > Is there any way to search like that?
> > > >
> > > > thanks.
> > > > Jang.
> > > > --
> > > > DEV용식
> > > > http://devyongsik.tistory.com
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> > --
> > DEV용식
> > http://devyongsik.tistory.com
> >
>



-- 
DEV용식
http://devyongsik.tistory.com