Re: Unique Fields

2008-03-12 Thread Ion Badita

The "problem" is that my unique field is a title, many terms per field.
I want to make an index with titles and i don't want to have duplicates.

John


Erick Erickson wrote:

You can easily find whether a term is in the index with TermEnum/TermDocs
(I think TermEnum is all you really need).

Except, you'll probably also have to keep an internal map of IDs added since
the searcher was opened and check against that too.

Best
Erick

On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita <[EMAIL PROTECTED]>
wrote:

  

Hi,

I want to create an index with one unique field.
Before inserting a document i must be sure that "unique field" is unique.



John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  




Re: Specialized XML handling in Lucene

2008-03-12 Thread Eran Sevi
Indeed it seems like a problematic way.

I would also have a problem searching for documents with more then one
value. if the query is something simple like : "value1 AND value2" I would
expect to get all xml docs with both values, but if I use the doc=element
method, I won't get any result because each doc contains only value1 or
value2 or something else, even if their xml_doc_id is the same.
back to the drawing table...
On Tue, Mar 11, 2008 at 9:50 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:

> On 03/11/2008 at 11:48 AM, Steven A Rowe wrote:
> > 5 billion docs is within the range that Lucene can handle.  I
> > think you should try doc = element and see how well it works.
>
> Sorry, Eran, I was dead wrong about this assertion.  See this thread for
> more information:
>
> <
> http://www.nabble.com/MultiSearcher-to-overcome-the-Integer.MAX_VALUE-limit-td15876190.html
> >
>
> Looks like doc = element is *not* the way to go.
>
> Steve
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-12 Thread thogau

Thanks for your suggestion markmiller. When I try this query, I get both
documents as hits. The one with the field having a value and also the one
with the field not set...
Any idea why?


markrmiller wrote:
> 
> You cannot have a purely negative query like you can in Solr.
> 
> Try: *:* -MY_FIELD_NAME:[* TO *]
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Searching-for-null-%28empty%29-fields%2C-how-to-use--field%3A-*-TO-*--tp15976538p16000127.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Michael McCandless


Daniel Noll wrote:

I have filtered out lines in the log which indicated an exception  
adding the
document; these occur when our Reader throws an IOException and  
there were so

many that it bloated the file.


OK, I think very likely this is the issue: when IndexWriter hits an  
exception while processing a document, the portion of the document  
already indexed is left in the index, and then its docID is marked  
for deletion.  You can see these deletions in your infoStream:


  flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments

This means you have deletions in your index, by docID, and so when  
you optimize the docIDs are then compacted.


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using Lucene from scripting language without any java coding

2008-03-12 Thread Mathieu Lecarme
Here is a POC about using Lucene, via Compass, from PHP or Python (other 
languages will come later), with only XML configuration, object 
notation, and native use of scripting language.


http://blog.garambrogne.net/index.php?post/2008/03/11/Using-Compass-without-dirtying-its-hands-with-java

It's look like Solr, but it's different.
Solr
framework: Lucene
serialisation: XML
transport: HTTP via servlet.

Goniomotre
framework : Compass + Spring
serialisation: JSON
transport: Socket via Mina

Other difference, Solr is a mature project with admin pages, caching and 
production ready stuff.
Goniometre is a prototype for Compass fan, wich like XML, and coding 
with any language except Java.


Examples, test and code are available via svn.

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Unique Fields

2008-03-12 Thread Erick Erickson
So, you're tokenizing the title field? If so, I don't understand how you
expect
this to work. Would the title "this is one order" and "is one order this" be
considered
identical? Would capitalization matter? Punctuation? Throwing all the terms
of a title into a tokenized field and expecting some magic to keep
duplicates
is beyond the scope of Lucene, you'll have to roll some customized solution.

For instance, index your title UN_TOKENIZED in a duplicate field (after
applying
whatever massaging you want re: punctuation, spaces, etc.). Use
TermDocs/TermEnum
on that field to detect duplicates. You won't search on this field

Or create a hash of the title and index *that* in a separate field and check
against
the hash with termenum/terndocs. Or.

But no, there's no magic that makes Lucene DWIM (Do What I Mean)...

Best
Erick

On Wed, Mar 12, 2008 at 2:01 AM, Ion Badita <[EMAIL PROTECTED]>
wrote:

> The "problem" is that my unique field is a title, many terms per field.
> I want to make an index with titles and i don't want to have duplicates.
>
> John
>
>
> Erick Erickson wrote:
> > You can easily find whether a term is in the index with
> TermEnum/TermDocs
> > (I think TermEnum is all you really need).
> >
> > Except, you'll probably also have to keep an internal map of IDs added
> since
> > the searcher was opened and check against that too.
> >
> > Best
> > Erick
> >
> > On Tue, Mar 11, 2008 at 11:04 AM, Ion Badita <
> [EMAIL PROTECTED]>
> > wrote:
> >
> >
> >> Hi,
> >>
> >> I want to create an index with one unique field.
> >> Before inserting a document i must be sure that "unique field" is
> unique.
> >>
> >>
> >>
> >> John
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
>
>


Re: Searching for null (empty) fields, how to use -field:[* TO *]

2008-03-12 Thread thogau

Thanks Erick, I ended up by following your second suggestion.
It has been a bit tricky since I had to plug into a MapConverter but it
works as expected.
Thanks to all.

--thogau



You could also think about making a filter, probably when you open
your searcher. You can use TermDocs/TermEnum to find all of the documents
that *do* have entries for your field, assemble those into a filter, then
invert that filter. Keep the filter around and use it whenever you need
to. Perhaps CachingWrapperFilter would help here (although I've never
used the latter).

Another possibility is to index a field only for those documents that
don't have any value for MY_FIELD_NAME. So when indexing a doc, you
have something like
if (has MY_FIELD_NAME) {
   doc.add("MY_FIELD_NAME", );
} else {
   doc.add("NO_MY_FIELD_NAME", "no");
}

Now finding docs without your field really is just searching on
NO_MY_FIELD_NAME:no

Your index would be very slightly bigger in this instance

FWIW
Erick
-- 
View this message in context: 
http://www.nabble.com/Searching-for-null-%28empty%29-fields%2C-how-to-use--field%3A-*-TO-*--tp15976538p16002412.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter Hits

2008-03-12 Thread JensBurkhardt

Hello everybody,

I have s slight problem using lucenes highlighter. If i have the highlighter
enabled, a query creates 0 hits, if i disable the highlighter i get the
hits.
It seems like, when i call searcher.search() and pass my Hits hits to the
highlighter function, the program quits. All prints after the highlighter
call also do not appear.
I have no idea what the problem is. 

Thanks in advise

Jens Burkhardt
-- 
View this message in context: 
http://www.nabble.com/Highlighter-Hits-tp16002424p16002424.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Erick Erickson
I certainly found that lazy loading changed my speed dramatically, but
that was on a particularly field-heavy index.

I wonder if TermEnum/TermDocs would be fast enough on an indexed
(UN_TOKENIZED???) field for a unique id.

Mostly, I'm hoping you'll try this and tell me if it works so I don't have
to sometime 

Erick

On Tue, Mar 11, 2008 at 9:26 PM, Daniel Noll <[EMAIL PROTECTED]> wrote:

> On Wednesday 12 March 2008 09:53:58 Erick Erickson wrote:
> > But to me, it always seems...er...fraught to even *think* about relying
> > on doc ids. I know you've been around the block with Lucene, but do you
> > have a compelling reason to use the doc ID and not your own unique ID?
>
> From memory it was around a factor of 10 times slower to use a text field
> for
> this; I haven't tested it recently and the case of retrieving the Document
> should be slightly faster now that we have FieldSelector, but it certainly
> won't be faster as to get the document you need the ID in the first place.
>
> For single documents it wasn't a problem, the use cases are:
>  1. Bulk database operations based on the matched documents.
>  2. Creating a filter BitSet based on a database query.
>
> Effectively this is required because Lucene offered no way to update a
> Document after it was indexed; if it had that feature we would never have
> needed a database. ;-)
>
> Daniel
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Highlighter Hits

2008-03-12 Thread Erick Erickson
What does your stack trace look like? I've never seen Lucene "just quit"
without throwing an exception, and printStackTrace() is your friend.

Or are you catching exceptions without logging them? If so, shame
on you .

Best
Erick

P.S. I can't recommend strongly enough that you get a good IDE
and debug in it. I spent far too much of my life debugging with
printlns and never, ever want to go back there again... Eclipse
is free if sometimes "interesting" to set up. IntelliJ is sweet. And
a unit test or two will help significantly too. Sorry if you know all this,
but your comment about prints lights me right up .



On Wed, Mar 12, 2008 at 9:54 AM, JensBurkhardt <[EMAIL PROTECTED]> wrote:

>
> Hello everybody,
>
> I have s slight problem using lucenes highlighter. If i have the
> highlighter
> enabled, a query creates 0 hits, if i disable the highlighter i get the
> hits.
> It seems like, when i call searcher.search() and pass my Hits hits to the
> highlighter function, the program quits. All prints after the highlighter
> call also do not appear.
> I have no idea what the problem is.
>
> Thanks in advise
>
> Jens Burkhardt
> --
> View this message in context:
> http://www.nabble.com/Highlighter-Hits-tp16002424p16002424.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


cannot delete cfs files on windows

2008-03-12 Thread Ioannis Cherouvim

Hello

I can index many times and delete the index files (manually). But if I 
search once, then the cfs file is locked and cannot be deleted. 
Subsequent indexings create new cfs files. Even if I undeploy the tomcat 
web application which holds the search code, the cfs file cannot be deleted.



O/S:
Windows XP


Code to index:
IndexWriter writer = new IndexWriter(
   PATH,
   new StandardAnalyzer(),
   true);

writer.addDocument(doc1);
writer.addDocument(doc2);
writer.addDocument(doc3);

writer.optimize();
writer.close();


Code to search:
Searcher searcher = null;
IndexReader indexReader = null;
try {
   indexReader = IndexReader.open(PATH);
   searcher = new IndexSearcher(indexReader);
   Hits hits = searcher.search(query);
   ...
} finally {
   searcher.close();
   indexReader.close();
}



Am I doing something wrong?

thanks,
Ioannis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter Hits

2008-03-12 Thread Matthew Hall
I suspect you are using a different analyzer to highlight than you are 
using to search.


A couple of things you can check:

Immediately after your query simply print out hits.length, this should 
conclusively tell you that you query is in fact working, after that 
ensure that you are using the same analyzer for your highlighter that 
you are for your query parser.


If you are not, its entirely possible that the text you are trying to 
highlight with is being transformed differently than how it was in the 
query, and as a result isn't matching against your fields anymore.


Hope that helps,

Matt

JensBurkhardt wrote:

Hello everybody,

I have s slight problem using lucenes highlighter. If i have the highlighter
enabled, a query creates 0 hits, if i disable the highlighter i get the
hits.
It seems like, when i call searcher.search() and pass my Hits hits to the
highlighter function, the program quits. All prints after the highlighter
call also do not appear.
I have no idea what the problem is. 


Thanks in advise

Jens Burkhardt
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Indexing Yes and No

2008-03-12 Thread Raq

Querying Lucene with

includeNews:Yes

Works fine and brings back expected results..

includeNews:No

Does not work and brings back nothing..

There are definitely documents in my index that has the word "No" in the
includeNews field.

Tested in Luke with all the analyzers.

Any ideas? Any thoughts? Any workarounds..? 

Any help much appreciated.

Thanks

Raq
-- 
View this message in context: 
http://www.nabble.com/Indexing-Yes-and-No-tp16012319p16012319.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexReader deleteDocument

2008-03-12 Thread varun sood
Hi,
 I am trying to delete a document without using the hits object.
What is the unique field in the index that I can use to delete the document?

I am trying to make a web interface where index can be modified, smaller
subset of what Luke does but using JSPs and Servlet.

to use deleteDocument(int docNum)
I need docNum how can I get this? or does it have to come only vis Hits?

Thanks,
Varun


Re: IndexReader deleteDocument

2008-03-12 Thread Mark Miller
Have you seen the work that Mark Harwood has done making a GWT version 
of Luke? I think its in the latest release.


varun sood wrote:

Hi,
  I am trying to delete a document without using the hits object.
What is the unique field in the index that I can use to delete the document?

I am trying to make a web interface where index can be modified, smaller
subset of what Luke does but using JSPs and Servlet.

to use deleteDocument(int docNum)
I need docNum how can I get this? or does it have to come only vis Hits?

Thanks,
Varun

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Yes and No

2008-03-12 Thread Mark Miller
Well, if your using a stopword list, "no" is likely to be on it and 
"yes" is not.


Raq wrote:

Querying Lucene with

includeNews:Yes

Works fine and brings back expected results..

includeNews:No

Does not work and brings back nothing..

There are definitely documents in my index that has the word "No" in the
includeNews field.

Tested in Luke with all the analyzers.

Any ideas? Any thoughts? Any workarounds..?

Any help much appreciated.

Thanks

Raq
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexReader deleteDocument

2008-03-12 Thread varun sood
No. I haven't but I will. even though I would like to make my own
implementation. So any idea of how to get the "doc num"?

Thanks for replying.
Varun

On Wed, Mar 12, 2008 at 5:15 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

> Have you seen the work that Mark Harwood has done making a GWT version
> of Luke? I think its in the latest release.
>
> varun sood wrote:
> > Hi,
> >   I am trying to delete a document without using the hits object.
> > What is the unique field in the index that I can use to delete the
> document?
> >
> > I am trying to make a web interface where index can be modified, smaller
> > subset of what Luke does but using JSPs and Servlet.
> >
> > to use deleteDocument(int docNum)
> > I need docNum how can I get this? or does it have to come only vis Hits?
> >
> > Thanks,
> > Varun
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
> OK, I think very likely this is the issue: when IndexWriter hits an
> exception while processing a document, the portion of the document
> already indexed is left in the index, and then its docID is marked
> for deletion.  You can see these deletions in your infoStream:
>
>flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments
>
> This means you have deletions in your index, by docID, and so when
> you optimize the docIDs are then compacted.

Aha.  Under 2.2, a failure would result in nothing being added to the text 
index so this would explain the problem.  It would also explain why smaller 
data sets are less likely to cause the problem (it's less likely for there to 
be an error in it.)

Workarounds?
  - flush() after any IOException from addDocument()  (overhead?)
  - use ++ to determine the next document ID instead of
index.getWriter().docCount()  (out of sync after an error but fixes itself
on optimize().
  - Use a field for a separate ID (slower later when reading the index)
  - ???

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexing api wrt Analyzer

2008-03-12 Thread John Wang
Hi all:

Maybe this has been asked before:

I am building an index consists of multiple languages, (stored as a
field), and I have different analyzers depending on the language of the
language to be indexed. But the IndexWriter takes only an Analyzer.

I was hoping to have IndexWriter take an AnalyzerFactory, where the
AnalyzerFactory produces Analyzer depending on some criteria of the
document, e.g. language.

Maybe I am going about the wrong way.

Any suggestions on how to go about?

Thanks

-John


Re: indexing api wrt Analyzer

2008-03-12 Thread Asgeir Frimannsson
On Thu, Mar 13, 2008 at 10:40 AM, John Wang <[EMAIL PROTECTED]> wrote:

> Hi all:
>
>Maybe this has been asked before:
>
>I am building an index consists of multiple languages, (stored as a
> field), and I have different analyzers depending on the language of the
> language to be indexed. But the IndexWriter takes only an Analyzer.
>
>I was hoping to have IndexWriter take an AnalyzerFactory, where the
> AnalyzerFactory produces Analyzer depending on some criteria of the
> document, e.g. language.
>
>Maybe I am going about the wrong way.
>
>Any suggestions on how to go about?
>

Perhaps this is what you are searching for:

http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
each field, as well as a default analyzer.

cheers,
asgeir


Re: indexing api wrt Analyzer

2008-03-12 Thread Daniel Noll
On Thursday 13 March 2008 15:21:19 Asgeir Frimannsson wrote:
> >I was hoping to have IndexWriter take an AnalyzerFactory, where the
> > AnalyzerFactory produces Analyzer depending on some criteria of the
> > document, e.g. language.

> With PerFieldAnalyzerWrapper, you can specify which analyzer to use with
> each field, as well as a default analyzer.

Certainly this would work as long as you store each language in a different 
Lucene field.  This is probably a good idea anyway as it will be easier for 
the QueryParser where there won't necessarily be enough text to determine the 
language easily.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-12 Thread Daniel Noll
On Thursday 13 March 2008 00:42:59 Erick Erickson wrote:
> I certainly found that lazy loading changed my speed dramatically, but
> that was on a particularly field-heavy index.
>
> I wonder if TermEnum/TermDocs would be fast enough on an indexed
> (UN_TOKENIZED???) field for a unique id.
>
> Mostly, I'm hoping you'll try this and tell me if it works so I don't have
> to sometime 

I added a "uid" field to our existing fields.  After the load there were some 
gaps in the values for this field; presumably those were documents where 
adding the doc failed and adding the fallback doc also failed.  The index 
contains 20004 documents.  Each test I ran over 10 iterations and times below 
are an average of the last 5 as it took around 5 rounds to warm up.

Filter building, for a filter returning 1000 documents randomly selected:

   Time to build filter by UID (100% Derby) - 93ms
   Additional time to build filter by DocID - 12ms (13% penalty)

13% penalty is acceptable IMO.  The problem comes next.

Bulk operation building, for a query returning around 2800 documents:

   Time to build the bulkop by DocID (100% Hits) - 6ms
   Time to fetch the "uid" field from the document - 152ms (2600% penalty)
   Time to do the DB query (not counting commit though) - 10ms

For interest's sake I also timed fetching the document with no FieldSelector, 
that takes around 410ms for the same documents.  So there is still a big 
benefit in using the field selector, it just isn't anywhere near enough to 
get it close to the time it takes to retrieve the doc IDs.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good way of Indexing TextFiles

2008-03-12 Thread Sebastin

Hi All,
I tried one Indexing Stratergy:  
 1.I am having unique numbers as the search column for ex : my
search query should be

  9840836588 AND dateSc:[13/03/2008 TO 16/03/2008]

while Indexing the numbers i divide the number by 3 

9840836588%3 = 26588

creating a folder in the foloowing format 

"/200080301-200080316/26588"

I index and store the records in that folder.so while searching i get the
modulo and search the records only in that folder.

is it a good way of indexing?  
 
   

Sebastin wrote:
> 
> Hi All,
>I am going to create a Lucene Index Store of Size 300 GB per
> month.I read Lucene Index Performance tips in wiki.can anyone suggest what
> are all the steps need to be followed while dealing with big Indexes.My
> Index Store gets updated every second.I used to search 15 days records
> approximately 150 GB records,in a time.Does anyone give me a clue,what
> have to set JVM for both Index and Search to avoid Out of memory error and
> how can i create Index store for large Indexes?
> 

-- 
View this message in context: 
http://www.nabble.com/Good-way-of-Indexing-TextFiles-tp15950791p16021739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Good way of Indexing TextFiles

2008-03-12 Thread Sebastin

Hi All,
I tried one Indexing Stratergy:  
 1.I am having unique numbers as the search column for ex : my
search query should be

  9840836588 AND dateSc:[13/03/2008 TO 16/03/2008]

while Indexing the numbers i divide the number by 3 

9840836588%3 = 26588

creating a folder in the foloowing format 

"/200080301-200080316/26588"

I index and store the records in that folder.so while searching i get the
modulo and search the records only in that folder.

is it a good way of indexing?  
 
   

Sebastin wrote:
> 
> Hi All,
>I am going to create a Lucene Index Store of Size 300 GB per
> month.I read Lucene Index Performance tips in wiki.can anyone suggest what
> are all the steps need to be followed while dealing with big Indexes.My
> Index Store gets updated every second.I used to search 15 days records
> approximately 150 GB records,in a time.Does anyone give me a clue,what
> have to set JVM for both Index and Search to avoid Out of memory error and
> how can i create Index store for large Indexes?
> 

-- 
View this message in context: 
http://www.nabble.com/Good-way-of-Indexing-TextFiles-tp15950791p16021743.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]