RE: WildCardQuery

2004-10-01 Thread Stephane James Vaucher
Can you be a little more precise about how you process your documents?

1) What's your analyser? SimpleAnalyzer?
2) How do you parse the query? Out-of-the-box QueryParser?

 can we not enter space or do an OR search with two words one of which
 has a wildcard ?

Simple answer, yes.

Complicated answer, words are delimited by your tokeniser. That's included
in your analyser (hence my question above). The asterix syntax comes
from using a query parser that transforms the query into a PrefixQuery
object.

sv

On Fri, 1 Oct 2004, Robinson Raju w Hi ,
Would there be a problem if one enters space while using wildcards ?
 say i search for 'abc' . i get 100 hits as results
 'man' gives - 200
 'abc man' gives 300
 but
 'ab* man'
 'abc ma*'
 ab* ma*'
 ab* OR ma*
 ..
 all of these return 0 results.
 can we not enter space or do an OR search with two words one of which
 has a wildcard ?

 Regards,
 Robin

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexHTML parser + Constructer

2004-10-01 Thread Karthik N S


Hi


Apologies .

Can Somebody Please tell me or  how to include  a constructer  within
'org.apache.lucene.demo.html.HtmlParser.java' ,
So that using the Constructer read the String argument,Strips the HTML
Tags and returns the String with out Tags.
Currently 'org.apache.lucene.demo.html.HtmlParser.java' method accepts
fullpath of the file and then reads
the Content to Strip Tags..




Thx in Advance
Karthik


-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Saturday, September 25, 2004 12:47 AM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


On Friday 24 September 2004 19:58, Fred Toth wrote:

 I've got unicode in my source HTML. In particular, within meta tags,
 and it's getting broken by the indexer. Note that I'm not trying to
 query on any of this, just store and retrieve document titles with
 unicode characters.

Please try again with the code from CVS, Christoph Goller committed a fix
for this problem (at least I think it was this problem) 1-3 weeks ago.

Regards
 Daniel

--
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildCardQuery

2004-10-01 Thread Robinson Raju
analyzer is StandardAnalyzer.
i use MultiFieldQueryParser to parse. 

The flow is this:
I have indexed a Database view. Now i need to search against a few columns
i take in the search criteria and search field , 
construct a wildcard query and add it to a boolean query

WildcardQuery wQuery = new WildcardQuery(new Term(searchFields[0],
searchString));
booleanQuery.add(wQuery, true, false);
Query queryfilter = MultiFieldQueryParser.parse(filterString,
filterFields, flags, analyzer);
hits = parallelMultiSearcher.search(booleanQuery,queryFilter);

when i dont use wild cards , it is taken as
+((ITM_SHRT_DSC:natal ITM_SHRT_DSC:tylenol) (ITM_LONG_DSC:natal
ITM_LONG_DSC:tylenol))
But when wildcard is used , it is taken as 
+ITM_SHRT_DSC:nat* tylenol +ITM_LONG_DSC:nat* Tylenol

the first return around 300 records , the second , 0. 

any help would be appreciated
Thanks
Robin

On Fri, 1 Oct 2004 02:06:04 -0400 (EDT), Stephane James Vaucher
[EMAIL PROTECTED] wrote:
 Can you be a little more precise about how you process your documents?
 
 1) What's your analyser? SimpleAnalyzer?
 2) How do you parse the query? Out-of-the-box QueryParser?
 
  can we not enter space or do an OR search with two words one of which
  has a wildcard ?
 
 Simple answer, yes.
 
 Complicated answer, words are delimited by your tokeniser. That's included
 in your analyser (hence my question above). The asterix syntax comes
 from using a query parser that transforms the query into a PrefixQuery
 object.
 
 sv
 
 On Fri, 1 Oct 2004, Robinson Raju w Hi ,
 
 
 Would there be a problem if one enters space while using wildcards ?
  say i search for 'abc' . i get 100 hits as results
  'man' gives - 200
  'abc man' gives 300
  but
  'ab* man'
  'abc ma*'
  ab* ma*'
  ab* OR ma*
  ..
  all of these return 0 results.
  can we not enter space or do an OR search with two words one of which
  has a wildcard ?
 
  Regards,
  Robin
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 



-- 
Regards,
Robin
9886394650
The merit of an action lies in finishing it to the end

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BooleanQuery - Too Many Clases on date range.

2004-10-01 Thread Scott Ganyo
You can use:
BooleanQuery.setMaxClauseCount(int maxClauseCount);
to increase the limit.
On Sep 30, 2004, at 8:24 PM, Chris Fraschetti wrote:
I recently read in regards to my problem that date_field:[0820483200
TO 110448]
is evluated into a series of boolean queries ... which has a cap of
1024 ... considering my documents will have dates spanning over many
years, and i need the granualirity of 'by day' searching, are there
any reccomendations on how to make this work?
Currently with query: +content_field:sometext +date_field:[0820483200
TO 110448]
I get the following exception:
org.apache.lucene.search.BooleanQuery$TooManyClauses
any suggestions on how I can still keep the granuality of by day, but
without limiting my search results? Are there any date formats that I
can change those numbers to that would allow me to complete the search
(i.e.  Feb, 15 2004 ) .. can lucene's range do a proper search on
formatted dates?
Is there a combination of RangeQuery and Query/MultiTermQuery that I 
can use?

your help is greatly appreciated.
--
___
Chris Fraschetti
e [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: BooleanQuery - Too Many Clases on date range.

2004-10-01 Thread Damian Gajda
Dnia 01-10-2004, pi o godzinie 07:57 -0500, Scott Ganyo napisa(a):
 You can use:
 
 BooleanQuery.setMaxClauseCount(int maxClauseCount);

I had a similar problem with date ranges. Someone on the list suggested
me a solution to my problems but it was more clever than the above
solution, which helps but makes the searches work slower and is memory
hungry (many terms are loaded into memmory, and than searched).

The solution suggested was to split dates into sub fields during
indexing and use those fields while searching. This makes it more
effective but harder to create a query (personally I prefer working on
queries build using Lucene API, than ones parsed by QueryParser).

For instance the time stamp 2004-10-01 15:34:26.001 may be split into
following fields:
some-date_year: 2004
some-date_month: 10
some-date_day: 01
some-date_time: 153426001

The above fields should be indexed so they can be searched. They give
some nice possibilities, for instance fast and easy querying for all
documents that have a date in a particular year, month or day of month.
For conveniece one could also store weekdays.

A query for a date range from 15th august to 10th october 2004 (in no
particular query language - this just gives an idea):
some-date_year = 2004 AND (
   (some-date_month = 08 AND some-date_day = 15) OR
   (some-date_month=09) OR
   (some-date_month = 10 AND some-date_day = 10)
)

As You can see it is easy to build such a query from the lucene API. The
equalities are Term queries. The inequalities are Range queries. The AND
and OR operators can be provided by usage of Boolean queries.

Have fun implementing the solution - it has only one disadvantage. It
makes results sorting not so easy. The solution for it is usage of
multiple sort fields, or another stored field containing a full date
(one almost surely will need to store a date for each hit, unless You
want to write some baroque code to calculate date from split fields
values).

Have fun,
-- 
Damian Gajda
Caltha Sp. j.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



removing duplicate Documents from Hits

2004-10-01 Thread Timm, Andy (ETW)
Hello, I've searched on previous posts on this topic but couldn't find an answer.  I 
want to query my index (which are a number of 'flattened' Oracle tables) for some 
criteria, then return Hits such that there are no Documents that duplicate a 
particular field.  In the case where table A has a one-to-many relationship to table 
B, I get one Document for each (A1-B1, A1-B2, A1-B3...).  My index needs to have each 
of these records as 'B' is a searchable field in the index.  However, after the query 
is executed, I want my resulting Hits on be unique on 'A'.  I'm only returning the 
Oracle object ID, so once I've seen it once I don't need it again.  It looks like some 
sort of custom Filter is in order.  My fix at the moment is to run the query, then 
store unique id's in a Map to build another query that will return singletons on field 
'A'.  I could skip this step if there was a way to remove documents from Hits (I 
didn't see a way).  Has anyone written a filter that does this?  Are there others 
using Lucene to mimic a relational DB?  I've got a complex SQL search that joins (most 
outer) 40 some tables.  Query performance is important, and the tables are relatively 
static.  I find the ID's of the objects that match the users' criteria, then go to the 
DB to instantiate them.  Any comments are appreciated.  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: removing duplicate Documents from Hits

2004-10-01 Thread Doug Cutting
Timm, Andy (ETW) wrote:
Hello, I've searched on previous posts on this topic but couldn't find an answer.  I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field.  In the case where table A has a one-to-many relationship to table B, I get one Document for each (A1-B1, A1-B2, A1-B3...).  My index needs to have each of these records as 'B' is a searchable field in the index.  However, after the query is executed, I want my resulting Hits on be unique on 'A'.  I'm only returning the Oracle object ID, so once I've seen it once I don't need it again.  It looks like some sort of custom Filter is in order.
I'd suggest a HitCollector that uses a FieldCache of the A values to 
check for duplicates, and collect only a the best document id for each 
value of A.  This would use a bit of RAM, but be very fast.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/HitCollector.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/FieldCache.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


new release: 1.4.2

2004-10-01 Thread Doug Cutting
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 
1.4.1.  Details are at http://jakarta.apache.org/lucene/.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Memory leak in ParallelMultiSeacher?

2004-10-01 Thread Edwin Tang
Hello,
 
I ran across this post (http://java2.5341.com/msg/77213.html) in the mailing
list archives, and wondering if anyone has any updates on this?

Thanks,
Ed

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



multiple threads

2004-10-01 Thread Justin Swanhart
As I understand it, if two writers try to acess the same index for
writing, then one of the writers should block waiting for a lock until
the lock timeout period expires, and then they will return a Lock
wait timeout exception.

I have a multithreaded indexing applications that writes into one of
multiple indexes depending on a hash value, and I intend to merge all
the hashes when the indexing finishes.  Locking usually works but
sometimes it doesn't and I get IO exceptions such as the following..

java.io.IOException: Cannot delete _19.fnm
at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:198)
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:157)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:100)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389)
at org.en.global.indexer.IndexGroup.run(IndexGroup.java:387)


Any idea on why this could be happening?  I am using NFS currently,
but the problem appears on the local filesystem as well.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question regarding using Lucene or not

2004-10-01 Thread AmitShukla
Hello
I have a stand-alone java application. We have a new requirement where there
will be around 1000 data files in XML format. Each of them have the same
format. Nodes will have value and attributes. In the application, the user
will search for a particular spec (the data file) by defining parameters.
The parameters are both string and numeric. For example, the model should be
Cargo and its HP value should be 55,000 or near it . If we specify tolerance
value of 5000 then it should search for all the data files where model node
is Cargo (definitive match) and HP value is between 50,000 to 60,000 with
the one having 55,000 coming as the 100% match. 
Do you think Lucene can meet this requirement or do I need to look into any
other product ?

Please let me know.

Thanks.