Hi Pooja,
poojasreejith wrote:
I am using lucene2.2.0 for my application. I have a searcher.java class.
The problem I am facing is, it is not supporting
Query query = QueryParser.parse(q, contents,new StandardAnalyzer()); it
shows error; the method parse in the type QueryParser is not
Hi Liaqat,
Liaqat Ali wrote:
I want to index the Urdu language corpus (200 documents in CES XML DTD
format). Is net necessary to break the XML file into 200 different files
or it can be indexed in the original form using Lucene. Kindly guide in
this regard.
A Lucene document is composed of
Hi Ivan,
Ivan Vasilev wrote:
But how to understand the meaning of this: “To overcome this, you
have to index chinese characters as single tokens (this will increase
recall, but decrease precision).”
I understand it so: To increase the results I have to use instead of
the Chinese another
Hi Durga,
I have moved this discussion to the java-user list, since the java-dev
list is devoted to development of the Java Lucene library, and not to
questions about its capabilities. My answers are inline below.
[EMAIL PROTECTED] wrote:
1) What are the various languages supported by
poeta simbolista wrote:
I'd want to know the best way to look for strange encodings on a Lucene
index.
i have several inputs where input can have been encoded on different sets. I
not always know if my guess about the encoding has been ok. Hence, I'd
thought of querying the index for some
Hi Madhu,
Madhu wrote:
i am indexing pdf document using pdfbox 7.4, its working fine for some pdf
files. for japanese pdf files its giving the below exception.
caught a class java.io.IOException
with message: Unknown encoding for 'UniJIS-UCS2-H'
Can any one help me , how to set the
Mike wrote:
I've searched the mailing list archives, the web, read the FAQ, etc and I
don't see anything relevant so here it goes…
I'm trying to implement a radius based searching based on zip/postal codes.
Here is a selection of interesting threads from the Lucene ML with
relevant info:
Hi Cedric,
Cedric Ho wrote:
On 8/13/07, Erick Erickson [EMAIL PROTECTED] wrote:
Are you iterating through a Hits object that has more than
100 (maybe it's 200 now) entries? Are you loading each document that
satisfies the query? Etc. Etc.
Unfortunately, yes. And I know this is another big
Hi Antonello,
Antonello Provenzano wrote:
I've been working for a while on the implementation of a website
oriented to contents that would contain millions of entries, most of
them indexable (such as descriptions, texts, names, etc.).
The ideal solution to make them searchable would be to use
qaz zaq wrote:
I have Search Terms: T1, T2... Tn. Also I have document fields of F1 F2... Fm.
I want to search the match documents across F1 to Fm fields,i.e., all of the
T1, T2, ...Tn need to be matched, but can be in the combination of T1, T2,
... Tn field.
I check the
Hi Ed,
Ed Murray wrote:
Could
someone let me know the best Analyzer to use to get an exact match on a Unix
filename when it is inserted into an untokened field.
Filenames
obviously contain spaces and forward slashes along with other characters. I
am using
a WhitespaceAnalyzer but when
Hi Aliaksandr,
Aliaksandr Radzivanovich wrote:
What if I need to search for synonyms, but synonyms can be expanded to
phrases of several words?
For example, user enters query tcp, then my application should also
find documents containing phrase Transmission Control Protocol. And
conversely,
[EMAIL PROTECTED]:
Hi Steven.
When i access to this address, this message appread
Forbidden
You don't have permission to access /servlets/ProjectHome on this server.
What's the problem?
Thakns.
Steven Rowe wrote:
Mahdi Rahimi wrote:
Hi.
How can I access JavaCC??
Thanks
Hi Rob,
Robert Walpole wrote:
At the moment I am attempting to do this as follows...
analyzer = new PorterStemAnalyzer();
parser = new QueryParser(content, analyzer);
Query query = parser.parse(keywords: relaxing);
Hits hits = idxSearcher.search(query);
...but this is not returning any
Mahdi Rahimi wrote:
Hi.
How can I access JavaCC??
Thanks
https://javacc.dev.java.net/
--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Hi Sawan,
Sawan Sharma wrote:
Now, The problem occured when I passed the multiple words in term query.
e.g.code
QueryFilter filter = new QueryFilter(new TermQuery(new Term(FieldName,
FieldValue)));
code
where field name and field value dynamically getting.
here we take the example value.
Hi Sebastin,
Sebastin wrote:
i index my document using SimpleAnalyzer() when i search the Indexed
field in the searcher class it doesnt give me the results.help me to sort
out this issue.
My Code:
test=9840836598
test1=bch01
testRecords=(test+ +test1);
Hi Jes,
Jesse Prabawa wrote:
The Lucene FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ
mentions that the position of the matches in the text does not affect
scoring. So is there anyway that I can make the position of the
matches affect scoring? For example, I want matches that occur at
Hi Daniel,
Daniel Noll wrote:
On Saturday 16 June 2007 11:39:35 Chris Hostetter wrote:
: The mailing list has already answered this question dozens of times.
: I've been wondering lately, does this list have a FAQ? If so, is this
: question on it?
The wiki is open to editing by all.
Daniel Noll wrote:
On Friday 15 June 2007 11:07:25 Antony Sequeira wrote:
Hi
I am aware that with Lucene I can not do negative only queries such as
-foo:bar
The mailing list has already answered this question dozens of times. I've
been wondering lately, does this list have a FAQ? If
Hi Harini,
Harini Raghavan wrote:
I am trying to create a lucene query to search for companies based on
areacode. The phone no. is stored in the lucene index in the form of
'415-567-2323'. I need to create a query like +areaCode:415-. But the
QueryParser is stripping off the hyphen(-).
Bowesman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 06, 2007 11:36 PM
To: java-user@lucene.apache.org
Subject: Re: How can I search over all documents NOT in a certain subset?
Steven Rowe wrote:
Conceptually (caveat: untested), you could:
1. Extend Filter[1] (call it DejaVuFilter
Hi Tim,
Tim Smith wrote:
How can I restore the behavior of the old
WildcardQuery under 2.1?
I badly need 'cat???' to match 'cat' again just like
in the older versions.
The behavior you want was last sighted in Java Lucene four releases ago
(v1.4.3).
See Doug Cutting's response to a similar
Hi Divya,
The Lucene library itself provides no support for backup.
You might be interested in the Solr project[1], which extends Lucene,
and which automates index replication. From the Solr Introduction /
Features page[2]:
Replication
* Efficient distribution of index parts that have
Hi Hilton,
Hilton Campbell wrote:
Hello all,
In my application I want to perform a search over all the documents
that are NOT in a certain subset, and I'm not sure how I should do
it.
Specifically, the application performs a search and the top N results
are shown to the user. The user
Hi Mohammad,
Mohammad Norouzi wrote:
[Hoss wrote:]
...are there Persian characters with a category type of SPACE_SEPARATOR,
LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
How can I know that?
The Unicode standard's codes[1] for these are:
SPACE SEPARATOR: Zs
LINE SEPARATOR: Zl
Hi Michael,
Michael Böckling wrote:
Hi folks!
The topic says it all: I want to modify the StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that are NOT followed
by a whitespace character, in addition to punctuation characters that ARE
followed by
Hi Terry,
The one place I know where KeywordAnalyzer is definitely useful is when
it is used in conjunction with PerFieldAnalyzerWrapper.
Steve
dontspamterry wrote:
Hi Otis,
I tried both ways, did some queries, and results are the same, so I guess
it's a matter of preference???
-Terry
change I've made is not to ignoring
unicode characters in Persian and arabic language, because with original
WhitespaceAnalyzer it didnt work fine whether it ignore or something
else, I
dont know but I extends my classes and now I am using my analyzer to index.
On 5/22/07, Steven Rowe [EMAIL
Hi Mohammad,
May I ask what your language is? And what kind of changes to
WhitespaceAnalyzer were required to make it work with your language?
If you have made modifications to WhitespaceAnalyzer that are generally
useful, please consider contributing your changes back to the Lucene
project.
Mike Klaas wrote:
On 18-May-07, at 1:01 PM, charlie w wrote:
Is there an upper limit on the number of fields comprising a document,
and if so what is it?
There is not. They are relatively costless if omitNorms=False
Mike, I think you meant relatively costless if omitNorms=True.
Steve
Hi Charles,
The need presented by your use case sounds very similar to that served
by the SynonymAnalyzer given in Erik Hatcher's and Otis Gospodnetic's
excellent book Lucene in Action - take a look:
http://lucenebook.com/
Steve
Charles Patridge wrote:
I have looked around on Lucene web
$ whenever you encountered any of the items in your
list, then when concept searching is called for, search on
WildAnimals$.
Highlighting might be tricky, but certainly do-able, especially with
the capabilities of a MemoryIndex..
Erick
On 5/16/07, Steven Rowe [EMAIL PROTECTED] wrote:
Hi
due to requirements and the
fact that we were having memory issues for cases where a parent had an
extremely large number of children (~200,000).
-Terry
Steven Rowe wrote:
Hi Terry,
Why not have another index in which a document has one field for the
parent and another field containing
Krishna Prasad Mekala wrote:
I have to create the index from my Oracle database. Can anybody tell me
how to create the index from Oracle using lucene?
Check out Marcelo Ochoa's Oracle/Lucene integration:
http://issues.apache.org/jira/browse/LUCENE-724
Steve
Karl Wettin's code to facilitate index copying may be useful (the below
link is to a post of Karl's to the java-dev mailing list):
http://www.nabble.com/Resolving-term-vector-even-when-not-stored--t3412160.html
Steve
Erick Erickson wrote:
In the immortal words of Erik H. ...it depends...
Mohammad Norouzi wrote:
Steven,
what this means:
Each index added must have the same number of documents, but
typically each contains different fields. Each document contains the
union of the fields of all documents with the same document number.
When searching, matches for a query term are
Hi Mohammad,
Have you looked at MultiSearcher?
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/MultiSearcher.html
Section 5.6 of Lucene in Action covers its use.
Steve
Mohammad Norouzi wrote:
hi
I have two separated index but there are some fields that are common
between
to one index,
you need to add the same documents in the same order to the other
indexes. Failure to do so will result in undefined behavior.
-
Steve
Steven Rowe wrote:
Hi Mohammad,
Have you looked at MultiSearcher?
http://lucene.apache.org/java/docs/api/org/apache/lucene/search
I think ParallelReader, first released in Lucene-Java 1.9, should meet
your needs:
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/ParallelReader.html
-
An IndexReader which reads multiple, parallel indexes. Each index added
must have the same number of documents, but
Hi Chris,
Chris Lu wrote:
Hi, Steven,
Thanks for the instant reply! But let's see the warning in the
ParallelReader javadoc:
It is up to you to make sure all indexes are created and modified
the same way. For example, if you add documents to one index, you need
to add the same documents
This may help:
http://www.pdfbox.org/userguide/text_extraction.html#Lucene+Integration
ashwin kumar wrote:
hi all i am able to convert a pdf in to a text file using pdfbox. and this
is the code that i used
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import
Hi Ruchika,
Are there are any quote characters in your index (may the Luke be with
you[1])? If not, you could just remove all quotes from your query
(except the surrounding ones indicating phrase matching, of course), and
things will work, as you have indicated.
Which version of Lucene are you
Hi senthil,
senthil kumaran wrote:
I've indexed 4 among 5 fields with Field.Store.YES Field.Index.NO. And
indexed the remaining one, say it's Field Name is *content*, with
Field.Store.YES Field.Index.Tokenized(It's value is collective value of
other 4 fields and some more values).So my
Check out QueryParser.setAllowLeadingWildcard():
http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)
(though AFAICT this feature is not in any released version of Lucene yet
- you'll have to use a nightly build).
poeta simbolista
heikki doeleman wrote:
One question though .. is there an easy way to download the sources
from the svn repository, in one go ? I did it now by right-clicking
links to files
The Source Code section of the Lucene Java Developer Resources page
Jason wrote:
Hi all,
I have come across what I think is a curious but insidious bug with
the java lucene hit highlighter.
[...]
when I search for - Acquisition Plan -
in my search results I get:
summary(ancilliary stuff deleted)
attached to the emAcquisition/em
emPlan/emand
Walt Stoneburner wrote:
Do I have correct and complete understanding of the two operators?
Not entirely complete :) - more information in the October 2006 thread
QueryParser is Badly Broken:
http://www.gossamer-threads.com/lists/lucene/java-user/40945
Hi Sdeck,
sdeck wrote:
The query for collecting a specific actor is around 200-300 milliseconds,
and the movie one, that actually queries each actor, takes roughly 500-700
milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes
500 milliseconds*# of movies
I'm having
/store/RAMDirectory.html?
Hope it helps,
Steve
Steven Rowe wrote:
Hi Sdeck,
sdeck wrote:
The query for collecting a specific actor is around 200-300 milliseconds,
and the movie one, that actually queries each actor, takes roughly
500-700
milliseconds. Yet, for a genre, where you may have
Hi Scott,
sdeck wrote:
I guess, any ideas why I would run out of heap memory by combining all of
those boolean queries together and then running the query? What is happening
in the background that would make that occur? Is it storing something in
memory, like all of the common terms or
Hi Kapil,
Kapil Chhabra wrote:
Hi Steve,
Thanks for the response.
Actually I am not looking for a query language. My question is, whether
Lucene supports Nested Queries or self joins?
As per
http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html
In BNF, the
Antonio Bruno wrote:
To use but directly the docId would render efficient and fastest the
searches much. Thoughts to the possibility of being able to apply a
first CachingWrapperFilter F1 on an index and a second
CachingWrapperFilter F2 on an other index and after to make (F1 AND
F2) and to
Hi Adrian,
I don't see anything obviously wrong with your code.
Can you give more details about which field values are different from
what you expect? I'm guessing it's the id field you're worried about,
but it's not clear from what you have written whether it's the title or
the id field which
Karl Koch wrote:
The coord(q,d) normalisation is a score factor based on how many of
the query terms are found in the specified document. and described
here:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord
Does this have a theoretical base? On
Karl Koch wrote:
Is there any other paper that actually shows the benefit of doing
this particular normalisation with coord_q_d? I am not suggesting
here that it is not useful, I am just looking for evidence how the
idea developed.
I think it's a mischaracterization to call coordination a
abdul aleem wrote:
How to actually retrieve the content of search,
Most of the examples in Lucene in Action
Searcher gives the results found in number of
documents
but i coudln't find an API to retrieve the line or
paragraph where the search is matched
Hi Abdul,
I don't know what
Bhavin,
Mark Harwood gives a solution that looks almost exactly like what you want:
http://www.mail-archive.com/java-user@lucene.apache.org/msg05154.html
Steve
Chris Hostetter wrote:
serach the archives for faceted searching and category counts and you
should find lots of discussions on
static String QueryParser.escape(String) should do the trick:
http://lucene.apache.org/java/docs/api/org/apache/lucene/queryParser/QueryParser.html#escape(java.lang.String)
Look at the bottom of the below-linked page for the list of characters
that the above method will escape:
George Aroush wrote:
From your email, I take it that even for the Java folks, they can't
accumulate the list of files that make up 2.0.1. Am I right?
There has never been and likely will never be a 2.0.1 release.
2.0.1, 2.1 -- these are labels for *potential* future releases.
2.1 is much
Hi Jong,
Jong Kim wrote:
I'm looking for a stemmer that is capable of returning all morphological
variants of a query term (to be used for high-recall search). For example,
given a query term of 'cares', I would like to be able to generate 'cares',
'care', 'cared', and 'caring'.
To
Hi Bill,
Bill Taylor wrote:
On Oct 16, 2006, at 5:44 AM, Christoph Pächter wrote:
I know that I can index pdf-files (using a third-party library).
Could you please tell me where to find this library?
There are several PDF extraction packages listed here (look under the
Lucene Document
Hi Rahil,
Rahil wrote:
I was just wondering whether there is a
difference between the regular expression you sent me i.e.
(i) \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s*
and
(ii) \\b
as they lead to the same output. For example, the string search testing
a-new string=3/4 results in
Marcus Falck wrote:
Any good approaches for allowing case sensitive and case insensitive
searches?
Except adding an additional field and skipping the LowerCaseFilter.
Since this severely increases the index size (and the index already
is around 1 TB).
Hi Marcus,
How about a filter that
Hi Rahil,
Rahil wrote:
I couldnt figure out a valid regular expression to write a valid
Pattern.compile(String regex) which can tokenise a string into O/E -
visual acuity R-eye=6/24 into O,/,E, -, visual, acuity,
R, -, eye, =, 6, /, 24.
The following regular expression should match
Vladimir Olenin wrote:
- is there a place I can get already crawled internet web pages in an
archive (10 - 100Gb of data)
I don't the sizes of the corpora mentioned on Lucene Wiki's Resources
page, but it's a good place to start:
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_0_0/CHANGES.txt
Otis Gospodnetic wrote:
CHANGES.txt is your best source for that answer.
KEGan [EMAIL PROTECTED] wrote:
What about the internal of Lucene? Are there any major changes in there?
The Resources page on the Lucene Wiki has a collection of articles that
may be useful to you:
http://wiki.apache.org/jakarta-lucene/Resources
Michael McCandless wrote:
Mark Miller wrote:
I'll one up you:
http://www.manning.com/hatcher2/
Might as well save yourself a whole lot of time and
Hi Luis,
Chris Hostetter wrote:
Luis Rodrigo Aguado wrote:
: I've been looking through the documentation in the official
: web-site, and the Javadoc belongs to v2.1, that I could not find
: anywhere, anyone has a clue about where to find it or when will it be
: officially released?
There has been a long-running thread on the java-dev list about how to
allow application-specific extra stuff to be placed in the index, at
multiple levels of granularity. Some of this conversation is captured
on the Wiki at:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Maybe you
As Jason says, you can structure each Lucene document with one Field per
content type, and index all data that way. The database is not required.
To address your search complexity concern, you can create queries that
search only those Field(s) the user wants -- there is no need to have a
Field
Michael J. Prichard wrote:
Hey Otis,
Sure I would love to! Can you ping me at [EMAIL PROTECTED] and
let me know what I need to do? Do I just post it to JIRA?
Thanks,
Michael
Otis Gospodnetic wrote:
A good place for that in JIRA. could you put it there? We have a
bunch of
Rajan, Renuka wrote:
I am trying to match accented characters with non-accented characters
in French/Spanish and other Western European languages. The use case
is that the users may type letters without accents in error and we
still want to be able to retrieve valid matches. The one idea,
Hugh Ross wrote:
The problem is that the standard analyzer removes the stop word (i.e.
of) before indexing and searching. Is there an
workaround for this?
See my response to a similar question here:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200510.mbox/[EMAIL
PROTECTED]
In
Mordo, Aviran (EXP N-NANNATEK) wrote:
What you are asking is not possible. The whole purpose of the analyzer
is to tokenize the fields, so if you want them to be tokenized don't use
the Keyword fields.
Um, KeywordAnalyzer?
Anton Feldmann wrote:
3) How do I display the sentence before and after the sentence the hit
is in?
You could:
1. Make your Lucene Document be a set of three sentences (before,
searchable, after), which you store, but write a custom Analyzer which
only returns tokens for the searchable
Mufaddal Khumri wrote:
lets say i do this while indexing:
doc.add(Field.Text(categoryNames, categoryNames));
Now while searching categoryNames, I do a search for digital cameras.
I only want to match the exact phrase digital cameras with documents who
have exactly the phrase digital cameras
MALCOLM CLARK wrote:
Could you send me the url for HighFreqTerms.java in cvs?
ViewCVS URL:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java
-
To
Code and examples for embedding Lucene in HSQLDB and Derby relational
databases:
http://issues.apache.org/jira/browse/LUCENE-434
Rick Hillegas wrote:
Thanks to Yonik for replying to my last question about queries and filters.
Now I have another issue. I would appreciate any pointers to
Hi Bob,
StandardAnalyzer filters the token stream created by StandardTokenizer
through StandardFilter, LowercaseFilter, and then StopFilter. Unless
you supply a stoplist to the StandardAnalyzer constructor, you get the
default set of English stopwords, from StopAnalyzer:
public static
There is a proposal to extend indexing (item #11 in the API Changes
section):
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
An excerpt:
11. (Hard) Make indexing more flexible, so that one could
e.g., not store positions or even frequencies, or alternately,
to store extra
81 matches
Mail list logo