Re: What kind of System Resources are required to index 625 million row table...???

2011-08-18 Thread Erick Erickson
-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Tuesday, August 16, 2011 8:14 PM >> To: java-user@lucene.apache.org >> Subject: R

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Erick Erickson
sort is interesting.  We must "retain" the granularity > of the "original" timestamp for Index maintenance purposes, > but we could add another field, with a granularity of > "date" instead of "date+time", which would be used for > sorting only. &

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Erick Erickson
About your OOM. Grant asked a question that's pretty important, how many unique terms in the field(s) you sorted on? At a guess, you tried sorting on your timestamp and your timestamp has millisecond or less granularity, so there are 625M of them. Memory requirements for sorting grow as the number

Re: Store the documents content in the index

2011-07-18 Thread Erick Erickson
It's certainly possible as others have said, but don't be surprised if it's not performant. At root, you still have a disk out there that's being used for fetching the data. Simply moving it from fetching individual files to fetching that data from the index doesn't change that fundamental fact. B

Re: Questions on index Writer

2011-07-15 Thread Erick Erickson
Index files should not be disappearing unless you're using the form of opening an indexwriter that creates a new index. We'd need to see the code you use top open the IW to provide more help. If all you're doing is looking at the index directory, segments will disappear as they are merged so that'

Re: jigar query please help me

2011-07-15 Thread Erick Erickson
You can ignore the warning. But you haven't told us a thing about *how* the failure occurs or what gets reported. What exactly are you doing? What exactly fails (i.e.do you just not find files? Get a stack trace? Get a "class not found error"?) We really cannot help at all without more informatio

Re: Large indexes

2011-07-08 Thread Erick Erickson
Simply breaking up your index into separate pieces on the same machine buys you nothing, in fact it costs you considerably. Have you put a profiler on the system to see what's happening? I expect you're swapping all over the place and are memory-constrained. Have you considered sharding your index

Re: questions about fieldCache

2011-06-21 Thread Erick Erickson
adOnlyDirectoryReader Entries from fieldCache. > Only sorting produces now an entry. > > So action that starts a new searcher and closes the old one (like > replication) > should release cache from fieldCache through garbage collection? > > Regards > Bernd > > Am 21.06.

Re: questions about fieldCache

2011-06-21 Thread Erick Erickson
Hmmm, I'm not going to even try to talk about the code itself, but I will add a couple of clarifications: Jetty has nothing to do with it. It's in Lucene, and it's used for sorting and sometimes faceting. The cache is associated with a reader on a machine used to search. When replication happens,

Re: anyway to store value as bytes?

2011-06-21 Thread Erick Erickson
Does this help? http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/util/IndexableBinaryStringTools.html If not, here's a note from Ryan McKinley on another thread (googling lucene storing binary data brought it up)... ** You can store binary data using a binary field type -- then y

Re: How to deal with not analyzed fields and analyzed ones in the same query

2011-06-20 Thread Erick Erickson
See PerFieldAnalyzerWrapper, then form your query like field1:word1 OR field2:word1 Best Erick On Mon, Jun 20, 2011 at 10:40 AM, G.Long wrote: > Hi :) > > I know it is possible to create a query on different fields with different > analyzers with PerFieldAnalyzer class but is it possible to also

Re: looks like no allowing of paging without counting entire result set?

2011-06-20 Thread Erick Erickson
a plain indexing library for those typical rdbms indexing use-cases that > you have. > > Dean > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Monday, June 20, 2011 6:15 AM > To: java-user@lucene.apache.org > Subject: Re: loo

Re: looks like no allowing of paging without counting entire result set?

2011-06-20 Thread Erick Erickson
re: 20020101 to the end of time.. Use a clause like [2002-01-01 TO *] About paging... Yes, you have to start all over again for each search. The basic problem is that you have to score every document each search, the last document scored might be the highest-scoring document. But let's back up a

Re: getting OutOfMemoryError

2011-06-17 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists You've given us no information to go on here, what are you trying to do when this happens? What have you tried? What is the query you're running when this happens? How much memory are you allocating to the JVM? You're apparently sorting

Re: So many compiling errors in the 3.3 source release when added into Eclipse

2011-06-13 Thread Erick Erickson
How did you do this? Did you execute the "ant eclipse" target first? See the instructions at: http://wiki.apache.org/solr/HowToContribute#Eclipse_.28Galileo.2C_J2EE_version_1.2.2.20100217-2310.2C_but_any_relatively_recent_Eclipse_should_do.29: Best Erick On Sat, Jun 11, 2011 at 2:16 AM, dyzc2010

Re: Slow ness of IndexWriter.close()

2011-06-13 Thread Erick Erickson
My first question is "what are you trying to do at a higher level"? Because asking people to check your code without telling us what you're trying to accomplish makes it difficult to know what to look at. You might review: http://wiki.apache.org/solr/UsingMailingLists That said, at a guess, your

Re: Index size and performance degradation

2011-06-11 Thread Erick Erickson
<<>> Hmmm, then it's pretty hopeless I think. Problem is that anything you say about running on a machine with 2G available memory on a single processor is completely incomparable to running on a machine with 64G of memory available for Lucene and 16 processors. There's really no such thing as an

Re: Updating a document

2011-06-10 Thread Erick Erickson
Well, taking the code all together, what I expect is that you'll have a document after all is done that only has a "DocId" in it. Nowhere do you fetch the document from the index. What is your evidence that you haven't deleted the document? If you haven't reopened your reader after the above, you'

Re: Boosting a document at query time, based on a field value/range

2011-06-09 Thread Erick Erickson
I take it from this that you want documents with values #outside# 20-30 to still be found? In that case you can do something like add a clause like: OR field:[20 TO 30]^10 or similar. Best Erick BTW, is there a reason you decided not to use Solr? In many ways it's easier than straight Lucene...

Re: MultiFieldQueryParser with default AND and stopfilter

2011-06-08 Thread Erick Erickson
to MFQP?  Worth a try. > > > -- > Ian. > > > On Wed, Jun 8, 2011 at 2:38 PM, Erick Erickson > wrote: >> Could you just construct a BooleanQuery with the >> terms against different fields instead of using MFQP? >> e.g. >> >> bq.add(qp.parse("title

Re: MultiFieldQueryParser with default AND and stopfilter

2011-06-08 Thread Erick Erickson
Could you just construct a BooleanQuery with the terms against different fields instead of using MFQP? e.g. bq.add(qp.parse("title:(the AND project)", SHOULD)) bq.add(qp.parse("desc:(the AND project)", SHOULD)) etc...? If your QueryParser was created with a PerFieldAnalyzerWrapper I think you mig

Re: Lucene Result

2011-06-08 Thread Erick Erickson
> > But thanks for the reply. > > On Wed, Jun 8, 2011 at 6:14 PM, Erick Erickson wrote: > >> hard to say. You should get a copy of Luke and inspect your index to >> see if what you >> think you put there is actually there. When you added data to your >> index,

Re: Lucene Result

2011-06-08 Thread Erick Erickson
hard to say. You should get a copy of Luke and inspect your index to see if what you think you put there is actually there. When you added data to your index, did you perform a commit? Best Erick On Wed, Jun 8, 2011 at 2:45 AM, Pranav goyal wrote: > There is one field DocId which I am storing as

Re: multiple small indexes or one big index?

2011-06-03 Thread Erick Erickson
ddition I am going to switch to another collector as well. ATM I >> collect the results and then sort them using the std. Collections.sort >> approach... I have to look what Lucene offers and switch to something >> else. >> >> Thanks, >> Alex >> >> O

Re: Federated relevance ranking

2011-06-02 Thread Erick Erickson
e a fairly > complex system, and adding anything Hadoop-related feels like it might > push us over a tipping point into the realm of unwieldy overcomplexity. >  But, this is a hard problem after all, so some amount of complexity is > inevitable. > > On 06/02/2011 07:05 PM, Eric

Re: Federated relevance ranking

2011-06-02 Thread Erick Erickson
As you've found out, raw scores certainly aren't comparable across different indexes #unless# the documents are fairly distributed. You're talking large indexes here, so if the documents are balanced across all your indexes, the results should be pretty comparable. This pre-supposes that the indexe

Re: boosting fields

2011-06-02 Thread Erick Erickson
Have you tried using the explain method on a Searcher and examining the results? Best Erick On Thu, Jun 2, 2011 at 3:51 PM, Clemens Wyss wrote: > I have a minimal unit test in which I add three documents to an index. The > documents have two fields "year" and "descritpion". > doc1(year = "2007"

Re: multiple small indexes or one big index?

2011-06-02 Thread Erick Erickson
me. > > Thanks, > Alex > > On 02.06.2011 13:04, Erick Erickson wrote: >> >> At this size, really consider going to a single index. The lack of >> administrative headaches alone is probably well worth the effort >> >> I almost guarantee that the time yo

Re: multiple small indexes or one big index?

2011-06-02 Thread Erick Erickson
gt; Multi-threaded searching will be next and if that hasn't helped, I will > switch to one big index. > All indexes together are rather small, ~200MB and 50.000 documents. > > -Alex > > On 01.06.2011 23:26, Erick Erickson wrote: >> >> I'd start by putting them

Re: multiple small indexes or one big index?

2011-06-01 Thread Erick Erickson
I'd start by putting them all in one index. There's no penalty in Lucene for having empty fields in a document, unlike an RDBMS. Alternately, if you're opening then closing searchers each time, that's very expensive. Could you open the searchers once and keep them open (all 90 of them)? That alone

Re: QueryParser/StopAnalyzer question

2011-05-28 Thread Erick Erickson
ed case, too. > > This probably leaves me with a single option which is not to use > stopwords at all, allowing me to get the best of the both worlds. Does > anyone have any experience on how much of increased index size > (roughly) can I expect? > > Regards, > Mindau

Re: QueryParser/StopAnalyzer question

2011-05-23 Thread Erick Erickson
Hmmm, somehow I missed this days ago Anyway, the Lucene query parsing process isn't quite Boolean logic. I encourage you to think in terms of "required", "optional", and "prohibited". Both queries are equivalent, to see this try attaching &debugQuery=on to your URL and look at the "parsed que

Re: how to create a range query with string parameters

2011-05-17 Thread Erick Erickson
Actually, there are no results in the range [l220-2 TO l220-10] This is basically a string comparison, and l220-2 > l220-10 so this range would never match. Best Erick On Tue, May 17, 2011 at 1:51 PM, G.Long wrote: > I set the field article to NOT_ANALYZED and I didn't quoted the article > valu

Re: How do we reverse sort on the docid ?

2011-05-16 Thread Erick Erickson
Ahhh, I probably should have read more carefully! At any rate, I think all you need to do is specify the reverse boolean in the SortField c'tor??? Best Erick On Mon, May 16, 2011 at 8:12 AM, shrinath.m wrote: > > Erick Erickson wrote: >> >> Why do you want to do this? t

Re: How do we reverse sort on the docid ?

2011-05-16 Thread Erick Erickson
Why do you want to do this? the internal doc ids are transient. If you update a document by delete/add, the internal id will now be different. What I'm getting at is that I'd like to be sure the use case here does what you think it will because this smells like an XY problem, see: http://people.apa

Re: Lucene 3.3 in Eclipse

2011-05-16 Thread Erick Erickson
bq: Just curious. How would this version be published if there are missing jar and there are compiling errors? Well, the fact that it has been published probably means that you've missed a step somewhere. There'd have been howls of outrage if something as egregious as this were the case That

Re: Querying Lucene property for exact value

2011-05-06 Thread Erick Erickson
I'm a bit confused by this: *** With my query, I would like to only return 'patrol' items and nothing else. Is there a way to do this?? My current querying code is below. This returns all items with 'patrol' in it. ** Are you saying that if you're searching on "p

Re: AW: AW: AW: AW: "fuzzy prefix" search

2011-05-06 Thread Erick Erickson
Well, Solr officially uses Lucene, but you'll do disappointingly little Java coding. Which some people think is a plus :). The biggest issue will be making really, really sure that your schema.xml file in Solr reflects your use in the Lucene code Actually, I'd swallow the blue pill and just make t

Re: Higher scoring if term is at the beginning of a field/document

2011-05-04 Thread Erick Erickson
moon > The moon is bright > This is a moon > > i.e. the "leftmost hit" of my search term should be rated highest/best... > > How should I analyze/search my documents to get this search/rating behavior? > >> -Ursprüngliche Nachricht- >> Von: Erick

Re: Higher scoring if term is at the beginning of a field/document

2011-05-04 Thread Erick Erickson
What is the problem you're trying to solve? I'm wondering if this is an XY problem. See: http://people.apache.org/~hossman/#xyproblem Best Erick On Wed, May 4, 2011 at 3:16 AM, Clemens Wyss wrote: > Given the I have 3 documents with exactly one field and the fields have the > following contents

Re: AW: AW: AW: "fuzzy prefix" search

2011-05-04 Thread Erick Erickson
Shingles won't to that either, so I suspect you'll have to write a custom tokenizer. Best Erick On Wed, May 4, 2011 at 2:07 AM, Clemens Wyss wrote: > I know this is just an example. > But even the WhitespaceAnalyzer takes the words apart, which I don't want. I > would like the phrases as they a

Re: How to fix the number of searched terms for a field

2011-05-03 Thread Erick Erickson
Why do you want to do this? I'm wondering if this is an XY problem... See: http://people.apache.org/~hossman/#xyproblem Best Erick On Tue, May 3, 2011 at 7:55 AM, harsh srivastava wrote: > Hi All, > > > I want to know any inbuilt method in lucene that can help me to fix the > number of searched

Re: lucene 3.1 contrib packages source code download

2011-04-28 Thread Erick Erickson
Are you sure you need to? They may simply have moved. Which ones are you using? If you tell us maybe we can suggest where to find them. Best Erick On Thu, Apr 28, 2011 at 9:14 AM, Tanuj Jain wrote: > Hi, > Can anyone please tell where I could download *packages > org.apache.lucene.analysis.* *so

Re: Reg: Query behavior

2011-04-26 Thread Erick Erickson
You can also specify a large slop in your phrase (e.g. "arcos biosciences"~500 which will take distance into account when scoring, although it may not be enough to rank the document where you want. Sujit's comment is probably a better place to start. Best Erick On Tue, Apr 26, 2011 at 2:59 PM, Su

Re: Locking Issue with Concurrency

2011-04-20 Thread Erick Erickson
What do you mean by "access"? Are you trying to write to the common index with more than one of your machines? Best Erick On Wed, Apr 20, 2011 at 8:32 AM, Yogesh Dabhi wrote: > > > Three Instance of My application & they access common lucene directory > > > > Instance1 jdk64 ,64 os > > Instance2

Re: Solr 1.4.1: Weird query results

2011-04-20 Thread Erick Erickson
ersion isn't exactly >> the fastest and easiest thing so I'd like to strike out all other >> possibilities before :) >> >> Best regards, >> >>    Erik >> >> >> Am 20.04.2011 01:07, schrieb Lance Norskog: >>> >>> Lo

Re: Solr 1.4.1: Weird query results

2011-04-19 Thread Erick Erickson
H, I don't see the problem either. It *sounds* like you don't really have the default search field defined the way you think you do. Did you restart Solr after making that change? I'm assuming that when you say "not created by Solr" you mean that it's created by Lucene. What version of Lucene

Re: German*Filter, Analyzer "cutting" off letters from (french) words...

2011-04-18 Thread Erick Erickson
You can easily string together your own tokenizer and any number of filters to create an analyzer that does exactly what you need. Lucene In Action shows an example for creating your own analyzer by assembling the standard parts Best Erick On Mon, Apr 18, 2011 at 3:08 AM, Clemens Wyss wrote:

Re: Applying a sample data set to lucene

2011-04-14 Thread Erick Erickson
I would not go there first. There are examples out there to, for instance, index Wikipedia but that is, IMO, too complex for just starting to get your feet wet. I think you'd be better off looking at the Lucene demo code and trying to understand/modify that as a starting point, see: http://lucene.

Re: speed of CheckIndex

2011-04-14 Thread Erick Erickson
What information do you need? Could you just ping the stats component and parse the results (basically the info on the admin/stats page). Best Erick On Thu, Apr 14, 2011 at 11:56 AM, jm wrote: > Hi, > > I need to collect some diagnostic info from customer sites, so I would like > to get info on

Re: lucene 3.0.3 | searching problem with *.docx file

2011-04-12 Thread Erick Erickson
You've given us anything to go on here, except "it doesn't work". You might review this page: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Tue, Apr 12, 2011 at 9:05 AM, Ranjit Kumar wrote: > Hi, > > I am creating index with help of StandardAnalyzer for *.docx file it's > fine. Bu

Re: German*Filter, Analyzer "cutting" off letters from (french) words...

2011-04-12 Thread Erick Erickson
I don't quite get why the German analyzer would do this, but all the Filters I see are stemmers and I expect they'd reduce the words as you indicate. What version of Lucene are you using? Best Erick On Tue, Apr 12, 2011 at 8:46 AM, Clemens Wyss wrote: > I try to apply German*Filter and or Anal

Re: Help with delimited text

2011-04-06 Thread Erick Erickson
A TermQuery is really dumb. It doesn't do anything at all to the input, it assumes you've done all that up front. Try parsing a query rather than using TermQuery And I suspect you'll have problems with casing, but that's another story Best Erick On Wed, Apr 6, 2011 at 6:33 AM, Mark Wilts

Re: Question about open files

2011-04-06 Thread Erick Erickson
I suspect you're already aware of this, but I've overlooked the obvious so many times I thought I'd mention it... A classic mistake is to assign a reader with reopen and not close the old reader, see: http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/IndexReader.html#reopen() <

Re: OutOfMemoryError with FSDirectory

2011-04-04 Thread Erick Erickson
FSDirectory will, indeed, store the index on disk. However, when *using* that index, lots of stuff happens. Specifically: When indexing, there is a buffer that accumulates documents until it's flushed to disk. Are you indexing? When searching (and this is the more important part), various caches a

Re: How to do Multiple-Cluster Query?

2011-04-01 Thread Erick Erickson
You might consider a multiValued field and a positionIncrementGap longer than the longest tuple. At that point, you can search for phrase queries where the slop is less than the positionIncrementGap. I'm a bit rushed, so if you need more details we can talk later Best Erick 2011/4/1 袁武 [GMa

Re: Performance and index size (rephrased question)

2011-03-31 Thread Erick Erickson
5-10 G indexes are pretty small by Lucene/Solr standards, so given reasonable hardware resources this should be no problem. That said, only measurement will nail this down. But an often-used rule of thumb is that you need to consider some better strategies in the 40G range. CAUTION: you haven't sp

Re: minimum string length for proximity search

2011-03-30 Thread Erick Erickson
Uhhhm, doesn't "term1 term2"~5 work? If not, why not? You might get some use from http://lucene.apache.org/java/2_4_0/queryparsersyntax.html Or if that's not germane, perhaps you can explain your use case. Best Erick On Wed, Mar 30, 2011 at 5:49 PM, Andy Yang wrote: > Is there a minimum string

Re: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Erick Erickson
You get this in response to doing what? Are you sure you've unpackaged the nightly build and aren't inadvertently getting older jars? Best Erick On Tue, Mar 29, 2011 at 7:21 AM, Patrick Diviacco wrote: > I've downloaded the nightly build of Lucene (TRUNK) and I'm referring to the > following doc

Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Erick Erickson
I'm always skeptical of storing the doc IDs since they can change out from underneath you (just delete even a single document and optimize). What is it you're doing with the doc ID that you couldn't do with the guid? If your "guid list" were ordered, I can imagine building filters quite quickly fro

Re: Results: get per field scores ?

2011-03-22 Thread Erick Erickson
ly because it > slows down a lot computations. Is it true ? > > On 22 March 2011 14:29, Erick Erickson wrote: > >> Try Searcher.explain. >> >> Best >> Erick >> >> On Tue, Mar 22, 2011 at 4:34 AM, Patrick Diviacco >> wrote: >> > Is

Re: Results: get per field scores ?

2011-03-22 Thread Erick Erickson
Try Searcher.explain. Best Erick On Tue, Mar 22, 2011 at 4:34 AM, Patrick Diviacco wrote: > Is there a way to display Lucene scores per field instead of the global one > ? > Both my query and my docs have 3 fields. > > I would like to see the scores for each field in the results. Can I ? > > Or

Re: Am I correctly parsing the strings ? Terms or Phrases ?

2011-03-22 Thread Erick Erickson
A good habit to develop is to print out the toString() of the assembled queries, that'll get you going pretty quickly understanding what the query assembly is all about without having to wait for people to respond. But the short form is that phrase queries require all the terms to be adjacent, whi

Re: Performance problems with lazily loaded fields

2011-03-22 Thread Erick Erickson
Don't do that Let's back up a second and ask why in the world you want to do this, what's the use-case you're satisfying? Because spinning through all the results and getting information from the underlying documents is inherently expensive since, as Sanne says, you're doing disk seeks. Most L

Re: Building a query of single terms...

2011-03-22 Thread Erick Erickson
The easiest way to figure out this kind of thing is to print out the toString() on the queries after they're assembled. I believe you'll find that the difference is that the PhraseQuery would find text like "Term1 Term2 Term3" but not text like "Term1 some stuff Term2 more stuff Term3" whereas Bool

Re: How to normalize Lucene scores... (over all queries)

2011-03-22 Thread Erick Erickson
You can't. If by "normalize" you mean compare the scores between two different queries, it's meaningless. The scores from one query to another are not comparable. If by "normalize" you mean make into a value between 0 and 1, anywhere you have access to raw scores I believe you also have access to

Re: Issue with disk space on UNIX

2011-03-14 Thread Erick Erickson
This sounds like you're not closing your index searchers and the file system is keeping them around. On the Unix box, does hour index space reappear just by restarting the process? Not using reopen correctly is sometimes the culprit, you need something like this (taken from the javadocs). IndexRe

Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
passed an english stop words set. My > question was if I have to call any other function of the german analyzer for > it to be corrent. > > Thank you. > > > Quoting Erick Erickson : > >> I don't understand what you're saying here. If you put a stemmer in the

Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
4, 2011 at 8:21 AM, Vasiliki Gkouta wrote: > Thanks a lot for your help Erick! About the fields you mentioned: If I don't > use stemmers, except for the constructor argument related to the stop words, > is there anything else that I have to modify? > > Thanks, &

Re: Analyzer enquiry

2011-03-13 Thread Erick Erickson
StandardAnalyzer works well for most European languages. The problem will be stemming. Applying stemming via English rules to non-English languages produces...er...interesting results. You can go ahead and create language-specific fields for each language and use StandardAnalyzer with the appropri

Re: matching multi-word terms

2011-03-12 Thread Erick Erickson
This looks like just a phrase query, perhaps with no slop. Term query definitely won't work if you've tokenized a the field, because your terms would be "A" and "B", but not "A B". SpanQueries should also work if you want, there's no reason to subclass anything, just use SpanNearQuery... You can

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Erick Erickson
Solr doesn't do it. There exist various tokenizers/filters that just strip the HTML tags, but there's nothing built into Solr that I know of that understands HTML, HTML-aware operations are outside Solr's purview. Best Erick On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m wrote: > On Fri, Mar 11, 20

Re: Indexing of multilingual labels

2011-03-11 Thread Erick Erickson
It's not so much a matter of problems with indexing/searching as it is with search behavior. The reason these strategies are implemented is that using English stemming, say, on other languages will produce "interesting" results. There's no a-priori reason you can't index multiple languages in the

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Erick Erickson
What mail client are you using? I also had this problem and it's solved in Gmail by sending the mail as "plain text" rather than "Rich formatting". Best Erick On Fri, Mar 11, 2011 at 4:35 AM, Li Li wrote: > hi >    it seems my mail is judged as spam. >    Technical details of permanent failure:

Re: IndexSearcher Single Instance Bottleneck?

2011-03-10 Thread Erick Erickson
No, Lucene itself shouldn't be doing this, the recommendation is for multiple threads to share a single searcher. I'd first look upstream, are your requests being processed serially? I.e. is there a single thread that's handling requests? Best Erick On Thu, Mar 10, 2011 at 4:25 PM, RobM wrote: >

Re: document object

2011-03-10 Thread Erick Erickson
If you're loading 100,000 documents, you can expect it to be slow. If you're loading 10 documents, it should be quite fast... So how big is hits.length? And what version of Lucene are you using? The Hits object has been deprecated for quite some time I believe. The problem here is that you're

Re: "shared fields"?

2011-03-09 Thread Erick Erickson
How large is (large)? What machines are you intending to run this on? In general, though, don't worry about index size until you actually have some numbers to deal with. Solr generally has resource issues based on the number of #unique# terms in an index. So repeating the same thing in a bunch of

Re: performance issues in multivalued fields

2011-03-07 Thread Erick Erickson
You have to describe in detail what "taking a huge performance hit" means, there's not much to go on here... But in general, adding N elements to a mutli-valued field isn't a problem at all. This bit of code: Document D = searcher.doc(hits[i].doc); is very suspicious. Does your cLucene version h

Re: query with long names

2011-02-16 Thread Erick Erickson
Sure, just use a field that is not analyzed. Perhaps you want to define a new field in your documents like "nameKey" that is analyzed with something like KeywordAnalyzer. See: http://lucene.apache.org/java/3_0_3/api/all/index.html PerFieldAnalyzerWrapper will let you use different analyzers for di

Re: I can't post email to d...@lucene.apache.org maillist

2011-02-15 Thread Erick Erickson
What does "can't post" mean? Bounced as spam? rejected for other reasons? This question came through so obviously you can post something I found that sending mail as "plain text" kept the spam filter from kicking in. Best Erick On Tue, Feb 15, 2011 at 7:29 AM, Li Li wrote: > hi all >    is

Re: Multi Index Search Query

2011-02-15 Thread Erick Erickson
This is usually something you should not do. Is there any possibility you can combine these indexes into one? Maybe sharded? Because this approach is almost guaranteed to scale poorly. This smells like an XY problem, perhaps you can back up and explain what the higher-level problem you're trying t

Re: About WordNet synonyms search

2011-02-12 Thread Erick Erickson
d content:brown) > fox (content:fox content:trick content:throw content:slyboots content:fuddle > content:fob content:dodger content:discombobulate content:confuse > content:confound content:befuddle content:bedevil) > > 2011/2/13 Erick Erickson > >> At a guess make is a synon

Re: Iterating over all documents in an index

2011-02-12 Thread Erick Erickson
Be aware that when you do a doc.get(), the fields are the *stored* fields in their original, unanalyzed form. Is that really what you want? Or do you want the tokenized form of the fields? If the latter, you might get the Luke code, it reconstructs all the fields in the document from the terms tha

Re: About WordNet synonyms search

2011-02-12 Thread Erick Erickson
At a guess make is a synonym for one of your search terms. doc.get returns the original content, not synonyms. So what are your synonyms that might be a factor here? Best Erick On Sat, Feb 12, 2011 at 6:04 AM, Gong Li wrote: > Hi, > > I am tying WordNet synonyms into an SynonymAnalyzer. But I

Re: ****SPAM(5.0)**** Re: How to index part numbers

2011-01-28 Thread Erick Erickson
I wonder if you can define the problem away? It sounds like you have essentially random input here. That is, the users can put in whatever they want so whatever you do will be wrong sometime. Could you sidestep the problem with auto-complete and prefix queries (essentially adding * to the user's in

Re: Highlight Wildcard Queries: Scores

2011-01-26 Thread Erick Erickson
It is, I think, a legitimate question to ask whether scoring is worthwhile on wildcards. That is, does it really improve the user experience? Because the MaxBooleanClause gets tripped pretty quickly if you add the terms back in, so you'd have to deal with that. Would your users be satisfied with s

Re: Query parse errors for dashes in Lucene (3.0.3)

2011-01-24 Thread Erick Erickson
Yes. You're confusing an *engine* with a full-blown application. The user here is a Java programmer. I argue that guessing, which is what you're asking for, is emphatically NOT in the domain of the search *engine*, which is what Lucene is. Imagine the poor programmer trying to understand why certa

Re: Indexing with weights

2011-01-24 Thread Erick Erickson
I think all you need to do is index the keywords in one field and weights in another. Then just search on keywords and sort on weight. Note: the field you sort on should NOT be tokenized. Best Erick On Mon, Jan 24, 2011 at 4:02 PM, Chris Schilling wrote: > Hello, > > I have a bunch of text doc

Re: [REINDEX] Note: re-indexing required !

2011-01-23 Thread Erick Erickson
<<>> yes <<>> Unknown. The devs are trying mightily to keep this kind of thing out of the 3_x branch, but this was a fairly nasty bug rather than an enhancement which made it important enough to put in the 3x branch. This is NOT the same sort of issue you've seen in messages about rebuilding tru

Re: Filter documents on a field value while searching the index

2011-01-22 Thread Erick Erickson
27;re not going to return the docs anyway, why filter them later? Best Erick > -amg > > On Sat, Jan 22, 2011 at 12:32 PM, Erick Erickson > wrote: > > I guess I don't see what the problem is. These look to me like > > standard Lucene query syntax options. If I'm o

Re: Filter documents on a field value while searching the index

2011-01-22 Thread Erick Erickson
I guess I don't see what the problem is. These look to me like standard Lucene query syntax options. If I'm off base here, let me know. If you're building your own BooleanQuery, you can add these as sub-clauses Here's the Lucene query syntax: http://lucene.apache.org/java/2_9_1/queryparsersyntax.

Re: Filter Performance

2011-01-22 Thread Erick Erickson
That's certainly valid. You could also consider n-grams here as another approach. Its also useful to restrict the number of leading (or trailing) characters you allow. For instance, requiring at least 3 non-wildcard leading characters makes a big difference. It's also a legitimate question how wel

Re: Question on writer optimize() / file merging?

2011-01-17 Thread Erick Erickson
See below: On Sun, Jan 16, 2011 at 10:15 AM, sol myr wrote: > Hi, > > I'm trying to understand the behavior of file merging / optimization. > I see that whenever my IndexWriter calls 'commit()', it creates a new file > (or fileS). > I also see these files merged when calling 'optimize()' , as mu

Re: lucene across many clients

2011-01-14 Thread Erick Erickson
That is certainly one way of approaching it. Another is to add a ClientId field to each document and add a mandatory "AND ClientId=thisclient" to each query. This will have some effect on relevance since the statistics are gathered over the whole corpus rather than just the individual client. Als

Re: Searching Multivalued fileds

2011-01-14 Thread Erick Erickson
You need to create (and it's pretty easy) your own analysis chain that returns a larger position increment gap, which is intended for this very situation using the proximity (or SpanNear) as Jokin suggested. Best Erick On Tue, Jan 11, 2011 at 1:24 AM, Jokin Cuadrado wrote: > you should use a pr

Re: parsing Java log file with Lucene 3.0.3

2011-01-04 Thread Erick Erickson
Lucene In Action has an example of creating a synonymanalyzer that you can adapt. The general idea is to subclass from Analyzer and implement the required functions, perhaps wrapping a Tokenizer in a bunch of Filters. You might be able to crib some ideas from solr.analysis.WordDelimiterFilter Best

Re: parsing Java log file with Lucene 3.0.3

2011-01-02 Thread Erick Erickson
Some days I just can't read... First question: Why do you require standard analyzer?Are you really making use of the special processing? Take a look at other analyzer options. PatternAnalyzer, SimpleAnalyzer, etc. If you really require StandardAnalyzer, consider using two fields. field_original a

Re: parsing Java log file with Lucene 3.0.3

2011-01-01 Thread Erick Erickson
<<>> No, that is not the case. Storing a field stores an exact copy of the input, without any analysis. The intent of storing a field is to return something to display in the results list that reflects the original document. What use would it be to store something that had gone through the analysi

Re: two IndexSearchers on one dir?

2010-12-31 Thread Erick Erickson
It's not a problem, but it's best to share the underlying reader. You could open your short-lived searcher by getting a reader via getIndexReader() on your long-lived searcher What's the underlying use-case you're trying to make happen? Best Erick On Fri, Dec 31, 2010 at 8:01 PM, Paul Libbre

Re: parsing Java log file with Lucene 3.0.3

2010-12-31 Thread Erick Erickson
Have you looked at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Best Erick On Fri, Dec 31, 2010 at 6:12 AM, Benzion G wrote: > Hi, > > I need to parse the Java log files with Lucene 3.0.3. The StandardAnalyzer > is > OK, except it's handling of dots. > > E.g. it handles "java.la

<    1   2   3   4   5   6   7   8   9   10   >