Re: lucene-core-3.3.0 not optimizing

2011-12-06 Thread Erick Erickson
Try taking a look at the patch, but on a quick glance it doesn't look like the underlying code has changed much. But note the whole point of this is that optimize is overused given its former name, why do you want to keep using it? Best Erick On Tue, Dec 6, 2011 at 1:04 AM, KARTHIK SHIVAKUMAR w

Re: [OT] editing index with Luke question

2011-12-14 Thread Erick Erickson
You may have to re-open the index. Also, is there any option to commit? I have to admit I haven't tried this... Best Erick On Wed, Dec 14, 2011 at 5:08 AM, Michael Südkamp wrote: > sure > > -Ursprüngliche Nachricht- > Von: Vinaya Kumar Thimmappa [mailto:vthimma...@ariba.com] > Gesendet:

Re: Using Lucene to match document sets to each other

2011-12-16 Thread Erick Erickson
Have you looked at Lucene's "MoreLikeThis"? I confess I haven't worked with this enough to recommend *how* to use it, but it seems like it's in the general area you're talking about. http://lucene.apache.org/java/3_5_0/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html Best Er

Re: Help: About performance of search with sorting.

2011-12-20 Thread Erick Erickson
What are you specifying for your sort criteria? And what kind of field is it we're talking about here? Best Erick On Tue, Dec 20, 2011 at 8:45 AM, Qiurun wrote: > Dear all, > > I select some of docs that meet some criteria by using TopDocs search(Query > query, int n). Also It's easy to select

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-22 Thread Erick Erickson
I call into question why you "retrieve and materialize as many as 3,000 Documents from each index in order to display a page of results to the user". You have to be doing some post-processing because displaying 12,000 documents to the user is completely useless. I wonder if this is an "XY" problem

Re: Retrieving large numbers of documents from several disks in parallel

2011-12-27 Thread Erick Erickson
completely redesign the system - but maybe > thats just what we'll have to do. > > Anyways, thanks for your help! Any other suggestions would be appreciated, > but if there is no (relatively) easy solution, thats ok. > > Rob > > On Thu, Dec 22, 2011 at 4:51 AM, Erick Erick

Re: Commit data to disk ...

2012-01-05 Thread Erick Erickson
Lucene 2.0? I don't even know how to find the docs any more, I really suggest you upgrade to something more recent. In the 2.9 both IndexReader and IndexWriter have commit() methods. Best Erick On Tue, Jan 3, 2012 at 8:35 AM, Dragon Fly wrote: > > Hi, I'm using Lucene 2.0 and was wondering how

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Erick Erickson
the time interval is just a RangeQuery in the Lucene world. The rest is pretty standard search stuff. You probably want to have a look at the NRT (near real time) stuff in trunk. Your reads/writes are pretty high, so you'll need some experimentation to size your site correctly. Best Erick On We

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Erick Erickson
d vaues ) satisying > a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits > > Or I need to use hitCollector to achieve that ? > > Any sample solr/lucene query to compte aggregates ( like SUM ) will be great. > > -Thanks, > Prasenjit > > On Thu, Jan 5, 2012 at 7:10

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Erick Erickson
onent exist for Solr with > the SUM function built in? > > http://wiki.apache.org/solr/StatsComponent > > On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson > wrote: >> You will encounter endless grief until you stop >> thinking of Solr/Lucene as a replacement for >> an R

Re: Help running out of files

2012-01-06 Thread Erick Erickson
Can you show the code? In particular are you re-opening the index writer? Bottom line: This isn't a problem anyone expects in 3.1 absent some programming error on your part, so it's hard to know what to say without more information. 3.1 has other problems if you use spellcheck.collate, you might

Re: Unsubscribe failure

2012-01-11 Thread Erick Erickson
If your e-mail client sends things in anything but plain text, you might try switching the format to plain text. I've had the spam filter reject formatted e-mail before... May not be relevant, but it's worth a try. Best Erick On Wed, Jan 11, 2012 at 12:44 PM, Bennett, Tony wrote: > I tried to u

Re: Differences between BooleanQuery and QueryParser

2012-01-30 Thread Erick Erickson
The parsing will be a trivial part of the overall query time, so small that I wouldn't worry about it in the least. I'd concentrate on doing the thing that takes the least maintenance. In the examples you're positing, it's not at all clear you could even measure the difference... Do what's easies

Re: using character '%' in queries (Lucene v3.1.0)

2012-01-31 Thread Erick Erickson
Depending on your analyzer, this could well be stripped from the input. Perhaps try using Luke to examine the actual values in the index to see if it's there. And the escape character for Lucene is the backslash.. See: http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#Escaping Special Cha

Re: lucene-3.0.3

2012-02-01 Thread Erick Erickson
What did you try and what exceptions did you get? You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Wed, Feb 1, 2012 at 8:54 AM, Prasad KVSH wrote: > It will be great if you provide some working examples on this. We tried > to deploy solr.war but getting exceptions.

Re: Need to enforce logging of Lucene queries

2012-02-06 Thread Erick Erickson
Solr already logs the queries themselves although there isn't any way that I know of to associate that with a user. Although in Solr land, it seems that whatever servlet container that you would use for Solr should be able to log all the URLs that hit the server. Best Erick On Mon, Feb 6, 2012 a

Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Erick Erickson
I'm curious what the nature of your data is such that you have 1.25 trillion documents. Even at 100M/shard, you're still talking 12,500 shards. The "laggard" problem will rear it's ugly head, not to mention the administration of that many machines will be, shall we say, non-trivial... Best Erick

Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Erick Erickson
13,000 is a bit different. Now I'm in even more need of > help. > > How is "easy" - 15 million audit records a month, coming from several active > systems, and a requirement to keep and search across seven years of data. > > > > Thanks a lot, > The Captn >

Re: Top matched data should be on Top

2012-02-14 Thread Erick Erickson
You cannot simply count words like this and expect the docs to be ordered as you imply. The problem is that the lengths of the fields are encoded an a byte (or perhaps an int, I forget). Thus, some loss of precision is inherent in the process. You have to encode values from 1 to 2^31 or so in somet

Re: Indexing 100Gb of readonly numeric data

2012-02-15 Thread Erick Erickson
Actually, you might well have your index be larger than your source, assuming you're going to be both storing and indexing everything. There's also the "deep paging" issue, see: https://issues.apache.org/jira/browse/SOLR-1726 which comes into play if you expect to return a lot of rows. Solr really

Re: date issues

2012-02-23 Thread Erick Erickson
1> Don't use sint, it's being deprecated. And it'll take up more space than a TrieDate 2> Precision. Sure, use the coarsest time you can, normalizing everything to day would be a good thing. You won't get any space savings by storing to day resolution, it's just a long under the covers. But depend

Re: Most recent document within a group ...

2012-02-26 Thread Erick Erickson
Have you looked at the Searcher.search variant that takes a Sort parameter? Best Erick On Sun, Feb 26, 2012 at 8:30 AM, Dragon Fly wrote: > > Hi, > > Let's say I have 6 documents and each document has 2 fields (i.e. > CustomerName and OrderDate).  For example: > > Doc 1    John    20120115 > Do

Re: Most recent document within a group ...

2012-02-27 Thread Erick Erickson
Just try it. Sorting doesn't load the document, it does load the unique values for the sort field. Which is why indexing dates benefits from using the coarsest resolution you can, i.e. don't store millisecond resolution if all you care about is the day something was published. In fact, sorting doe

Re: Breaking up a query results based upon ROWNUM or something similar?

2012-03-21 Thread Erick Erickson
Sure, in Solr you can specify start/rows parameters on queries like: &start=0&rows=1 &start=1&rows=1 &start=2&rows=1 You'll hit the "deep paging" problem, however. Briefly as you page deeper and deeper you're response time will drop, see: https://issues.apache.org/jira/browse/S

Re: NumericField exception java.lang.IllegalStateException: call set???Value() before usage in lucene 3.5

2012-03-27 Thread Erick Erickson
I'll, of course, defer to Uwe for technical Lucene issues, but you've got a copy/paste error it looks like. I doubt it's the root of your problem, but this code reuses priceField, it seems like you intend the second to use salesField NumericField priceField = new NumericField("price");

Re: search for token starting with a wildcard

2012-04-12 Thread Erick Erickson
Typically, they index the text in reverse order as well as forward order (similar to synonyms) so if you have a term in your field "reverse", you index "esrever" and now your leading-wildcard search for "*verse" becomes a trailing search for "esrev*". There is an implementation in Solr, see: http:

Re: a higher-level layer above lucene

2012-04-16 Thread Erick Erickson
What kind of hiding are you interested in? Solr does a lot of this... Best Erick On Mon, Apr 16, 2012 at 1:37 PM, Akos Tajti wrote: > Hi All, > > I'm looking for a solution that hides the complexity and the low level > structure of Lucene (to make it much simpler to use). I came across the > Com

Re: a higher-level layer above lucene

2012-04-16 Thread Erick Erickson
a search server and the communication eith it is > done through a RESTful API. What I need is a Java API that I can use > programmatically. > > Ákos Tajti > > > > On 2012.04.16., at 19:58, Erick Erickson wrote: > >> What kind of hiding are you interested in? Solr doe

Re: a higher-level layer above lucene

2012-04-17 Thread Erick Erickson
tom code. > > Ákos Tajti > > > > > On Tue, Apr 17, 2012 at 12:23 AM, Erick Erickson > wrote: > >> To do what? You're asking very general questions that are >> hard to answer simply because of the lack of any detail, >> use cases, etc. >> >>

Re: Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Erick Erickson
Maybe I'm missing something here, but why not just boost the terms in the fields at query time? Best Erick On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera wrote: > I have documents that are marked up with Taxonomy and Ontology terms > separately. > When I calculate the document similarity, I want

Re: searching minus digit

2012-04-24 Thread Erick Erickson
the - is part of the query syntax, you must escape it. See: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/queryparsersyntax.html Best Erick On Tue, Apr 24, 2012 at 5:44 AM, S Eslamian wrote: > How can I search a minus digit like -123 in lucene? > When I search this,lucene excl

Re: searching minus digit

2012-04-24 Thread Erick Erickson
2012 13:40, S Eslamian a écrit : >> >>  Thank you but when I search this : Query termQuery = new TermQuery >>> ("field","\-1234"); I get this exception : >>> Invalid escape sequence (valid one are \b \t \n \f \r \" \' \\) >>> >>

Re: Re-indexing a particular field only without re-indexing the entire enclosing document in the index

2012-04-25 Thread Erick Erickson
There's no update-in-place, currently you _have_ to re-index the entire document. But to the original question: There is a "limited join" capability you might investigate that would allow you to split up the textual data and metadata into two different documents and join them. I don't know how we

Re: Storing same field twice (analyzed+not-analyzed), sorting

2012-04-27 Thread Erick Erickson
Hmmm, putting analyzed and unanalyzed values in the same field seems like it'd be difficult to get right. In the Solr world, two separate fields are usually used. Sorting is right out, the results are unpredictable. What does it mean to sort on a field with multiple tokens? For a doc with "aardva

Re: pruning package- pruneAllPositions

2012-05-02 Thread Erick Erickson
Not unless you provide a lot more context, there's nothing to go on here! You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Wed, May 2, 2012 at 6:11 AM, Zeynep P. wrote: > Hi, > > In the pruning package, pruneAllPositions throws an exception. In the code > it is comm

Re: Restricting search results to a dynamic slice of documents

2012-05-05 Thread Erick Erickson
On the face of it, it looks like one of the subclasss of lucene.search.Filter should be what you're looking for. Or is the "dynamic slice" something you couldn't formulate into a query? Best Erick On Fri, May 4, 2012 at 2:51 PM, Earl Hood wrote: > I require the ability to perform a search on a d

Re: Getting the frequencies by corresponding order of documents were indexed

2012-05-14 Thread Erick Erickson
In general you can't rely on anything like this. I admit the merge stuff isn't my area of expertise, but when segments are merged, there's no guarantee that they're merged in order. In general the internal Lucene doc ID should be treated as predictable only for closed segments. Your solution of us

Re: Problem querying Lucene after escaping

2012-06-25 Thread Erick Erickson
TermQuerys are assumed to be parsed already. So you're looking for a _single_ term "ncbi-geneid:379474 or XI.24622". You'd construct something like Query query1 = new TermQuery(new Term("type", "gene")); Query query2 = new TermQuery(new Term("alt_Id", "ncbi-geneid:379474")); Query query3 = new Te

Re: Join support across multiple document types in Lucene

2012-06-29 Thread Erick Erickson
Performance here is an issue. The performance of Solr/Lucene's query joins is a function of the number of _unique_ values in the fields being joined, and can be unacceptably slow in the wild, it depends on your use-case. denormalization is the usual approach, but that gives DB folks the hives. Bu

Re: incrementally indexing

2012-07-05 Thread Erick Erickson
Hmmm, it's not quite clear what the problem is. But let's say you have indexed your hard drive. Somewhere you'll have to keep a record of what you've done, say the timestamp when you started looking at your hard drive to index it. Next time you run, you simply only index files that have changed si

Re: index.merge.scheduler exception - java.io.IOException: Input/output error

2012-07-09 Thread Erick Erickson
no, you can't delete those files, and you can't regenerate just those files, all the various segment files are necessary and intertwined... Consider using the CheckIndex facility, see: http://solr.pl/en/2011/01/17/checkindex-for-the-rescue/ note, the CheckIndex class is contained in the lucene co

Re: about some date store

2012-07-12 Thread Erick Erickson
You can only show that is stored (Field.Store.YES). Only then can you use document.get(...) and get something to display Best Erick On Thu, Jul 12, 2012 at 2:55 AM, sam wrote: > it's take a new problem,what even I seaching,I can only get the first line > data,if the data can be seach.and ,when i

Re: Pattern Analyzer

2012-07-13 Thread Erick Erickson
Sure, you can do it that way. But first I'd look over the zillion tokenizers and filters that are available and string together the ones that best suit your need. For instance, WhitespaceTokenizer and PatternReplaceFilter might make your regex much easier since the PatternReplaceFilter gets just th

Re: Multiple sort field

2012-07-18 Thread Erick Erickson
Lucene certainly supports multiple sort criteria, see IndexSearcher.search, any one that takes a Sort object. The Sort object can contain a list of fields where any ties in the first N field(s) are decided by looking at field N+1. But, Ganesh, be a little careful about resolving by internal Lucene

Re: Matching on "owned" docs -- filter or query? Or sort?

2012-07-22 Thread Erick Erickson
Hmmm, what about simply boosting very high on owner, and probably grouping on title? If you boosted on owner, you wouldn't even have to index the title separately for each user, your "owner" field could be multivalued and contain _all_ the owner IDs. In that case you wouldn't have to group at all.

Re: Matching on "owned" docs -- filter or query? Or sort?

2012-07-23 Thread Erick Erickson
of somehow doing an empty query to fetch all docs, sorting them to > put docs with the userId first, and then running a DuplicateFilter on title > with KM_USE_FIRST_OCCURRENCE. This is the duplicate elimination behavior I > want. Then do a text search on the remainder. But this

Re: Lucene 4.0 GA time frame

2012-07-31 Thread Erick Erickson
My guess is in the September/October time frame. Things are actually moving along pretty quickly, and there's considerable sentiment to get it out the door... Best Erick On Mon, Jul 30, 2012 at 11:11 PM, Vitaly Funstein wrote: > Given that the Alpha is out, are there any more or less definitive

Re: easy way to figure out most common tokens?

2012-08-15 Thread Erick Erickson
I don't see how you could without indexing everything first since you can't know what the most frequent terms until you've processed all your documents If you know these terms in advance, it seems like you could just call then stopwords and use the common stopword processing. If you have to e

Re: Negative query issue

2012-09-07 Thread Erick Erickson
The first thing you should do is enumerate what you expect and what you get. We have no way of knowing what expectations of yours are not being met. Here's an interesting blog you might want to read: http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/ Best Erick On Wed, Sep 5, 2012 at 8:

Re: Variable term weighting while indexing

2012-09-29 Thread Erick Erickson
Yeah, payloads are probably what you want, otherwise the words are indistinguishable. Best Erick On Sat, Sep 29, 2012 at 12:23 PM, parnab kumar wrote: > Hi All, > >I have an algorithm by which i measure the importance of a term > in a document . While indexing i want to store weight

Re: Variable term weighting while indexing

2012-10-01 Thread Erick Erickson
> be the payload weight . >Is the above intuition correct ? > > Thanks, > Parnab > > On Sun, Sep 30, 2012 at 2:13 AM, Erick Erickson > wrote: > >> Yeah, payloads are probably what you want, otherwise the words are >> indistinguishable. >> >> Be

Re: Restrict Lucene search in concrete document ids

2012-10-16 Thread Erick Erickson
You probably want to use a Lucene Filter then use one of the query methods that takes a filter. Best Erick On Tue, Oct 16, 2012 at 4:36 AM, sxam wrote: > Hi, > Prior to search I have a concrete list of Lucene Document Ids (different > every time) and I want to limit my search only to those speci

Re: Restrict Lucene search in concrete document ids

2012-10-16 Thread Erick Erickson
eds to change > for every query, that might be a problem. > > 2012/10/16 Erick Erickson > >> You probably want to use a Lucene Filter then use >> one of the query methods that takes a filter. >> >> Best >> Erick >> >> On Tue, Oct 16, 2012 at 4:

Re: Multiple Blocking Threads with search during an Index reload

2012-10-24 Thread Erick Erickson
You haven't given us much to go on here, you might review: http://wiki.apache.org/solr/UsingMailingLists But one can imagine that this must be something you're doing that's unusual, or more people would have reported something similar. At a guess (since you haven't really told us _anything_ abou

Re: using CharFilter to inject a space

2012-11-03 Thread Erick Erickson
So I've gotta ask... _why_ do you want to inject the spaces? If it's just to break this up into tokens, wouldn't something like LetterTokenizer do? Assuming you aren't interested in leaving in numbers Or even StandardTokenizer unless you have e-mail & etc. Or what about PatternReplaceCharFilt

Re: using CharFilter to inject a space

2012-11-04 Thread Erick Erickson
kenizers is that they get rid of the commas > so if I use a ShingleFilter after them there's no way to tell if there was > a comma there or not. > > (another option I consider is to add an Attribute to specify if there was > a comma before or after a token) > > if there&#x

Re: Overriding DefaultSimilarity to not consider tf/idf and friends

2012-11-05 Thread Erick Erickson
first id see if omitting term frequencies and positions and norms did what you need, these are all things you can disable OOB... Best Erick On Mon, Nov 5, 2012 at 5:26 AM, Damian Birchler wrote: > Hi everyone > > ** ** > > We are using Lucene to search for possible duplicates in an address

Re: content disappears in the index

2012-11-12 Thread Erick Erickson
First, sorting on tokenized fields is undefined/unsupported. You _might_ get away with it if the author field always reduces to one token, i.e. if you're always indexing only the last name. I should say unsupported/undefined when more than one token is the result of analysis. You can do things lik

Re: content disappears in the index

2012-11-12 Thread Erick Erickson
cement="" > replace="all"/> > > pattern="(.{1,30})(.{31,})" > > replacement="$1" > replace="all"/> > > > > > > > > It reduces long lis

Re: content disappears in the index

2012-11-13 Thread Erick Erickson
There's nothing in Solr that I know of that does this. It would be a pretty easy custom filter to create though FWIW, Erick On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir wrote: > On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling > wrote: > > By the way, why does TrimFilter option updateOffse

Re: content disappears in the index

2012-11-15 Thread Erick Erickson
{ > > if (input.incrementToken()) { > > if (termAtt.length() > maxLength) { > > termAtt.setLength(maxLength); > > } > > > > return true; > > } else { > > return false; > > } > >

Re: Which stemmer?

2012-11-15 Thread Erick Erickson
I'd make it easy for myself. Generate (programmatically), a list like you showed for a _lot_ more terms, send it to your customer, and let _them_ pick. Unfortunately, the customer has no idea what "aggressive" means (for that matter, I don't know how porter handles specific words for that matter, I

Re: Using alternative scoring mechanism.

2012-12-02 Thread Erick Erickson
I think you're looking for per-field similiarity, does this help? https://issues.apache.org/jira/browse/LUCENE-2236 Note, in 4.0 only Best Erick On Sat, Dec 1, 2012 at 1:43 PM, Eyal Ben Meir wrote: > Can one replace the basic scoring algorithm (TF/IDF) for a specific field, > to use a differe

Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Erick Erickson
If it's a fixed list and not excessively long, would synonyms work? But if theres some kind of logic you need to apply, I don't think you're going to find anything OOB. The problem is that by the time a token filter gets called, they are already split up, you'll probably have to write a custom fil

Re: all the documents are not getting indexed in solr 3.6.1

2012-12-26 Thread Erick Erickson
Try looking at the admin/analysis page, that'll probably tell you a lot. You'll have to provide quite a bit more information to help us help you, you might want to review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sat, Dec 22, 2012 at 9:28 PM, wrote: > I am trying to index so

Re: WordDelimiterFilter Question (lucene 4.0)

2012-12-26 Thread Erick Erickson
It's always amazing to me how hitting the "send" button makes the solution obvious...too late . Been there, done that... more times than I want to count Best Erick On Sun, Dec 23, 2012 at 2:30 PM, Jeremy Long wrote: > Have you ever wished you could retract your question to a mailing list?

Re: Pulling lucene 4.1

2013-01-04 Thread Erick Erickson
BTW, if all you're interested in is the compiled code, you can always get the latest build from: http://wiki.apache.org/solr/NightlyBuilds(4x-SNAPSHOT). That code will be compiled from the link Shai pointed out except for any commits since the build... FWIW, Erick On Wed, Jan 2, 2013 at 2:01 PM,

Re: Field.Store.YES vs Field.Store.NO

2013-01-11 Thread Erick Erickson
This page might add a little insight: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#file-names. It's for 3.5, but the take-away is that *.fdt and *.fdx files are where a raw copy of the data goes when you specify Store.YES. They are completely independent of all t

Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
Maybe do the handling as an overridable method and make it abstract? That would give the skeleton of all the recovery stuff, but then require the user to implement the actual recovery? Just a thought Erick On Mon, Jan 21, 2013 at 9:06 AM, Michał Brzezicki wrote: > I don't think it is possible to

Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
P.S. Or just attach the code without your customized doc recovery stuff with a note about how to carry it forward? That way someone could pick it up if interested and generalize it. Best Erick On Mon, Jan 21, 2013 at 12:37 PM, Erick Erickson wrote: > Maybe do the handling as an overrida

Re: Real-time Get and Atomic Updates for SolrJ

2013-01-31 Thread Erick Erickson
I haven't used it myself, but I did find this for atomic updates: http://www.mumuio.com/solrj-4-0-0-alpha-atomic-updates/ Don't know if there really is need for specific support in SolrJ for RTG, isn't that all over on the Solr side and automagic? Best Erick On Wed, Jan 30, 2013 at 5:47 PM, Dye

Re: Is there a limit for a field size in Lucene 3.0.2

2013-02-21 Thread Erick Erickson
There's an overridable default of 10,000 tokens, that's the first place I'd look. Forget just how to set it to a higher value Best Erick. P.S. Please don't hit reply to a message and change the title, but start an e-mail fresh. See: http://people.apache.org/~hossman/#threadhijack On Thu, Fe

Re: Bulk indexing and delete old index files

2013-03-05 Thread Erick Erickson
If you kept an "indexed_time" field, you could always just index to the same instance and then do a delete by query, something like timestamp:[* TO NOW/DAY], commit and go. That would delete everything indexed before midnight. last night (NOW/DAY rounds down). Note, most of this would be already r

Re: Accent insensitive analyzer

2013-03-24 Thread Erick Erickson
ISOLatin1AccentFilter has been deprecated for quite some time, ASCIIFoldingFilter is preferred Best Erick On Fri, Mar 22, 2013 at 2:59 PM, Jerome Blouin wrote: > Thanks. I'll check that later. > > -Original Message- > From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJ

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Erick Erickson
@Tom - done On Mon, Mar 25, 2013 at 12:48 PM, Tom Burton-West wrote: > Please add tburtonw to contributors > Tom Burton-West > tburtonw at umich dot edu > > Tom > > On Mon, Mar 25, 2013 at 9:05 AM, Steve Rowe wrote: > > > > > On Mar 25, 2013, at 8:49 AM, Rafał Kuć wrote: > > > Could you add Ra

Re: Assert / NPE using MultiFieldQueryParser

2013-03-25 Thread Erick Erickson
@Simon did I actually catch a reference to: http://xkcd.com/722/ ??? that's one of my all-time favorites on XKCD, I think it describes my entire professional life "Bobby Tables" is another (http://xkcd.com/327/). There, I've done my bit to stop productivity today! Erick On Mon, Mar 25

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Erick Erickson
t you put) to > tburtonw. > > Steve > > On Mar 25, 2013, at 1:19 PM, Erick Erickson > wrote: > > > @Tom - done > > > > > > On Mon, Mar 25, 2013 at 12:48 PM, Tom Burton-West >wrote: > > > >> Please add tburtonw to contributors > >> Tom

Re: Consultant Inquiry

2013-03-29 Thread Erick Erickson
There are a bunch of possibilities listed here: http://wiki.apache.org/solr/Support Best Erick On Thu, Mar 28, 2013 at 2:32 PM, Nick Hoffman wrote: > I'm looking for a consultant for Lucene Solr. > > Our team of 3 extended OpenBravo (Java ERP) with a built-in Shopping Cart > (written in JS). >

Re: Highlighting search words in full document

2013-04-07 Thread Erick Erickson
Sounds like what you want to do is 1> with each verse, store the chapter ID. This could be the ID of another document. There's no requirement that all docs in an index have the same structure. In this case, you could have a "type" field in each doc with values like "verse" and "chapter". For your v

Re: Highlighting search words in full document

2013-04-07 Thread Erick Erickson
entire chapter that were highlighted > in the selected verse. > > Thanks! > > Sent from my iPhone > > On Apr 7, 2013, at 5:38 AM, Erick Erickson wrote: > >> Sounds like what you want to do is >> 1> with each verse, store the chapter ID. This could be the ID

Re: Why I can not interrupt the search?

2013-06-05 Thread Erick Erickson
Have you seen TimeLimitingCollector? Best Erick On Wed, Jun 5, 2013 at 6:39 AM, 朱彦安 wrote: > Hello! > > In the search hit a lot,I want to hit the 2000 docs return data immediately. > > I can not find such a method of lucene. > > I have tried: > > public int score(Collector collector,int max){ >

Re: build of trunk hangs

2013-06-22 Thread Erick Erickson
What Adrien said. I've had this happen when I kill a build partway through (but just sometimes). If you're on a fast network, I'll sometimes just delete the entire .ivy2 cache, but that's a little drastic. Erick On Thu, Jun 20, 2013 at 9:15 AM, Adrien Grand wrote: > Hi, > > On Thu, Jun 20, 2013

Re: Securing stored data using Lucene

2013-06-23 Thread Erick Erickson
Security has at least two parts. First, allowing users access to specific documents, for which Alon's comments are the "usual" way to do this in Solr/Lucene. But the patch you referenced doesn't address this, it's all about encrypting the data stored on disk. This is useful for keeping people who

Re: Questions about doing a full text search with numeric values

2013-07-01 Thread Erick Erickson
WordDelimiterFilter(Factory if you're experimenting with Solr as Jack suggests) will fix a number of your cases since it splits on case change and numeric/alpha changes. There are a bunch of ways to recombine things so be aware that it'll take some fiddling with the parameters. As Jack suggests, us

Re: In memory index (current status in Lucene)

2013-07-01 Thread Erick Erickson
Hey Emma! It's been a while Building on what Steven said, here's Uwe's blog on MMapDirectory and Lucene: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html I've always considered RAMDirectory for rather restricted use-cases. I.e. if I know without doubt that the index is

Re: Questions about doing a full text search with numeric values

2013-07-06 Thread Erick Erickson
; see if "this letter sequence occur(s)" in it? I'm thinking I'm missing > something because that seems no different than using wildcards. Or am I > missing a subtle difference? > > Thank you. > > -Original Message- > From: Erick Erickson [mailto:erick

Re: Lucene in Action

2013-07-10 Thread Erick Erickson
Right, unfortunately, there's nothing that I know of that's super-recent. Jack Kupransky is e-publishing a book on Solr, which will be more up to date but I don't know how thoroughly it dives into the underlying Lucene code. Otherwise, I think the best thing is to tackle a real problem (perhaps tr

Re: What is text searching algorithm in Lucene 4.3.1

2013-07-17 Thread Erick Erickson
Note: as of Lucene 4.x, you can plug in your own scoring algorithm, it ships with several variants (e.g. BM25) so you can look at the pluggable scoring where all the code for the various algorithms is concentrated. Erick On Wed, Jul 17, 2013 at 12:40 AM, Jack Krupansky wrote: > The source code i

Re: Partial word match using n-grams

2013-07-19 Thread Erick Erickson
Well, it depends on what you put between your tokenizer and ngram filter. Putting WordDelimiterFilterFactory would break up on the underscore (and lots of other things besides) and submit the separate tokens which would then be n-grammed separately. That has other implications, of course, but you g

Re: Trying to search java.lang.NullPointerException in log file.

2013-07-22 Thread Erick Erickson
Even though you're on the Lucene list, consider installing Solr just to see the admin/analysis page to see how your index and query analysis works. There's no reason you couldn't split this up on periods into separate words and then just use phrase query to find java.lang.NullPointerException, but

Re: Tokenize String using Operators(Logical Operator, : operator etc)

2013-07-23 Thread Erick Erickson
I really don't see what the use-case here is. When you say "later", what does that mean? You're indexing what and querying how? Best Erick On Tue, Jul 23, 2013 at 7:19 AM, dheerajjoshim wrote: > Greetings, > > I am looking a way to tokenize the String based on Logical operators > > Below String

Re: need searcher example to read indexes generated by solr

2013-07-27 Thread Erick Erickson
Have you looked at either the Blacklight or Velocity Response Writer? This latter is shipped standard with Solr, access it by the /browse handler. It's pretty easily customizable Blacklight is here: http://projectblacklight.org/ Best Erick On Thu, Jul 25, 2013 at 1:14 PM, mlotfi wrote:

Re: Phonetic Filter

2013-08-06 Thread Erick Erickson
Take a look at the BeiderMorseFilterFactory perhaps? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory Here's a mention that it explicitly works for French: http://docs.lucidworks.com/display/solr/Phonetic+Matching But I admit there's not much here on _how_,

Re: Query

2013-08-11 Thread Erick Erickson
You probably want something more like "electro hydraulic power assist steering"~5, quote marks and all. And note that it's not quite "within 5 positions", it's more "up to five single-word transpositions" which is kind of a slippery concept. "electro hydraulic assist power steering"~5 would requi

Re: IllegalStateException in SpanTermQuery

2013-08-14 Thread Erick Erickson
As Mike said, this is an intended change. The test passed in 3.5 because there was no check if Span queries were working on a field that supported them. In 4.x this is checked and an error is thrown. Best Erick On Wed, Aug 14, 2013 at 12:22 AM, Yonghui Zhao wrote: > In our old code, we create t

Re: Lucene index customization

2013-08-24 Thread Erick Erickson
Have you looked at the whole flexible indexing functionality? Here's a couple of places to start: http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/ http://www.slideshare.net/LucidImagination/flexible-indexing-in-lucene-40 I'm still not quite sure why you want to do this,

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-01 Thread Erick Erickson
I really recommend you restructure your program, it's a hard to follow. For instance, you open a new IndexWriter every time through the while (flags) loop. You only close it in the if (iwcTemp1.getConfig().getOpenMode() == OpenMode.CREATE_OR_APPEND) { case. That may be the root of your problem rig

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-02 Thread Erick Erickson
ile is read. > > If there's another alternative I will be more than happy to know . > > As of now, I still get StreamClosedException and > LockObtainFailedException. So any help on this will be deeply appreciated.. > > > On 9/1/2013 5:46 PM, Erick Erickson wrote: > >> I

Re: Making lucene indexing multi threaded

2013-09-02 Thread Erick Erickson
Stop. Back up. Test. The very _first_ thing I'd do is just comment out the bit that actually indexes the content. I'm guessing you have some loop like: while (more files) { read the file transform the data create a Lucene document index the document } Just comment out the "index

Re: Smart Chinese Analyzer Performance

2013-09-06 Thread Erick Erickson
Well, various people have measured between a 50% and 70+% reduction in memory used for identical data, so I'd say so. The CHANGES.txt is where I'd look to see if anything mentioned is worth your time. Not to mention SolrCloud... Erick On Fri, Sep 6, 2013 at 3:41 PM, Darren Hoffman wrote: > I

  1   2   3   4   5   6   7   8   9   10   >