Re: Help with a fieldcomparator!

2015-01-17 Thread Erick Erickson
. They're a lot of differents patterns but not to much documents as result of the query filter For that reason I think the best way is a custom FieldComparator. Thanks Víctor Podberezski On Fri, Jan 16, 2015 at 9:31 PM, Erick Erickson erickerick...@gmail.com wrote: Personally I would do

Re: Help with a fieldcomparator!

2015-01-16 Thread Erick Erickson
Personally I would do this on the ingestion side with a new field. That is, analyze the input field when you were indexing the doc, extract the min value from any numbers, and put that in a new field. Then it's simply sorting by the new field. This is likely to be much more performant than

Re: Indexing

2015-01-15 Thread Erick Erickson
Basically there is a stored fork and an indexed fork. If you specify the input should be stored, a verbatim copy is put in a special segment file with the extension .fdt. This is entirely orthogonal to indexing the tokens, which are what search operates on. So you can store and index, store but

Re: Details on setting block parameters for Lucene41PostingsFormat

2015-01-10 Thread Erick Erickson
Tom: I'll be very interested to see your final numbers. I did a worst-case test at one point and saw a 2/3 reduction, but that was deliberately worst case, I used a bunch of string/text types, did some faceting on them, etc, IOW not real-world at all. So it'll be cool to see what you come up

Re: Lucene search/count performance abrupt degradation (MMapDirectory)

2015-01-08 Thread Erick Erickson
Thanks for closing this off. On Thu, Jan 8, 2015 at 7:21 AM, Piotr Idzikowski piotridzikow...@gmail.com wrote: We have detected the problem: the excessive(!) amount of memory allocated to Java heap. This articles helped us find the issue:

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Erick Erickson
Should be, but it's a bit confusing because the query syntax is not pure boolean, so there's no set to take away the docs with entries in field 1, you need the match all docs bit, i.e. *:* -field1:[* TO *] (That's asterisk:asterisk -field1:[* TO *] in case the silly list interprets the asterisks

Re: manually merging Directories

2014-12-23 Thread Erick Erickson
I doubt this is going to work. I have to ask why you're worried about the I/O; this smacks of premature optimization. Not only do the files have to be moved, but the right control structures need to be in place to inform Solr (well, Lucene) exactly what files are current. There's a lot of room for

Re: OutOfMemoryError indexing large documents

2014-11-25 Thread Erick Erickson
Well 1 don't send 20 docs at once. Or send docs over some size N by themselves. 2 seriously consider the utility of indexing a 100+M file. Assuming it's mostly text, lots and lots and lots of queries will match it, and it'll score pretty low due to length normalization. And you probably can't

Re: analyzers for Thai, Telugu, Vietnamese, Korean, Urdu,...

2014-11-08 Thread Erick Erickson
There are a bunch of different examples in the schema file that should point you in the right direction, whether these specific languages are supported is an open question though. Best, Erick On Sat, Nov 8, 2014 at 2:47 AM, Olivier Binda olivier.bi...@wanadoo.fr wrote: Hello What should I use

Re: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Erick Erickson
bq: Our server runs many hundreds (soon to be thousands) of indexes simultaneously This is actually kind of scary. How do you expect to fit many thousands of indexes into memory? Raising per-process virtual memory to unlimited still doesn't handle the amount of RAM the Solr process needs. It

Re: Caused by: java.lang.OutOfMemoryError: Map failed

2014-11-07 Thread Erick Erickson
has removed the related session from the UI. Yes, it’s a necessary kind of scary… -Brian On Nov 7, 2014, at 4:20 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Our server runs many hundreds (soon to be thousands) of indexes simultaneously This is actually kind of scary. How do

Re: Negative Wildcard Queries

2014-10-31 Thread Erick Erickson
, Prad Nelluru prn...@microsoft.com wrote: Thanks! Is it possible to say -hello world* ? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 30, 2014 10:15 PM To: java-user Subject: Re: Negative Wildcard Queries Actually, hello world

Re: Negative Wildcard Queries

2014-10-31 Thread Erick Erickson
happen if they put in this. Thanks! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 11:47 AM To: java-user Subject: Re: Negative Wildcard Queries Um not sure what that means. Are you looking to not return docs where hello

Re: Payload and Similarity Function: Always same value

2014-10-30 Thread Erick Erickson
Ralf: Here's an end-to-end Payloads example you can use to compare, although it sounds like you've already figured out your immediate problem.. https://lucidworks.com/blog/end-to-end-payload-example-in-solr/ Best, Erick On Thu, Oct 30, 2014 at 1:24 PM, Ralf Bierig ralf.bie...@gmail.com wrote:

Re: Negative Wildcard Queries

2014-10-30 Thread Erick Erickson
Actually, hello world* is possible with the ComplexPhraseQueryParser as of 4.8, see SOLR-1604 (yeah, it's been hanging around for a while). But to your question: Just prefix it with *:*, i.e. q=*:* -hello* Best, Erick On Thu, Oct 30, 2014 at 6:29 PM, Prad Nelluru prn...@microsoft.com wrote: Hi

Re: Making lucene indexing multi threaded

2014-10-28 Thread Erick Erickson
bq: When I loop the result set, I reuse the same Document instance. I really, really, _really_ hope you're calling new for the Document in the loop. Otherwise that single document will eventually contain all the data from your entire corpus! I'd expect some other errors to pop out if you are

Re: NOTICE: Seeking Moderators for java-user@lucene

2014-09-30 Thread Erick Erickson
Sure, keep me as a moderator. Sorry, I let the e-mail one slip by. Erick On Tue, Sep 30, 2014 at 11:03 AM, Steve Rowe sar...@gmail.com wrote: Please keep me on the list of moderators. My inattention this past week is temporary and non-vacation-related. - Steve On Sep 30, 2014, at 12:51 PM,

Re: Can lucene index tokenized files?

2014-09-15 Thread Erick Erickson
How are they delimited? If they're just a text stream, it seems all you need is a whitespace tokenizer. Won' How are you going to search them though? Is your query submission process going to _also_ do the transformations or will you have to construct a query-time analysis chain that mimics the

Re: Please add to Lucene Wiki

2014-09-03 Thread Erick Erickson
Peter: I'd be glad to add you to the Wiki, but in order to do so I need your Wiki login. And are you interested in the Lucene Wiki, the Solr Wiki or both? Best, Erick On Wed, Sep 3, 2014 at 1:33 AM, Peter Oehler oeh...@axonic.net wrote: Hi! I am Peter, one of two founders of Lookeen

Re: Question regarding complex queries and long tail suggestions

2014-09-03 Thread Erick Erickson
Take a look at the ComplexPhraseQueryParser here: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser Best, Erick On Wed, Sep 3, 2014 at 12:41 PM, Mirko Sertic mirko.ser...@web.de wrote: Hi@all I am using Lucene 4.9 for a search application.

Re: indexing all suffixes to support leading wildcard?

2014-08-28 Thread Erick Erickson
The usual approach is to index to a second field but backwards. See ReverseStringFilter... Then all your leading wildcards are really trailing wildcards in the reversed field. Best, Erick On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander rob.nikan...@gmail.com wrote: Hi, I've got some

Re: Calculate Term Frequency

2014-08-19 Thread Erick Erickson
Hmmm, I'm not at all an expert here, but Solr has a function query termfreq that does what you're doing I think? I wonder if the code for that function query would be a good place to copy (or even make use of)? See TermFreqValueSource... Maybe not helpful at all, but... Erick On Tue, Aug 19,

Re: BitSet in Filters

2014-08-12 Thread Erick Erickson
bq: Unless, I can cache these filters in memory, the cost of constructing this filter at run time per query is not practical Why do you say that? Do you have evidence? Because lots and lots of Solr installations do exactly this and they run fine. So I suspect there's something you're not telling

Re: escaping characters

2014-08-11 Thread Erick Erickson
Take a look at the adnim/analysis page for the field in question. The next bit of critical information is adding debug=query to the URL. The former will tell you what happens to the input stream at query and index time, the latter will tell you how the query got through the query parsing process.

Re: Performance StringCoding.decode

2014-08-05 Thread Erick Erickson
Well, that code is when you're reading the fields of documents off disk. Stored fields are compressed/decompressed automatically. So one question is what is your test doing? In other words, is it artificially hitting this? The theory is that this should only be done when you gather the final top

Re: Is housekeeping of Lucene indexes block index update but allow search ?

2014-08-04 Thread Erick Erickson
Right. 1 Occasionally the merge will require 2x the disk space. (3x in compound file system). The merging is, indeed, done in the background, it is NOT a blocking operation. 2 n/a. It shouldn't block at all. Here's a cool video by Mike McCandless on the merging process, plus some explanations:

Re: How to set threshold for categorized document as Matched ,Partial Match, No Match with query.

2014-07-12 Thread Erick Erickson
Well, first be aware that the scores are not comparable across different queries. So any rule like don't show any score X is meaningless. A _very_ good match at the top position (as judged by humans) for one query might score less than X, and a poor match for another query that was nonetheless

Re: How to capture number of page e number of line in file pdf indexed?

2014-07-06 Thread Erick Erickson
This isn't a Solr problem, but a PDF problem. The Tika project is what's used to extract the PDF info, including a bunch of metadata. Tika uses PDFBox, which at least allows you to extract a page at a time and maybe much more (I just barely looked at the interface)... You can use Tika from a

Re: Incremental Field Updates

2014-07-01 Thread Erick Erickson
This JIRA is complicated, don't really expect it in 4.9 as it's been hanging around for quite a while. Everyone would like this, but it's not easy. Atomic updates will work, but you have to stored=true for all source fields. Under the covers this actually reads the document out of the stored

Re: QueryParserUtil, big query with wildcards - runs endlessly and produces heavy load

2014-06-26 Thread Erick Erickson
I suspect you're getting leading wildcard searches as well, which must do entire term scans unless you're doing the reverse trick. Replacing all successive whitespace gives you:

Re: Custom Sorting

2014-06-25 Thread Erick Erickson
that allows one to write their own sort on a field that is not indexed by Lucene. Thanks again, --- Thanks n Regards, Sandeep Ramesh Khanzode On Wednesday, June 25, 2014 1:21 AM, Erick Erickson erickerick...@gmail.com wrote: I'm a little confused here. Sure

Re: Custom Sorting

2014-06-24 Thread Erick Erickson
I'm a little confused here. Sure, sorting on a number of fields will increase memory, the basic idea here is that you need to cache all the sort values (plus support structures) for performance reasons. If you create your own custom sort that goes out to a DB and gets the doc, you have to be

Re: timing merges

2014-06-12 Thread Erick Erickson
Michael is, of course, the Master of Merges... I have to ask, though, have you demonstrated to your satisfaction that you're actually seeing a problem? And that fewer merges would actually address that problem? 'cause this might be an XY problem Best, Erick On Thu, Jun 12, 2014 at 4:11 AM,

Re: timing merges

2014-06-12 Thread Erick Erickson
for a long time. Currently, the following settings are applied. TieredMergePolicy logMergePolicy = new TieredMergePolicy(); logMergePolicy.setSegmentsPerTier(1000); conf.setMergePolicy(logMergePolicy); What's a good way to resolve this? Regards Jamie On 2014/06/12, 4:04 PM, Erick Erickson

Re: timing merges

2014-06-12 Thread Erick Erickson
at a time. Regards Jamie On 2014/06/12, 4:39 PM, Erick Erickson wrote: What version of Solr/Lucene? Merging is supposed to be happening in the background for quite a while, so I'd be surprised if this was really the culprit unless you're on an older version of Lucene. See: http

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-30 Thread Erick Erickson
Try a cURL statement like: curl http://localhost:8983/solr/update/extract?literal.id=doc33captureAttr=truedefaultField=text; -F myfile=@testRTFVarious.rtf first, then work up to the post.jar bits... Two cautions: 1 make sure to commit afterwards. Something like

Re: MultiReader docid reliability

2014-05-30 Thread Erick Erickson
If you do an optimize, btw, the internal doc IDs may change. But _why_ do you want to keep them? You may have very good reasons, but it's not clear that this is necessary/desirable from what you've said so far... Best, Erick On Fri, May 30, 2014 at 7:49 AM, Nicola Buso nb...@ebi.ac.uk

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-30 Thread Erick Erickson
Hmmm, you might want to move this over to the Solr user's list. This list is lucene, which doesn't have anything to do with post.jar ;)... On Fri, May 30, 2014 at 8:25 AM, Erick Erickson erickerick...@gmail.com wrote: Try a cURL statement like: curl http://localhost:8983/solr/update

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-28 Thread Erick Erickson
bq: We did a detailed analysis for each step and observed that indexing per RTF file(i.e using path and content(with File Reader)) happened at the same millisecond and On an average it took 95millisec for each file to get indexed and took anywhere between 200 to 500millisec for file to get

Re: NewBie To Lucene || Perfect configuration on a 64 bit server

2014-05-26 Thread Erick Erickson
bq: We don’t want to search on the complete document store Why not? Alexandre's comment is spot on. For 500 docs you could easily form a filter query like fq=id1 OR id2 OR id3 (solr-style, but easily done in Lucene). You get these IDs from the DB search. This will still be MUCH faster than

Re: IndexWriter.addIndexes() multithread correct?

2014-05-22 Thread Erick Erickson
right, for docs with the same score, ties are broken by the internal Lucene ID. This may even change _on the same node_ due to merges! If you want to control this, consider always specifying a secondary sort by, say, your id field if you have one, or date stamp or.. Best, Erick On Thu, May

Re: Multi-thread indexing, should the commit be called from each thread?

2014-05-21 Thread Erick Erickson
I'll be more emphatic than Shai; you should _definitely_ not commit from each thread, especially if you are doing a hard commit with openSearcher=true or a soft commit. In either case you open a new searcher which fires all your autowarming queries which.. IOW they're expensive operations. More

Re: ContributorsGroup Add

2014-04-10 Thread Erick Erickson
You're already in the Solr contributors group, are you asking to be added to the Lucene contributors group too? Erick On Thu, Apr 10, 2014 at 10:34 AM, Keith Mericle kmeri...@innoventsolutions.com wrote: Hello, My userID is KeithMericle. I would like to be added to the ContributorsGroup

Re: Index size for Same DataSet.

2014-03-25 Thread Erick Erickson
You're probably fine. Part of indexing is merging segments, and when segments are merged the data from deleted (or updated) documents is reclaimed. Any slight variance in the commit algorithm will potentially reclaim more or less space. What happens if you optimize (forceMerge) as a final step?

Re: How to implement and search?

2014-03-01 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists Then work your way through: https://lucene.apache.org/solr/4_7_0/tutorial.html If you're having problems with the tutorial, tell us what they are. You'll get a _lot_ more help if you ask specific questions that show you're trying

Re: updateDocument (somtimes) no longer deleting documents after Update to 4.6

2014-02-24 Thread Erick Erickson
I suspect you're finding the old doc that is simply marked as deleted. Did you check for that? One quick way to see if this is even in the right ballpark would be to do a forceMerge. If the problem disappears, then this is relevant I'd guess. Warning: The operative word here is guess, I haven't

Re: Lucene performance

2014-01-25 Thread Erick Erickson
You'll have to do some tuning with that kind of ingestion rate, and you're talking about a significant size cluster here. At 172M documents/day or so, you're not going to store very many days per node. Storing doesn't make much of any difference as far as search speed is concerned, the raw data

Re: exporting a query to String with default operator = AND ?

2014-01-24 Thread Erick Erickson
First of all, query.toString is not idempotent. You cannot count on feeding the results of query.toString back into query and getting the same thing, so that's out. Not quite sure what the right solution is though Best, Erick On Fri, Jan 24, 2014 at 11:29 AM, Olivier Binda

Re: Presence of uncommitted changes

2014-01-17 Thread Erick Erickson
You might want to look at the soft/hard commit options for insuring data integrity .vs. latency. Here's a blog on this topic at the Solr level, but all the Solr stuff is realized at the Lucene level eventually, so

Re: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-01-15 Thread Erick Erickson
the lengths of fields are encoded and lose some precision. So I suspect the length of the field calculated for the two documents are the same after encoding. Adding debug=all to the query will show you if this is the case. Best Erick On Wed, Jan 15, 2014 at 3:39 AM, andy yhl...@sohu.com wrote:

Re: Custom Tokenizer

2013-12-05 Thread Erick Erickson
You can also string together one of a myriad of TokenFilters, see: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I'd recommend spending some time on the admin/analysis page to understand what all the combinations do. I'd also recommend against dealing with punctuation etc by using

Re: Deletion of Index not happening in Lucene 4.3

2013-11-29 Thread Erick Erickson
Did you open a new writer to search with? Erick On Fri, Nov 29, 2013 at 6:23 AM, VIGNESH S vigneshkln...@gmail.com wrote: Hi, I searched the word again and document is appearing and iam getting the hit On Fri, Nov 29, 2013 at 4:26 PM, Ian Lea ian@gmail.com wrote: How do you know

Re: Deletion of Index not happening in Lucene 4.3

2013-11-29 Thread Erick Erickson
Bah. open a new writer should be open a new seacher On Fri, Nov 29, 2013 at 8:36 AM, Erick Erickson erickerick...@gmail.comwrote: Did you open a new writer to search with? Erick On Fri, Nov 29, 2013 at 6:23 AM, VIGNESH S vigneshkln...@gmail.comwrote: Hi, I searched the word again

Re: Scanning through inverted index

2013-11-27 Thread Erick Erickson
Probably should explain what your end goal here is. Reconstructing the entire document? Just finding out what documents a few words belong to? The former will be painful and lossy, Luke does that for instance. FWIW, Erick On Mon, Nov 25, 2013 at 11:54 AM, Michael Berkovsky

Re: Alphanumeric Field Comparison : Lucene 4.5

2013-11-27 Thread Erick Erickson
If this is your complete pattern, can you index two different fields, one text and one numeric, say name_text_sort that holds Bay name_int_sort (make sure it's a number field!) that holds the 1, 2, 11, etc. Then just do a primary sort on name_text_sort and secondary sort on name_int_sort? FWIW,

Re: Alphanumeric Field Comparison : Lucene 4.5

2013-11-27 Thread Erick Erickson
are the same width, so Bay 1 Bay 10 Or even left-pad the digits with 0, the user's wouldn't see that unless they used terms component or something. On Wed, Nov 27, 2013 at 8:51 AM, Erick Erickson erickerick...@gmail.comwrote: If this is your complete pattern, can you index two different

Re: lucene enrypted index

2013-11-20 Thread Erick Erickson
Use an encrypting filesystem rather than encrypt the index IMO. Here's the problem. Any encryption process that you could use for encoding short tokens that you can then search is easily broken (ask Adobe about that!). Wildcards won't work. Consider that you've indexed (encrypted) running and

Re: WhitespaceAnalyzer vs StandardAnalyzer

2013-11-18 Thread Erick Erickson
-Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, November 15, 2013 4:45 PM To: java-user Subject: Re: WhitespaceAnalyzer vs StandardAnalyzer Well, your example will work exactly as you want. And if your input is strictly controlled, that's fine

Re: WhitespaceAnalyzer vs StandardAnalyzer

2013-11-15 Thread Erick Erickson
Well, your example will work exactly as you want. And if your input is strictly controlled, that's fine. But if you're putting in text, for instance, punctuation will be part of the token. I.e. in the sentence just before this one, token would not be found, but token. would. The admin/analysis

Re: Does a Lucene Filter reduce the search space of the underlying Query?

2013-11-12 Thread Erick Erickson
Not quite sure what you're after here. numDocs and docFreq are index-wide numbers, they're not re-calculated on a per-query basis. AFAIK, filters have nothing at all to do with these numbers. Why do you care? What is it that you'd like to behave differently and why would that be good? Or did I

Re: Filter by tags

2013-11-09 Thread Erick Erickson
You probably want to look at minimum should match (mm) in edismax... Best, Erick On Sat, Nov 9, 2013 at 6:33 PM, Laécio Freitas Chaves laeciofrei...@gmail.com wrote: Hi, I'm needing to filter a search for images using tags. For example, a search for the museum, house, art and sea return

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote:

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
(but performance/scalability is a concern). I have the control over the query. Another solution would be to translate a query on foo to foo or #foo or @foo WDYT? Thanks! S. On Tue, Nov 5, 2013 at 2:17 PM, Erick Erickson erickerick...@gmail.com wrote: If your universe of items you want to match

Re: Query performance in Lucene 4.x

2013-09-27 Thread Erick Erickson
Hmmm, since 4.1, fields have been stored compressed by default. I suppose it's possible that this is a result of compressing/uncompressing. What happens if 1 you enable lazy field loading 2 don't load any fields? FWIW, Erick On Thu, Sep 26, 2013 at 10:55 AM, Desidero desid...@gmail.com wrote:

Re: Search in a specific ScoreDoc result

2013-09-17 Thread Erick Erickson
Why not? You can use a standard query as a filter query from the Solr side, so it's got to be possible in Lucene. What about using filters doesn't seem to work for this case? Best, Erick On Tue, Sep 17, 2013 at 6:54 AM, David Miranda david.b.mira...@gmail.comwrote: Hi, I want to do a kind

Re: Regarding Compression Tool

2013-09-14 Thread Erick Erickson
content) than the actual documents size. I thought that I can use the CompressionTool to minimize the memory size. You can help, if there is any possiblities or way to store the entire content and to use the highlighter feature. Thankyou On Fri, Sep 13, 2013 at 6:54 PM, Erick Erickson

Re: Regarding Compression Tool

2013-09-13 Thread Erick Erickson
Compression is for the _stored_ data, which is not searched. Ignore the compression and insure that you index the data. The compressing/decompressing for looking at stored values is, I believe, done at a very low level that you don't need to care about at all. If you index the data in the field,

Re: Profiling Solr Lucene for query

2013-09-08 Thread Erick Erickson
Why have 36 shards for just a few million docs each? That's the first thing I'd look at. How many physical boxes? How much memory per JVM? How many JVMs? How much physical memory per box? 'Cause this seems excessive time-wise for loading the info. Best Erick On Sun, Sep 8, 2013 at 7:03 AM,

Re: Expunge deleting using excessive transient disk space

2013-09-08 Thread Erick Erickson
How much free disk space do you have when you try the merge? Is this a typo? int name=|maxMergeAtOnce2/int Note name=| Best Erick On Sun, Sep 8, 2013 at 7:26 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi again, In order to delete part of my index I run a delete by query that

Re: Smart Chinese Analyzer Performance

2013-09-06 Thread Erick Erickson
Well, various people have measured between a 50% and 70+% reduction in memory used for identical data, so I'd say so. The CHANGES.txt is where I'd look to see if anything mentioned is worth your time. Not to mention SolrCloud... Erick On Fri, Sep 6, 2013 at 3:41 PM, Darren Hoffman

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-02 Thread Erick Erickson
StreamClosedException and LockObtainFailedException. So any help on this will be deeply appreciated.. On 9/1/2013 5:46 PM, Erick Erickson wrote: I really recommend you restructure your program, it's a hard to follow. For instance, you open a new IndexWriter every time through the while (flags) loop

Re: Making lucene indexing multi threaded

2013-09-02 Thread Erick Erickson
Stop. Back up. Test. G The very _first_ thing I'd do is just comment out the bit that actually indexes the content. I'm guessing you have some loop like: while (more files) { read the file transform the data create a Lucene document index the document } Just comment out the index

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-01 Thread Erick Erickson
I really recommend you restructure your program, it's a hard to follow. For instance, you open a new IndexWriter every time through the while (flags) loop. You only close it in the if (iwcTemp1.getConfig().getOpenMode() == OpenMode.CREATE_OR_APPEND) { case. That may be the root of your problem

Re: Lucene index customization

2013-08-24 Thread Erick Erickson
Have you looked at the whole flexible indexing functionality? Here's a couple of places to start: http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/ http://www.slideshare.net/LucidImagination/flexible-indexing-in-lucene-40 I'm still not quite sure why you want to do

Re: IllegalStateException in SpanTermQuery

2013-08-14 Thread Erick Erickson
As Mike said, this is an intended change. The test passed in 3.5 because there was no check if Span queries were working on a field that supported them. In 4.x this is checked and an error is thrown. Best Erick On Wed, Aug 14, 2013 at 12:22 AM, Yonghui Zhao zhaoyong...@gmail.comwrote: In our

Re: Query

2013-08-11 Thread Erick Erickson
You probably want something more like electro hydraulic power assist steering~5, quote marks and all. And note that it's not quite within 5 positions, it's more up to five single-word transpositions which is kind of a slippery concept. electro hydraulic assist power steering~5 would require 1

Re: Phonetic Filter

2013-08-06 Thread Erick Erickson
Take a look at the BeiderMorseFilterFactory perhaps? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory Here's a mention that it explicitly works for French: http://docs.lucidworks.com/display/solr/Phonetic+Matching But I admit there's not much here on _how_,

Re: need searcher example to read indexes generated by solr

2013-07-27 Thread Erick Erickson
Have you looked at either the Blacklight or Velocity Response Writer? This latter is shipped standard with Solr, access it by the /browse handler. It's pretty easily customizable Blacklight is here: http://projectblacklight.org/ Best Erick On Thu, Jul 25, 2013 at 1:14 PM, mlotfi

Re: Tokenize String using Operators(Logical Operator, : operator etc)

2013-07-23 Thread Erick Erickson
I really don't see what the use-case here is. When you say later, what does that mean? You're indexing what and querying how? Best Erick On Tue, Jul 23, 2013 at 7:19 AM, dheerajjoshim dheeraj.ma...@gmail.com wrote: Greetings, I am looking a way to tokenize the String based on Logical

Re: Trying to search java.lang.NullPointerException in log file.

2013-07-22 Thread Erick Erickson
Even though you're on the Lucene list, consider installing Solr just to see the admin/analysis page to see how your index and query analysis works. There's no reason you couldn't split this up on periods into separate words and then just use phrase query to find java.lang.NullPointerException, but

Re: Partial word match using n-grams

2013-07-19 Thread Erick Erickson
Well, it depends on what you put between your tokenizer and ngram filter. Putting WordDelimiterFilterFactory would break up on the underscore (and lots of other things besides) and submit the separate tokens which would then be n-grammed separately. That has other implications, of course, but you

Re: What is text searching algorithm in Lucene 4.3.1

2013-07-17 Thread Erick Erickson
Note: as of Lucene 4.x, you can plug in your own scoring algorithm, it ships with several variants (e.g. BM25) so you can look at the pluggable scoring where all the code for the various algorithms is concentrated. Erick On Wed, Jul 17, 2013 at 12:40 AM, Jack Krupansky j...@basetechnology.com

Re: Lucene in Action

2013-07-10 Thread Erick Erickson
Right, unfortunately, there's nothing that I know of that's super-recent. Jack Kupransky is e-publishing a book on Solr, which will be more up to date but I don't know how thoroughly it dives into the underlying Lucene code. Otherwise, I think the best thing is to tackle a real problem (perhaps

Re: Questions about doing a full text search with numeric values

2013-07-06 Thread Erick Erickson
than using wildcards. Or am I missing a subtle difference? Thank you. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 01, 2013 5:23 AM To: java-user Subject: Re: Questions about doing a full text search with numeric values

Re: Questions about doing a full text search with numeric values

2013-07-01 Thread Erick Erickson
WordDelimiterFilter(Factory if you're experimenting with Solr as Jack suggests) will fix a number of your cases since it splits on case change and numeric/alpha changes. There are a bunch of ways to recombine things so be aware that it'll take some fiddling with the parameters. As Jack suggests,

Re: Securing stored data using Lucene

2013-06-23 Thread Erick Erickson
Security has at least two parts. First, allowing users access to specific documents, for which Alon's comments are the usual way to do this in Solr/Lucene. But the patch you referenced doesn't address this, it's all about encrypting the data stored on disk. This is useful for keeping people who

Re: build of trunk hangs

2013-06-22 Thread Erick Erickson
What Adrien said. I've had this happen when I kill a build partway through (but just sometimes). If you're on a fast network, I'll sometimes just delete the entire .ivy2 cache, but that's a little drastic. Erick On Thu, Jun 20, 2013 at 9:15 AM, Adrien Grand jpou...@gmail.com wrote: Hi, On

Re: Why I can not interrupt the search?

2013-06-05 Thread Erick Erickson
Have you seen TimeLimitingCollector? Best Erick On Wed, Jun 5, 2013 at 6:39 AM, 朱彦安 shaco@gmail.com wrote: Hello! In the search hit a lot,I want to hit the 2000 docs return data immediately. I can not find such a method of lucene. I have tried: public int score(Collector

Re: Highlighting search words in full document

2013-04-07 Thread Erick Erickson
Sounds like what you want to do is 1 with each verse, store the chapter ID. This could be the ID of another document. There's no requirement that all docs in an index have the same structure. In this case, you could have a type field in each doc with values like verse and chapter. For your verse

Re: Highlighting search words in full document

2013-04-07 Thread Erick Erickson
in the entire chapter that were highlighted in the selected verse. Thanks! Sent from my iPhone On Apr 7, 2013, at 5:38 AM, Erick Erickson erickerick...@gmail.com wrote: Sounds like what you want to do is 1 with each verse, store the chapter ID. This could be the ID of another document. There's

Re: Consultant Inquiry

2013-03-29 Thread Erick Erickson
There are a bunch of possibilities listed here: http://wiki.apache.org/solr/Support Best Erick On Thu, Mar 28, 2013 at 2:32 PM, Nick Hoffman njhof...@gmail.com wrote: I'm looking for a consultant for Lucene Solr. Our team of 3 extended OpenBravo (Java ERP) with a built-in Shopping Cart

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Erick Erickson
@Tom - done On Mon, Mar 25, 2013 at 12:48 PM, Tom Burton-West tburt...@umich.eduwrote: Please add tburtonw to contributors Tom Burton-West tburtonw at umich dot edu Tom On Mon, Mar 25, 2013 at 9:05 AM, Steve Rowe sar...@gmail.com wrote: On Mar 25, 2013, at 8:49 AM, Rafał Kuć

Re: Assert / NPE using MultiFieldQueryParser

2013-03-25 Thread Erick Erickson
@Simon did I actually catch a reference to: http://xkcd.com/722/ ??? that's one of my all-time favorites on XKCD, I think it describes my entire professional life Bobby Tables is another (http://xkcd.com/327/). There, I've done my bit to stop productivity today! Erick On Mon, Mar 25,

Re: [ANNOUNCE] Wiki editing change

2013-03-25 Thread Erick Erickson
) to tburtonw. Steve On Mar 25, 2013, at 1:19 PM, Erick Erickson erickerick...@gmail.com wrote: @Tom - done On Mon, Mar 25, 2013 at 12:48 PM, Tom Burton-West tburt...@umich.edu wrote: Please add tburtonw to contributors Tom Burton-West tburtonw at umich dot edu Tom On Mon, Mar 25

Re: Accent insensitive analyzer

2013-03-24 Thread Erick Erickson
ISOLatin1AccentFilter has been deprecated for quite some time, ASCIIFoldingFilter is preferred Best Erick On Fri, Mar 22, 2013 at 2:59 PM, Jerome Blouin jblo...@expedia.com wrote: Thanks. I'll check that later. -Original Message- From: Sujit Pal [mailto:sujitatgt...@gmail.com]

Re: Bulk indexing and delete old index files

2013-03-05 Thread Erick Erickson
If you kept an indexed_time field, you could always just index to the same instance and then do a delete by query, something like deletequerytimestamp:[* TO NOW/DAY]/query/delete, commit and go. That would delete everything indexed before midnight. last night (NOW/DAY rounds down). Note, most of

Re: Is there a limit for a field size in Lucene 3.0.2

2013-02-21 Thread Erick Erickson
There's an overridable default of 10,000 tokens, that's the first place I'd look. Forget just how to set it to a higher value Best Erick. P.S. Please don't hit reply to a message and change the title, but start an e-mail fresh. See: http://people.apache.org/~hossman/#threadhijack On Thu,

Re: Real-time Get and Atomic Updates for SolrJ

2013-01-31 Thread Erick Erickson
I haven't used it myself, but I did find this for atomic updates: http://www.mumuio.com/solrj-4-0-0-alpha-atomic-updates/ Don't know if there really is need for specific support in SolrJ for RTG, isn't that all over on the Solr side and automagic? Best Erick On Wed, Jan 30, 2013 at 5:47 PM,

Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
Maybe do the handling as an overridable method and make it abstract? That would give the skeleton of all the recovery stuff, but then require the user to implement the actual recovery? Just a thought Erick On Mon, Jan 21, 2013 at 9:06 AM, Michał Brzezicki mbrzezi...@gmail.com wrote: I don't

Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
P.S. Or just attach the code without your customized doc recovery stuff with a note about how to carry it forward? That way someone could pick it up if interested and generalize it. Best Erick On Mon, Jan 21, 2013 at 12:37 PM, Erick Erickson erickerick...@gmail.com wrote: Maybe do the handling

<    1   2   3   4   5   6   7   8   9   10   >