Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
If your ranges are always contiguous, you could index two fields: range-start and range-end and then perform queries like: range-start:[* TO 30] AND range-end:[5 TO *] If you have multiple ranges which could have gaps in between then you need something more complicated :) On 02/27/2012

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
I think your example case would end up like this: doc ... str name=start-range1/str -- single-valued range field str name=end-range15/str ... /doc On 02/27/2012 04:26 PM, federico.wachs wrote: Michael thanks a lot for your quick answer, but i'm not exactly sure I understand

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
No; contiguous means there are no gaps between them. You need something like what you described initially. Another approach is to de-normalize your data so that you have a single document for every range. But this might or might not suit your application. You haven't said anything about the

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
Yes, I see - I think your best bet is to index every day as a distinct value. Don't worry about having 100's of values. -Mike On 02/27/2012 05:11 PM, federico.wachs wrote: This is used on an apartment booking system, and what I store as solr documents can be seen as apartments. These

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov
I don't know if this would help with OOM conditions, but are you using a tint type field for this? That should be more efficient to search than a regular int or string. -Mike On 02/27/2012 05:27 PM, federico.wachs wrote: Yeah that's what I'm doing right now. But whenever I try to index an

Re: StreamingUpdateSolrServer - exceptions not propagated

2012-03-27 Thread Mike Sokolov
On 3/27/2012 11:14 AM, Mark Miller wrote: On Mar 27, 2012, at 10:51 AM, Shawn Heisey wrote: On 3/26/2012 6:43 PM, Mark Miller wrote: It doesn't get thrown because that logic needs to continue - you don't necessarily want one bad document to stop all the following documents from being added.

Re: Populating 'multivalue' fields (m:1 relationships)

2012-05-11 Thread Mike Sokolov
You can specify a solr field as multi-valued, and then supply multiple values for it. What that really does is concatenate all the values with a positional gap between them to prevent phrases and other positional queries from traversing the boundary between the distinct values. -Mike On

Re: slow highlighting because of stemming

2011-07-29 Thread Mike Sokolov
I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each

ideas for versioning query?

2011-08-01 Thread Mike Sokolov
A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
Thanks, Tomas. Yes we are planning to keep a current flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/FieldCollapsing

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
I think a 30% increase is acceptable. Yes, I think we'll try it. Although our case is more like # groups ~ # documents / N, where N is a smallish number (~1-5?). We are planning for a variety of different index sizes, but aiming for a sweet spot around a few M docs. -Mike On 08/01/2011

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Mike Sokolov
If you want to avoid re-indexing, you could consider building a synonym file that is generated using your rule set, and then using that to expand your queries. You'd need to get a list of all terms in your index and then process them to generate synyonyms. Actually, I don't know how to get a

Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Mike Sokolov
You have a few choices: 1) flatten your field structure - like your undesirable example, but wouldn't you want to have the document identifier as a field value also? 2) use phrase queries to make sure the key/value pairs are adjacent 3) use a join query That's all I can think of -Mike On

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Mike Sokolov
Although you weren't very clear about it, it sounds as if you want the results to be sorted by a name that actually matched the query? In general that is not going to be easy, since it is not something that can be computed in advance and thus indexed. -Mike On 08/03/2011 10:39 AM, Olson,

Re: Index not getting refreshed

2011-09-15 Thread Mike Sokolov
Is it possible you have two solr instances running off the same index folder? This was a mistake I stumbled into early on - I was writing with one, and reading with the other, so I didn't see updates. -Mike On 09/15/2011 12:37 AM, Pawan Darira wrote: I am commiting but not doing replication

Re: org.apache.pdfbox.pdmodel.PDPage Error

2011-10-25 Thread Mike Sokolov
On 10/24/2011 02:35 PM, MBD wrote: Is this really a stumper? This is my first experience with Solr and having spent only an hour or so with it I hit this barrier (below). I'm sure *I* am doing something completely wrong just hoping someone more familiar with the platform can help me identify

Re: How to delete a SOLR document if that particular data doesnt exist in DB?

2010-10-20 Thread Mike Sokolov
Since you are performing a complete reload of all of your data, I don't understand why you can't create a new core, load your new data, swap your application to look at the new core, and then erase the old one, if you want. Even so, you could track the timestamps on all your documents, which

Re: different results depending on result format

2010-10-21 Thread Mike Sokolov
quick follow-up: I also notice that the query from solrj gets version=1, whereas the admin webapp puts version=2.2 on the query string, although this param doesn't seem to change the xml results at all. Does this indicate an older version of solrj perhaps? -Mike On 10/21/2010 04:47 PM, Mike

Re: different results depending on result format

2010-10-22 Thread Mike Sokolov
? -Mike On 10/21/2010 04:47 PM, Mike Sokolov wrote: I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same

Re: different results depending on result format

2010-10-22 Thread Mike Sokolov
looking into the virtual hosts config in tomcat; it seems as if there must indeed be another solr instance running; in fact I'm now concerned there might be two solr instances running against the same data folder. yargh. -Mike On 10/22/2010 09:05 AM, Mike Sokolov wrote: Yes - I really only have

Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
Right - my point was to combine this with the previous approaches to form a query like: samsung AND android AND GPS AND word_count:3 in order to exclude documents containing additional words. This would avoid the combinatoric explosion problem otehrs had alluded to earlier. Of course this

Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov
seems like the only working idea. Maybe Varun could comment on the maximum numbers of terms that his queries will contain? Regards, Toke Eskildsen On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote: Right - my point was to combine this with the previous approaches to form a query like

Re: Query question

2010-11-03 Thread Mike Sokolov
Another alternative (prettier to my eye), would be: (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View) -Mike On 11/03/2010 09:28 AM, kenf_nc wrote: Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Mike Sokolov
Yes - I commented out the dataDir element in solrconfig.xml and then got the expected behavior: the core used a data subdirectory in the core subdirectory. It seems like the problem arises from using the solrconfig.xml that's distributed as example/solr/conf/solrconfig.xml The

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Suppose your analysis stack includes lower-casing, but your synonyms are only supposed to apply to upper-case tokens. For example, PET might be a synonym of positron emission tomography, but pet wouldn't be. -Mike On 04/26/2011 09:51 AM, Robert Muir wrote: On Tue, Apr 26, 2011 at 12:24 AM,

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Yes, I see. Makes sense. It is a bit hard to see a bad case for your proposal in that light. Here is one other example; I'm not sure whether it presents difficulties or not, and may be a bit contrived, but hey, food for thought at least: Say you have set up synonyms between names and

Re: Searching for escaped characters

2011-04-28 Thread Mike Sokolov
StandardTokenizer will have stripped punctuation I think. You might try searching for all the entity names though: (agrave | egrave | omacron | etc... ) The names are pretty distinctive. Although you might have problems with greek letters. -Mike On 04/28/2011 12:10 PM, Paul wrote: I'm

Re: Replicaiton Fails with Unreachable error when master host is responding.

2011-04-28 Thread Mike Sokolov
No clue. Try wireshark to gather more data? On 04/28/2011 02:53 PM, Jed Glazner wrote: Anybody? On 04/27/2011 01:51 PM, Jed Glazner wrote: Hello All, I'm having a very strange problem that I just can't figure out. The slave is not able to replicate from the master, even though the master is

updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov
This is in 1.4 - we push updates via SolrJ; our application sees the updates, but when we use the solr admin screens to run test queries, or use Luke to view the schema and field values, it sees the database in its state prior to the commit. I think eventually this seems to propagate, but I'm

Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov
Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? -Mike On 05/02/2011 11:49 AM, Ahmet Arslan wrote: This is in 1.4 - we push updates via SolrJ; our

Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov
Ah - I didn't expect that. Thank you! On 05/02/2011 12:07 PM, Ahmet Arslan wrote: Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? Yes separate commit

Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
I think the key question here is what's the best way to perform indexing without affecting search performance, or without affecting it much. If you have a batch of documents to index (say a daily batch that takes an hour to index and merge), you'd like to do that on an offline system, and

Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
Thanks - that sounds like what I was hoping for. So the I/O during replication will have *some* impact on search performance, but presumably much less than reindexing and merging/optimizing? -Mike Master/slave replication does this out of the box, easily. Just set the slave to update on

Re: What is correct use of HTMLStripCharFilter in Solr 3.1

2011-05-12 Thread Mike Sokolov
It preserves the location of the terms in the original HTML document so that you can highlight terms in HTML. This makes it possible (for instance) to display the entire document, with all the search terms highlighted, or (with some careful surgery) to display formatted HTML (bold, italic,

document storage

2011-05-13 Thread Mike Sokolov
Would anyone care to comment on the merits of storing indexed full-text documents in Solr versus storing them externally? It seems there are three options for us: 1) store documents both in Solr and externally - this is what we are doing now, and gives us all sorts of flexibility, but doesn't

Re: document storage

2011-05-16 Thread Mike Sokolov
On 05/15/2011 11:48 AM, Erick Erickson wrote: Where are the documents coming from? Because storing them ONLY in Solr risks losing them if your index is somehow hosed. In our case, we generally have source documents and can reproduce the index if need be, but that's a good point. Storing

Re: boolean versus non-boolean search

2011-05-16 Thread Mike Sokolov
On 05/16/2011 09:24 AM, Dmitry Kan wrote: Dear list, Might have missed it from the literature and the list, sorry if so, but: SOLR 1.4.1 solrQueryParser defaultOperator=AND/ Consider the query: term1 term2 OR term1 term2 OR term1 term3 I think what's happening is that your query gets

Re: [POLL] How do you (like to) do logging with Solr

2011-05-16 Thread Mike Sokolov
We use log4j explicitly and find it irritating to deal with the built-in JDK logging default. We also have conflicts with other packages that have their own ideas about how to bind slf4j, so the less of this the better, IMO. The 1.6.1 no-op default behavior seems a bit unfortunate as

Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea

2011-05-20 Thread Mike Sokolov
Cool! suggestion: you might want to replace externalVal.toLowerCase().split( ); with externalVal.toLowerCase().split(\\s+); also I bet folks might have different ideas about what to do with hyphens, so maybe: externalVal.toLowerCase().split([-\\s]+); In fact why not make it a

Re: Solr Highlight Component

2011-05-24 Thread Mike Sokolov
A possible workaround is to re-fetch the documents in your result set with a query that is: +id=(id1 or id2 or ... id20) (highlight query) where id1..20 are the doc ids in your result set would require two round-trips though -Mike On 05/24/2011 08:19 AM, Koji Sekiguchi wrote: (11/05/24

Re: solr Invalid Date in Date Math String/Invalid Date String

2011-05-27 Thread Mike Sokolov
The * endpoint for range terms wasn't implemented yet in 1.4.1 As a workaround, we use very large and very small values. -Mike On 05/27/2011 12:55 AM, alucard001 wrote: Hi all I am using SOLR 1.4.1 (according to solr info), but no matter what date field I use (date or tdate) defined in

Re: Obtaining query AST?

2011-05-31 Thread Mike Sokolov
I believe there is a query parser that accepts queries formatted in XML, allowing you to provide a parse tree to Solr; perhaps that would get you the control you're after. -Mike On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote: Hi, I want to write my own query expander. It needs to obtain

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov
Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction,

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov
opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more

Re: Text field case sensitivity problem

2011-06-15 Thread Mike Sokolov
but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
not familiar at all with doing this and have spent 0 time looking into it). Once configured the if(fieldName.equals(title)) line would be replaced with something like if(externalFields.contains(fieldName)){...} or something like that. Thoughts/comments? On Mon, Jun 20, 2011 at 9:05 AM, Mike

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov
would need at present. On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov soko...@ifactory.com mailto:soko...@ifactory.com wrote: Another option for determining whether to go to external storage would be to examine the SchemaField, see if it is stored, and if not, try

Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov
On 06/22/2011 04:01 AM, Dennis de Boer wrote: Hi Bill, as far as I understood now, with the help of my friend, you can't. Multivalued fields don't work that way. You can however always filter the facet results manually in the JSP. You knwo what the user chose as a facet. Yes - that is the

Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov
We always remove the facet filter when faceting: in other words, for a good user experience, you generally want to show facets based on the query excluding any restriction based on the facets. So in your example (facet B selected), we would continue to show *all* facets. Only if you performed

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
issues like this is to wrap the stream I'm handing to the parser in some kind of cleanup stream that handles a few yucky issues. You could, eg, just strip out invalid XML characters. Maybe Nutch should be doing this, or at least handling the error better? -Mike On 06/27/2011 09:19 AM, Mike

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it,

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov
Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox

Re: Looking for Custom Highlighting guidance

2011-06-29 Thread Mike Sokolov
Does the phonetic analysis preserve the offsets of the original text field? If so, you should probably be able to hack up FastVectorHighlighter to do what you want. -Mike On 06/29/2011 02:22 PM, Jamie Johnson wrote: I have a schema with a text field and a text_phonetic field and would like

Re: Looking for Custom Highlighting guidance

2011-06-30 Thread Mike Sokolov
It's going to be a bit complicated, but I would start by looking at providing a facility for merging an array of FieldTermStacks. The constructor for FieldTermStack() takes a fieldName and builds up a list of TermInfos (terms with positions and offsets): I *think* that if you make two of

Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov
like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolovsoko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike

Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov
wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope

Re: TermVectors and custom queries

2011-07-01 Thread Mike Sokolov
Yes, that's right. But at the moment the HL code basically has to reconstruct and re-run your query - it doesn't have any special knowledge. There's some work going on to try and fix that, but it seems like it's going to require some fairly major deep re-plumbing. -Mike On 07/01/2011 07:54

Re: How do I add a custom field?

2011-07-07 Thread Mike Sokolov
Did you ever commit? On 07/07/2011 01:58 PM, Gabriele Kahlout wrote: so, how about this: Document doc = searcher.doc(i); // i get the doc doc.removeField(wc); // remove the field in case there's addWc(doc, docLength); //add the new field

Re: How do I specify a different analyzer at search-time?

2011-07-11 Thread Mike Sokolov
There is a syntax that allows you to specify different analyzers to use for indexing and querying, in solr.xml. But if you don't do that, it should use the same analyzer in both cases. -Mike On 07/11/2011 10:58 AM, Gabriele Kahlout wrote: With a lucene QueryParser instance it's possible to

Re: strip html from data

2011-07-25 Thread Mike Sokolov
I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit:

Re: strip html from data

2011-07-25 Thread Mike Sokolov
Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote: charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: I think you need to list

Re: strip html from data

2011-07-25 Thread Mike Sokolov
1 2 term text bla bla keyword false false typewordword startOffset 6 10 endOffset 9 13 On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25

creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
I'm creating a some Solr plugins that index and search documents in a special way, and I'd like to make them as easy as possible to configure. Ideally I'd like users to be able to just drop a jar in place without having to copy any configuration into schema.xml, although I suppose they will

Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
ok, never mind all is well - I had a mismatch between the schema-declared field and my programmatic field, where I was overzealous in using OMIT_TF_POSITIONS. -Mike On 6/2/2012 5:02 PM, Mike Sokolov wrote: I'm creating a some Solr plugins that index and search documents in a special way

Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov
()); setQueryAnalyzer(new WhitespaceGapAnalyzer()); } protected Field.Index getFieldIndex(SchemaField field, String internalVal) { return Field.Index.ANALYZED; } } On 6/2/2012 5:48 PM, Mike Sokolov wrote: ok, never mind all is well - I had a mismatch

Re: Efficiently mining or parsing data out of XML source files

2012-06-06 Thread Mike Sokolov
I agree, that seems odd. We routinely index XML using either HTMLStripCharFilter, or XmlCharFilter (see patch: https://issues.apache.org/jira/browse/SOLR-2597), both of which parse the XML, and we don't see such a huge speed difference from indexing other field types. XmlCharFilter also

highlighting field boundary detection

2012-06-19 Thread Mike Sokolov
Does anybody know of a way to detect when the highlight snippet begins at the beginning of the field or ends at the end of the field using one of the standard highlighters shipped w/Solr? We'd like to display ellipses only when there is additional text surrounding the snippet in the original

Re: Problem while indexing XML file with special characters represented uuml

2012-07-10 Thread Mike Sokolov
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't use a true XML parser? You might want to try passing your documents through xmllint -noent (basically parse and reserialize) - that should inline the characters as UTF-8? On 07/09/2012 03:18 PM, Michael Belenki wrote:

Re: Problem while indexing XML file with special characters represented uuml

2012-07-11 Thread Mike Sokolov
I think the issue here is that DIH uses Woodstox BasicStreamReader (see http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) which has only minimal DTD support. It might be best to use ValidatingStreamReader