Replication not triggered
We have old fashioned replication configured between one master and one slave. Everything used to work but today I noticed that recent records were not present in the slave (same query gives hits on master but non on slave). The replication communication seems to work. This is what I get in the logs: INFO: [default] webapp=/solr path=/replication params={command=fetchindex_=1430136325501wt=json} status=0 QTime=0 Apr 27, 2015 2:05:25 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave in sync with master. Apr 27, 2015 2:05:25 PM org.apache.solr.core.SolrCore execute INFO: [default] webapp=/solr path=/replication params={command=details_=1430136325600wt=json} status=0 QTime=21 It says both are in sync, but obviously they are not and even the replication page of the admin view mentions different Version, Gen and size: Master (Searching) 1430107573634 27 287.19 GB Master (Replicable) 1430107573634 27 - Slave (Searching) 1429762011916 23 287.14 GB Any idea why the replication is not triggered here or what I could try to fix it? Solr Version is 4.10.3. -Michael
Re: variaton on boosting recent documents gives exception
Am 13.02.2015 um 11:18 schrieb Gonzalo Rodriguez: You can always change the type of your sortyear field to an int, or create an int version of it and use copyField to populate it. But that would require me to reindex. Would be nice to have some type conversion available within a function query. And using NOW/YEAR will round the current date to the start of the year, you can read more about this in the Javadoc: http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/util/DateMathParser.html You can test it using the example collection: http://localhost:8983/solr/collection1/select?q=*:*boost=recip(ms(NOW/YEAR,manufacturedate_dt),3.16e-11,1,1)fl=id,manufacturedate_dt,score,[explain]defType=edismax and checking the explain field for the numeric value given to NOW/YEAR vs NOW/HOUR, etc. The definition of *_dt fields int the example-schema is 'date' but my field is text or (t)int if I have to reindex. To compare against this int field I need another (comparable) int. ms(NOW/YEAR,manufacturedate_dt) is an int, but a huge one, which is very difficult to bring into a sensible relationship to e.g. '2015'. Your suggestion would only work if I change my year to a date like 2015-01-01T00:00:00Z which is not a sensible format for a publication year and not even easily creatable by copyfield. What I need is a real year number, not a date truncated to the year, which is only accessible as the number of milliseconds since the epoch of Jan, 1st 00:00:00h, which is not very handy. -Michael
variaton on boosting recent documents gives exception
Since my field to measure recency is not a date field but a string field (with only year-numbers in it), I tried a variation on the suggested boost function for recent documents: recip(sub(2015,min(sortyear,2015)),1,10,10) But this gives an exception when used in a boost or bf parameter. I guess the reason is that all the mathematics doesn't work with a string field even if it only contains numbers. Am I right with this guess? And if so, is there a function I can use to change the type to something numeric? Or are there other problems with my function? Another related question: as you can see the current year (2015) is hard coded. Is there an easy way to get the current year within the function? Messing around with NOW looks very complicated. -Michael
pf doesn't work like normal phrase query
My aim is to boost exactish matches similar to the recipe described in [1]. The anchoring works in q but not in pf, where I need it. Here is an example that shows the effect: q=title_exact:anatomiepf=title_exact^2000 debugQuery says it is interpreted this way: +title_exact: anatomie (title_exact: ^2000.0) As you can see the the contents of q is missing in the boosted part. Of course I also tried more realistic variants like q=title:anatomiepf=title_exact^10 (regular field and no quotes in q, exact field in pf) gives: +title:anatomie (title_exact: ^10.0) The fieldType definition is not exactly as in [1] but very similar and working in q (see first example above). Here are the relevant parts of my schema.xml: field name=title_exact type=text_lr indexed=true stored=false multiValued=true/ copyField source=title dest=title_exact / fieldType name=text_lr class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^(.*)$ replacement= $1 / tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType Any idea what is going wrong here? And even more important how I can fix it? --Michael [1] http://robotlibrarian.billdueber.com/2012/03/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/
Re: pf doesn't work like normal phrase query
Am 11.01.2015 um 14:01 schrieb Ahmet Arslan: What happens when you do not use fielded query? q=anatomieqf=title_exact instead of q=title_exact:anatomie Then it works (with qf=title): +(title:anatomie) (title_exact: anatomie ^20.0) Only problem is that my frontend always does a fielded query. Is there a way to make it work for fielded query? Or put another way: How can I do this boost in more complex queries like: title:foo AND author:miller AND year:[2010 TO *] It would be nice to have a title foo before another title some foo and bar (given the other criteria also match both titles). In such cases it is almost impossible to move the search fields to the qf parameter. --Michael
Re: pf doesn't work like normal phrase query
Am 11.01.2015 um 14:19 schrieb Michael Lackhoff: Or put another way: How can I do this boost in more complex queries like: title:foo AND author:miller AND year:[2010 TO *] It would be nice to have a title foo before another title some foo and bar (given the other criteria also match both titles). In such cases it is almost impossible to move the search fields to the qf parameter. How about this one: It should be possible to construct a query with a combination of more than one query parser. Is it possible to get this pseudo-code-variant of the above example into a working search-URL?: (defType=edismax q=anantomie qf=title10 related_title^5 pf=title_exact^20 ) AND (defType=edismax q=miller qf=author^10 editor^5 ) AND (defType=edismax or perhaps other defType q=[2010 TO *] qf=year ) My knowledge of the syntax is just not good enough to build such a beast and test it. What would a select-request look like to do such a query? Or would it be far too slow because of the complexity? --Michael
Re: pf doesn't work like normal phrase query
Hi Ahmet, You might find this useful : https://lucidworks.com/blog/whats-a-dismax/ I have a basic understanding but will do further reading... Regarding your example : title:foo AND author:miller AND year:[2010 TO *] last two clauses better served as a filter query. http://wiki.apache.org/solr/CommonQueryParameters#fq You are right for a hand crafted query but I have to deal with arbitrary complex user queries which are syntax-checked within the front end application but not much more. I find it difficult to automatically detect what part of the query can be moved to a filter query. By the way it is possible to combine different query parsers in a single query, but I believe your use-case does not need that. https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries Perhaps not, but how can I tackle my original problem then? Is there a way to boost exact titles (or whatever is in pf for that matter) within fielded queries, since that is what I have to deal with? The example above was just that -- an example -- people can come up with all sorts of complex/fielded queries but most of them contain a title (or part of it) and I want to boost those that have an exact(ish) match. --Michael
Re: pf doesn't work like normal phrase query
Am 11.01.2015 um 18:30 schrieb Jack Krupansky: It's still not quite clear to me what your specific goal is. From your vague description it seems somewhat different from the blog post that you originally cited. So, let's try one more time... explain in plain English what use case you are trying to satisfy. I think it is the use case from the blog entry. I got the complaint that users didn't find (at least not on the first result page) titles they entered exactly -- and I wanted to fix this by boosting exact matches. The example given to me was the title Anatomie. So I tried it: title:anatomie and got lots of hits all of which contained the word in the title but among the first 10 hits there was none with the (exact) title Anatomie the user was looking for. As next step I did a web search, found the blog entry, implemented it, was happy with the simple case but couldn't make it work with fielded queries (which we have to support, see below). At the moment we even have only fielded queries since the Application makes the default search field explicit -- which I could change but would like to keep if possible. But even if I change this case I still have to cope with fielded queries that are not just targeting the default search field. You mention fielded queries, but in my experience very few end-users would know about let alone use them. So, either you are giving your end-users specific guidance for writing queries - in which case you can give them more specific guidance that achieves your goals, or if these fielded queries are in fact generated by the client or app layer code, then maybe you just need to put more intelligence into that query-generation code in the client. It is the old library search problem: most users don't use it but we also have various kinds of experts amoung our users (few but important) who really use all the bells and whistles. And I have to somehow satisfy both groups: those who only do a one-word-search within the default search field and those with complex fielded queries -- and both should find titles they enter exactly at the top, even if combined with dozens of other criteria. And it doesn't really help to question the demand since the demand is there and somewhat external. The point is how to best meet it. --Michael
Re: pf doesn't work like normal phrase query
Thanks everyone for all the advice! To sum up there seems to be no easy solution. I only have the option to either - make things really complicated - only help some users/query structures - accept the status quo What could help is an analogon to field aliases: If it was possible to say f.title.pf=title_exact^10 title_proper^5 analogous to (the existing) f.title.qf=title_proper^10 title_related everything should work just fine But I guess this will only come if or when one of the developers has an itch to scratch ;-) Anyway, thanks a lot for all help and a great product --Michael
Re: Solution for reverse order of year facets?
Hi Ahmet, I forgot to include what I did for one customer : 1) Using StatsComponent I get min and max values of the field (year) 2) Calculate smart gap/range values according to minimum and maximum. 3) Re-issue the same query (for thee second time) that includes a set of facet.query. It's amazing, everyone I am talking with about this problem seems to remember some hack(s) to work around the problem ;-) On one hand it shows, there are some options (and thanks for giving me some more!) but on the other hand it also shows how much need there is for a real solution like SOLR-1672. I really hope Shawn finds some time to make it work. -Michael
Solution for reverse order of year facets?
If I understand the docs right, it is only possible to sort facets by count or value in ascending order. Both variants are not very helpful for year facets if I want the most recent years at the top (or appear at all if I restrict the number of facet entries). It looks like a requirement that was articulated repeatedly and the recommended solution seems to be to do some math like 1 - year and index that. So far so good. Only problem is that I have many data sources and I would like to avoid to change every connector to include the new field. I think a better solution would be to have a custom TokenFilterFactory that does it. Since it seems a common request, did someone already build such a TokenFilterFactory? If not, do you think I could build one myself? I do some (script-)programming but have no experience with Java, so I think I could adapt an example. Are there any guides out there? Or even better, is there a built-in solution I haven't heard of? -Michael
Re: Solution for reverse order of year facets?
On 03.03.2014 16:33 Ahmet Arslan wrote: Currently there are two storing criteria available. However sort by index - to return the constraints sorted in their index order (lexicographic by indexed term) - should return most recent year at top, no? No, it returns them -- as you say -- in lexicographic order and that means oldest first, like: 1815 1820 ... 2012 2013 (might well stop before we get here) 2014 -Michael
Re: Solution for reverse order of year facets?
Hi Ahmet, There is no built in solution for this. Yes, I know, that's why I would like the TokenFilterFactory Two workaround : 1) use facet.limit=-1 and invert the list (faceting response) at client side 2) use multiples facet.query a)facet.query=year:[2012 TO 2014]facet.query=year:[2010 TO 2012] b)facet.query=year:2014facet.query=year:2013 ... I thought about these but they have the disadvantage that 1) could return hundreds of facet entries. 2b) is better but would need about 30 facet-queries which makes quite a long URL and it wouldn't always work as expected. There are subjects that were very popular in the past but with no (or very few) recent publications. For these I would get empty results for my 2014-1985 facet-queries but miss all the stuff from the 1960s. From all these thoughts I came to the conclusion that a custom TokenFilterFactory could do exactly what I want. In effect it would give me a reverse sort: 1 - 2014 = 7986 1 - 2013 = 7987 ... The client code can easily regain the original year values for display. And I think it shouldn't be too difficult to write such a beast, only problem is I am not a Java programmer. That is why I asked if someone has done it already or if there is a guide I could use. After all it is just a simple subtraction... -Michael
Re: Solution for reverse order of year facets?
On 03.03.2014 19:58 Shawn Heisey wrote: There's already an issue in Jira. https://issues.apache.org/jira/browse/SOLR-1672 Thanks, this is of course the best solution. Only problem is that I use a custom verson from a vendor (based on version 4.3) I want to enhance. But perhaps they apply the patch. In the meantime I still think the custom filter could be a workaround. I can't take a look now, but I will later if someone else hasn't taken it up. That would be great! Thanks -Michael
Re: SOLR 3.3.0 multivalued field sort problem
On 13.08.2011 18:03 Erick Erickson wrote: The problem I've always had is that I don't quite know what sorting on multivalued fields means. If your field had tokens a and z, would sorting on that field put the doc at the beginning or end of the list? Sure, you can define rules (first token, last token, average of all tokens (whatever that means)), but each solution would be wrong sometime, somewhere, and/or completely useless. Of course it would need rules but I think it wouldn't be too hard to find rules that are at least far better than the current situation. My wish would include an option that decides if the field can be used just once or every value on its own. If the option is set to FALSE, only the first value would be used, if it is TRUE, every value of the field would get its place in the result list. so, if we have e.g. record1: ccc and bbb record2: aaa and zzz it would be either record2 (aaa) record1 (ccc) or record2 (aaa) record1 (bbb) record1 (ccc) record2 (zzz) I find these two outcomes most plausible so I would allow them if technical possible but whatever rule looks more plausible to the experts: some solution is better than no solution. -Michael
Re: SOLR 3.3.0 multivalued field sort problem
On 13.08.2011 20:31 Martijn v Groningen wrote: The first solution would make sense to me. Some kind of a strategy mechanism for this would allow anyone to define their own rules. Duplicating results would be confusing to me. That is why I would only activate it on request (setting a special option). Example use case: A library catalogue with an author sort. All books of an author would be together, no matter how many co-authors the book has. So I think it could be useful (as an option) but I have no idea how diffcult it would be to implement. As I said, it would be nice to have at least something. Any possible customization would be an extra bonus. -Michael
Re: SOLR 3.3.0 multivalued field sort problem
On 13.08.2011 21:28 Erick Erickson wrote: Fair enough, but what's first value in the list? There's nothing special about mutliValued fields, that is where the schema has multiValued=true. under the covers, this is no different than just concatenating all the values together and putting them in at one go, except for some games with the position between one term and another (positionIncrementGap). Part of my confusion is that the term multi-valued is sometimes used to refer to multiValued=true and sometimes used to refer to documents with more than one *token* in a particular field (often as the result of the analysis chain) I guess, since multivalued fields are not really different under the hood, they should be treated the same. So, no matter if the different values are the result of a multiValued=true or of the analysis chain: if the whole thing starts with an a put it first, if it starts with a z put it last. Example (multivalued field): Smith, Adam Duck, Dagobert = sort as s (or S) Example tokenized field: This is a tokenized field = sort as t (or T) The second case seems to be more in the grouping/field collapsing arena, although that doesn't work on fields with more than one value yet either. But that seems a more sensible place to put the second case rather than overloading sorting. It depends how you see the meaning of sorting: 1. Sort the records based on one single value per record (and return them in this order) 2. Sort the values of the field to sort on (and return the records belonging to the respective values) As long as sorting is only allowed on single value fields, both are identical. As soon as you allow multivalued fields to be sorted on, both interpretations mean something different and I think both have their valid use case. But I don't want to stress this too far. -Michael
Re: problem in setting field attribute in schema.xml
Am 26.05.2011 12:52, schrieb Romi: i have done it, i deleted old indexes and created new indexes but still able to search it through *:*, and no result when i search it as field:value. really surprising result. :-O I really don't understand your problem. Thist is not at all surprising but the expected behaviour: *:* just gives you every document in your index, no matter what of the document is stored or indexed, it just gives _everything_ whereas field:value does an actual search if there is an indexed value value in field field. So no surprise either that you didn't get a result here if you didn't index field. -Michael
Re: problem in setting field attribute in schema.xml
Am 26.05.2011 14:10, schrieb Romi: did u mean when i set indexed=false and store=true, solr does not index the field's value but store its value as it is??? I don't know if you are asking me since you do not quote anything but yes of course this is exactly the purpose of indexed and stored. -Michael
Re: problem in setting field attribute in schema.xml
Am 25.05.2011 15:47, schrieb Vignesh Raj: It's very strange. Even I tried the same now and am getting the same result. I have set both indexed=false and stored=false. But still if I search for a keyword using my default search, I get the results in these fields as well. But if I specify field:value, it shows 0 results. Can anyone explain? I guess you copy the field to your default search field. -Michael
Re: Is semicolon a character that needs escaping?
On 08.09.2010 00:05 Chris Hostetter wrote: : Subject: Is semicolon a character that needs escaping? ... : From this I conclude that there is a bug either in the docs or in the : query parser or I missed something. What is wrong here? Back in Solr 1.1, the standard query parser treated ; as a special character and looked for sort instructions after it. Starting in Solr 1.2 (released in 2007) a sort param was added, and semicolon was only considered a special character if you did not explicilty mention a sort param (for back compatibility) Starting with Solr 1.4, the default was changed so that semicolon wasn't considered a meta-character even if you didn't have a sort param -- you have to explicilty select the lucenePlusSort QParser to get this behavior. I can only assume that if you are seeing this behavior, you are either using a very old version of Solr, or you have explicitly selected the lucenePlusSort parser somewhere in your params/config. This was heavily documented in CHANGES.txt for Solr 1.4 (you can find mention of it when searching for either ; or semicolon) I am using 1.3 without a sort param which explains it, I think. It would be nice to update to 1.4 but we try to avoid such actions on a production server as long as everything runs fine (the semicolon thing was only reported recently). Many thanks for your detailed explanation! -Michael
Is semicolon a character that needs escaping?
According to http://lucene.apache.org/java/2_9_1/queryparsersyntax.html only these characters need escaping: + - || ! ( ) { } [ ] ^ ~ * ? : \ but with this simple query: TI:stroke; AND TI:journal I got the error message: HTTP ERROR: 400 Unknown sort order: TI:journal My first guess was that it was a URL encoding issue but everything looks fine: http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournalversion=2.2start=0rows=10indent=on as you can see, the semicolon is encoded as %3B There is no problem when the query ends with the semicolon: TI:stroke; gives no error. The first query also works if I escape the semicolon: TI:stroke\; AND TI:journal From this I conclude that there is a bug either in the docs or in the query parser or I missed something. What is wrong here? -Michael
Re: Is semicolon a character that needs escaping?
On 03.09.2010 00:57 Ken Krugler wrote: The docs need to be updated, I believe. From some code I wrote back in 2006... [...] Thanks this explains it very well. But in general escaping characters in a query gets tricky - if you can directly build queries versus pre-processing text sent to the query parser, you'll save yourself some pain and suffering. What do you mean by these two alternatives? That is, what exactly could I do better? Also, since I did the above code the DisMaxRequestHandler has been added to Solr, and it (IIRC) tries to be smart about handling this type of escaping for you. Dismax is not (yet) an option because we need the full lucene syntax within the query. Perhaps this will change with the new enhanced dismax request handler but I didn't play with it enough (will do with the next release). -Michael
Re: Is semicolon a character that needs escaping?
Hi Ken, But in general escaping characters in a query gets tricky - if you can directly build queries versus pre-processing text sent to the query parser, you'll save yourself some pain and suffering. What do you mean by these two alternatives? That is, what exactly could I do better? By can build..., I meant if you can come up with a GUI whereby the user doesn't have to use special characters (other than say quoting) then you can take a collection of clauses and programmatically build your query, without using the query parser. I think I have that (escaping of characters that have a special meaning in Solr). I just didn't know that the semicolon is one of them. So it would be nice if the docs could be updated to account for this. Thanks again -Michael
Re: Very basic questions: Indexing text
On 28.06.2010 23:00 Ahmet Arslan wrote: 1) I can get my docs in the index, but when I search, it returns the entire document. I'd love to have it only return the line (or two) around the search term. Solr can generate Google-like snippets as you describe. http://wiki.apache.org/solr/HighlightingParameters I didn't know this is possible and am also interested in this feature but even after reading the given Wiki page I cannot make out which is the parameter to use. The only paramter that could be similar is 'hl.maxAlternateFieldLength' where it is possible to give a length to return but according to the description that is for the case no match. And there is hl.fragmentsBuilder but with no explanation (the refered page SolrFragmentsBuilder does not yet exist). Could you give an example? E.g. lets say I have a field 'title' and a field 'fulltext' and my search term is 'solr'. What would be the right set of parameters to get back the whole title-field but only a sniplet of 50 words (or three sentences or whatever the unit) from the fulltext field. Thanks -Michael
Re: exceptionhandling error-reporting?
On 06.04.2010 17:49 Alexander Rothenberg wrote: On Monday 05 April 2010 20:14:44 Chris Hostetter wrote: define crashes ? ... presumabl you are tlaking about the client crashing because it can't parse theerro response, correct? ... the best suggestion given the current state of Solr is to make hte client smart enough to not attempt parsing of hte response unless the response code is 200. Yes, it tries to parse the HTML-output but exspecting JSON syntax. Because it is a perl-mod from CPAN, i dont really want to customize it... You don't have to. Just wrap the call in an eval, at least that is what I do. -Michael
Re: Confused by Solr Ranking
On 09.03.2010 16:01 Ahmet Arslan wrote: I kind of suspected stemming to be the reason behind this. But I consider stemming to be a good feature. This is the side effect of stemming. Stemming increases recall while harming precision. But most people want the best possible combination of both, something like: (raw_field:word OR stemmed_field:word^0.5) and it is nice that Solr allows such arrangements but it would be even nicer to have some sort of automatic take this field, transform the contents in a couple of ways and do some boosting in the order given. At least this would be my wish for the recent question about the one feature I would like to see. Or even better, allow not only a hierarchy of transformations but also a hierarchy of fields (like in dismax, but with the full power of the standard request handler) -Michael
Re: schema-based Index-time field boosting
On 23.11.2009 19:33 Chris Hostetter wrote: ...if there was a way to oost fields at index time that was configured in the schema.xml, then every doc would get that boost on it's instances of those fields but the only purpose of index time boosting is to indicate that one document is more significant then another doc -- if every doc gets the same boost, it becomes a No-OP. (think about the math -- field boosts become multipliers in the fieldNorm -- if every doc gets the same multiplier, then there is no net effect) Coming in a bit late but I would like a variant that is not a No-OP. Think of something like title:searchstring^10 OR catch_all:searchstring Of course I can always add the boosting at query time but it would make life easier if I could define a default boost in the schema so that my query could just be title:searchstring OR catch_all:searchstring but still get the boost for the title field. Thinking this further it would be even better if it was possible to define one (or more) fallback field(s) with associated boost factor in the schema. Then it would be enough to query for title:searchstring and it would be automatically expanded to e.g. title:searchstring^10 OR title_other_language:searchstring^5 OR catchall:searchstring or whatever you define in the schema. -Michael
Re: How to import multiple RSS-feeds with DIH
On 09.11.2009 09:46 Noble Paul നോബിള് नोब्ळ् wrote: When you say , the second example does not work , what does it mean? some exception?(if yes, please post the stacktrace) Very mysterious. Now it works but I am sure I got an exception before. All I remember is something like java.io.IOException: FULL. In the right frame of the DIH debugging screen I got an error message from firefox: the connection was reset while displaying the page. But I don't think it is reproducable now, perhaps some unrelated problem like low memory or such. Thanks anyway and sorry for the noise. -Michael
Getting started with DIH
I would like to start using DIH to index some RSS-Feeds and mail folders To get started I tried the RSS example from the wiki but as it is Solr complains about the missing id field. After some experimenting I found out two ways to fill the id: - copyField source=link dest=id/ in schema.xml This works but isn't very flexible. Perhaps I have other types of records with a real id or a multivalued link-field. Then this solution would break. - Changing the id field to type uuid Again I would like to keep real ids where I have them and not a random UUID. What didn't work but looks like the potentially best solution is to fill the id in my data-config by using the link twice: field column=link xpath=/RDF/item/link / field column=id xpath=/RDF/item/link / This would be a definition just for this single data source but I don't get any docs (also no error message). No trace of any inserts whatsoever. Is it possible to fill the id that way? Another question regarding MailEntityProcessor I found this example: document entity processor=MailEntityProcessor user=someb...@gmail.com password=something host=imap.gmail.com protocol=imaps folders = x,y,z/ /document But what is the dataSource (the enclosing tag to document)? That is, how would a minimal but complete data-config.xml look like to index mails from an IMAP server? And finally, is it possible to combine the definitions for several RSS-Feeds and Mail-accounts into one data-config? Or do I need a separate config file and request handler for each of them? -Michael
Re: Getting started with DIH
On 08.11.2009 17:03 Lucas F. A. Teixeira wrote: You have an example on using mail dih in solr distro blushDon't know where my eyes were. Thanks!/blush When I was at it I looked at the schema.xml for the rss example and it uses link as UniqueKey, which is of course good, if you only have rss items but not so good if you also plan to add other data sources. So I am still interested in a good solution for my id problem: What didn't work but looks like the potentially best solution is to fill the id in my data-config by using the link twice: field column=link xpath=/RDF/item/link / field column=id xpath=/RDF/item/link / This would be a definition just for this single data source but I don't get any docs (also no error message). No trace of any inserts whatsoever. Is it possible to fill the id that way? and this one: And finally, is it possible to combine the definitions for several RSS-Feeds and Mail-accounts into one data-config? Or do I need a separate config file and request handler for each of them? Thanks -Michael
Re: Getting started with DIH
On 08.11.2009 16:56 Michael Lackhoff wrote: What didn't work but looks like the potentially best solution is to fill the id in my data-config by using the link twice: field column=link xpath=/RDF/item/link / field column=id xpath=/RDF/item/link / This would be a definition just for this single data source but I don't get any docs (also no error message). No trace of any inserts whatsoever. Is it possible to fill the id that way? Found the answer in the list archive: use TemplateTransformer: field column=link xpath=/RDF/item/link / field column=id template=${slashdot.link} / Only minor and cosmetic problem: there are brackets around the id field (like [http://somelink/]). For an id this doesn't really matter but I would like to understand what is going on here. In the wiki I found only this info: The rules for the template are same as the templates in 'query', 'url' etc but I couldn't find any info about those either. Is this documented somewhere? -Michael
Re: Getting started with DIH
On 09.11.2009 06:54 Erik Hatcher wrote: The brackets probably come from it being transformed as an array. Try saying multiValued=false on your field specifications. Indeed. Thanks Erik that was it. My first steps with DIH showed me what a powerful tool this is but although the DIH wiki page might well be the longest in the whole wiki there are so many mysteries left for the uninitiated. Is there any other documentation I might have missed? Thanks -Michael
Re: Getting started with DIH
On 09.11.2009 08:20 Noble Paul നോബിള് नोब्ळ् wrote: It just started of as a single page and the features just got piled up and the page just bigger. we are thinking of cutting it down to smaller more manageable pages Oh, I like it the way it is as one page, so that the browser full text search can help. It is just that the features and power seem to grow even faster than the wike page ;-) E.g. I couldn't find a way how to add a second rss feed. I tried with a second entity parallel to the slashdot one but got an exception: java.io.IOException: FULL whatever that means, so I must be doing something wrong but couldn't find a hint. -Michael
How to import multiple RSS-feeds with DIH
[A new thread for this particular problem] On 09.11.2009 08:44 Noble Paul നോബിള് नोब्ळ् wrote: The tried and tested strategy is to post the question in this mailing list w/ your data-config.xml. See my data-config.xml below. The first is the usual slashdot example with my 'id' addition, the second a very simple addtional feed. The second example works if I delete the slashdot-feed but as I said I would like to have them both. -Michael dataConfig dataSource type=HttpDataSource / document entity name=slashdot pk=link url=http://rss.slashdot.org/Slashdot/slashdot; processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=TemplateTransformer,DateFormatTransformer field column=source xpath=/RDF/channel/title commonField=true / field column=source-link xpath=/RDF/channel/link commonField=true / field column=subject xpath=/RDF/channel/subject commonField=true / field column=titlexpath=/RDF/item/title / field column=link xpath=/RDF/item/link / field column=id template=${slashdot.link} / field column=description xpath=/RDF/item/description / field column=creator xpath=/RDF/item/creator / field column=item-subject xpath=/RDF/item/subject / field column=slash-department xpath=/RDF/item/department / field column=slash-sectionxpath=/RDF/item/section / field column=slash-comments xpath=/RDF/item/comments / field column=date xpath=/RDF/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / /entity entity name=heise pk=link url=http://www.heise.de/newsticker/heise.rdf; processor=XPathEntityProcessor forEach=/RDF/channel | /RDF/item transformer=TemplateTransformer field column=source xpath=/RDF/channel/title commonField=true / field column=source-link xpath=/RDF/channel/link commonField=true / field column=titlexpath=/RDF/item/title / field column=link xpath=/RDF/item/link / field column=id template=${heise.link} / /entity /document /dataConfig
Re: Preparing the ground for a real multilang index
On 08.07.2009 00:50 Jan Høydahl wrote: itself and do not need to know the query language. You may then want to do a copyfield from all your text_lang - text for convenient one- field-to-rule-them-all search. Would that really help? As I understand it, copyfield takes the raw, not yet analyzed field value. I cannot see yet the advantage of this text-field over the current situation with no text_lang fields at all. The copied-to text field has to be language agnostic with no stemming at all, so it would miss many hits. Or is there a way to combine many differently stemmed variants into one field to be able to search against all of them at once? That would be great indeed! -Michael
EnglishPorterFilterFactory and PatternReplaceFilterFactory
In Germany we have a strange habbit of seeing some sort of equivalence between Umlaut letters and a two letter representation. Example 'ä' and 'ae' are expected to give the same search results. To achieve this I added this filter to the text fieldtype definition: filter class=solr.PatternReplaceFilterFactory pattern=ä replacement=ae replace=all / to both index and query analyzers (and more for the other umlauts). This works well when I search for a name (a word not stemmed) but not e.g. with the word Wärme. search for 'wärme' works search for 'waerme' does not work search for 'waerm' works if I move the EnglishPorterFilterFactory after the PatternReplaceFilterFactory. DebugQuery for waerme gives a parsedquery FS:waerm. What I don't understand is why the (existing) records are not found. If I understand it right, there should be 'waerm' in the index as well. By the way, the reason why I keep the EnglishPorterFilterFactory is that the records are in many languages and the English stemming gives good results in many cases and I don't want (yet) to multiply my fields to have language specific versions. But even if the stemming is not right because the language is not English I think records should be found as long as the analyzers are the same for index and query. This is with Solr 1.3. Can someone shed some light on what is going on and how I can achieve my goal? -Michael
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
On 02.07.2009 16:34 Walter Underwood wrote: First, don't use an English stemmer on German text. It will give some odd results. I know but at the moment I only have the choice between no stemmer at all and one stemmer and since more than half of the records are English (about 60% English, 30% German, some Italian, French and others) the results are not too bad. Are you using the same conversions on the index and query side? Yes, index and query look exactly the same. That is what I don't understand. I am not complaining about a misbehaving stemmer, unless it does already something odd with the umlauts. The German stemmer might already handle typewriter umlauts. If it doesn't, use the pattern replace factory. You will also need to convert ß to ss. That is what I tried. And yes I also have a filter for ß to ss. It just doesn't work as expected. You really do need separate fields for each language. Eventually. But now I have to get ready really soon with a small application and people don't find what they expect. Handling these characters is language-specific. The typewriter umlaut conversion is wrong for English. It is correct, but rare, to see a diaresis in English when vowels are pronounced separately, like coöperate. In Swedish, it is not OK to convert ö to another letter or combination of letters. It is just for German users and at the moment it would be totally ok to have coöperate indexed as cooeperate, I know it is wrong and it will be fixed but given the tight schedule all I want at the moment is the combination of some stemming (perhaps 70% right or more) and typewriter umlauts (perhaps 90% correct, you gave examples for the missing 10%). Do I have any chance? -Michael
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
On 02.07.2009 17:28 Erick Erickson wrote: I'm shooting a bit in the dark here, but I'd guess that these are actually understandable results. Perhaps not too much in the dark That is your implicit assumption, it seems to me, is that'wärme' and 'waerme' should go through the stemmer and become 'wärm' and 'waerm', that you can then do the substitution on and produce the same output. I don't think that's a valid assumption. Sounds very reasonable. Will see what I can make out of all this to keep our librarians happy... Yonik Seeley wrote: Also, check out MappingCharFilterFactory in Solr 1.4 and mapping-ISOLatin1Accent.txt in example/solr/conf Thanks for the hint, looking forward to the 1.4 release ;-) at the moment we are on 1.3 though, I hope to upgrade soon but probably not soon enough for this app. -Michael
Preparing the ground for a real multilang index
As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all language without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of term. Do I have to expand the query to something like TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael
Re: Preparing the ground for a real multilang index
On 03.07.2009 00:49 Paul Libbrecht wrote: [I'll try to address the other responses as well] I believe the proper way is for the server to compute a list of accepted languages in order of preferences. The web-platform language (e.g. the user-setting), and the values in the Accept-Language http header (which are from the browser or platform). All this is not going to help much because the main application is a scientific search portal for books and articles with many users searching cross-language. The most typical use case is a German user searching multilingual. So we might even get the search multilingual, e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for Accept-headers or a language select field (would be left on any in most cases). Other popular use cases are citations (in whatever language) cut and pasted into the search field. Then you expand your query for surfing waves (say) to: - phrase query: surfing waves exactly (^2.0) - two terms, no stemming: surfing waves (^1.5) - iterate through the languages and query for stemmed variants: - english: surf wav ^1.0 - german surfing wave ^0.9 - - then maybe even try the phonetic analyzer (matched in a separate field probably) This is an even more sophisticated variant of the multiple OR I came up with. Oh well... I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual. indeed and often users are not even aware of it, especially in a scientific context they use their native tongue and English almost interchangably -- and they expect the search engine to cope with it. I think the best would be to process the data according to its language but don't make any assumptions about the query language and I am totally lost how to get a clever schema.xml out of all this. Thanks everyone for listening and I am still open for good suggestions to deal with this problem! -Michael
Re: Moving from single core to multicore
On 10.02.2009 02:39 Chris Hostetter wrote: : Now all that is left is a more cosmetic change I would like to make: : I tried to place the solr.xml in the example dir to get rid of the : -Dsolr.solr.home=multicore for the start and changed the first entry : from core0 to solr and moved the core1 dir from multicore directly : under the example dir : Idea behind all this: Use the original single core under solr as core0 : and add a second one on the same directory level (core1 parallel to : solr). Then I started solr with the old java -jar start.jar in the : example dir. But the multicore config seems to be ignored then, I get solr looks for conf/solr.xml relative the Solr Home Dir and if it doesn't find it then it looks for conf/solrconfig.xml ... if you don't set the solr.solr.home system property then the Solr Home Dir defaults to ./solr/ so putting your new solr.xml file in example/solr/conf should be what you are looking for. Almost. I had to change solr.xml like this, otherwise everything was expected under ./solr looking for solr/solr and solr/core1: cores adminPath=/admin/cores core name=core0 instanceDir=. property name=dataDir value=./data / /core core name=core1 instanceDir=../core1 property name=dataDir value=../core1/data / /core /cores Though the dataDir property seems to be ignored, I had to set it in solrconfig.xml of both cores. Thanks for all your help, the support all of you are giving is really outstanding! --Michael
Moving from single core to multicore
Hello, I am not that experienced but managed to get a Solr index going by copying the example dir from the distribution (1.3 released version) and changing the fields in schema.xml to my needs. As I said everything is working very well so far. Now I need a second index on the same machine and the natural solution seems to be multicore (I would really like to keep the two distinct so I didn't put everything in one index). But I have some problems setting this up. As long as I try the multicore sample everything works but when I copy my schema.xml into the multicore/core0/conf dir I only get 404 error messages when I enter the admin url. Looks like I cannot just copy over a single core config to a multicore environment and that is o.k., what I am missing is some guidance what to look out for. What are the settings that have to be adjusted to multicore? I would like to avoid trial and error for every single setting I have in my config. And a related question: I would like to keep the existing data dir as core0-datadir (/path_to_installation/example/solr/data). Is this possible with the dataDir parameter? And if yes, what would be the correct value? /solr/data/ or /path_to_installation/example/solr/data/? Do I need an absolute path or is it relative to the dir where my start.jar is? Thanks, Michael
Re: Moving from single core to multicore
On 09.02.2009 15:40 Ryan McKinley wrote: But I have some problems setting this up. As long as I try the multicore sample everything works but when I copy my schema.xml into the multicore/core0/conf dir I only get 404 error messages when I enter the admin url. what is the url you are hitting? those from the wiki: http://localhost:8983/solr/core0/select?q=*:* Do you see links from the index page? Sorry, I don't know what you mean by this Are there any messages in the log files? This looks like the key. The output is a bit difficult to follow but I found the most likely reason: the txt files were missing (stopwords.txt, synonyms.txt ...) and then the fieldtype definitions failed. After I copied the complete conf dir over to multicore it is almost working now. Only problems: I get this warning: 2009-02-09 16:27:31.177::WARN: /solr/admin/ java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:571) [lots more] and both cores seem to reference the old single core data. If I do a search both give (the same) results (from the old core), I expected them to be empty, searching in a newly created index somewhere below the multicore dir. I couldn't find a datadir definition so I still don't know how to add a real second core (not just two cores with the same data). Any ideas? Thanks so far Michael
Re: Moving from single core to multicore
On 09.02.2009 17:01 Ryan McKinley wrote: Check your solrconfig.xml you probably have somethign like this: !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- dataDir${solr.data.dir:./solr/data}/dataDir (from the example) either remove that or make each one point to the correct location Thanks, that's it! Now all that is left is a more cosmetic change I would like to make: I tried to place the solr.xml in the example dir to get rid of the -Dsolr.solr.home=multicore for the start and changed the first entry from core0 to solr and moved the core1 dir from multicore directly under the example dir Idea behind all this: Use the original single core under solr as core0 and add a second one on the same directory level (core1 parallel to solr). Then I started solr with the old java -jar start.jar in the example dir. But the multicore config seems to be ignored then, I get my old single core e.g. http://localhost:8983/solr/core1/select?q=*:* is no longer found. As I said everything works if I leave it in the multicore subdir and start with -Dsolr.solr.home=multicore but it would be nice if I could do without that extra subdir and the extra start parameter. --Michael
Re: date range query performance
On 31.10.2008 19:16 Chris Hostetter wrote: forteh record, you don't need to index as a StrField to get this benefit, you can still index using DateField you just need to round your dates to some less graunlar level .. if you always want to round down, you don't even need to do the rounding yourself, just add /SECOND or /MINUTE or /HOUR to each of your dates before sending them to solr. (SOLR-741 proposes adding a config option to DateField to let this be done server side) Is this also possible for the timestamp that is automatically added to all new/updated docs? I would like to be able to search (quickly) for everything that was added within the last week or month or whatever. And because I update the index only once a day a granuality of /DAY (if that exists) would be fine. - Michael
Re: date range query performance
On 01.11.2008 06:10 Erik Hatcher wrote: Yeah, this should work fine: field name=timestamp type=date indexed=true stored=true default=NOW/DAY multiValued=false/ Wow, that was fast, thanks! -Michael
Re: Searching for future or null dates
On 26.09.2008 06:17 Chris Hostetter wrote: that's true, regretably there is no prefix operator to indicate a SHOULD clause in the Lucene query langauge, so if you set the default op to AND you can't then override it on individual clauses. this is one of hte reasons i never make the default op AND. Just for symmetry or to get rid of this restriction wouldn't it be a good idea to add such a prefix operator? i'm sure your food will still taste pretty good :) That's what my wife keeps telling me ;-) Many thanks. I think I will leave it as is for the current application but use OR-Default plus prefix operators for new projects. -Michael
Re: Searching for future or null dates
On 23.09.2008 00:30 Chris Hostetter wrote: : Here is what I was able to get working with your help. : : (productId:(102685804)) AND liveDate:[* TO NOW] AND ((endDate:[NOW TO *]) OR : ((*:* -endDate:[* TO *]))) : : the *:* is what I was missing. Please, PLEASE ... do yourself a favor and stop using AND and OR ... food will taste better, flowers will smell fresher, and the world will be a happy shinny place... +productId:102685804 +liveDate:[* TO NOW] +(endDate:[NOW TO *] (*:* -endDate:[* TO *])) I would also like to follow your advice but don't know how to do it with defaultOperator=AND. What I am missing is the equivalent to OR: AND: + NOT: - OR: ??? I didn't find anything on the Solr or Lucene query syntax pages. If there is such an equivalent then I guess the query would become: productId:102685804 liveDate:[* TO NOW] (endDate:[NOW TO *] OR(*:* -endDate:[* TO *])) I switched to the AND-default because that is the default in my web frontend so I don't have to change logic. What should I do in this situation? Go back to the OR-default? It is not so much this example I am after but I have a syntax translater in my application that must be able to handle similar expressions and I want to keep it simple and still have tasty food ;-) -Michael
Re: wildcard newbie question
On 31.01.2008 00:31 Alessandro Senserini wrote: I have a text field type called courseTitle and it contains Struts 2 If I search courseTitle:strut* I get the documents but if I search with courseTitle:struts* I do not get any results. Could you please explain why? Just a guess: It might be because of stemming. Do you have the same effect with words that don't end in an 's' or similar? If my guess is correct, only 'strut' is in the index, not 'struts'. -Michael
Out of heap space with simple updates
I wanted to try to do the daily update with XML updates (was mentioned recently as the recommended way) but got an OutOfMemoryError: Java heap space after 319000 records. I am sending one document at a time through the http update interface, so every request should be short enough to not run out of memory. Do I have to commit after every few thousand records to avoid the error? My understanding was that I have to do a commit only at the very end. Or are there other things I could try? How can I increase the heap size? I use the included jetty and start solr with java -jar start.jar. After I ran into the error a commit wasn't possible either. What is the best way to avoid this sort of problems? Thanks -Michael
Re: Out of heap space with simple updates
On 23.01.2008 20:57 Chris Harris wrote: I'm using java -Xms512M -Xmx1500M -jar start.jar Thanks! I did see the -X... params in recent threads but didn't know where to place them -- not being a java guy at all ;-) -Michael
Re: Another text I cannot get into SOLR with csv
After a long weekend I could do a deeper look into this one and it looks as if the problem has to do with splitting. This one works for me fine. $ cat t2.csv id,name 12345,'s-Gravenhage 12345,'s-Gravenhage 12345,s-Gravenhage $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary @t2.csv -H 'Content-type:text/csv; charset=utf-8' My csv-file: DBRECORDID,PUBLPLACE 43298,'s-Gravenhage The URL (giving a 400 error): http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=truecommit=true; (PUBLPLACE is defined as multivalued field) If I remove the f.PUBLPLACE.split=true parameter OR make sure that the apostrophe is not the first character, everything is fine. But I need the field to be multivalued and thus need the split parameter (not for this record but for others) and as the example shows, some have an apostrophe as the first character. Any ideas how to deal with this? -Michael
Re: Another text I cannot get into SOLR with csv
On 08.01.2008 16:11 Yonik Seeley wrote: Ahh, wait, it looks a single quote as the encapsulator for split field values by default. Try adding f.PUBLPLACE.encapsulator=%00 to disable the encapsulation. Hmm. Yes, this works but: - I didn't find anything about it in the docs (wiki). On the contrary it suggests that the single quote has to be explicitly set: f.tags.encapsulator=' (http://wiki.apache.org/solr/UpdateCSV?#head-c238cb494f800d345766acda16e08d82663127ce) - A literal encapsulator should be possible to add by doubling it ' = '' but this gives the same error - is it possible to change the split field separator for all fields? The URL is getting rather long already.
Re: correct escapes in csv-Update files
On 03.01.2008 17:16 Yonik Seeley wrote: CSV doesn't use backslash escaping. http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm This is text with a quoted string Thanks for the hint but the result is the same, that is, quoted behaves exactly like \quoted\: - both leave the single unescaped quote in the record: quoted - both have the problem with a backslash before the escaped quote: This is text with a \quoted string gives an error invalid char between encapsualted token end delimiter. So, is it possible to get a record into the index with csv that originally looks like this?: This is text with an unusual \combination of characters A single quote is no problem: just double it ( - ). A single backslash is no problem: just leave it alone (\ - \) But what about a backslash followed by a quote (\ - ???) -Michael
Re: Another text I cannot get into SOLR with csv
On 04.01.2008 16:55 Yonik Seeley wrote: On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote: If the fields value is: 's-Gravenhage I cannot get it into SOLR with CSV. This one works for me fine. $ cat t2.csv id,name 12345,'s-Gravenhage 12345,'s-Gravenhage 12345,s-Gravenhage $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary @t2.csv -H 'Content-type:text/csv; charset=utf-8' But you are cheating ;-) This works for me too but I am using a local csv file for the update: http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true Perhaps the problem is that I cannot define a charset for the stream.file? -Michael