+(-...) vs +(*:* -...) vs -(+...)
Dear reader, why does +(-x_ss:y) finds 0 docs, while -(+x_ss:y) finds many docs? Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised. Kind regards, J. Barth
graphq query delete: this IndexWriter is closed ??
Dear reader, still using solr 8.1.1 because of this: https://issues.apache.org/jira/browse/SOLR-13738 tried to delete approx. 25ooo solr docs (size of each ca. 1 kB) using this query: curl http://serv7:8982/solr/Suchindex/update -H "Content-type: text/xml" --data-binary '+id:d-nb.info* -rnd_d:[0 TO 10] -id:*#* -_query_:"{!graph from=id to=parent_ids}class_s:meta"' now 2020-02-21 13:47:03.523 ERROR (qtp548482954-59) [ x:Suchindex] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: this IndexWriter is closed Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Caused by: java.lang.ClassCastException: class org.apache.lucene.search.IndexSearcher cannot be cast to class org.apache.solr.search.SolrIndexSearcher (org.apache.lucene.search.IndexSearcher and org.apache.solr.search.SolrIndexSearcher are in unnamed module of loader org.eclipse.jetty.webapp.WebAppClassLoader @c1fca1e) Ooops... even commit does not work. Did rollback. Helps. Did delete without the -_query_:"..." part, works. Kind regards. Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
WITHDRAWN! Re: q.op=AND vs default (q.op=OR)
Mea culpa ... ran the different queries against two different solr instances. Everything works fine. Kind regards, Jochen Am 04.12.19 um 13:20 schrieb Jochen Barth: Found https://cwiki.apache.org/confluence/display/lucene/BooleanQuerySyntax But this does not explain the problem... Oh... and there is a bug in the abbreviated queries: the X behind the second _g_dn should be Y - but this does not affect the further AND/OR/+/- query structure. Jochen Am 04.12.19 um 12:39 schrieb Jochen Barth: Dear reader, I'm using solr 8.1.1. I'm trying to switch from q.op=OR to q.op=AND, because of the parser I generate the queries for solr is somewhat more simple to develop with q.op=AND. but the new query is returning less hits; I have shortened the query string for better readability; _g_up stands for »from=id to=parent_ids«, _g_dn stands for »from=parent_ids to=id«, _tft stands for »-type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal« and I have dropped some lenghty query terms (e. g. text_tei_ft, text_abstract_ft, meta_title_txt, meta_name_txt, ...) (for readability) here the original query without q.op=AND q=_query_:"{!graph _g_up}filter(+((_query_:\"{!graph _g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X text_pdf_ft:X\") (_query_:\"{!graph _g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X text_pdf_ft:X\") ) +class_s:meta )" here the new query with q.op=AND: q=_query_:"{!graph _g_up}filter(((_query_:\"{!graph _g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X OR text_pdf_ft:X\") OR (_query_:\"{!graph _g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X OR text_pdf_ft:X\") ) class_s:meta )" for comparison, the whole (unshortened) query as by debug=query from solr; whitout q.op=AND: # parsedquery: "GraphQuery([[filter(+(([[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h schliemann\" meta_shelflocator_txt:\"h schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false]) ([[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry schliemann\" meta_subject_txt:\"henry schliemann\" meta_shelflocator_txt:\"henry schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry schliemann\" text_ocr_ft:\"henry schliemann\" text_heidicon_ft:\"henry schliemann\" text_watermark_ft:\"henry schliemann\" text_catalogue_ft:\"henry schliemann\" text_index_ft:\"henry schliemann\" text_tei_ft:\"henry schliemann\" text_abstract_ft:\"henry schliemann\" text_pdf_ft:\"henry schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) +class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])" with q.op=AND: # parsedquery: "+GraphQuery([[+filter(+(([[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h schliemann\" meta_shelflocator_txt:\"h schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h schliemann\"],id=parent_ids] [Trave
Re: q.op=AND vs default (q.op=OR)
Found https://cwiki.apache.org/confluence/display/lucene/BooleanQuerySyntax But this does not explain the problem... Oh... and there is a bug in the abbreviated queries: the X behind the second _g_dn should be Y - but this does not affect the further AND/OR/+/- query structure. Jochen Am 04.12.19 um 12:39 schrieb Jochen Barth: Dear reader, I'm using solr 8.1.1. I'm trying to switch from q.op=OR to q.op=AND, because of the parser I generate the queries for solr is somewhat more simple to develop with q.op=AND. but the new query is returning less hits; I have shortened the query string for better readability; _g_up stands for »from=id to=parent_ids«, _g_dn stands for »from=parent_ids to=id«, _tft stands for »-type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal« and I have dropped some lenghty query terms (e. g. text_tei_ft, text_abstract_ft, meta_title_txt, meta_name_txt, ...) (for readability) here the original query without q.op=AND q=_query_:"{!graph _g_up}filter(+((_query_:\"{!graph _g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X text_pdf_ft:X\") (_query_:\"{!graph _g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X text_pdf_ft:X\") ) +class_s:meta )" here the new query with q.op=AND: q=_query_:"{!graph _g_up}filter(((_query_:\"{!graph _g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X OR text_pdf_ft:X\") OR (_query_:\"{!graph _g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR _query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X OR text_pdf_ft:X\") ) class_s:meta )" for comparison, the whole (unshortened) query as by debug=query from solr; whitout q.op=AND: # parsedquery: "GraphQuery([[filter(+(([[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h schliemann\" meta_shelflocator_txt:\"h schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false]) ([[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry schliemann\" meta_subject_txt:\"henry schliemann\" meta_shelflocator_txt:\"henry schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry schliemann\" text_ocr_ft:\"henry schliemann\" text_heidicon_ft:\"henry schliemann\" text_watermark_ft:\"henry schliemann\" text_catalogue_ft:\"henry schliemann\" text_index_ft:\"henry schliemann\" text_tei_ft:\"henry schliemann\" text_abstract_ft:\"henry schliemann\" text_pdf_ft:\"henry schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) +class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])" with q.op=AND: # parsedquery: "+GraphQuery([[+filter(+(([[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h schliemann\" meta_shelflocator_txt:\"h schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] [[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h schliemann\"],id=parent_ids] [TraversalFilter: +class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false
q.op=AND vs default (q.op=OR)
xt:\"henry schliemann\" text_ocr_ft:\"henry schliemann\" text_heidicon_ft:\"henry schliemann\" text_watermark_ft:\"henry schliemann\" text_catalogue_ft:\"henry schliemann\" text_index_ft:\"henry schliemann\" text_tei_ft:\"henry schliemann\" text_abstract_ft:\"henry schliemann\" text_pdf_ft:\"henry schliemann\"],id=parent_ids] [TraversalFilter: +class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) +class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])" wdiff of both: "GraphQuery([[filter(+(([[meta_title_txt:\"h"+GraphQuery([[+filter(+(([[meta_title_txt:\"h ... [TraversalFilter:class_s:meta +class_s:meta -type_s:multivolume_work ... [TraversalFilter:class_s:meta +class_s:meta -type_s:multivolume_work ... so the + before the »filter(« shouldnt be strictly necessary nor be the problem, and the + efore class_s:meta isn't necessary, too, but can't be the problem, too, in my opinion. What I found out is, that, "+" and "-" have higher precedence than "AND" and "OR"... but I don't see my error... Does someone has a hint for me? Kind regards, Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
filter in JSON Query DSL
Dear reader, this query works as expected: curl -XGET http://localhost:8982/solr/Suchindex/query -d ' {"query": { "bool": { "must": "*:*" } }, "filter": [ "meta_subject_txt:globe" ] }' this does not (nor without the curley braces around "filter"): curl -XGET http://localhost:8982/solr/Suchindex/query -d ' {"query": { "bool": { "must": [ "*:*", { "filter": [ "meta_subject_txt:globe" ] } ] } } }' Is "filter" within deeper queries possible? I've got some complex queries with a "kernel" somewhat below the top level... Is "canonical" json important to match query cache entry? Would it help to serialize this queries to standard syntax and then use filter(...)? Kind regards, Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
solr-8.1.1 -> solr-8.2.0, "lucene... cannot be cast"
$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917) at java.base/java.lang.Thread.run(Thread.java:834) start=0=50=sort_title_s%20asc=id=true=1=type_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_ dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"query":{"bool":{"must":[{"bool":{"must":["sort_shelflocator_s:cod\ \ pal\\ lat\\ 00*"],"should":[{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung51\"","to":"id"}},{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg .de/collection/sammlung52\"","to":"id"}}]}},"class_s:meta"],"must_not":[{"join":{"from":"id","query":{"bool":{"must":[{"bool":{"must":["sort_shelflocator_s:cod\\ pal\\ lat\\ 00*"],"should":[{"graph":{"from":"parent_ids"," query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung51\"","to":"id"}},{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung52\"","to":"id"}}]}},"class_s:meta"]}} ,"to":"parent_ids"}}]}}} -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
Re: facetting+tagging in JSON Query DSL
Oops.. my thunderbird did not preserve color... the keywords to look for in the query are »facet.field=%7B%21ex%3Dtype_s%7Dtype_s« and »{"#type_s":"type_s:article"}« Kind regards, Jochen Am 26.08.19 um 15:25 schrieb Jochen Barth: Dear reader, I'm trying to do this: https://lucene.apache.org/solr/guide/8_1/faceting.html#tagging-and-excluding-filters with JSON Query DSL: https://lucene.apache.org/solr/guide/8_1/json-query-dsl.html#tagging-in-json-query-dsl here is the complete query - essential parts in red (see below): But it does not seem to work - has this perhaps to do with bool:{should:[...]} in {filter:...}? Kind regards, Jochen start=0=50=sort_title_s%20asc=id=true=1=%7B%21ex%3Dtype_s%7Dtype_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"filter":[{"bool":{"should":[{"#type_s":"type_s:article"}]}}],"query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne meta_name_txt:sonne meta_subject_txt:sonne meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal"}}]}}]}},"class_s:meta"],"must_not":[{"join":{"from":"parent_ids","query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne meta_name_txt:sonne meta_subject_txt:sonne meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal"}}]}}]}},"class_s:meta"]}},"to":"id"}}]}}} -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
facetting+tagging in JSON Query DSL
Dear reader, I'm trying to do this: https://lucene.apache.org/solr/guide/8_1/faceting.html#tagging-and-excluding-filters with JSON Query DSL: https://lucene.apache.org/solr/guide/8_1/json-query-dsl.html#tagging-in-json-query-dsl here is the complete query - essential parts in red (see below): But it does not seem to work - has this perhaps to do with bool:{should:[...]} in {filter:...}? Kind regards, Jochen start=0=50=sort_title_s%20asc=id=true=1=%7B%21ex%3Dtype_s%7Dtype_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"filter":[{"bool":{"should":[{"#type_s":"type_s:article"}]}}],"query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne meta_name_txt:sonne meta_subject_txt:sonne meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal"}}]}}]}},"class_s:meta"],"must_not":[{"join":{"from":"parent_ids","query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne meta_name_txt:sonne meta_subject_txt:sonne meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta -type_s:multivolume_work -type_s:periodical -type_s:issue -type_s:journal"}}]}}]}},"class_s:meta"]}},"to":"id"}}]}}} -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
Re: graph query parser: depth dependent score?
Dear reader, I've found an different solution for my problem and don't need a depth dependent score anymore. Kind regards, Jochen Am 19.02.19 um 14:42 schrieb Jochen Barth: Dear reader, I'll have a hierarchical graph "like a book": { id:solr_doc1; title:book } { id:solr_doc2; title:chapter; parent_ids: solr_doc1 } { id:solr_doc3; title:subchapter; parent_ids: solr_doc2 } etc. Now to match all docs with "title" and "chapter" I could do: +_query_:"{!graph from=parent_ids to=id}title:book" +_query_:"{!graph from=parent_ids to=id}title:chapter", The result would be solr_doc2 and solr_doc3; but is there a way to "boost" or "put a higher score" on solr_doc2 than on solr_doc3 because of direct match (and not via {!graph... ) ? The only way to do so seems a {!boost before {!graph, but what I can do there is not dependent on the match nor {!graph, I think. Kind regards, Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
graph / boolean performance
Dear reader I have queries of the following kind: +( X ) - {!join from=parent_ids to=id}( X ) X is a {!graph query. Is there a way to tell Solr to cache the result of "X" because the result is needed in the whole query again (within {!join...)? example query (json): { "query" : { "bool" : { "must" : [ { "graph" : { "from" : "id", "query" : "fulltext_ocr_txtlarge:troja", "to" : "parent_ids", "useAutn" : "true" } } ], "must_not" : [ { "join" : { "from" : "parent_ids", "query" : { "bool" : { "must" : [ { "graph" : { "from" : "id", "query" : "fulltext_ocr_txtlarge:troja", "to" : "parent_ids", "useAutn" : "true" } } ] } }, "to" : "id" } } ] } } } Kind regards, Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
graph query parser: depth dependent score?
Dear reader, I'll have a hierarchical graph "like a book": { id:solr_doc1; title:book } { id:solr_doc2; title:chapter; parent_ids: solr_doc1 } { id:solr_doc3; title:subchapter; parent_ids: solr_doc2 } etc. Now to match all docs with "title" and "chapter" I could do: +_query_:"{!graph from=parent_ids to=id}title:book" +_query_:"{!graph from=parent_ids to=id}title:chapter", The result would be solr_doc2 and solr_doc3; but is there a way to "boost" or "put a higher score" on solr_doc2 than on solr_doc3 because of direct match (and not via {!graph... ) ? The only way to do so seems a {!boost before {!graph, but what I can do there is not dependent on the match nor {!graph, I think. Kind regards, Jochen -- Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
Highlighting: single id:... x1000 vs (id: OR id: ... x1000)
Dear reader, I'll like to highlight very large ocr docs (termvectors=true etc.). Therefore I've made a separate highlight store collection where i'll want to higlight ids selected from an other query from a separate collection (containing the same ids). Now querying like this: q=ocr:abc AND id:x1 hl=true hl.fl=ocr hl.useFastVector... ... q=ocr:abc AND id:x2 hl=true hl.fl=ocr hl.useFastVector.. q=ocr:abc AND id:x3 hl=true hl.fl=ocr hl.useFastVector.. q=ocr:abc AND id:x4 hl=true hl.fl=ocr hl.useFastVector.. ... till x1000 works very much faster than q=ocr:abc AND (id:x1 OR id:x2 OR id:x3 OR id... ... id:x1000) Why? Kind regards, Jochen barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
I'll found out that storing Documents as separate docs+id does not help either. You must have an completely separate collection/core to get things work fast. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Ok, https://wiki.apache.org/solr/SolrPerformanceFactors states that: Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document). But in my case (with docValues=true) there should be no reason to access *.fdt. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Stored vs non-stored very large text fields
Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch: Couple of random thoughts: 1) The latest (4.8) Solr has support for nested documents, as well as for expand components. Maybe that will let you have more efficient architecture: http://heliosearch.org/expand-block-join/ Yes, I've seen this, but as far as I understood you have to know on which nesting level you do your query. My search should work on any level, say, volume title 1986 chapter 1.1 author marc chapter 1.1.3 title does not matter chapter 1.1.3.1 title abc chapter 1.1.3.2 title xyz should match by querying +author:marc +title:abc // or // +author:marc +title:xyz but // not // +title:abc +title:xyz (we'll have an unkown number of levels) 2) Do you return OCR text to the client? Or just search it? If just search it, you don't need to store it I'll want to get highlighted snippets. 3) If you do need to store it and return it, do you always have to return it? If not, you could look at lazy-loading the field (setting in solrconfig.xml). Let's see, perhaps this is a sorting problem which could be solved by setting field id and sort_... to docValues=ture. 4) Is OCR text or image? The stored fields are compressed by default, I wonder if the compression/decompression of a large image is an issue. Text. 5) JDK 8 apparently makes Lucene much happier (speed of some operations). Might be something to test if all else fails. Ok... Thanks, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Kind regards, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
Dear Shawn, see attachment for my first brute force no-compression attempt. Kind regards, Jochen Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2014-04-29 13:58:27.0 +0200 *** *** 38,43 --- 38,44 import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat; + import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentInfo; import org.apache.lucene.store.Directory; *** *** 56,62 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1 14) { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); --- 57,63 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2014-04-29 13:57:08.0 +0200 *** *** 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; --- 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; *** *** 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat(); private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat(); private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat(); --- 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat
Re: Stored vs non-stored very large text fields
Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Stored vs non-stored very large text fields
Ok, https://wiki.apache.org/solr/SolrPerformanceFactors states that: Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document). But in my case (with docValues=true) there should be no reason to access *.fdt. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn