+(-...) vs +(*:* -...) vs -(+...)

2020-05-21 Thread Jochen Barth

Dear reader,

why does +(-x_ss:y) finds 0 docs,

while -(+x_ss:y) finds many docs?

Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised.

Kind regards, J. Barth



graphq query delete: this IndexWriter is closed ??

2020-02-21 Thread Jochen Barth

Dear reader,

still using solr 8.1.1 because of this: 
https://issues.apache.org/jira/browse/SOLR-13738


tried to delete approx. 25ooo solr docs (size of each ca. 1 kB)

using this query: curl http://serv7:8982/solr/Suchindex/update -H 
"Content-type: text/xml" --data-binary '+id:d-nb.info* 
-rnd_d:[0 TO 10] -id:*#* -_query_:"{!graph from=id 
to=parent_ids}class_s:meta"'


now

2020-02-21 13:47:03.523 ERROR (qtp548482954-59) [   x:Suchindex] 
o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: this 
IndexWriter is closed
Caused by: org.apache.lucene.store.AlreadyClosedException: this 
IndexWriter is closed
Caused by: java.lang.ClassCastException: class 
org.apache.lucene.search.IndexSearcher cannot be cast to class 
org.apache.solr.search.SolrIndexSearcher 
(org.apache.lucene.search.IndexSearcher and 
org.apache.solr.search.SolrIndexSearcher are in unnamed module of loader 
org.eclipse.jetty.webapp.WebAppClassLoader @c1fca1e)


Ooops... even commit does not work.

Did rollback. Helps.

Did delete without the -_query_:"..." part, works.

Kind regards.

Jochen

--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



WITHDRAWN! Re: q.op=AND vs default (q.op=OR)

2019-12-04 Thread Jochen Barth

Mea culpa ...
ran the different queries against two different solr instances.

Everything works fine.

Kind regards,
Jochen



Am 04.12.19 um 13:20 schrieb Jochen Barth:

Found
https://cwiki.apache.org/confluence/display/lucene/BooleanQuerySyntax

But this does not explain the problem...

Oh... and there is a bug in the abbreviated queries: the X behind the 
second _g_dn should be Y - but this does not affect the further 
AND/OR/+/- query structure.


Jochen


Am 04.12.19 um 12:39 schrieb Jochen Barth:

Dear reader, I'm using solr 8.1.1.

I'm trying to switch from q.op=OR to q.op=AND, because of the parser 
I generate the queries for solr is somewhat more simple to develop 
with q.op=AND.


but the new query is returning less hits;

I have shortened the query string for better readability; _g_up 
stands for »from=id to=parent_ids«, _g_dn stands for »from=parent_ids 
to=id«,


_tft stands for »-type_s:multivolume_work -type_s:periodical 
-type_s:issue -type_s:journal«


and I have dropped some lenghty query terms (e. g. text_tei_ft, 
text_abstract_ft, meta_title_txt, meta_name_txt, ...) (for readability)


here the original query without q.op=AND

q=_query_:"{!graph _g_up}filter(+((_query_:\"{!graph 
_g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph 
traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X 
text_pdf_ft:X\") (_query_:\"{!graph _g_dn}meta_subject_txt:X 
meta_shelflocator_txt:X\" _query_:\"{!graph 
traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X 
text_pdf_ft:X\") ) +class_s:meta )"


here the new query with q.op=AND:

q=_query_:"{!graph _g_up}filter(((_query_:\"{!graph 
_g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR 
_query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" 
_g_up}text_ocr_ft:X OR text_pdf_ft:X\") OR (_query_:\"{!graph 
_g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR 
_query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" 
_g_up}text_ocr_ft:X OR text_pdf_ft:X\") ) class_s:meta )"



for comparison, the whole (unshortened) query as by debug=query from 
solr;


whitout q.op=AND:

# parsedquery: "GraphQuery([[filter(+(([[meta_title_txt:\"h 
schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h 
schliemann\" meta_shelflocator_txt:\"h 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" 
text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" 
text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" 
text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" 
text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h 
schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false]) 
([[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry 
schliemann\" meta_subject_txt:\"henry schliemann\" 
meta_shelflocator_txt:\"henry 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry 
schliemann\" text_ocr_ft:\"henry schliemann\" 
text_heidicon_ft:\"henry schliemann\" text_watermark_ft:\"henry 
schliemann\" text_catalogue_ft:\"henry schliemann\" 
text_index_ft:\"henry schliemann\" text_tei_ft:\"henry schliemann\" 
text_abstract_ft:\"henry schliemann\" text_pdf_ft:\"henry 
schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) 
+class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])" 



with q.op=AND:

# parsedquery: "+GraphQuery([[+filter(+(([[meta_title_txt:\"h 
schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h 
schliemann\" meta_shelflocator_txt:\"h 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" 
text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" 
text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" 
text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" 
text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h 
schliemann\"],id=parent_ids] [Trave

Re: q.op=AND vs default (q.op=OR)

2019-12-04 Thread Jochen Barth

Found
https://cwiki.apache.org/confluence/display/lucene/BooleanQuerySyntax

But this does not explain the problem...

Oh... and there is a bug in the abbreviated queries: the X behind the 
second _g_dn should be Y - but this does not affect the further 
AND/OR/+/- query structure.


Jochen


Am 04.12.19 um 12:39 schrieb Jochen Barth:

Dear reader, I'm using solr 8.1.1.

I'm trying to switch from q.op=OR to q.op=AND, because of the parser I 
generate the queries for solr is somewhat more simple to develop with 
q.op=AND.


but the new query is returning less hits;

I have shortened the query string for better readability; _g_up stands 
for »from=id to=parent_ids«, _g_dn stands for »from=parent_ids to=id«,


_tft stands for »-type_s:multivolume_work -type_s:periodical 
-type_s:issue -type_s:journal«


and I have dropped some lenghty query terms (e. g. text_tei_ft, 
text_abstract_ft, meta_title_txt, meta_name_txt, ...) (for readability)


here the original query without q.op=AND

q=_query_:"{!graph _g_up}filter(+((_query_:\"{!graph 
_g_dn}meta_subject_txt:X meta_shelflocator_txt:X\" _query_:\"{!graph 
traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X 
text_pdf_ft:X\") (_query_:\"{!graph _g_dn}meta_subject_txt:X 
meta_shelflocator_txt:X\" _query_:\"{!graph 
traversalFilter=\\\"class_s:meta _tft\\\" _g_up}text_ocr_ft:X 
text_pdf_ft:X\") ) +class_s:meta )"


here the new query with q.op=AND:

q=_query_:"{!graph _g_up}filter(((_query_:\"{!graph 
_g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR 
_query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" 
_g_up}text_ocr_ft:X OR text_pdf_ft:X\") OR (_query_:\"{!graph 
_g_dn}meta_subject_txt:X OR meta_shelflocator_txt:X\" OR 
_query_:\"{!graph traversalFilter=\\\"class_s:meta _tft\\\" 
_g_up}text_ocr_ft:X OR text_pdf_ft:X\") ) class_s:meta )"



for comparison, the whole (unshortened) query as by debug=query from 
solr;


whitout q.op=AND:

# parsedquery: "GraphQuery([[filter(+(([[meta_title_txt:\"h 
schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h 
schliemann\" meta_shelflocator_txt:\"h 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" 
text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" 
text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" 
text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" 
text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h 
schliemann\"],id=parent_ids] [TraversalFilter: class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false]) 
([[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry 
schliemann\" meta_subject_txt:\"henry schliemann\" 
meta_shelflocator_txt:\"henry 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"henry schliemann\" meta_name_txt:\"henry 
schliemann\" text_ocr_ft:\"henry schliemann\" text_heidicon_ft:\"henry 
schliemann\" text_watermark_ft:\"henry schliemann\" 
text_catalogue_ft:\"henry schliemann\" text_index_ft:\"henry 
schliemann\" text_tei_ft:\"henry schliemann\" text_abstract_ft:\"henry 
schliemann\" text_pdf_ft:\"henry schliemann\"],id=parent_ids] 
[TraversalFilter: class_s:meta -type_s:multivolume_work 
-type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) 
+class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])" 



with q.op=AND:

# parsedquery: "+GraphQuery([[+filter(+(([[meta_title_txt:\"h 
schliemann\" meta_name_txt:\"h schliemann\" meta_subject_txt:\"h 
schliemann\" meta_shelflocator_txt:\"h 
schliemann\"],parent_ids=id][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false] 
[[meta_title_txt:\"h schliemann\" meta_name_txt:\"h schliemann\" 
text_ocr_ft:\"h schliemann\" text_heidicon_ft:\"h schliemann\" 
text_watermark_ft:\"h schliemann\" text_catalogue_ft:\"h schliemann\" 
text_index_ft:\"h schliemann\" text_tei_ft:\"h schliemann\" 
text_abstract_ft:\"h schliemann\" text_pdf_ft:\"h 
schliemann\"],id=parent_ids] [TraversalFilter: +class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false

q.op=AND vs default (q.op=OR)

2019-12-04 Thread Jochen Barth
xt:\"henry schliemann\" 
text_ocr_ft:\"henry schliemann\" text_heidicon_ft:\"henry schliemann\" 
text_watermark_ft:\"henry schliemann\" text_catalogue_ft:\"henry 
schliemann\" text_index_ft:\"henry schliemann\" text_tei_ft:\"henry 
schliemann\" text_abstract_ft:\"henry schliemann\" text_pdf_ft:\"henry 
schliemann\"],id=parent_ids] [TraversalFilter: +class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])) 
+class_s:meta)],id=parent_ids][maxDepth=-1][returnRoot=true][onlyLeafNodes=false][useAutn=false])"


wdiff of both:

"GraphQuery([[filter(+(([[meta_title_txt:\"h"+GraphQuery([[+filter(+(([[meta_title_txt:\"h
   ... [TraversalFilter:class_s:meta  +class_s:meta  -type_s:multivolume_work ... 
[TraversalFilter:class_s:meta  +class_s:meta  -type_s:multivolume_work
 ...

so the + before the »filter(« shouldnt be strictly necessary nor be the 
problem,


and the + efore class_s:meta isn't necessary, too, but can't be the 
problem, too, in my opinion.



What I found out is, that, "+" and "-" have higher precedence than "AND" 
and "OR"... but I don't see my error...


Does someone has a hint for me?


Kind regards,

Jochen


--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



filter in JSON Query DSL

2019-09-27 Thread Jochen Barth

Dear reader,

this query works as expected:

curl -XGET http://localhost:8982/solr/Suchindex/query -d '
{"query": { "bool": { "must": "*:*" } },
"filter": [ "meta_subject_txt:globe" ] }'

this does not (nor without the curley braces around "filter"):

curl -XGET http://localhost:8982/solr/Suchindex/query -d '
{"query": { "bool": { "must": [ "*:*", { "filter": [ 
"meta_subject_txt:globe" ] } ] } } }'


Is "filter" within deeper queries possible?

I've got some complex queries with a "kernel" somewhat below the top 
level...


Is "canonical" json important to match query cache entry?

Would it help to serialize this queries to standard syntax and then use 
filter(...)?


Kind regards,

Jochen



--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



solr-8.1.1 -> solr-8.2.0, "lucene... cannot be cast"

2019-08-30 Thread Jochen Barth
$ReservedThread.run(ReservedThreadExecutor.java:366)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)

    at java.base/java.lang.Thread.run(Thread.java:834)


start=0=50=sort_title_s%20asc=id=true=1=type_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_
dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"query":{"bool":{"must":[{"bool":{"must":["sort_shelflocator_s:cod\
\ pal\\ lat\\ 
00*"],"should":[{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung51\"","to":"id"}},{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg
.de/collection/sammlung52\"","to":"id"}}]}},"class_s:meta"],"must_not":[{"join":{"from":"id","query":{"bool":{"must":[{"bool":{"must":["sort_shelflocator_s:cod\\ 
pal\\ lat\\ 00*"],"should":[{"graph":{"from":"parent_ids","

query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung51\"","to":"id"}},{"graph":{"from":"parent_ids","query":"parent_ids:\"/digi.ub.uni-heidelberg.de/collection/sammlung52\"","to":"id"}}]}},"class_s:meta"]}}
,"to":"parent_ids"}}]}}}

--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



Re: facetting+tagging in JSON Query DSL

2019-08-26 Thread Jochen Barth

Oops.. my thunderbird did not preserve color...
the keywords to look for in the query are

»facet.field=%7B%21ex%3Dtype_s%7Dtype_s«
and
»{"#type_s":"type_s:article"}«

Kind regards, Jochen

Am 26.08.19 um 15:25 schrieb Jochen Barth:

Dear reader,

I'm trying to do this: 
https://lucene.apache.org/solr/guide/8_1/faceting.html#tagging-and-excluding-filters 



with JSON Query DSL: 
https://lucene.apache.org/solr/guide/8_1/json-query-dsl.html#tagging-in-json-query-dsl


here is the complete query - essential parts in red (see below):

But it does not seem to work - has this perhaps to do with 
bool:{should:[...]} in {filter:...}?


Kind regards, Jochen

start=0=50=sort_title_s%20asc=id=true=1=%7B%21ex%3Dtype_s%7Dtype_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"filter":[{"bool":{"should":[{"#type_s":"type_s:article"}]}}],"query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne 
meta_name_txt:sonne meta_subject_txt:sonne 
meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne 
text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne 
text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne 
text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal"}}]}}]}},"class_s:meta"],"must_not":[{"join":{"from":"parent_ids","query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne 
meta_name_txt:sonne meta_subject_txt:sonne 
meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne 
text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne 
text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne 
text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal"}}]}}]}},"class_s:meta"]}},"to":"id"}}]}}}




--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



facetting+tagging in JSON Query DSL

2019-08-26 Thread Jochen Barth

Dear reader,

I'm trying to do this: 
https://lucene.apache.org/solr/guide/8_1/faceting.html#tagging-and-excluding-filters 



with JSON Query DSL: 
https://lucene.apache.org/solr/guide/8_1/json-query-dsl.html#tagging-in-json-query-dsl


here is the complete query - essential parts in red (see below):

But it does not seem to work - has this perhaps to do with 
bool:{should:[...]} in {filter:...}?


Kind regards, Jochen

start=0=50=sort_title_s%20asc=id=true=1=%7B%21ex%3Dtype_s%7Dtype_s=meta_name_ss=meta_periodical_title_s=meta_subject_ss=meta_date_dtrs_date_dtrs.facet.range.start=-01-01T00%3A00%3A00Z_date_dtrs.facet.range.end=2100-01-01T00%3A00%3A00Z_date_dtrs.facet.range.gap=%2B1YEAR={"filter":[{"bool":{"should":[{"#type_s":"type_s:article"}]}}],"query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne 
meta_name_txt:sonne meta_subject_txt:sonne 
meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne 
text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne 
text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne 
text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal"}}]}}]}},"class_s:meta"],"must_not":[{"join":{"from":"parent_ids","query":{"bool":{"must":[{"bool":{"should":[{"bool":{"should":[{"graph":{"from":"parent_ids","query":"meta_title_txt:sonne 
meta_name_txt:sonne meta_subject_txt:sonne 
meta_shelflocator_txt:sonne","to":"id"}},{"graph":{"from":"id","query":"text_ocr_ft:sonne 
text_heidicon_ft:sonne text_watermark_ft:sonne text_catalogue_ft:sonne 
text_index_ft:sonne text_tei_ft:sonne text_abstract_ft:sonne 
text_pdf_ft:sonne","to":"parent_ids","traversalFilter":"class_s:meta 
-type_s:multivolume_work -type_s:periodical -type_s:issue 
-type_s:journal"}}]}}]}},"class_s:meta"]}},"to":"id"}}]}}}


--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



Re: graph query parser: depth dependent score?

2019-02-27 Thread Jochen Barth

Dear reader, I've found an different solution for my problem
and don't need a depth dependent score anymore.
Kind regards, Jochen

Am 19.02.19 um 14:42 schrieb Jochen Barth:

Dear reader,

I'll have a hierarchical graph "like a book":

{ id:solr_doc1; title:book }

{ id:solr_doc2; title:chapter; parent_ids: solr_doc1 }

{ id:solr_doc3; title:subchapter; parent_ids: solr_doc2 }

etc.

Now to match all docs with "title" and "chapter" I could do:

+_query_:"{!graph from=parent_ids to=id}title:book"

+_query_:"{!graph from=parent_ids to=id}title:chapter",

The result would be solr_doc2 and solr_doc3;

but is there a way to "boost" or "put a higher score" on solr_doc2 
than on solr_doc3 because of direct match (and not via {!graph... ) ?



The only way to do so seems a {!boost before {!graph, but what I can 
do there is not dependent on the match nor {!graph, I think.



Kind regards,

Jochen



--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



graph / boolean performance

2019-02-27 Thread Jochen Barth

Dear reader

I have queries of the following kind:

+( X )

- {!join from=parent_ids to=id}( X )

X is a {!graph query.

Is there a way to tell Solr to cache the result of "X" because the 
result is needed in the whole query again (within {!join...)?


example query (json):

{
   "query" : {
  "bool" : {
 "must" : [
    {
   "graph" : {
  "from" : "id",
  "query" : "fulltext_ocr_txtlarge:troja",
  "to" : "parent_ids",
  "useAutn" : "true"
   }
    }
 ],
 "must_not" : [
    {
   "join" : {
  "from" : "parent_ids",
  "query" : {
 "bool" : {
    "must" : [
   {
  "graph" : {
 "from" : "id",
 "query" : "fulltext_ocr_txtlarge:troja",
 "to" : "parent_ids",
 "useAutn" : "true"
  }
   }
    ]
 }
  },
  "to" : "id"
   }
    }
 ]
  }
   }
}

Kind regards,

Jochen


--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



graph query parser: depth dependent score?

2019-02-19 Thread Jochen Barth

Dear reader,

I'll have a hierarchical graph "like a book":

{ id:solr_doc1; title:book }

{ id:solr_doc2; title:chapter; parent_ids: solr_doc1 }

{ id:solr_doc3; title:subchapter; parent_ids: solr_doc2 }

etc.

Now to match all docs with "title" and "chapter" I could do:

+_query_:"{!graph from=parent_ids to=id}title:book"

+_query_:"{!graph from=parent_ids to=id}title:chapter",

The result would be solr_doc2 and solr_doc3;

but is there a way to "boost" or "put a higher score" on solr_doc2 than 
on solr_doc3 because of direct match (and not via {!graph... ) ?



The only way to do so seems a {!boost before {!graph, but what I can do 
there is not dependent on the match nor {!graph, I think.



Kind regards,

Jochen

--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580



Highlighting: single id:... x1000 vs (id: OR id: ... x1000)

2014-05-15 Thread Jochen Barth
Dear reader,

I'll like to highlight very large ocr docs (termvectors=true etc.).
Therefore I've made a separate highlight store collection where i'll
want to higlight ids selected from an other query from a separate
collection (containing the same ids).

Now querying like this:


q=ocr:abc AND id:x1 hl=true hl.fl=ocr hl.useFastVector... ...
q=ocr:abc AND id:x2 hl=true hl.fl=ocr hl.useFastVector..
q=ocr:abc AND id:x3 hl=true hl.fl=ocr hl.useFastVector..
q=ocr:abc AND id:x4 hl=true hl.fl=ocr hl.useFastVector..

... till x1000 works very much faster than

q=ocr:abc AND (id:x1 OR id:x2 OR id:x3 OR id... ... id:x1000)

Why?

Kind regards,
Jochen barth


-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-05-05 Thread Jochen Barth
I'll found out that storing Documents as separate docs+id does not  
help either.

You must have an completely separate collection/core to get things work fast.

Kind regards,
Jochen


Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Ok, https://wiki.apache.org/solr/SolrPerformanceFactors

states that: Retrieving the stored fields of a query result can be  
a significant expense. This cost is affected largely by the number  
of bytes stored per document--the higher byte count, the sparser the  
documents will be distributed on disk and more I/O is necessary to  
retrieve the fields (usually this is a concern when storing large  
fields, like the entire contents of a document).


But in my case (with docValues=true) there should be no reason to  
access *.fdt.


Kind regards,
Jochen

Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Something is really strange here:

even when configuring fields id + sort_... to docValues=true --  
so there's nothing to get from stored documents file --  
performance is still terrible with ocr stored=true _even_ with my  
patch which stores uncompressed like solr4.0.0 (checked with  
strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one  
compressed chunk,

or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Dear reader,

I'm trying to use solr for a hierarchical search:
metadata from the higher-levelled elements is copied to the lower ones,
and each element has the complete ocr text which it belongs to.

At volume level, of course, we will have the complete ocr text in one
doc and we need to store it for highlighting.

My solr instance is configured like this:
java -Xms12000m -Xmx12000m -jar start.jar
[ imported with 4.7.0, performance tests with 4.8.0 ]

Solr index files are of this size:
  0.013gb .tip The index into the Term Dictionary
  0.017gb .nvd Encodes length and boost factors for docs and fields
  0.546gb .tim The term dictionary, stores term info
  1.332gb .doc Contains the list of docs which contain each term along
with frequency
  4.943gb .pos Stores position information about where a term occurs in
the index
 12.743gb .tvd Contains information about each document that has term
vectors
 17.340gb .fdt The stored fields for documents ocr

Configuring the ocr field as non-stored I'll get those performance
measures (see docs/s) after warmup:

jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 3.96 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
16353 docs/s; 0.474 MB/s

... and with ocr stored, even _not_ requesting ocr with fl=... with
disabled documentCache class=solr.LRUCache ... / and
enableLazyFieldLoadingfalse/enableLazyFieldLoading
[ with documentCache and enableLazyFieldLoading results are even worser ]

... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 61.58 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1052 docs/s; 0.030 MB/s

... using solr-4.8.0 and oracle-jdk1.7.0_55 :
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 58.80 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1102 docs/s; 0.032 MB/s

Is there any reason why stored vs non-stored is 16 times slower?
Is there a way to store ocr field in a separate index or somethings
like this?

Kind regards,
J. Barth




-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch:
 Couple of random thoughts:
 1) The latest (4.8) Solr has support for nested documents, as well as
 for expand components. Maybe that will let you have more efficient
 architecture: http://heliosearch.org/expand-block-join/

Yes, I've seen this, but as far as I understood you have to know on
which nesting level you do your query.
My search should work on any level, say,

volume title 1986
chapter 1.1 author marc
chapter 1.1.3 title does not matter
chapter 1.1.3.1 title abc
chapter 1.1.3.2 title xyz

should match by querying +author:marc +title:abc // or // +author:marc
+title:xyz
but // not // +title:abc +title:xyz

(we'll have an unkown number of levels)


 2) Do you return OCR text to the client? Or just search it? If just
 search it, you don't need to store it

I'll want to get highlighted snippets.

 3) If you do need to store it and return it, do you always have to
 return it? If not, you could look at lazy-loading the field (setting
 in solrconfig.xml).

Let's see, perhaps this is a sorting problem which could be solved by
setting field id and sort_... to docValues=ture.

 4) Is OCR text or image? The stored fields are compressed by default,
 I wonder if the compression/decompression of a large image is an
 issue.

Text.


 5) JDK 8 apparently makes Lucene much happier (speed of some
 operations). Might be something to test if all else fails.

Ok...

Thanks,
J. Barth


 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?

Kind regards,
J. Barth



 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Dear Shawn,

see attachment for my first brute force no-compression attempt.

Kind regards,
Jochen


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn



diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2014-04-29 13:58:27.0 +0200
***
*** 38,43 
--- 38,44 
  import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat;
+ import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentInfo;
  import org.apache.lucene.store.Directory;
***
*** 56,62 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1  14) {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
--- 57,63 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2014-04-29 13:57:08.0 +0200
***
*** 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
--- 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
***
*** 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat();
private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat();
private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat();
--- 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat

Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance is  
still terrible with ocr stored=true _even_ with my patch which stores  
uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Ok, https://wiki.apache.org/solr/SolrPerformanceFactors

states that: Retrieving the stored fields of a query result can be a  
significant expense. This cost is affected largely by the number of  
bytes stored per document--the higher byte count, the sparser the  
documents will be distributed on disk and more I/O is necessary to  
retrieve the fields (usually this is a concern when storing large  
fields, like the entire contents of a document).


But in my case (with docValues=true) there should be no reason to  
access *.fdt.


Kind regards,
Jochen

Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance  
is still terrible with ocr stored=true _even_ with my patch which  
stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one  
compressed chunk,

or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn