Too much data in highlighting results

Thorsten Heit Thu, 15 May 2025 14:36:41 -0700

Hi,

we use a Solr standalone server instance (9.8.1 actually) for indexing documents. The core we use was created by using the sample techproducts config, and when uploading a (preprocessed) document the following fields are used for storing data:


- id (string)
- content (text_general)
  (contains the extracted text from the document)
- resourcename (text_general)
  (the file name of the original document)
- index_date (pdate)
  (the timestamp when the document is to be indexed in Solr)

A simple search using a core with some test data

%> curl 'https://solrserver:8983/solr/mycore/query?q=test&q.op=AND&indent=true&start=0&rows=100&fl=id,resourcename&sort=id%20asc'


actually returns 39 results.

When adding highlighting, I'm a bit puzzled by the amount of (additional?) data I'm receiving:

%> curl 'https://solrserver:8983/solr/mycore/query?q=test&q.op=AND&indent=true&start=0&rows=100&fl=id,resourcename&sort=id%20asc&hl=true&hl.fl=content'


now returns
{
  "responseHeader":{
    "status":0,
    "QTime":2858,
    "params":{
      "q":"test",
      "hl":"true",
      "indent":"true",
      "fl":"id,resourcename",
      "start":"0",
      "q.op":"AND",
      "sort":"id asc",
      "hl.fl":"content",
      "rows":"100"
    }
  },
  "response":{
    "numFound":39,
    "start":0,
    "numFoundExact":true,
    "docs":[{
      "resourcename":"Dokument.txt",
      "id":"202309050002"
    },{
      "resourcename":"2023000319__InvestTagListe_2023_09_264__14_27",
      "id":"202309210004"
    },{
      "resourcename":"test.txt",
      "id":"202309210006"
    },
    (...)
    {
      "resourcename":"result.csv",
      "id":"202407240005"
    },
    (...)
    ]
  },
  "highlighting":{
    "202309050002":{
      "content":["<em>test</em>: als wort\r\ngaming\n"]
    },
    "202309210004":{
      "content":["Tag Liste (...)"]
    },
    "202309210006":{
      "content":["<em>test</em>\n"]
    },
    (...)
    "202407240005":{
      "content":["\"RequestId\" (...) \"\";\"\";\"\";\"\"\n\n"]
    },
    (...)
  }
}

The first and the third highlighting results seem correct, the content entries match the original documents (apart from the <em> tags).

The second highlighting result is a 1133 chars long string, the first highlighted part begins at #485 and the second (last) at #548, more or less near to each other, but lots of surrounding and (IMHO) irrelevant data. Apart from that the original content string is 3670 chars long and contains the search text at two more locations (2350 and 2625) that are not shown (highlighted) here.

The hightlight result with id 202407240005 is almost 8k long and contains four highlighted parts at #3385, #4623, #5860 and #7098, i.e. 3k irrelevant data at the beginning, >1k data between the different parts and ~1k at the end.

When I use the original highlighter (&hl.method=original) instead of the default unified, the results seem more "usable", i.e. they are much shorter (<=142 chars), but in all cases but one contain only one highlighted part even when there are more...

Perhaps I'm misunderstanding something here, but could someone please explain this behaviour?



Regards

Thorsten

OpenPGP_0x5A54BBB878225E08.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Too much data in highlighting results

Reply via email to