Re: Some highlighted snippets aren't being returned

2013-09-12 Thread Eric O'Hanlon
maxAnalyzedChars did it!  I wasn't setting that param, and I'm working with 
some very long documents.  I also made the hl.fl param formatting change that 
you suggested, Aloke.

Thanks again!

- Eric

On Sep 11, 2013, at 3:10 AM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on 
 what happens!
 
 - Eric
 
 On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal alghos...@gmail.com wrote:
 
 Hi Eric,
 
 As Bryan suggests, you should look at appropriately setting up the
 fragSize  maxAnalyzedChars for long documents.
 
 One issue I find with your search request is that in trying to
 highlight across three separate fields, you have added each of them as
 a separate request param:
 hl.fl=contentshl.fl=titlehl.fl=original_url
 
 The way to do it would be
 (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
 them as values to one comma (or space) separated field:
 hl.fl=contents,title,original_url
 
 Regards,
 Aloke
 
 On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:
 Eric,
 
 Your example document is quite long. Are you setting hl.maxAnalyzedChars?
 If you don't, the highlighter you appear to be using will not look past
 the first 51,200 characters of the document for snippet candidates.
 
 http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
 
 -- Bryan
 
 
 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned
 
 Hi again Everyone,
 
 I didn't get any replies to this, so I thought I'd re-send in case
 anyone
 missed it and has any thoughts.
 
 Thanks,
 Eric
 
 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:
 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
 the
 query's solr params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 
 
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m
 
 imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t
 
 ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of
 
 _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code
 
 facet.field=domainfacet.field=date_of_capture_facet.field=mimetype
 
 _codefacet.field=geographic_focus__facetfacet.field=organization_based_i
 
 n__facetfacet.field=organization_type__facetfacet.field=language__facet
 
 facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face
 
 t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig
 
 inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr
 
 ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all
 documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
 
 highlighting=
 
 
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 
 
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 
 
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
  {},
 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
  {contents=
[...actual snippet is returned here...]},
 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
  {contents=
[...actual snippet is returned here...]},
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
  {contents=
[...actual snippet is returned here...]},
 
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
  {contents=
[...actual snippet is returned here...]},
 
 
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
  {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing
 a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
 that's
 indexed and stored, and being searched for all text searches.  For
 example, one

Re: Some highlighted snippets aren't being returned

2013-09-11 Thread Eric O'Hanlon
Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on what 
happens!

- Eric

On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi Eric,
 
 As Bryan suggests, you should look at appropriately setting up the
 fragSize  maxAnalyzedChars for long documents.
 
 One issue I find with your search request is that in trying to
 highlight across three separate fields, you have added each of them as
 a separate request param:
 hl.fl=contentshl.fl=titlehl.fl=original_url
 
 The way to do it would be
 (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
 them as values to one comma (or space) separated field:
 hl.fl=contents,title,original_url
 
 Regards,
 Aloke
 
 On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:
 Eric,
 
 Your example document is quite long. Are you setting hl.maxAnalyzedChars?
 If you don't, the highlighter you appear to be using will not look past
 the first 51,200 characters of the document for snippet candidates.
 
 http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
 
 -- Bryan
 
 
 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned
 
 Hi again Everyone,
 
 I didn't get any replies to this, so I thought I'd re-send in case
 anyone
 missed it and has any thoughts.
 
 Thanks,
 Eric
 
 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:
 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
 the
 query's solr params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 
 
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m
 
 imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t
 
 ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of
 
 _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code
 
 facet.field=domainfacet.field=date_of_capture_facet.field=mimetype
 
 _codefacet.field=geographic_focus__facetfacet.field=organization_based_i
 
 n__facetfacet.field=organization_type__facetfacet.field=language__facet
 
 facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face
 
 t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig
 
 inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr
 
 ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all
 documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
 
 highlighting=
 
 
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
 
 
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
   {},
 
 
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
   {contents=
 [...actual snippet is returned here...]},
 
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
   {contents=
 [...actual snippet is returned here...]},
 
 
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
   {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing
 a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
 that's
 indexed and stored, and being searched for all text searches.  For
 example, one of the search results is for a crawl of this document:
 
 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
 df
 
 And if you view that document on the web, you'll see that it does
 contain unangan.
 
 Has anyone seen this before?  And does

Re: Some highlighted snippets aren't being returned

2013-09-09 Thread Aloke Ghoshal
Hi Eric,

As Bryan suggests, you should look at appropriately setting up the
fragSize  maxAnalyzedChars for long documents.

One issue I find with your search request is that in trying to
highlight across three separate fields, you have added each of them as
a separate request param:
hl.fl=contentshl.fl=titlehl.fl=original_url

The way to do it would be
(http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
them as values to one comma (or space) separated field:
hl.fl=contents,title,original_url

Regards,
Aloke

On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote:
 Eric,

 Your example document is quite long. Are you setting hl.maxAnalyzedChars?
 If you don't, the highlighter you appear to be using will not look past
 the first 51,200 characters of the document for snippet candidates.

 http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

 -- Bryan


 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned

 Hi again Everyone,

 I didn't get any replies to this, so I thought I'd re-send in case
 anyone
 missed it and has any thoughts.

 Thanks,
 Eric

 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

  Hi Everyone,
 
  I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
 the
 query's solr params from the Tomcat log:
 
  ...
  webapp=/solr-4.2
  path=/select
 

 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m

 imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t

 ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of

 _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code

facet.field=domainfacet.field=date_of_capture_facet.field=mimetype

 _codefacet.field=geographic_focus__facetfacet.field=organization_based_i

 n__facetfacet.field=organization_type__facetfacet.field=language__facet

 facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face

 t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig

 inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr

 ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
  ...
 
  For the query above (which can be simplified to say: find all
 documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
  
  highlighting=
 

 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
 {contents=
   [...actual snippet is returned here...]},
 
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
 {contents=
   [...actual snippet is returned here...]},
 

 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
 {}}
  
 
  I have eight (as opposed to five) results above because I'm also doing
 a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
  I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
 that's
 indexed and stored, and being searched for all text searches.  For
 example, one of the search results is for a crawl of this document:

 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
 df
 
  And if you view that document on the web, you'll see that it does
 contain unangan.
 
  Has anyone seen this before?  And does anyone have any good
 suggestions
 for troubleshooting/fixing the problem?
 
  Thanks!
 
  - Eric



Re: Some highlighted snippets aren't being returned

2013-09-08 Thread Eric O'Hanlon
Hi again Everyone,

I didn't get any replies to this, so I thought I'd re-send in case anyone 
missed it and has any thoughts.

Thanks,
Eric

On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted snippets 
 for some, but not all results.  For reference, I'm searching through an index 
 that contains web crawls of human-rights-related websites.  I'm running solr 
 as a webapp under Tomcat and I've included the query's solr params from the 
 Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true}
  hits=8 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all documents that 
 contain the word unangan and return facets, highlights, etc.), I get five 
 search results.  Only three of these are returning highlighted snippets.  
 Here's the highlighting portion of the solr response (note: printed in ruby 
 notation because I'm receiving this response in a Rails app):
 
 
 highlighting=
  
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
{},
   20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
{contents=
  [...actual snippet is returned here...]},
   20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
{contents=
  [...actual snippet is returned here...]},
   
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999=
{contents=
  [...actual snippet is returned here...]},
   
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw=
{contents=
  [...actual snippet is returned here...]},
   
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf=
{}}
 
 
 I have eight (as opposed to five) results above because I'm also doing a 
 grouped query, grouping by a field called original_url, and this leads to 
 five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word 
 unangan, as expected, and this term is appearing in a text field that's 
 indexed and stored, and being searched for all text searches.  For example, 
 one of the search results is for a crawl of this document: 
 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
 
 And if you view that document on the web, you'll see that it does contain 
 unangan.
 
 Has anyone seen this before?  And does anyone have any good suggestions for 
 troubleshooting/fixing the problem?
 
 Thanks!
 
 - Eric



Re: Some highlighted snippets aren't being returned

2013-09-08 Thread Bill Bell
Zip up all your configs 

Bill Bell
Sent from mobile


On Sep 8, 2013, at 3:00 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

 Hi again Everyone,
 
 I didn't get any replies to this, so I thought I'd re-send in case anyone 
 missed it and has any thoughts.
 
 Thanks,
 Eric
 
 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:
 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted snippets 
 for some, but not all results.  For reference, I'm searching through an 
 index that contains web crawls of human-rights-related websites.  I'm 
 running solr as a webapp under Tomcat and I've included the query's solr 
 params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.mimetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_type__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of_capture_.facet.limit=6group.field=original_urlhl.simple.post=/codefacet.field=domainfacet.field=date_of_capture_facet.field=mimetype_codefacet.field=geographic_focus__facetfacet.field=organization_based_in__facetfacet.field=organization_type__facetfacet.field=language__facetfacet.field=creator_name__facethl.fragsize=600f.creator_name__facet.facet.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=original_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxrows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.facet.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true}
  hits=8 status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all documents that 
 contain the word unangan and return facets, highlights, etc.), I get five 
 search results.  Only three of these are returning highlighted snippets.  
 Here's the highlighting portion of the solr response (note: printed in 
 ruby notation because I'm receiving this response in a Rails app):
 
 
 highlighting=
 {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  
 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  
 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%202002%20tentang%20Perlindungan%20Anak.pdf=
   {},
  20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
   {contents=
 [...actual snippet is returned here...]},
  
 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-uu-no-39-tahun-1999=
   {contents=
 [...actual snippet is returned here...]},
  
 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-39-tahun-1999?tmpl=componentformat=raw=
   {contents=
 [...actual snippet is returned here...]},
  
 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf=
   {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing a 
 grouped query, grouping by a field called original_url, and this leads to 
 five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word 
 unangan, as expected, and this term is appearing in a text field that's 
 indexed and stored, and being searched for all text searches.  For example, 
 one of the search results is for a crawl of this document: 
 http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.pdf
 
 And if you view that document on the web, you'll see that it does contain 
 unangan.
 
 Has anyone seen this before?  And does anyone have any good suggestions for 
 troubleshooting/fixing the problem?
 
 Thanks!
 
 - Eric
 


RE: Some highlighted snippets aren't being returned

2013-09-08 Thread Bryan Loofbourrow
Eric,

Your example document is quite long. Are you setting hl.maxAnalyzedChars?
If you don't, the highlighter you appear to be using will not look past
the first 51,200 characters of the document for snippet candidates.

http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars

-- Bryan


 -Original Message-
 From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
 Sent: Sunday, September 08, 2013 2:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Some highlighted snippets aren't being returned

 Hi again Everyone,

 I didn't get any replies to this, so I thought I'd re-send in case
anyone
 missed it and has any thoughts.

 Thanks,
 Eric

 On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote:

  Hi Everyone,
 
  I'm facing an issue in which my solr query is returning highlighted
 snippets for some, but not all results.  For reference, I'm searching
 through an index that contains web crawls of human-rights-related
 websites.  I'm running solr as a webapp under Tomcat and I've included
the
 query's solr params from the Tomcat log:
 
  ...
  webapp=/solr-4.2
  path=/select
 

params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m

imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t

ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of

_capture_.facet.limit=6group.field=original_urlhl.simple.post=/code

facet.field=domainfacet.field=date_of_capture_facet.field=mimetype

_codefacet.field=geographic_focus__facetfacet.field=organization_based_i

n__facetfacet.field=organization_type__facetfacet.field=language__facet

facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face

t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig

inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr

ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac
 et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8
 status=0 QTime=108
  ...
 
  For the query above (which can be simplified to say: find all
documents
 that contain the word unangan and return facets, highlights, etc.), I
 get five search results.  Only three of these are returning highlighted
 snippets.  Here's the highlighting portion of the solr response (note:
 printed in ruby notation because I'm receiving this response in a Rails
 app):
 
  
  highlighting=
 

{20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
 202002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
 

20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
 02002%20tentang%20Perlindungan%20Anak.pdf=
 {},
20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf=
 {contents=
   [...actual snippet is returned here...]},
20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
 uu-no-39-tahun-1999=
 {contents=
   [...actual snippet is returned here...]},
 
20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
 39-tahun-1999?tmpl=componentformat=raw=
 {contents=
   [...actual snippet is returned here...]},
 

20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
 timut_heritage.pdf=
 {}}
  
 
  I have eight (as opposed to five) results above because I'm also doing
a
 grouped query, grouping by a field called original_url, and this leads
 to five grouped results.
 
  I've confirmed that my highlight-lacking results DO contain the word
 unangan, as expected, and this term is appearing in a text field
that's
 indexed and stored, and being searched for all text searches.  For
 example, one of the search results is for a crawl of this document:

http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p
 df
 
  And if you view that document on the web, you'll see that it does
 contain unangan.
 
  Has anyone seen this before?  And does anyone have any good
suggestions
 for troubleshooting/fixing the problem?
 
  Thanks!
 
  - Eric