RE: [EXT] Re: Faceting with a multi valued field

2018-09-27 Thread Hanjan, Harinder
I control everything except the data that's being indexed. So I can manipulate 
the Solr query as needed.

I tried the facet.prefix option and initial testing shows promise. 
q=*:*=on=Communities=BANFF+TRAIL+-+BNF

Thanks much! 


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Tuesday, September 25, 2018 3:14 PM
To: solr-user
Subject: [EXT] Re: Faceting with a multi valued field

What specifically do you control? Just keyword (and "Communities:"
part is locked?) or anything after q= or anything that allows multiple 
variables?

Because if you could isolate search value, you could use for example 
facet.prefix, set in solrconfig as a default parameter and populated from the 
same variable as the Communities search.

You may also want to set facet.mincount=1 in solrconfig.xml to avoid 0-value 
facets in general:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F4_faceting.html=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E=RgNvfB_bRwAfe9NpY1HedFlSHUNY0QbZ4VCXTzduTMo=

Regards,
   Alex.


On 25 September 2018 at 16:50, John Blythe  wrote:
> you can update your filter query to be a facet query, this will apply 
> the query to the resulting facet set instead of the Communities field itself.
>
> --
> John Blythe
>
>
> On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder 
> 
> wrote:
>
>> Hello!
>>
>> I am doing faceting on a field which has multiple values and it's 
>> yielding expected but undesireable results. I need different 
>> behaviour but not sure how to formulate a query for it. Here is my current 
>> setup.
>>
>> = Data Set =
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document 
>> Type":"Engagement - What We Heard Report", "Navigation":"Livelink", 
>> "SolrId":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_one=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E=-ZCoMFGNAEILlQOvY1Stra9dCF-rM48tZSTT3QJcOA0=;
>>   }
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document 
>> Type":"Engagement - What We Heard Report", "Navigation":"Livelink", 
>> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_two=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E=_JPFUX0e0zqyJWHQzWH815ThZAsdGu5TwDSkXBIL23Q=;
>>   }
>>   {
>> "Communities":["SUNALTA - SNA"],
>> "Document Type":"Engagement - What We Heard Report", 
>> "Navigation":"Livelink", 
>> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_three=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E=scFc0GYxSyRaAiAmu4M3AvYNiMsgqffG1Jmko76YjH8=;
>>   }
>>
>> = Query I run now =
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_
>> solr_everything_select-3Fq-3D-2A-3A-2A-26facet-3Don-26facet.field-3DC
>> ommunities-26fq-3DCommunities-3A-2522BANFF=DwIBaQ=jdm1Hby_BzoqwoY
>> zPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSN
>> EuM3U=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E=G7NJKKdDNh0wP5l
>> sjrSQnmbT77hUTSgx2giYBuQFdEI=
>> TRAIL - BNF"
>>
>>
>> = Results I get now =
>> {
>>   ...
>>   "facet_counts":{
>> "facet_queries":{},
>> "facet_fields":{
>>   "Communities":[
>> "BANFF TRAIL - BNF",2,
>> "PARKDALE - PKD",2,
>> "SUNALTA - SNA",0]},
>>...
>>
>> Notice that the Communities facet has 2 non zero results. I 
>> understand this is because I'm using fq to get only documents which 
>> contain BANFF TRAIL but those documents also contain PARKDALE.
>>
>> Now, I am using facets to drive navigation on my page. The business 
>> case is that user can select a community to get documents pertaining 
>> to that specific community only. This works with the query I have 
>> above. However, the facets results also contain other communities 
>> which then get displayed to the user. For example, with the query 
>> above, user will see both BANFF TRAIL and

RE: [EXT] Re: Faceting with a multi valued field

2018-09-27 Thread Hanjan, Harinder
John,

I just want to make sure I understand correctly. Replace, fq with facet.query?

So then the resultant query goes from:
q=*:*=on=Communities=Communities:"BANFF TRAIL - BNF"

to:
q=*:*=on=Communities="BANFF TRAIL - BNF"


If that's correct, then this does not resolve the issue. I still get 2 values 
under Communities facet.

Harinder

-Original Message-
From: John Blythe [mailto:johnbly...@gmail.com] 
Sent: Tuesday, September 25, 2018 2:50 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: Faceting with a multi valued field

you can update your filter query to be a facet query, this will apply the query 
to the resulting facet set instead of the Communities field itself.

--
John Blythe


On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder 
wrote:

> Hello!
>
> I am doing faceting on a field which has multiple values and it's 
> yielding expected but undesireable results. I need different behaviour 
> but not sure how to formulate a query for it. Here is my current setup.
>
> = Data Set =
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document 
> Type":"Engagement - What We Heard Report", "Navigation":"Livelink", 
> "SolrId":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_one=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI=GMTgF731T72VIryx_v7VD5f_oBlbrzXYAB1UEBQMOOc=;
>   }
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document 
> Type":"Engagement - What We Heard Report", "Navigation":"Livelink", 
> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_two=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI=FN6T49z8wjc_mRdXnVHgcdZBcZB6O_InSyUzxaxxiM0=;
>   }
>   {
> "Communities":["SUNALTA - SNA"],
> "Document Type":"Engagement - What We Heard Report", 
> "Navigation":"Livelink", 
> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_three=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI=HEJFyAhHIn5T-riqVVMR011KXAn38lZUDyRQ-ljC-qA=;
>   }
>
> = Query I run now =
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_s
> olr_everything_select-3Fq-3D-2A-3A-2A-26facet-3Don-26facet.field-3DCom
> munities-26fq-3DCommunities-3A-2522BANFF=DwIBaQ=jdm1Hby_BzoqwoYzPs
> UCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3
> U=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI=Cx6EubqN_-ocYrZA6jsJ
> TGzodPqUPVu78eY1iMB_0L8=
> TRAIL - BNF"
>
>
> = Results I get now =
> {
>   ...
>   "facet_counts":{
> "facet_queries":{},
> "facet_fields":{
>   "Communities":[
> "BANFF TRAIL - BNF",2,
> "PARKDALE - PKD",2,
> "SUNALTA - SNA",0]},
>...
>
> Notice that the Communities facet has 2 non zero results. I understand 
> this is because I'm using fq to get only documents which contain BANFF 
> TRAIL but those documents also contain PARKDALE.
>
> Now, I am using facets to drive navigation on my page. The business 
> case is that user can select a community to get documents pertaining 
> to that specific community only. This works with the query I have 
> above. However, the facets results also contain other communities 
> which then get displayed to the user. For example, with the query 
> above, user will see both BANFF TRAIL and PARKDALE as selected values 
> even though user only selected BANFF TRAIL. It's worthwhile noting 
> that I have no control over the data being sent to Solr and can't change it.
>
> How can I formulate a query to ensure that when user selects BANFF 
> TRAIL, only BANFF TRAIL is returned under Solr facets?
>
> Thanks!
> Harinder
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.
>


Faceting with a multi valued field

2018-09-25 Thread Hanjan, Harinder
Hello!

I am doing faceting on a field which has multiple values and it's yielding 
expected but undesireable results. I need different behaviour but not sure how 
to formulate a query for it. Here is my current setup.

= Data Set =
  {
"Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"SolrId":"http://thesimpsons.com/one;
  }
  {
"Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"Id":"http://thesimpsons.com/two;
  }
  {
"Communities":["SUNALTA - SNA"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"Id":"http://thesimpsons.com/three;
  }

= Query I run now =
http://localhost:8984/solr/everything/select?q=*:*=on=Communities=Communities:"BANFF
 TRAIL - BNF"


= Results I get now =
{
  ...
  "facet_counts":{
"facet_queries":{},
"facet_fields":{
  "Communities":[
"BANFF TRAIL - BNF",2,
"PARKDALE - PKD",2,
"SUNALTA - SNA",0]},
   ...

Notice that the Communities facet has 2 non zero results. I understand this is 
because I'm using fq to get only documents which contain BANFF TRAIL but those 
documents also contain PARKDALE.

Now, I am using facets to drive navigation on my page. The business case is 
that user can select a community to get documents pertaining to that specific 
community only. This works with the query I have above. However, the facets 
results also contain other communities which then get displayed to the user. 
For example, with the query above, user will see both BANFF TRAIL and PARKDALE 
as selected values even though user only selected BANFF TRAIL. It's worthwhile 
noting that I have no control over the data being sent to Solr and can't change 
it.

How can I formulate a query to ensure that when user selects BANFF TRAIL, only 
BANFF TRAIL is returned under Solr facets?

Thanks!
Harinder


NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


RE: [EXT] Re: field was indexed without position data; cannot run SpanTermQuery

2018-08-22 Thread Hanjan, Harinder
Perhaps my memory fails me, I remember setting it as a text field and it not 
working. In any case, I have set it up as follows and things are working well.


  


  


Thanks Erick!

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, August 22, 2018 11:16 AM
To: solr-user 
Subject: [EXT] Re: field was indexed without position data; cannot run 
SpanTermQuery

StrFields are, by definition, unanalyzed. They cannot have position information 
because they are always one token. Searching for a phrase on a single token 
makes no sense. The error message is a little odd, but accurate.

You'll have to re-index as a text-based field if you want this kind of 
functionality, and position information must be enabled on that field or you'll 
get the same error (see omitTermFreqAndPosition, omitPosition etc. here:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_6-5F6_defining-2Dfields.html=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=2GSWQKu6aYflteP6PPE5Td_FWHCz0OtC_mDEDe7HBDU=HpsOvlutGSAh9qm_xbMoTeRgoXfpA3Emq3okIoHpoP0=)

Best,
Erick

On Wed, Aug 22, 2018 at 9:58 AM, Hanjan, Harinder  
wrote:
> Hello!
>
> I am doing wildcard queries to satisfy our search type ahead requirement for 
> both single and mutli word (phrases) queries.
> I just noticed this error in the logs.
>
> 2018-08-22 16:36:48.433 INFO  (qtp1654589030-18) [   x:suggestions] 
> o.a.s.c.S.Request [suggestions]  webapp=/solr path=/select 
> params={q={!complexphrase}"traffic+c*"} status=500 QTime=7
> 2018-08-22 16:36:48.433 ERROR (qtp1654589030-18) [   x:suggestions] 
> o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: field "suggestion" 
> was indexed without position data; cannot run SpanTermQuery (term=traffic)
>
> The error is only thrown when I search for phrases but works fine for single 
> word queries.
> This causes the error: q={!complexphrase}"traffic c*"
> This runs fine: q={!complexphrase}"traffic*"
>
>  required="true" />  sortMissingLast="true" />
>
> How can I troubleshoot this? The field 'suggestion' is stored as a StrField, 
> could that be the reason?
>
>
> Full stack trace:
> 2018-08-22 16:55:49.164 INFO  (qtp1654589030-17) [   x:suggestions] 
> o.a.s.c.S.Request [suggestions]  webapp=/solr path=/select 
> params={q={!complexphrase}"traffic+c*"} status=500 QTime=6
> 2018-08-22 16:55:49.164 ERROR (qtp1654589030-17) [   x:suggestions] 
> o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: field "suggestion" 
> was indexed without position data; cannot run SpanTermQuery (term=traffic)
> at 
> org.apache.lucene.search.spans.SpanTermQuery$SpanTermWeight.getSpans(S
> panTermQuery.java:119) at 
> org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(S
> panNearQuery.java:214) at 
> org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:121)
> at 
> org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:38)
> at org.apache.lucene.search.Weight.bulkScorer(Weight.java:147)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:657)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462)
> at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(Sol
> rIndexSearcher.java:215) at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearche
> r.java:1600) at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher
> .java:1417) at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java
> :584) at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSea
> rch(QueryComponent.java:1435) at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponen
> t.java:375) at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sear
> chHandler.java:295) at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rBase.java:177) at 
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:382) at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:326) at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletH
> andler.java:1751) at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> 582) at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> va:143) at 
> org.eclipse.jetty.secu

field was indexed without position data; cannot run SpanTermQuery

2018-08-22 Thread Hanjan, Harinder
Hello!

I am doing wildcard queries to satisfy our search type ahead requirement for 
both single and mutli word (phrases) queries.
I just noticed this error in the logs.

2018-08-22 16:36:48.433 INFO  (qtp1654589030-18) [   x:suggestions] 
o.a.s.c.S.Request [suggestions]  webapp=/solr path=/select 
params={q={!complexphrase}"traffic+c*"} status=500 QTime=7
2018-08-22 16:36:48.433 ERROR (qtp1654589030-18) [   x:suggestions] 
o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: field "suggestion" 
was indexed without position data; cannot run SpanTermQuery (term=traffic)

The error is only thrown when I search for phrases but works fine for single 
word queries.
This causes the error: q={!complexphrase}"traffic c*"
This runs fine: q={!complexphrase}"traffic*"




How can I troubleshoot this? The field 'suggestion' is stored as a StrField, 
could that be the reason?


Full stack trace:
2018-08-22 16:55:49.164 INFO  (qtp1654589030-17) [   x:suggestions] 
o.a.s.c.S.Request [suggestions]  webapp=/solr path=/select 
params={q={!complexphrase}"traffic+c*"} status=500 QTime=6
2018-08-22 16:55:49.164 ERROR (qtp1654589030-17) [   x:suggestions] 
o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: field "suggestion" 
was indexed without position data; cannot run SpanTermQuery (term=traffic)
at 
org.apache.lucene.search.spans.SpanTermQuery$SpanTermWeight.getSpans(SpanTermQuery.java:119)
at 
org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:214)
at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:121)
at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:38)
at org.apache.lucene.search.Weight.bulkScorer(Weight.java:147)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:657)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462)
at 
org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:215)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1600)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1417)
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 

RE: Type ahead functionality using complex phrase query parser

2018-08-15 Thread Hanjan, Harinder
Keeping the field as string so that no analysis is done on it has yielded 
promising results.  



I will test more tomorrow and report back.

-Original Message-
From: Hanjan, Harinder [mailto:harinder.han...@calgary.ca] 
Sent: Wednesday, August 15, 2018 5:01 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Type ahead functionality using complex phrase query parser

Hello!

I can't get Solr to give the results I would expect, would appreciate if 
someone could point me in the right direction here.

/select?q={!complexphrase}"gar*"
shows me the following terms

-garages

-garburator

-gardening

-gardens

-garage

-garden

-garbage

-century gardens

-community gardens

I was not expecting to see the bottom two.

--- schema.xml ---
  
  
  
   


--- query ---
/select?q={!complexphrase}"gar*"

--- solrconfig.xml ---

   
  explicit
  10
  suggestion
   


Thanks!
Harinder


NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


Type ahead functionality using complex phrase query parser

2018-08-15 Thread Hanjan, Harinder
Hello!

I can't get Solr to give the results I would expect, would appreciate if 
someone could point me in the right direction here.

/select?q={!complexphrase}"gar*"
shows me the following terms

-garages

-garburator

-gardening

-gardens

-garage

-garden

-garbage

-century gardens

-community gardens

I was not expecting to see the bottom two.

--- schema.xml ---



  
  
   


--- query ---
/select?q={!complexphrase}"gar*"

--- solrconfig.xml ---

   
  explicit
  10
  suggestion
   


Thanks!
Harinder


NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


RE: [EXT] Re: Extracting top level URL when indexing document

2018-06-13 Thread Hanjan, Harinder
Thank you Alex.  I have managed to get this to work via 
URLClassifyProcessorFactory. If anyone is interested, it can be easily done via 
with the following solrconfig.xml



  true
  SolrId
  hostname
  





 urlProcessor
   
  

I will look at how to submit a patch to the Java doc.

Thanks!
Harinder

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Wednesday, June 13, 2018 12:13 AM
To: solr-user 
Subject: [EXT] Re: Extracting top level URL when indexing document

Try URLClassifyProcessorFactory in the processing chain instead, configured in 
solrconfig.xml

There is very little documentation for it, so check the source for exact 
params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, 
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level 
> URL to be stored in a different field. I have had some success with 
> the PatternTokenizerFactory (relevant schema bits at the bottom) but 
> the behavior appears to be inconsistent.  Most of the times, the top 
> level URL is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca_
> about-2Dcmlc_=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M
> =N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1n
> vhzANSYzX_MuFCGcxdD4=bAlhGU5kNa_tlJbhmb8vEe3gRIF9vcH7de6UJL-mM28=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca;
> d=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhV
> Hu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFC
> GcxdD4=-4gwWSR2Uut2C-JHJ3c0Uj0Ys0W4APyH7if3WXsEvqU=
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommission.ca_reports.php=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNk
> yKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJR
> D0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=ZfPgYWPLxqnMbfYceg-RObyXzSmmcPTU0t8Z
> 55ZVbY4=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommissio=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30I
> rhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzAN
> SYzX_MuFCGcxdD4=BM-LaN4V7PlZW3_vm6prIX-NS3EW1zPz42Cy25S9HxU=
>
> Fail
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attainyourhome.co
> m_=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_M
> uFCGcxdD4=bHYfs9IWkicyxYn5tZN0EtKNIA1O9MCyrDMVxG1Kn1g=
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attai=DwIBaQ=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=9k
> DXPBHblDyQp9yLzYAyGTvboVZDKrzUK3jYYLmJLTI=
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__livea=DwIBaQ=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=Xy
> mwSoyJw0F3EqGH7zaDoSJBIu-oVNFxmnVxOnDghJc=
>
> Fail
>
>
>
>
> Relevant schema:
> 
>
>  multiValued="false"/>
>
>  sortMissingLast="true">
> 
> 
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
> group="0"/>
> 
> 
>
>
> I have tested the Regex and it is matching things fine. Please see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__regex101.com_r_wN6cZ7_358=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=U-s-VXfldf8O1uoyOmy_hf3jRuTUml1MMV8YxF-RWUc=.
> So it appears that I have a gap in my understanding of how Solr 
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named 

Extracting top level URL when indexing document

2018-06-12 Thread Hanjan, Harinder
Hello!

I am indexing web documents and have a need to extract their top-level URL to 
be stored in a different field. I have had some success with the 
PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior 
appears to be inconsistent.  Most of the times, the top level URL is extracted 
just fine but for some documents, it is being cut off.

Examples:
URL

Extracted URL

Comment

http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf

http://www.calgaryarb.ca

Success

http://www.calgarymlc.ca/about-cmlc/

http://www.calgarymlc.ca

Success

http://www.calgarypolicecommission.ca/reports.php

http://www.calgarypolicecommissio

Fail

https://attainyourhome.com/

https://attai

Fail

https://liveandplay.calgary.ca/DROPIN/page/dropin

https://livea

Fail




Relevant schema:











I have tested the Regex and it is matching things fine. Please see 
https://regex101.com/r/wN6cZ7/358.
So it appears that I have a gap in my understanding of how Solr 
PatternTokenizerFactory works. I would appreciate any insight on the issue. 
hostname field will be used in facet queries.

Thank you!
Harinder


NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


RE: Search Analytics Help

2018-04-26 Thread Hanjan, Harinder
This seems promising

https://github.com/lucidworks/banana


-Original Message-
From: Ennio Bozzetti [mailto:ebozze...@thorlabs.com]
Sent: Thursday, April 26, 2018 1:39 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Search Analytics Help

Hello,

I'm setting up SOLR on an internal website for my company and I would like to 
know if anyone can recommend an analytics that I can see what the users are 
searching for? Does the log in SOLR give me that information?

Thank you,
Ennio Bozzetti



NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work.

Thanks!

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 09, 2018 2:15 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY=
 written by a colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca>
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from 
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and 
> the link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to 
> catch this kind of problem and prevent it bringing down your Solr 
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
> <harinder.han...@calgary.ca>
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain 
> > documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem 
> > Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom 
> > HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in 
> > the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK
> Du vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). 
> > The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure 
> > Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential 
> > or
>
> > legally privileged. If you are not the intended recipient named 
> > above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately 
> > by
>
> > telephone and then destroy or delete this communication, or return 
> > it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr. 


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 09, 2018 11:24 AM
To: solr-user@lucene.apache.org
Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

+1



https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=



We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 





-Original Message-

From: Charlie Hull [mailto:char...@flax.co.uk] 

Sent: Monday, April 9, 2018 12:44 PM

To: solr-user@lucene.apache.org

Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?



I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.



Cheers



Charlie



On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>

wrote:



> Hello!

>

> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 

> we have in our Sharepoint system. I have used the tika-app.jar 

> directly to extract the document in question and it does _not_ throw 

> an exception and extract the contents just fine. So it would seem Solr 

> is doing something different than a Tika standalone installation.

>

> After some Googling, I found out that Solr uses its custom HtmlMapper

> (MostlyPassthroughHtmlMapper) which passes through all elements in the 

> HTML document to Tika. As Tika limits nested elements to 100, this 

> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 

> XML element nesting. This is metioned in TIKA-2091 

> (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0=
>  jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.

> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 

> "solution" is to use Tika's default parsing/mapping mechanism but no 

> details have been provided on how to configure this at Solr.

>

> I'm hoping some folks here have the knowledge on how to configure Solr 

> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 

> use Tika's implementation.

>

> Thank you!

> Harinder

>

>

> 

> NOTICE -

> This communication is intended ONLY for the use of the person or 

> entity named above and may contain information that is confidential or 

> legally privileged. If you are not the intended recipient named above 

> or a person responsible for delivering messages or communications to 

> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 

> distribution, or copying of this communication or any of the 

> information contained in it is strictly prohibited. If you have 

> received this communication in error, please notify us immediately by 

> telephone and then destroy or delete this communication, or return it 

> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.

>



How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Hello!

Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have 
in our Sharepoint system. I have used the tika-app.jar directly to extract the 
document in question and it does _not_ throw an exception and extract the 
contents just fine. So it would seem Solr is doing something different than a 
Tika standalone installation.

After some Googling, I found out that Solr uses its custom HtmlMapper 
(MostlyPassthroughHtmlMapper) which passes through all elements in the HTML 
document to Tika. As Tika limits nested elements to 100, this causes Tika to 
throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This 
is metioned in TIKA-2091 
(https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131).
 The "solution" is to use Tika's default parsing/mapping mechanism but no 
details have been provided on how to configure this at Solr.

I'm hoping some folks here have the knowledge on how to configure Solr to 
effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's 
implementation.

Thank you!
Harinder



NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


Getting "zip bomb" exception while sending HTML document to solr

2018-04-05 Thread Hanjan, Harinder
Hello!

I'm sending a HTML document to Solr and Tika is throwing the "Zip bomb 
detected!" exception back. Looks like Tika has an arbitrary limit of 100 level 
of XML element nesting 
(https://github.com/apache/tika/blob/9130bbc1fa6d69419b2ad294917260d6b1cced08/tika-core/src/main/java/org/apache/tika/sax/SecureContentHandler.java#L72-L75).
  Luckily, the variable (maxDepth) does have a public setter function but I am 
not sure if it's possible to set this at Solr.  Is it possible? If so, how 
would I set the value of maxDepth to a higher number?

Thanks!

Here is the full stack trace:
2018-04-05 16:47:48.034 ERROR (qtp1654589030-15) [   x:aconn] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Zip bomb detected!
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
ca.calgary.csc.wds.solr.GsaAconnRequestHandler.handleRequestBody(GsaAconnRequestHandler.java:84)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at 
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
at 
org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:138)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
... 35 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: 
Suspected zip bomb: 100 levels of XML element nesting
at