Looks like stop words (in, and, on) is what is breaking. The regex looks like it is correct.
Kevin Risden On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <harinder.han...@calgary.ca> wrote: > Hello! > > I am indexing web documents and have a need to extract their top-level URL > to be stored in a different field. I have had some success with the > PatternTokenizerFactory (relevant schema bits at the bottom) but the > behavior appears to be inconsistent. Most of the times, the top level URL > is extracted just fine but for some documents, it is being cut off. > > Examples: > URL > > Extracted URL > > Comment > > http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf > > http://www.calgaryarb.ca > > Success > > http://www.calgarymlc.ca/about-cmlc/ > > http://www.calgarymlc.ca > > Success > > http://www.calgarypolicecommission.ca/reports.php > > http://www.calgarypolicecommissio > > Fail > > https://attainyourhome.com/ > > https://attai > > Fail > > https://liveandplay.calgary.ca/DROPIN/page/dropin > > https://livea > > Fail > > > > > Relevant schema: > <copyField dest="hostname" source="SolrId"/> > > <field name="hostname" type="hostnameType" stored="true" indexed="false" > multiValued="false"/> > > <fieldType name="hostnameType" class="solr.TextField" > sortMissingLast="true"> > <analyzer type="index"> > <tokenizer > > class="solr.PatternTokenizerFactory" > > pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)" > group="0"/> > </analyzer> > </fieldType> > > > I have tested the Regex and it is matching things fine. Please see > https://regex101.com/r/wN6cZ7/358. > So it appears that I have a gap in my understanding of how Solr > PatternTokenizerFactory works. I would appreciate any insight on the issue. > hostname field will be used in facet queries. > > Thank you! > Harinder > > ________________________________ > NOTICE - > This communication is intended ONLY for the use of the person or entity > named above and may contain information that is confidential or legally > privileged. If you are not the intended recipient named above or a person > responsible for delivering messages or communications to the intended > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > of this communication or any of the information contained in it is strictly > prohibited. If you have received this communication in error, please notify > us immediately by telephone and then destroy or delete this communication, > or return it to us by mail if requested by us. The City of Calgary thanks > you for your attention and co-operation. >