Looks like stop words (in, and, on) is what is breaking. The regex looks
like it is correct.

Kevin Risden

On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <harinder.han...@calgary.ca>
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level URL
> to be stored in a different field. I have had some success with the
> PatternTokenizerFactory (relevant schema bits at the bottom) but the
> behavior appears to be inconsistent.  Most of the times, the top level URL
> is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> http://www.calgarymlc.ca/about-cmlc/
>
> http://www.calgarymlc.ca
>
> Success
>
> http://www.calgarypolicecommission.ca/reports.php
>
> http://www.calgarypolicecommissio
>
> Fail
>
> https://attainyourhome.com/
>
> https://attai
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://livea
>
> Fail
>
>
>
>
> Relevant schema:
> <copyField dest="hostname" source="SolrId"/>
>
> <field name="hostname" type="hostnameType" stored="true" indexed="false"
> multiValued="false"/>
>
> <fieldType name="hostnameType" class="solr.TextField"
> sortMissingLast="true">
>                 <analyzer type="index">
>                                 <tokenizer
>
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
>                                                 group="0"/>
>                 </analyzer>
> </fieldType>
>
>
> I have tested the Regex and it is matching things fine. Please see
> https://regex101.com/r/wN6cZ7/358.
> So it appears that I have a gap in my understanding of how Solr
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>

Reply via email to