Re: PatternReplaceFilterFactory problem

2019-01-29 Thread Chris Wareham
Thanks for the help - changing the field type of the destination for the 
copy fields to "text_en" solved the problem. I'd foolishly assumed that 
the analysis of the source fields was applied then the resulting tokens 
passed to the copy field, which doesn't really make sense now that I 
think about it!


So the indexing process is:

+---+ ++ +-+
|companyName| |  companyName   | | companyName |
|input data |>|text_en analysis|>|index|
+---+ ++ +-+
  |
  |   ++ +-+
  +-->|  text  |>|text |
  |text_en analysis| |index|
  ++ +-+

Rather than:

+---+ ++   +-+
|companyName| |  companyName   |   | companyName |
|input data |>|text_en analysis|-->|index|
+---+ ++   +-+
  |
   +-+ +-+
   | text|>|text |
   |text_general analysis| |index|
   +-+ +-+


On 28/01/2019 12:37, Scott Stults wrote:

Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.ware...@graduate-jobs.com> wrote:


I'm trying to index some data which often includes domain names. I'd
like to remove the .com TLD, so I have modified the text_en field type
by adding a PatternReplaceFilterFactory filter. However, it doesn't
appear to be working as a search for "text:(mydomain.com)" matches
records but "text:(mydomain)" does not.


  








  
  








  


The actual field definitions are as follows:













Re: PatternReplaceFilterFactory problem

2019-01-28 Thread Alexandre Rafalovitch
In Admin UI, there is an Analysis screen. You can enter your text and
your query there and see what happens to it at every step of the
processing pipeline.

This should tell you whether the problem is in indexing, query, or
somewhere else entirely (e.g. you are querying a different field as
Scott suggests).

Regards,
   Alex.
P.s. (Semi-)random tip of the day. If you copyField the content, it is
indexed and searched by the rules of the _target_ field. Only when you
search on the field directly, its chain is invoked.

On Mon, 28 Jan 2019 at 06:02, Chris Wareham
 wrote:
>
> I'm trying to index some data which often includes domain names. I'd
> like to remove the .com TLD, so I have modified the text_en field type
> by adding a PatternReplaceFilterFactory filter. However, it doesn't
> appear to be working as a search for "text:(mydomain.com)" matches
> records but "text:(mydomain)" does not.
>
> positionIncrementGap="100">
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>
>
> The actual field definitions are as follows:
>
> stored="true"  required="true" />
> stored="true"  required="true" />
> stored="false" />
>
>
>


Re: PatternReplaceFilterFactory problem

2019-01-28 Thread Scott Stults
Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.ware...@graduate-jobs.com> wrote:

> I'm trying to index some data which often includes domain names. I'd
> like to remove the .com TLD, so I have modified the text_en field type
> by adding a PatternReplaceFilterFactory filter. However, it doesn't
> appear to be working as a search for "text:(mydomain.com)" matches
> records but "text:(mydomain)" does not.
>
> positionIncrementGap="100">
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>  
>
> ignoreCase="true" synonyms="synonyms.txt"/>
> ignoreCase="true"/>
>
> pattern="([-a-z])\.com" replacement="$1"/>
>
> protected="protwords.txt"/>
>
>  
>
>
> The actual field definitions are as follows:
>
> stored="true"  required="true" />
> stored="true"  required="true" />
> stored="false" />
>
>
>
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com