Re: PreAnalyzed URP and SchemaRequest API

2018-04-13 Thread David Smiley
Yes I could imagine big gains from this strategy if OpenNLP is in the
analysis chain ;-)

On Fri, Apr 13, 2018 at 5:01 PM Markus Jelsma 
wrote:

> Hello David,
>
> If JSON serialization is too bulky, we could also opt for
> SimplePreAnalyzed right? At least as a FieldType it is possible, if not
> with URP, it just needs some work.
>
> Regarding results; we haven't done it yet, and won't for some time, but we
> will when we reintroduce OpenNLP in the analysis chain. We tried to
> introduce POS-tagging on our own two years ago, but i wasn't suited for
> production because it was too heavy on the CPU. Indexing data suddenly took
> eight to ten times longer in a SolrCloud environment with three replica's.
>
> If we offload our current chains without OpenNLP, it will only benefit
> when large fields pass through a regex, and for decompounding the Germanic
> languages we ingest. Offloading just this cost is a micro optimization,
> offloading the various OpenNLP char and token filters are really beneficial.
>
> Regarding a dependency on Lucene core and analysis-common, it would be
> helpful, but we'll manage.
>
> Thanks again,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 12th April 2018 19:16
> > To: solr-user@lucene.apache.org
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Ah ok.
> > I've wondered how much value there is in pre-analysis.  The serialization
> > of the analyzed form in JSON is bulky.  If you can share any results, I'd
> > be interested to hear how it went.  It's an optimization so you should be
> > able to know how much better it is.  Of course it isn't for everybody --
> > only when the analysis chain is sufficiently complex.
> >
> > On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello David,
> > >
> > > The remote client has everything on the class path but just calling
> > > setTokenStream is not going to work. Remotely, all i get from
> SchemaRequest
> > > API is a AnalyzerDefinition. I haven't found any Solr code that allows
> me
> > > to transform that directly into an analyzer. If i had that, it would
> make
> > > things easy.
> > >
> > > As far as i see it, i need to reconstruct a real Analyzer using
> > > AnalyzerDefinition's information. It won't be a problem, but it is
> > > cumbersome.
> > >
> > > Thanks anyway,
> > > Markus
> > >
> > > -Original message-
> > > > From:David Smiley 
> > > > Sent: Thursday 5th April 2018 19:38
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: PreAnalyzed URP and SchemaRequest API
> > > >
> > > > Is this really a problem when you could easily enough create a
> TextField
> > > > and call setTokenStream?
> > > >
> > > > Does your remote client have Solr-core and all its dependencies on
> the
> > > > classpath?   That's one way to do it... and presumably the direction
> you
> > > > are going because you're asking how to work with PreAnalyzedParser
> which
> > > is
> > > > in solr-core.  *Alternatively*, only bring in Lucene core and
> construct
> > > > things yourself in the right format.  You could copy
> PreAnalyzedParser
> > > into
> > > > your codebase so that you don't have to reinvent any wheels, even
> though
> > > > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't
> want
> > > > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > > > dependency.
> > > >
> > > > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <
> markus.jel...@openindex.io
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > We intend to move to PreAnalyzed URP for analysis offloading.
> Browsing
> > > the
> > > > > Javadocs i came across the SchemaRequest API looking for a way to
> get a
> > > > > Field object remotely, which i seem to need for
> > > > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> > > from
> > > > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > > > >
> > > > > So, to analyze remotely i do need an index-time analyzer. I can
> get it,
> > > > > but not turn it into a Field object, which the PreAnalyzedParser
> for
> > > some
> > > > > reason wants.
> > > > >
> > > > > Any hints here? I must be looking the wrong way.
> > > > >
> > > > > Many thanks!
> > > > > Markus
> > > > >
> > > > --
> > > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > > > http://www.solrenterprisesearchserver.com
> > > >
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


RE: PreAnalyzed URP and SchemaRequest API

2018-04-13 Thread Markus Jelsma
Hello David,

If JSON serialization is too bulky, we could also opt for SimplePreAnalyzed 
right? At least as a FieldType it is possible, if not with URP, it just needs 
some work.

Regarding results; we haven't done it yet, and won't for some time, but we will 
when we reintroduce OpenNLP in the analysis chain. We tried to introduce 
POS-tagging on our own two years ago, but i wasn't suited for production 
because it was too heavy on the CPU. Indexing data suddenly took eight to ten 
times longer in a SolrCloud environment with three replica's.

If we offload our current chains without OpenNLP, it will only benefit when 
large fields pass through a regex, and for decompounding the Germanic languages 
we ingest. Offloading just this cost is a micro optimization, offloading the 
various OpenNLP char and token filters are really beneficial.

Regarding a dependency on Lucene core and analysis-common, it would be helpful, 
but we'll manage.

Thanks again,
Markus
 
-Original message-
> From:David Smiley 
> Sent: Thursday 12th April 2018 19:16
> To: solr-user@lucene.apache.org
> Subject: Re: PreAnalyzed URP and SchemaRequest API
> 
> Ah ok.
> I've wondered how much value there is in pre-analysis.  The serialization
> of the analyzed form in JSON is bulky.  If you can share any results, I'd
> be interested to hear how it went.  It's an optimization so you should be
> able to know how much better it is.  Of course it isn't for everybody --
> only when the analysis chain is sufficiently complex.
> 
> On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma 
> wrote:
> 
> > Hello David,
> >
> > The remote client has everything on the class path but just calling
> > setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> > API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> > to transform that directly into an analyzer. If i had that, it would make
> > things easy.
> >
> > As far as i see it, i need to reconstruct a real Analyzer using
> > AnalyzerDefinition's information. It won't be a problem, but it is
> > cumbersome.
> >
> > Thanks anyway,
> > Markus
> >
> > -Original message-
> > > From:David Smiley 
> > > Sent: Thursday 5th April 2018 19:38
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: PreAnalyzed URP and SchemaRequest API
> > >
> > > Is this really a problem when you could easily enough create a TextField
> > > and call setTokenStream?
> > >
> > > Does your remote client have Solr-core and all its dependencies on the
> > > classpath?   That's one way to do it... and presumably the direction you
> > > are going because you're asking how to work with PreAnalyzedParser which
> > is
> > > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > > things yourself in the right format.  You could copy PreAnalyzedParser
> > into
> > > your codebase so that you don't have to reinvent any wheels, even though
> > > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > > dependency.
> > >
> > > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma  > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> > the
> > > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > > Field object remotely, which i seem to need for
> > > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> > from
> > > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > > >
> > > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > > but not turn it into a Field object, which the PreAnalyzedParser for
> > some
> > > > reason wants.
> > > >
> > > > Any hints here? I must be looking the wrong way.
> > > >
> > > > Many thanks!
> > > > Markus
> > > >
> > > --
> > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > > http://www.solrenterprisesearchserver.com
> > >
> >
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
> 


Re: PreAnalyzed URP and SchemaRequest API

2018-04-12 Thread David Smiley
Ah ok.
I've wondered how much value there is in pre-analysis.  The serialization
of the analyzed form in JSON is bulky.  If you can share any results, I'd
be interested to hear how it went.  It's an optimization so you should be
able to know how much better it is.  Of course it isn't for everybody --
only when the analysis chain is sufficiently complex.

On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma 
wrote:

> Hello David,
>
> The remote client has everything on the class path but just calling
> setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> to transform that directly into an analyzer. If i had that, it would make
> things easy.
>
> As far as i see it, i need to reconstruct a real Analyzer using
> AnalyzerDefinition's information. It won't be a problem, but it is
> cumbersome.
>
> Thanks anyway,
> Markus
>
> -Original message-
> > From:David Smiley 
> > Sent: Thursday 5th April 2018 19:38
> > To: solr-user@lucene.apache.org
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Is this really a problem when you could easily enough create a TextField
> > and call setTokenStream?
> >
> > Does your remote client have Solr-core and all its dependencies on the
> > classpath?   That's one way to do it... and presumably the direction you
> > are going because you're asking how to work with PreAnalyzedParser which
> is
> > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > things yourself in the right format.  You could copy PreAnalyzedParser
> into
> > your codebase so that you don't have to reinvent any wheels, even though
> > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > dependency.
> >
> > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello,
> > >
> > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> the
> > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > Field object remotely, which i seem to need for
> > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> from
> > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > >
> > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > but not turn it into a Field object, which the PreAnalyzedParser for
> some
> > > reason wants.
> > >
> > > Any hints here? I must be looking the wrong way.
> > >
> > > Many thanks!
> > > Markus
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


RE: PreAnalyzed URP and SchemaRequest API

2018-04-09 Thread Markus Jelsma
Hello David,

The remote client has everything on the class path but just calling 
setTokenStream is not going to work. Remotely, all i get from SchemaRequest API 
is a AnalyzerDefinition. I haven't found any Solr code that allows me to 
transform that directly into an analyzer. If i had that, it would make things 
easy.

As far as i see it, i need to reconstruct a real Analyzer using 
AnalyzerDefinition's information. It won't be a problem, but it is cumbersome.

Thanks anyway,
Markus
 
-Original message-
> From:David Smiley 
> Sent: Thursday 5th April 2018 19:38
> To: solr-user@lucene.apache.org
> Subject: Re: PreAnalyzed URP and SchemaRequest API
> 
> Is this really a problem when you could easily enough create a TextField
> and call setTokenStream?
> 
> Does your remote client have Solr-core and all its dependencies on the
> classpath?   That's one way to do it... and presumably the direction you
> are going because you're asking how to work with PreAnalyzedParser which is
> in solr-core.  *Alternatively*, only bring in Lucene core and construct
> things yourself in the right format.  You could copy PreAnalyzedParser into
> your codebase so that you don't have to reinvent any wheels, even though
> that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> SolrJ depending on Lucene-core, though it'd make a fine "optional"
> dependency.
> 
> On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> > Javadocs i came across the SchemaRequest API looking for a way to get a
> > Field object remotely, which i seem to need for
> > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> > SchemaRequest API is FieldTypeRepresentation, which offers me
> > getIndexAnalyzer() but won't allow me to construct a Field object.
> >
> > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > but not turn it into a Field object, which the PreAnalyzedParser for some
> > reason wants.
> >
> > Any hints here? I must be looking the wrong way.
> >
> > Many thanks!
> > Markus
> >
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
> 


Re: PreAnalyzed URP and SchemaRequest API

2018-04-05 Thread David Smiley
Is this really a problem when you could easily enough create a TextField
and call setTokenStream?

Does your remote client have Solr-core and all its dependencies on the
classpath?   That's one way to do it... and presumably the direction you
are going because you're asking how to work with PreAnalyzedParser which is
in solr-core.  *Alternatively*, only bring in Lucene core and construct
things yourself in the right format.  You could copy PreAnalyzedParser into
your codebase so that you don't have to reinvent any wheels, even though
that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
SolrJ depending on Lucene-core, though it'd make a fine "optional"
dependency.

On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma 
wrote:

> Hello,
>
> We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> Javadocs i came across the SchemaRequest API looking for a way to get a
> Field object remotely, which i seem to need for
> JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> SchemaRequest API is FieldTypeRepresentation, which offers me
> getIndexAnalyzer() but won't allow me to construct a Field object.
>
> So, to analyze remotely i do need an index-time analyzer. I can get it,
> but not turn it into a Field object, which the PreAnalyzedParser for some
> reason wants.
>
> Any hints here? I must be looking the wrong way.
>
> Many thanks!
> Markus
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com