Re: EmbeddedSolrServer removed quietly in 7.0?

2018-01-09 Thread Robert Krüger
On Tue, Jan 9, 2018 at 11:37 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 1/9/2018 2:42 AM, Robert Krüger wrote:
>
>> I am looking to upgrade an application that still uses version 4.6.1 to
>> latest solr (7.2) but realized that support for EmbeddedSolrServer seems
>> to
>> have vanished with 7.0 but I could find no mention of it in the release
>> notes, which strikes me as odd for such a disruptive change. Is it
>> somewhere else or has it just gone silently? The class wasn't deprecated
>> in
>> 6.x as far as I can see. What am I missing?
>>
>> I have no problem updating to 6.6.2 for now to delay having to worry about
>> a bigger migration but just don't want to do it, because I am missing
>> something obvious.
>>
>> Btw. I do not want to start yet another thread about the merits of using
>> an
>> embedded solr server ;-).
>>
>
> It's still there.  Here's the 7.2 javadoc:
>
> https://lucene.apache.org/solr/7_2_0/solr-core/org/apache/
> solr/client/solrj/embedded/EmbeddedSolrServer.html


Duh, thanks so much. I was looking in the solrj javadoc, not solr-core! My
bad.


>
>
> Since 5.0, it is a descendant of SolrClient, and the SolrServer abstract
> class was removed in 6.0.
>
> Regarding whether you should use the embedded server or not:  For
> production usage, I would not recommend it, because redundancy is not
> possible.  But there are certain situations (mostly dev/testing) where it's
> absolutely the right solution.
>

We're using it in a classic embedded situation in a desktop app where
redundancy is not an issue and are super-happy with it.


>
> I am not expecting EmbeddedSolrServer to be removed.  It is used
> extensively in Solr tests and by many users in the wild.
>
>
That is very good to know. Thanks again!


Re: Efficiency of integer storage/use

2015-10-21 Thread Robert Krüger
Thanks everyone, for your answers. I will probably make a simple parametric
test pumping a solr index full of those integers with very limited range
and then sorting by vector distances to see how the performance
characteristics are.

On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Robert,
> From what I know as inverted index as docvalues compress content much, even
> stored fields compressed too. So, I think you have much chance to
> experiment successfully. You might need tweak schema disabling storing
> unnecessary info in the index.
>
> On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de>
> wrote:
>
> > Thanks for the feedback.
> >
> > What I am trying to do is to "abuse" integers to store 8bit (or even
> lower)
> > values of metrics I use for content-based image/video search (such as
> > statistical values regarding color distribution) and then implement
> > similarity calculations based on formulas using vector distances. The
> Index
> > can become large (tens of millions of documents each with say 50-100
> > integers  describing the image metrics). I am looking at using a part of
> > those metrics for selecting a subset of images using range queries and
> then
> > more for sorting the result set by relevance.
> >
> > I was first looking at implementing those metrics as binary fields (see
> > other posting) and then use a custom function for the distance
> calculation
> > but so far I got the impression that way is not supported really well by
> > Solr. Base64-En/Decoding would kill performance and implementing a custom
> > field type with all that is probably required for that to work properly
> is
> > currently beyond my Solr knowledge. Besides, using built-in Solr features
> > makes it easier to finetune/experiment with different approaches,
> because I
> > can just play around with different queries and see what works best,
> > without each time adjusting a custom function.
> >
> > I hope that provides a better picture of what I am trying to achieve.
> >
> > Best,
> >
> > Robert
> >
> > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> > > Under the covers, Lucene stores ints in a packed format, so I'd just
> > count
> > > on that for a first pass.
> > >
> > > What is "a lot of integer values"? Hundreds of millions? Billions?
> > > Trillions?
> > >
> > > Unless you give us some indication of scale, it's hard to say anything
> > > helpful. But unless you have some evidence that your going to blow out
> > > memory I'd just ignore the "wasted" bits. Especially if you can use
> > > docValues,
> > > that option holds much of the underlying data in MMapDirectory
> > > that uses swappable OS memory
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> > > wrote:
> > > > Hi,
> > > >
> > > > I have a data model where I would store and index a lot of integer
> > values
> > > > with a very restricted range (e.g. 0-255), so theoretically the 32
> bits
> > > of
> > > > Solr's integer fields are complete overkill. I want to be able to to
> > > things
> > > > like vector distance calculations on those fields. Should I worry
> about
> > > the
> > > > "wasted" bits or will Solr compress/organize the index in a way that
> > > > compensates for this if there are only 256 (or even fewer) distinct
> > > values?
> > > >
> > > > Any recommendations on how my fields should be defined to make things
> > > like
> > > > numeric functions work as fast as technically possible?
> > > >
> > > > Thanks in advance,
> > > >
> > > > Robert
> > >
> >
> >
> >
> > --
> > Robert Krüger
> > Managing Partner
> > Lesspain GmbH & Co. KG
> >
> > www.lesspain-software.com
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com


Re: Efficiency of integer storage/use

2015-10-17 Thread Robert Krüger
Thanks for the feedback.

What I am trying to do is to "abuse" integers to store 8bit (or even lower)
values of metrics I use for content-based image/video search (such as
statistical values regarding color distribution) and then implement
similarity calculations based on formulas using vector distances. The Index
can become large (tens of millions of documents each with say 50-100
integers  describing the image metrics). I am looking at using a part of
those metrics for selecting a subset of images using range queries and then
more for sorting the result set by relevance.

I was first looking at implementing those metrics as binary fields (see
other posting) and then use a custom function for the distance calculation
but so far I got the impression that way is not supported really well by
Solr. Base64-En/Decoding would kill performance and implementing a custom
field type with all that is probably required for that to work properly is
currently beyond my Solr knowledge. Besides, using built-in Solr features
makes it easier to finetune/experiment with different approaches, because I
can just play around with different queries and see what works best,
without each time adjusting a custom function.

I hope that provides a better picture of what I am trying to achieve.

Best,

Robert

On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Under the covers, Lucene stores ints in a packed format, so I'd just count
> on that for a first pass.
>
> What is "a lot of integer values"? Hundreds of millions? Billions?
> Trillions?
>
> Unless you give us some indication of scale, it's hard to say anything
> helpful. But unless you have some evidence that your going to blow out
> memory I'd just ignore the "wasted" bits. Especially if you can use
> docValues,
> that option holds much of the underlying data in MMapDirectory
> that uses swappable OS memory....
>
> Best,
> Erick
>
> On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> wrote:
> > Hi,
> >
> > I have a data model where I would store and index a lot of integer values
> > with a very restricted range (e.g. 0-255), so theoretically the 32 bits
> of
> > Solr's integer fields are complete overkill. I want to be able to to
> things
> > like vector distance calculations on those fields. Should I worry about
> the
> > "wasted" bits or will Solr compress/organize the index in a way that
> > compensates for this if there are only 256 (or even fewer) distinct
> values?
> >
> > Any recommendations on how my fields should be defined to make things
> like
> > numeric functions work as fast as technically possible?
> >
> > Thanks in advance,
> >
> > Robert
>



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com


Efficiency of integer storage/use

2015-10-16 Thread Robert Krüger
Hi,

I have a data model where I would store and index a lot of integer values
with a very restricted range (e.g. 0-255), so theoretically the 32 bits of
Solr's integer fields are complete overkill. I want to be able to to things
like vector distance calculations on those fields. Should I worry about the
"wasted" bits or will Solr compress/organize the index in a way that
compensates for this if there are only 256 (or even fewer) distinct values?

Any recommendations on how my fields should be defined to make things like
numeric functions work as fast as technically possible?

Thanks in advance,

Robert


Problem with custom function on binary content

2015-10-15 Thread Robert Krüger
Hi,

I am trying to implement a custom function that evaluates fields stored as
type "binary" (BinaryField).

I have my ValueSourceParser set up and I can retrieve the arguments of my
function correctly and have a reference to the SchemaField. Now I am a bit
stuck, how I can retrieve the binary content from a ValueSource?

Let's make this easier with an example. Let's say I have a field named
"fingerprint" defined as binary and I want to implement a custom function
"norm" that computes an integer from that binary content that can be used
in sorting or whatever.

The field is declared like this:



I use "norm(fingerprint)" as my sorting expression and it arrives fine in
my ValueSourceParser.

To obtain a value source to use for the field value I tried two things:
1) parse the value source using parseValueSource()
2) parse the field name using parseId() and then something like:

SchemaField schemaField = fp.getReq().getSchema().getField(arg1);
ValueSource fieldValueSource =
schemaField.getType().getValueSource(schemaField, fp);

I thought I would just use that value source in my custom
FingerprintNormValueSource to retrieve the byte array via the
method org.apache.lucene.queries.function.FunctionValues#objectVal,
however, that always returns null for my documents.

Now I noticed that the ValueSource I get for the field is an instance
of org.apache.solr.schema.StrFieldSource and there objectVal is implemented
as strVal, which obviously will never return my fingerprint byte array.

Do I have to encode my byte arrays as strings or is there a direct way to
retrieve the binary data for use in my custom ValueSource?

Thanks a lot,

Robert


Initializing core takes very long at times

2015-08-05 Thread Robert Krüger
Hi,

for months/years, I have been experiencing occasional very long (30s+)
hangs when programmatically initializing a solr container in Java. The
application has worked for years in production with this setup without any
problems apart from this.

The code I have is this here:

 public void initContainer(File solrConfig) throws Exception {
logger.debug(initializing solr container with config {},
solrConfig);
Preconditions.checkNotNull(solrConfig);
Stopwatch stopwatch = Stopwatch.createStarted();
container =
CoreContainer.createAndLoad(solrConfig.getParentFile().getAbsolutePath(),
solrConfig);
containerInitialized = true;
logger.debug(initializing solr container took {}, stopwatch);
if (stopwatch.elapsed(TimeUnit.MILLISECONDS)  1000) {
logger.warn(initializing solr container took very long
({}), stopwatch);
}
}

So it is obviously the createAndLoad-Call. I posted about this a long time
ago and people suggested checking for uncommitted soft commits but now I
realized that I had these hangs in a test setup, where the index is created
from scratch, so that cannot be the problem.

Any ideas anyone?

My config is rather simple. Is there something wring with my locking
options that might cause this?

config

  indexConfig
lockTypenative/lockType
unlockOnStartuptrue/unlockOnStartup
ramBufferSizeMB64/ramBufferSizeMB
maxBufferedDocs1000/maxBufferedDocs
  /indexConfig

  luceneMatchVersionLUCENE_43/luceneMatchVersion

  updateHandler class=solr.DirectUpdateHandler2
autoCommit
   maxDocs1000/maxDocs
   maxTime1/maxTime
   openSearcherfalse/openSearcher
/autoCommit
  /updateHandler

  requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=2048 /
  /requestDispatcher

  requestHandler name=standard class=solr.StandardRequestHandler
default=true /
  requestHandler name=/update class=solr.UpdateRequestHandler /
  requestHandler name=/admin/
class=org.apache.solr.handler.admin.AdminHandlers /

  filterCache class=solr.LRUCache size=512 initialSize=512
autowarmCount=256/
  queryResultCache class=solr.LRUCache size=512 initialSize=512
autowarmCount=256/
  documentCache class=solr.LRUCache size=1 initialSize=512
autowarmCount=0/

  admin
defaultQuerysolr/defaultQuery
  /admin

/config

Thank you in advance,

Robert


Embedded Solr now deprecated?

2015-08-05 Thread Robert Krüger
Hi,

I tried to upgrade my application from solr 4 to 5 and just now realized
that embedded use of solr seems to be on the way out. Is that correct or is
there a just new API to use for that?

Thanks in advance,

Robert


Re: Initializing core takes very long at times

2015-08-05 Thread Robert Krüger
OK, now that I had a reproducible setup I could debug where it hangs:

public SystemInfoHandler(CoreContainer cc) {
super();
this.cc = cc;
init();
  }

  private void init() {
try {
  InetAddress addr = InetAddress.getLocalHost();
  hostname = addr.getCanonicalHostName();
 this is where it hangs
} catch (UnknownHostException e) {
  //default to null
}
  }


so it depends on my current network setup even for the embedded case. any
idea how I can stop solr from making that call? InetAddress.getLocalHost()
in this case returns some local vpn address and thus the reverse lookup
times out after 30 seconds. This actually happens twice, once when
initializing the container, again when initializing the core, so in my case
a minute per restart and looking at the code, I don't see how I can work
around this other than patching solr, which I am trying to avoid like hell.

On Wed, Aug 5, 2015 at 1:54 PM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 for months/years, I have been experiencing occasional very long (30s+)
 hangs when programmatically initializing a solr container in Java. The
 application has worked for years in production with this setup without any
 problems apart from this.

 The code I have is this here:

  public void initContainer(File solrConfig) throws Exception {
 logger.debug(initializing solr container with config {},
 solrConfig);
 Preconditions.checkNotNull(solrConfig);
 Stopwatch stopwatch = Stopwatch.createStarted();
 container =
 CoreContainer.createAndLoad(solrConfig.getParentFile().getAbsolutePath(),
 solrConfig);
 containerInitialized = true;
 logger.debug(initializing solr container took {}, stopwatch);
 if (stopwatch.elapsed(TimeUnit.MILLISECONDS)  1000) {
 logger.warn(initializing solr container took very long
 ({}), stopwatch);
 }
 }

 So it is obviously the createAndLoad-Call. I posted about this a long time
 ago and people suggested checking for uncommitted soft commits but now I
 realized that I had these hangs in a test setup, where the index is created
 from scratch, so that cannot be the problem.

 Any ideas anyone?

 My config is rather simple. Is there something wring with my locking
 options that might cause this?

 config

   indexConfig
 lockTypenative/lockType
 unlockOnStartuptrue/unlockOnStartup
 ramBufferSizeMB64/ramBufferSizeMB
 maxBufferedDocs1000/maxBufferedDocs
   /indexConfig

   luceneMatchVersionLUCENE_43/luceneMatchVersion

   updateHandler class=solr.DirectUpdateHandler2
 autoCommit
maxDocs1000/maxDocs
maxTime1/maxTime
openSearcherfalse/openSearcher
 /autoCommit
   /updateHandler

   requestDispatcher handleSelect=true 
 requestParsers enableRemoteStreaming=false
 multipartUploadLimitInKB=2048 /
   /requestDispatcher

   requestHandler name=standard class=solr.StandardRequestHandler
 default=true /
   requestHandler name=/update class=solr.UpdateRequestHandler /
   requestHandler name=/admin/
 class=org.apache.solr.handler.admin.AdminHandlers /

   filterCache class=solr.LRUCache size=512 initialSize=512
 autowarmCount=256/
   queryResultCache class=solr.LRUCache size=512 initialSize=512
 autowarmCount=256/
   documentCache class=solr.LRUCache size=1 initialSize=512
 autowarmCount=0/

   admin
 defaultQuerysolr/defaultQuery
   /admin

 /config

 Thank you in advance,

 Robert




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Initializing core takes very long at times

2015-08-05 Thread Robert Krüger
I am shipping solr as a local search engine with our software, so I have no
way of controlling that environment. Many other software packages (rdbmss,
nosql engines etc.) work well in such a setup (as does solr except this
problem). The problem is that in this case (AFAICS) the host cannot be
overridden in any way (by config or system property or whatever), because
that handler is coded as it is. It is in no way a natural limitation of the
type of software or my use case. But I understand that this is probably not
frequently a problem for people, because by far most solr use is classic
server-based use.

I may suggest a patch on the devel mailing list.


On Wed, Aug 5, 2015 at 5:42 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 8/5/2015 7:56 AM, Robert Krüger wrote:
  OK, now that I had a reproducible setup I could debug where it hangs:
 
  public SystemInfoHandler(CoreContainer cc) {
  super();
  this.cc = cc;
  init();
}
 
private void init() {
  try {
InetAddress addr = InetAddress.getLocalHost();
hostname = addr.getCanonicalHostName();
   this is where it hangs
  } catch (UnknownHostException e) {
//default to null
  }
}
 
 
  so it depends on my current network setup even for the embedded case. any
  idea how I can stop solr from making that call?
 InetAddress.getLocalHost()
  in this case returns some local vpn address and thus the reverse lookup
  times out after 30 seconds. This actually happens twice, once when
  initializing the container, again when initializing the core, so in my
 case
  a minute per restart and looking at the code, I don't see how I can work
  around this other than patching solr, which I am trying to avoid like
 hell.

 Because almost all users are using Solr in a mode where it requires the
 network, that code cannot be eliminated from Solr.

 It is critical that your machine's local network is set up completely
 right when you are running applications that are (normally)
 network-aware.  Generally that means having relevant entries for all
 interfaces in your hosts file and making sure that the DNS resolver code
 included with the operating system is not buggy.

 If you're dealing with a VPN or something else where the address is
 acquired from elsewhere, then you need to make sure that the machine has
 at least two DNS servers configured, that one of them is working, and
 that the forward and reverse DNS on those servers are completely set up
 for that interface's IP address.  Bugs in the DNS resolver code can
 complicate this.

 Thanks,
 Shawn




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Embedded Solr now deprecated?

2015-08-05 Thread Robert Krüger
I just saw lots of deprecation warnings in my current code and a method
that was removed, which is why I asked.

Regarding the use case, I am embedding it with a desktop application just
as others use java-based no-sql or rdbms engines and that makes sense
architecturally in my case and is just simpler than deploying a separate
little tomcat instance. API-wise, I know it is the same and it would be
doable to do it that way. The embedded option is just the logical and
simpler choice in terms of delivery, packaging, installation, automated
testing in this case. The network option just doesn't add anything here
apart from overhead (probably negligible in our case) and complexity.

So our use is in production but in a desktop way, not what people normally
think about when they hear production use.

Thanks everyone for the quick feedback! I am very relieved to hear it is
not on its way out and I will look at the api changes more closely and try
to get our application running on 5.2.1.

Best regards,

Robert

On Wed, Aug 5, 2015 at 4:34 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Where did you see that? Maybe I missed something yet
 again. This is unrelated to whether we ship a WAR if that's
 being conflated here.

 I rather doubt that embedded is on it's way out, although
 my memory isn't what it used to be.

 For starters, MapReduceIndexerTool uses it, so it gets
 regular exercise from that, and anything removing it would
 require some kind of replacement.

 How are you using it that you care? Wondering what
 alternatives exist...

 Best,
 Erick


 On Wed, Aug 5, 2015 at 9:09 AM, Robert Krüger krue...@lesspain.de wrote:
  Hi,
 
  I tried to upgrade my application from solr 4 to 5 and just now realized
  that embedded use of solr seems to be on the way out. Is that correct or
 is
  there a just new API to use for that?
 
  Thanks in advance,
 
  Robert




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Initializing core takes very long at times

2015-08-05 Thread Robert Krüger
I just posted on lucene-dev. I think just replacing getCanonicalHostName by
getHostName might do the Job. At least that's exactly what Logback does for
this purpose:

http://logback.qos.ch/xref/ch/qos/logback/core/util/ContextUtil.html

On Wed, Aug 5, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com
wrote:

 All patches welcome!

 On Wed, Aug 5, 2015 at 12:40 PM, Robert Krüger krue...@lesspain.de
 wrote:
  I am shipping solr as a local search engine with our software, so I have
 no
  way of controlling that environment. Many other software packages
 (rdbmss,
  nosql engines etc.) work well in such a setup (as does solr except this
  problem). The problem is that in this case (AFAICS) the host cannot be
  overridden in any way (by config or system property or whatever), because
  that handler is coded as it is. It is in no way a natural limitation of
 the
  type of software or my use case. But I understand that this is probably
 not
  frequently a problem for people, because by far most solr use is classic
  server-based use.
 
  I may suggest a patch on the devel mailing list.
 
 
  On Wed, Aug 5, 2015 at 5:42 PM, Shawn Heisey apa...@elyograg.org
 wrote:
 
  On 8/5/2015 7:56 AM, Robert Krüger wrote:
   OK, now that I had a reproducible setup I could debug where it hangs:
  
   public SystemInfoHandler(CoreContainer cc) {
   super();
   this.cc = cc;
   init();
 }
  
 private void init() {
   try {
 InetAddress addr = InetAddress.getLocalHost();
 hostname = addr.getCanonicalHostName();
    this is where it hangs
   } catch (UnknownHostException e) {
 //default to null
   }
 }
  
  
   so it depends on my current network setup even for the embedded case.
 any
   idea how I can stop solr from making that call?
  InetAddress.getLocalHost()
   in this case returns some local vpn address and thus the reverse
 lookup
   times out after 30 seconds. This actually happens twice, once when
   initializing the container, again when initializing the core, so in my
  case
   a minute per restart and looking at the code, I don't see how I can
 work
   around this other than patching solr, which I am trying to avoid like
  hell.
 
  Because almost all users are using Solr in a mode where it requires the
  network, that code cannot be eliminated from Solr.
 
  It is critical that your machine's local network is set up completely
  right when you are running applications that are (normally)
  network-aware.  Generally that means having relevant entries for all
  interfaces in your hosts file and making sure that the DNS resolver code
  included with the operating system is not buggy.
 
  If you're dealing with a VPN or something else where the address is
  acquired from elsewhere, then you need to make sure that the machine has
  at least two DNS servers configured, that one of them is working, and
  that the forward and reverse DNS on those servers are completely set up
  for that interface's IP address.  Bugs in the DNS resolver code can
  complicate this.
 
  Thanks,
  Shawn
 
 
 
 
  --
  Robert Krüger
  Managing Partner
  Lesspain GmbH  Co. KG
 
  www.lesspain-software.com




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Example of sorting by custom function

2015-04-04 Thread Robert Krüger
Thanks a lot!

Robert

On Fri, Apr 3, 2015 at 5:34 PM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 ValueSourceParser — yes.  You’ll find a ton of them in Solr to get ideas
 from.

 In your example you forgot the “asc” or “desc”.

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley

 On Fri, Apr 3, 2015 at 9:44 AM, Robert Krüger krue...@lesspain.de wrote:

  Hi,
 
  I have been looking around on the web for information on sorting by a
  custom function but the results are inconclusive to me and some of it
 seems
  so old that I suspect it's outdated. What I want to do is the following:
 
  I have a field fingerprint in my schema that contains a binary data
 (e.g.
  64 bytes) that can be used to compute a distance/similarity between two
  records.
 
  Now I want to be able to execute a query and sort its result by the
  distance of the matching records from a given reference value, so the
 query
  would look something like this
 
  q=*:*sort=my_distance_func(fingerprint, 0xadet54786eguizgig)
 
  where
 
  my_distance_func is my custom function
  fingerprint is the field in my schema (type binary)
  0xadet54786eguizgig is a reference value (byte array encoded in whatever
  way) to which the distance shall be computed, which differs with each
  query.
 
  Is ValueSourceParser the right way to look here? Is there a source code
  example someone can point me to?
 
  Thanks in advance,
 
  Robert
 



Example of sorting by custom function

2015-04-03 Thread Robert Krüger
Hi,

I have been looking around on the web for information on sorting by a
custom function but the results are inconclusive to me and some of it seems
so old that I suspect it's outdated. What I want to do is the following:

I have a field fingerprint in my schema that contains a binary data (e.g.
64 bytes) that can be used to compute a distance/similarity between two
records.

Now I want to be able to execute a query and sort its result by the
distance of the matching records from a given reference value, so the query
would look something like this

q=*:*sort=my_distance_func(fingerprint, 0xadet54786eguizgig)

where

my_distance_func is my custom function
fingerprint is the field in my schema (type binary)
0xadet54786eguizgig is a reference value (byte array encoded in whatever
way) to which the distance shall be computed, which differs with each query.

Is ValueSourceParser the right way to look here? Is there a source code
example someone can point me to?

Thanks in advance,

Robert


Re: Easiest way to embed solr in a desktop application

2015-01-15 Thread Robert Krüger
Hi Ahmet,

at first glance, I'm not sure. Need to look at it more carefully.

Thanks,

Robert

On Thu, Jan 15, 2015 at 3:53 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi Robert,

 Never used by myself but is solr-packager useful in your case?

 http://sourcesense.github.io/solr-packager/

 Ahmet


 On Thursday, January 15, 2015 4:45 PM, Robert Krüger krue...@lesspain.de
 wrote:
 Hi Andrea,

 you are assuming correctly. It is a local, non-distributed index that is
 only accessed by the containing desktop application. Do you know if there
 is a possibility to run the Solr admin UI on top of an embedded instance
 somehow?

 Thanks a lot,

 Robert

 On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com
 wrote:

  Hi Robert,
  I've used the EmbeddedSolrServer in a scenario like that and I never had
  problems.
  I assume you're talking about a standalone application, where the whole
  index resides locally and you don't need any cluster / cloud /
 distributed
  feature.
 
  I think the usage of EmbeddedSolrServer is discouraged in a (distributed)
  service scenario, because it is a direct connection to a SolrCore
  instance...but this is not a problem in the situation you described (as
 far
  as I know)
 
  Best,
  Andrea
 
 
  On 01/15/2015 03:10 PM, Robert Krüger wrote:
 
  Hi,
 
  I have been using an embedded instance of solr in my desktop application
  for a long time and it works fine. At the time when I made that decision
  (vs. firing up a solr web application within my swing application) I got
  the impression embedded use is somewhat unsupported and I should expect
  problems.
 
  My first question is, is this still the case now (4 years later), that
  embedded solr is discouraged?
 
  The one limitation I am running into is that I cannot use the solr admin
  UI
  for debugging purposes (mainly for running queries). Is there any other
  way
  to do this other than no longer using embedded solr and programmatically
  firing up a web application (e.g. using jetty)? Should I do the latter
  anyway?
 
  Any insights/advice greatly appreciated.
 
  Best regards,

 
  Robert
 
 
 


 --
 Robert Krüger
 Managing Partner
 Lesspain GmbH  Co. KG

 www.lesspain-software.com




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Easiest way to embed solr in a desktop application

2015-01-15 Thread Robert Krüger
I was considering the programmatic Jetty option but then I read that Solr 5
no longer supports being run with an external servlet container but maybe
they still support programmatic jetty use in some way. atm I am using solr
4.x, so this would work. No idea if this gets messy classloader-wise in any
way.

I have been using exactly the approach you described in the past, i.e. I
built a really, really simple swing dialogue to input queries and display
results in a table but was just guessing that the built-in ui was far
superior but maybe I should just live with it for the time being.

On Thu, Jan 15, 2015 at 3:56 PM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 It’d certainly be easiest to just embed Jetty into your application.  You
 don’t need to have Jetty as a separate process, you could launch it through
 it’s friendly Java API, configured to use solr.war.

 If all you needed was to make HTTP(-like) queries to Solr instead of the
 full admin UI, your application could stick to using EmbeddedSolrServer and
 also provide a UI that takes in a Solr query string (or builds one up) and
 then sends it to the embedded Solr and displays the result.

 Erik

  On Jan 15, 2015, at 9:44 AM, Robert Krüger krue...@lesspain.de wrote:
 
  Hi Andrea,
 
  you are assuming correctly. It is a local, non-distributed index that is
  only accessed by the containing desktop application. Do you know if there
  is a possibility to run the Solr admin UI on top of an embedded instance
  somehow?
 
  Thanks a lot,
 
  Robert
 
  On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com
 
  wrote:
 
  Hi Robert,
  I've used the EmbeddedSolrServer in a scenario like that and I never had
  problems.
  I assume you're talking about a standalone application, where the whole
  index resides locally and you don't need any cluster / cloud /
 distributed
  feature.
 
  I think the usage of EmbeddedSolrServer is discouraged in a
 (distributed)
  service scenario, because it is a direct connection to a SolrCore
  instance...but this is not a problem in the situation you described (as
 far
  as I know)
 
  Best,
  Andrea
 
 
  On 01/15/2015 03:10 PM, Robert Krüger wrote:
 
  Hi,
 
  I have been using an embedded instance of solr in my desktop
 application
  for a long time and it works fine. At the time when I made that
 decision
  (vs. firing up a solr web application within my swing application) I
 got
  the impression embedded use is somewhat unsupported and I should expect
  problems.
 
  My first question is, is this still the case now (4 years later), that
  embedded solr is discouraged?
 
  The one limitation I am running into is that I cannot use the solr
 admin
  UI
  for debugging purposes (mainly for running queries). Is there any other
  way
  to do this other than no longer using embedded solr and
 programmatically
  firing up a web application (e.g. using jetty)? Should I do the latter
  anyway?
 
  Any insights/advice greatly appreciated.
 
  Best regards,
 
  Robert
 
 
 
 
 
  --
  Robert Krüger
  Managing Partner
  Lesspain GmbH  Co. KG
 
  www.lesspain-software.com




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Easiest way to embed solr in a desktop application

2015-01-15 Thread Robert Krüger
Hi Andrea,

you are assuming correctly. It is a local, non-distributed index that is
only accessed by the containing desktop application. Do you know if there
is a possibility to run the Solr admin UI on top of an embedded instance
somehow?

Thanks a lot,

Robert

On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com
wrote:

 Hi Robert,
 I've used the EmbeddedSolrServer in a scenario like that and I never had
 problems.
 I assume you're talking about a standalone application, where the whole
 index resides locally and you don't need any cluster / cloud / distributed
 feature.

 I think the usage of EmbeddedSolrServer is discouraged in a (distributed)
 service scenario, because it is a direct connection to a SolrCore
 instance...but this is not a problem in the situation you described (as far
 as I know)

 Best,
 Andrea


 On 01/15/2015 03:10 PM, Robert Krüger wrote:

 Hi,

 I have been using an embedded instance of solr in my desktop application
 for a long time and it works fine. At the time when I made that decision
 (vs. firing up a solr web application within my swing application) I got
 the impression embedded use is somewhat unsupported and I should expect
 problems.

 My first question is, is this still the case now (4 years later), that
 embedded solr is discouraged?

 The one limitation I am running into is that I cannot use the solr admin
 UI
 for debugging purposes (mainly for running queries). Is there any other
 way
 to do this other than no longer using embedded solr and programmatically
 firing up a web application (e.g. using jetty)? Should I do the latter
 anyway?

 Any insights/advice greatly appreciated.

 Best regards,

 Robert





-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Easiest way to embed solr in a desktop application

2015-01-15 Thread Robert Krüger
Hi,

I have been using an embedded instance of solr in my desktop application
for a long time and it works fine. At the time when I made that decision
(vs. firing up a solr web application within my swing application) I got
the impression embedded use is somewhat unsupported and I should expect
problems.

My first question is, is this still the case now (4 years later), that
embedded solr is discouraged?

The one limitation I am running into is that I cannot use the solr admin UI
for debugging purposes (mainly for running queries). Is there any other way
to do this other than no longer using embedded solr and programmatically
firing up a web application (e.g. using jetty)? Should I do the latter
anyway?

Any insights/advice greatly appreciated.

Best regards,

Robert


Re: Performance/scaling with custom function queries

2014-06-12 Thread Robert Krüger
Thanks for the info. I will look at that.

On Wed, Jun 11, 2014 at 3:47 PM, Joel Bernstein joels...@gmail.com wrote:
 In Solr 4.9 there is a feature called RankQueries, that allows you to
 plugin your own ranking collector. So, if you wanted to write a
 ranking/sorting collector that used a thread per segment, you could cleanly
 plug it in.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com 
 david.w.smi...@gmail.com wrote:

 On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de
 wrote:

  Or will I have to set up distributed search to achieve that?


 Yes — you have to shard it to achieve that.  The shards could be on the
 same node.

 There were some discussions this year in JIRA about being able to do
 thread-per-segment but it’s not quite there yet.  FWIW I think it would be
 a nice option for some use-cases (like yours).

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Performance/scaling with custom function queries

2014-06-11 Thread Robert Krüger
Would Solr use multithreading to process the records of a function
query as described above? In my scenario concurrent searches are not
the issue, rather the speed of one query will be the optimization
target. Or will I have to set up distributed search to achieve that?

Thanks,

Robert

On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger krue...@lesspain.de wrote:
 Great, I was hoping for that. In my case I will have to deal with the
 worst case scenario, i.e. all documents matching the query, because
 the only criterion is the fingerprint and the result of the
 distance/similarity function which will have to be executed for every
 document. However, I am dealing with a scenario where there will not
 be many concurrent users.

 Thank you.

 On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote:
 You only need to have fast access to the fingerprint field so only that
 field needs to be in memory. You'll want to review how Lucene DocValues and
 FieldCache work. Sorting is done with a PriorityQueue so only the top N
 docs are kept in memory.

 You'll only need to access the fingerprint field values for documents that
 match the query, so it won't be a full table scan unless all the docs match
 the query.

 Sounds like an interesting project. Please keep us posted.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert




 --
 Robert Krüger
 Managing Partner
 Lesspain GmbH  Co. KG

 www.lesspain-software.com



-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Performance/scaling with custom function queries

2014-06-10 Thread Robert Krüger
Great, I was hoping for that. In my case I will have to deal with the
worst case scenario, i.e. all documents matching the query, because
the only criterion is the fingerprint and the result of the
distance/similarity function which will have to be executed for every
document. However, I am dealing with a scenario where there will not
be many concurrent users.

Thank you.

On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote:
 You only need to have fast access to the fingerprint field so only that
 field needs to be in memory. You'll want to review how Lucene DocValues and
 FieldCache work. Sorting is done with a PriorityQueue so only the top N
 docs are kept in memory.

 You'll only need to access the fingerprint field values for documents that
 match the query, so it won't be a full table scan unless all the docs match
 the query.

 Sounds like an interesting project. Please keep us posted.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Performance/scaling with custom function queries

2014-06-08 Thread Robert Krüger
Hi,

let's say I have an index that contains a field of type BinaryField
called fingerprint that stores a few (let's say 100) bytes that are
some kind of digital fingerprint-like thing.

Let's say I want to perform queries on that field to achieve sorting
or filtering based on a kind of custom distance function
customDistance, i.e. I input a reference fingerprint and Solr
returns either all documents sorted by
customDistance(referenceFingerprint,documentFingerprint) or use
that in an frange expression for filtering.

I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
do understand that using function queries with a custom function is
definitely an expensive thing as it will result in what is called a
full table scan in the sql world, i.e. data from all documents needs
to be touched to select the correct documents or sort by a function's
result.

Given all that and provided, I have to use a custom function for my
needs, I would like to know a few more details about solr architecture
to understand what I have to look out for.

I will have potentially millions of records. Does the data contained
in other index fields play a role when I only use the fingerprint
field for sorting and searching when it comes to RAM usage? I am
hoping to calculate that my RAM should be able to accommodate the
fingerprint data of all available documents for the queries to be fast
but not fingerprint data and all other indexed or stored data.

Example: My fingerprint data needs 100bytes per document, my other
indexed field data needs 900 bytes per document. Will I need 100MB or
1GB to fit all data that is needed to process one query in memory?

Are there other things to be aware of?

Thanks,

Robert


Set up embedded Solr container and cores programmatically to read their configs from the classpath

2014-02-11 Thread Robert Krüger
Hi,

I have an application with an embedded Solr instance (and I want to
keep it embedded) and so far I have been setting up my Solr
installation programmatically using folder paths to specify where the
specific container or core configs are.

I have used the CoreContainer methods createAndLoad and create using
File arguments and this works fine. However, now I want to change this
so that all configuration files are loaded from certain locations
using the classloader but I have not been able to get this to work.

E.g. I want to have my solr config located in the classpath at

my/base/package/solr/conf

and the core configs at

my/base/package/solr/cores/core1/conf,
my/base/package/solr/cores/core2/conf

etc..

Is this possible at all? Looking through the source code it seems that
specifying classpath resources in such a qualified way is not
supported but I may be wrong.

I could get this to work for the container by supplying my own
implementation of SolrResourceLoader that allows a base path to be
specified for the resources to be loaded (I first thought that would
happen already when specifying instanceDir accordingly but looking at
the code it does not. for resources loaded through the classloader,
instanceDir is not prepended). However then I am stuck with the
loading of the cores' resources as the respective code (see
org.apache.solr.core.CoreContainer#createFromLocal) instantiates a
SolResourceLoader internally.

Thanks for any help with this (be it a clarification that it is not possible).

Robert


Programatic instantiation of solr container and cores with config loaded from a jar

2013-07-22 Thread Robert Krüger
Hi,

I use solr embedded in a desktop app and I want to change it to no
longer require the configuration for the container and core to be in
the filesystem but rather be distributed as part of a jar file.

Could someone kindly point me to the right docs?

So far my impression is, I need to instantiate CoreContainer with a
custom SolrResourceLoader with properties parsed via some other API
but from the javadocs alone I feel a bit lost (why does it have to
have an instance directory at all?) and googling did not give me many
results. What would be ideal would be to have something like this
(pseudocode with partly imagined names, which hopefully illustrates
what I am trying to achieve):

ContainerConfig containerConfig =
ContainerConfigParser.parse(InputStream from Classloader);
CoreContainer  container = new CoreContainer(containerConfig);

CoreConfig coreConfig = CoreConfigParser.parse(container, InputStream
from Classloader);
container.register(name, coreConfig);

Ideally I would like to keep XML format to reuse my current solr.xml
and solrconfig.xml but that is just a nice-to-have.

Does such a way exist and if so, what are the real API classes and calls to use?

Thank you in advance,

Robert


Re: Programatic instantiation of solr container and cores with config loaded from a jar

2013-07-22 Thread Robert Krüger
Great, thank you!

On Jul 22, 2013 1:35 PM, Alan Woodward a...@flax.co.uk wrote:

 Hi Robert,

 The upcoming 4.4 release should make this a bit easier (you can check out
the release branch now if you like, or wait a few days for the official
version).  CoreContainer now takes a SolrResourceLoader and a ConfigSolr
object as constructor parameters, and you can create a ConfigSolr object
from a string representation of solr.xml using the ConfigSolr.fromString()
static method.

 Alan Woodward
 www.flax.co.uk


 On 22 Jul 2013, at 11:41, Robert Krüger wrote:

  Hi,
 
  I use solr embedded in a desktop app and I want to change it to no
  longer require the configuration for the container and core to be in
  the filesystem but rather be distributed as part of a jar file.
 
  Could someone kindly point me to the right docs?
 
  So far my impression is, I need to instantiate CoreContainer with a
  custom SolrResourceLoader with properties parsed via some other API
  but from the javadocs alone I feel a bit lost (why does it have to
  have an instance directory at all?) and googling did not give me many
  results. What would be ideal would be to have something like this
  (pseudocode with partly imagined names, which hopefully illustrates
  what I am trying to achieve):
 
  ContainerConfig containerConfig =
  ContainerConfigParser.parse(InputStream from Classloader);
  CoreContainer  container = new CoreContainer(containerConfig);
 
  CoreConfig coreConfig = CoreConfigParser.parse(container, InputStream
  from Classloader);
  container.register(name, coreConfig);
 
  Ideally I would like to keep XML format to reuse my current solr.xml
  and solrconfig.xml but that is just a nice-to-have.
 
  Does such a way exist and if so, what are the real API classes and
calls to use?
 
  Thank you in advance,
 
  Robert



Is Overlapping onDeckSearchers=2 really a problem?

2013-06-27 Thread Robert Krüger
Hi,

I have a desktop application where I am abusing solr as an embedded
database accessing it and I am quite happy with everything.
Performance is more than goog enough for my use case and Solr's query
capabilities match the requirements of my app quite well. However, I
have the well-known performance warnings (see subject) in the log
whenever I index a lot of documents, although I never experience any
performance problems (might be hidden, though). The properties of my
app are:

- I (soft-)commit after every indexed item because I need the changes
to be visible immediately
- The commits are serialized
- I do not have any warming queries configured

I have read the FAQ but don't see anthing that helps in my case. As I
said, I am happy with everything as it is but the warning makes me a
bit nervous (and maybe at some point my customers when their logs are
full of those warnings). What could I do to eliminate it? Can I
configure only one searcher to be used or anything like that?

Thanks for any hints,

Robert


Re: Is Overlapping onDeckSearchers=2 really a problem?

2013-06-27 Thread Robert Krüger
Hi,

On Thu, Jun 27, 2013 at 12:23 PM, Robert Krüger krue...@lesspain.de wrote:
 Hi,

 I have a desktop application where I am abusing solr as an embedded
 database accessing it and I am quite happy with everything.
 Performance is more than goog enough for my use case and Solr's query
 capabilities match the requirements of my app quite well. However, I
 have the well-known performance warnings (see subject) in the log
 whenever I index a lot of documents, although I never experience any
 performance problems (might be hidden, though). The properties of my
 app are:

 - I (soft-)commit after every indexed item because I need the changes
 to be visible immediately
 - The commits are serialized
 - I do not have any warming queries configured

 I have read the FAQ but don't see anthing that helps in my case. As I
 said, I am happy with everything as it is but the warning makes me a
 bit nervous (and maybe at some point my customers when their logs are
 full of those warnings). What could I do to eliminate it? Can I
 configure only one searcher to be used or anything like that?

 Thanks for any hints,

 Robert

sometime forcing oneself to describe a problem is the first step to a
solution. I just realized that I also had an autocommit statement in
my config with the exact same amount of time the seemed to be between
the warnings.

I removed that, because I don't think I really need it, and now the
warnings are gone. So it seems it happened whenever my manual commits
overlapped with an autocommit, which, of course, was more likely when
many commits were issued in sequence.


Re: Is Overlapping onDeckSearchers=2 really a problem?

2013-06-27 Thread Robert Krüger
Shawn,

On Thu, Jun 27, 2013 at 5:03 PM, Shawn Heisey s...@elyograg.org wrote:
 On 6/27/2013 5:59 AM, Robert Krüger wrote:
 sometime forcing oneself to describe a problem is the first step to a
 solution. I just realized that I also had an autocommit statement in
 my config with the exact same amount of time the seemed to be between
 the warnings.

 I removed that, because I don't think I really need it, and now the
 warnings are gone. So it seems it happened whenever my manual commits
 overlapped with an autocommit, which, of course, was more likely when
 many commits were issued in sequence.

 If all you are doing is soft commits, your transaction logs are going to
 grow out of control.

you are absolutely right. I was shooting myself in the foot with that change.

 http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup

 My recommendation:

 1) Remove all commits from your indexing application.
 2) Configure autoCommit with values similar to that wiki page.
 3) Configure autoSoftCommit to happen often.

 The autoCommit must have openSearcher set to false.  For autoSoftCommit,
 include a maxTime between 1000 and 5000 (milliseconds) and leave maxDocs
 out.

I did that but without autoSoftCommit because I need control over when
the commits happen and soft-commit in my application.

Thank you so much,

Robert


Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
On Wed, Jun 5, 2013 at 9:12 PM, Jack Krupansky j...@basetechnology.com wrote:
 Look in the Solr log - the error message should tell you what the multiple
 values are. For example,

 95484 [qtp2998209-11] ERROR org.apache.solr.core.SolrCore  –
 org.apache.solr.common.SolrException: ERROR: [doc=doc-1] multiple values
 encountered for non multiValued field content_s: [def, abc]

 One of the values should be the value of the field that is the source of the
 copyField. Maybe the other value will give you a clue as to where it came
 from.

 Check your SolrJ code - maybe you actually do try to initialize a value in
 the field that is the copyField target.

I see the values in the stack trace:

org.apache.solr.common.SolrException: ERROR:
[doc=8f60d040-3462-4b28-998f-fd05a64f1cd8:/] multiple values
encountered for non multiValued field name2: [rename, rename]

It is just twice the value of source-field and I am not referencing
that field in my java code.


Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
I don't know what I have to do to use the atomic update feature but I
am not aware of using it. But the way you describe it, it means that
the copyField directive does not overwrite the existing field content
and that's an easy explanation to what is happening in my case. Then
the second update (which I do manually, i.e. read current state,
manipulate fields and then add the document with the same id) will
lead to this. That was not so obvious to me from the docs.

Thanks,

Robert

On Thu, Jun 6, 2013 at 12:18 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I updated the Index using SolrJ and got the exact same error message

 there aren't a lot of specifics provided in this thread, so this may not
 be applicable, but if you mean you actaully using the atomic updates
 feature to update an existing document then the problem is that you still
 have the existing value in your name2 field, as well as another copy of
 the name field evaluated by copyField after the updates are applied...

 http://wiki.apache.org/solr/Atomic_Updates#Stored_Values


 -Hoss


Re: copyField generates multiple values encountered for non multiValued field

2013-06-06 Thread Robert Krüger
On Thu, Jun 6, 2013 at 1:52 PM, Jack Krupansky j...@basetechnology.com wrote:
 read current state, manipulate fields and then add the document with the
 same id)

 Ahh... then you have an IMPLICIT reference to the field in your Java code -
 you explicitly told Solr that you wanted to start with all existing field
 values. Just because a field is the target of a copyField doesn't make it
 any different from any other field when reading. Although, it does beg the
 question of whether or not this field should be stored or not - that's a
 data modeling question that only you can resolve. Do queries need to
 retrieve this field?
you're right. in my concrete use case it does not need to to be stored.



 Be sure to null out any values for any fields that are sourced by copy
 fields. Otherwise, yes, duplicated values would be exactly what you should
 expect.
yes, I will do that.


 Is there any reason that you can't simply use atomic update - create a new
 document with the same document id but with only set values for the fields
 to be changed? There is also add for multivalued fields.

 There isn't great doc for this. Basically, the value for every non-ID field
 would be a Map object (HashMap) with a set key whose value is the new
 field value.

 Here's a code fragment for setting one field:

SolrInputDocument doc2 = new SolrInputDocument();
MapString,String fpValue2 = new HashMapString, String();
fpValue2.put(set,fp2);
doc2.setField(FACTURES_PRODUIT, fpValue2);

 You need a separate Map object for each field to be set or added for
 appending to a multivalued field. And you need a simple (non-Map) value for
 your ID field.

thanks for the info! the code is a lot older than solr 4.0, so that
option was not available at the time of its writing. I will check if
it makes sense to use that feature. most likely yes.

Robert


copyField generates multiple values encountered for non multiValued field

2013-06-05 Thread Robert Krüger
I have the exact same problem as the guy here:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201105.mbox/%3C3A2B3E42FCAA4BF496AE625426C5C6E4@Wurstsemmel%3E

AFAICS he did not get an answer. Is this a known issue? What can I do
other than doing what copyField should do in my application?

I am using solr 4.0.0.

Thanks,

Robert


Re: copyField generates multiple values encountered for non multiValued field

2013-06-05 Thread Robert Krüger
OK, I have two fields defined as follows:

  field name=name   type=string   indexed=true  stored=true
multiValued=false /
  field name=name2   type=string_ci   indexed=true
stored=true  multiValued=false /

and this copyField directive

 copyField source=name dest=name2/

I updated the Index using SolrJ and got the exact same error message
that is in the subject. However, while waiting for feedback I built a
workaround at the application level and now reconstructing the
original state, to be able to answer you, I have different behaviour.
What happens now is that the field name2 is populated with multiple
values although it is not defined as multiValued (see above).

Although this is strange, it is consistent with the earlier problem in
that copyField does not seem to overwrite the existing field values. I
may be using it incorrectly (it's the first time I am using copyField)
but the docs in the wiki did not say anything about an overwrite
option.

Cheers,

Robert


On Wed, Jun 5, 2013 at 5:16 PM, Jack Krupansky j...@basetechnology.com wrote:
 Try describing your own symptom in your own words - because his issue
 related to Solr 1.4. I mean, where exactly are you setting
 allowDuplicates=false?? And why do you think it has anything to do with
 adding documents to Solr? Solr 1.4 did not have atomic update, so sending
 the exact same document twice would not result in a change in the index
 (unless you had a date field with a value of NOW.) Copy field only uses
 values from the current document.

 -- Jack Krupansky

 -Original Message- From: Robert Krüger
 Sent: Wednesday, June 05, 2013 10:37 AM
 To: solr-user@lucene.apache.org
 Subject: copyField generates multiple values encountered for non
 multiValued field


 I have the exact same problem as the guy here:

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201105.mbox/%3C3A2B3E42FCAA4BF496AE625426C5C6E4@Wurstsemmel%3E

 AFAICS he did not get an answer. Is this a known issue? What can I do
 other than doing what copyField should do in my application?

 I am using solr 4.0.0.

 Thanks,

 Robert


Re: uniqueKey not enforced

2012-10-24 Thread Robert Krüger
On Tue, Oct 23, 2012 at 2:37 PM, Erick Erickson erickerick...@gmail.com wrote:
 From left field:

 Try looking at your admin/schema browser page for the ID in question.
 That actually
 gets stuff out of your index (the actual indexed terms). See if you
 have two values
I'm running embedded, so I don't have that. However I have a simple UI
for performing queries and the duplicate records are displayed issuing
a *:* query.

 for that ID. In which case you _might_ have spaces before or after the value
 somehow. I notice your comment says something about computed, so... Since
 String types are totally unanalyzed, spaces would count.
No, the way the id is computed can not lead to leading or trailing whitespace.


 you can also use the TermsComponent to see what's there, see:
 http://wiki.apache.org/solr/TermsComponent
I'll take a look.


 Best
 Erick

Thanks,

Robert


Re: uniqueKey not enforced

2012-10-24 Thread Robert Krüger
On Wed, Oct 24, 2012 at 2:03 PM, Erick Erickson erickerick...@gmail.com wrote:
 Robert:

 But you do have an index somewhere, so the alternative for
 looking at it low-level would be
 1 get a copy of Luke and point it at your index. Very useful tool

I will do that, next time I have that condition. Unfortunately I
didn't back up the index files when that happened.

Thanks for the advice!

Robert


uniqueKey not enforced

2012-10-22 Thread Robert Krüger
Hi,

I noticed a duplicate entry in my index and I am wondering how that
can be, because I have a uniqueKey defined.

I have the following defined in my schema.xml:

?xml version=1.0 ?

schema name=main core version=1.1
  types
   fieldtype name=string  class=solr.StrField
sortMissingLast=true omitNorms=true/
 other field types omitted here ...
   /fieldType

  /types

 fields
  !-- general --
  !-- id computed as a combination of media id and path --
  field name=id   type=string   indexed=true  stored=true
multiValued=false /
 other fields omitted here ...

 /fields

 !-- field to use to determine and enforce document uniqueness. --
 uniqueKeyid/uniqueKey

 !-- field for the QueryParser to use when an explicit fieldname is absent --
 defaultSearchFieldname/defaultSearchField

 !-- SolrQueryParser configuration: defaultOperator=AND|OR --
 solrQueryParser defaultOperator=OR/
/schema

And now I have two records which both have the value
4b34b883-a9d9-428a-92c3-ba1a69d96a70:/Düsendrögl in its id field. Is
it the Non-ASCII chars that cause the uniqueness enforcement to fail?

I am using Solr 3.6.1.

Any ideas what's going on?

Thanks,

Robert


Re: uniqueKey not enforced

2012-10-22 Thread Robert Krüger
On Mon, Oct 22, 2012 at 2:08 PM, Jack Krupansky j...@basetechnology.com wrote:
 Which release of Solr?
3.6.1


 Is this a single node Solr or distributed or cloud?
single node, actually embedded in an application.


 Is is possible that you added documents with the overwrite=false
 attribute? That would suppress the uniqueness test.
no, I just used SolrServer.add(CollectionSolrInputDocument docs)


 Is it possible that you added those documents before adding the uniqueKey
 element to your schema, or added uniqueKey but did not restart Solr before
 adding those documents?
no, the element has been there for months, the index has been created
from scratch just before the test


 One minor difference from the Solr example schema is that your id field does
 not have required=true. I don't think that should matter (Solr will
 force the uniqueKey field to be required in documents), but I am curious how
 you managed to get an id field different from the Solr example.
so am I ;-). I will add the required attribute, though. It cannot hurt.


Re: uniqueKey not enforced

2012-10-22 Thread Robert Krüger
On Mon, Oct 22, 2012 at 6:01 PM, Jack Krupansky j...@basetechnology.com wrote:
 And, are you using UUID's or providing specific key values?
specific key values


Re: Config parameters to tweak for update performance

2012-10-18 Thread Robert Krüger
On Tue, Oct 16, 2012 at 4:13 PM, Shawn Heisey s...@elyograg.org wrote:
 On 10/16/2012 5:38 AM, Robert Krüger wrote:

 I use solr embedded in a desktop app and due to the consistency
 requirements of the application I have to commit rather often. Are
 there some best practices on how to optimize commit performance via
 the configuration? I could easily live with slower queries or more
 memory use as my index is rather small and queries are more than fast
 enough for my application.


 As I understand it, the new softCommit in Solr 4.0 is designed to address
 this exact issue -- near realtime search.  Basically new data is committed
 to RAM, which happens super-fast.  On a longer interval, you issue a hard
 commit, which writes the data in RAM to disk.

 For consistency through failures, you may need to have your application keep
 track of the last document successfully hard committed.  Solr 4.0 has a new
 updateLog capability that might make this unnecessary, someone with more
 experience will have to confirm or deny that.

 Thanks,
 Shawn

Great info! Thanks! I'll take a look at 4.0 then.


Config parameters to tweak for update performance

2012-10-16 Thread Robert Krüger
Hi,

I use solr embedded in a desktop app and due to the consistency
requirements of the application I have to commit rather often. Are
there some best practices on how to optimize commit performance via
the configuration? I could easily live with slower queries or more
memory use as my index is rather small and queries are more than fast
enough for my application.

Thanks in advance,

Robert


Re: Can I rely on correct handling of interrupted status of threads?

2012-10-12 Thread Robert Krüger
On Tue, Oct 2, 2012 at 11:48 AM, Robert Krüger krue...@lesspain.de wrote:
 Hi,

 I'm using Solr 3.6.1 in an application embedded directly, i.e. via
 EmbeddedSolrServer, not over an HTTP connection, which works
 perfectly. Our application uses Thread.interrupt() for canceling
 long-running tasks (e.g. through Future.cancel). A while (and a few
 Solr versions) back a colleague of mine implemented a workaround
 because he said that Solr didn't handle the thread's interrupted
 status correctly, i.e. not setting the interrupted status after having
 caught an InterruptedException or rethrowing it, thus killing the
 information that an interrupt has been requested, which breaks
 libraries relying on that. However, I did not find anything up-to-date
 in mailing list or forum archives on the web. Is that still or was it
 ever the case? What does one have to watch out for when interrupting a
 thread that is doing anything within Solr/Lucene?

 Any advice would be appreciated.

 Regards,

 Robert

Just in case anyone else has the same question: You cannot. Thread
interruption is not handled properly so you should not use Solr in
code that you plan to interrupt via Thread.interrupt. You will get
stuff like IOExceptions so you cannot cleanly tell the difference
between real errors and stuff caused by interruption. Reading old
Jira issues, it looks as if this was a conscious decision because of
the amount of work it would cause in API design changes and lots of
checked exceptions the caller would have to handle.


Re: Can I rely on correct handling of interrupted status of threads?

2012-10-04 Thread Robert Krüger
On Tue, Oct 2, 2012 at 8:50 PM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses
 request closing that lead to searcher leak and OOM. It was fixed about two
 years ago.

You mean InterruptedException?


Can I rely on correct handling of interrupted status of threads?

2012-10-02 Thread Robert Krüger
Hi,

I'm using Solr 3.6.1 in an application embedded directly, i.e. via
EmbeddedSolrServer, not over an HTTP connection, which works
perfectly. Our application uses Thread.interrupt() for canceling
long-running tasks (e.g. through Future.cancel). A while (and a few
Solr versions) back a colleague of mine implemented a workaround
because he said that Solr didn't handle the thread's interrupted
status correctly, i.e. not setting the interrupted status after having
caught an InterruptedException or rethrowing it, thus killing the
information that an interrupt has been requested, which breaks
libraries relying on that. However, I did not find anything up-to-date
in mailing list or forum archives on the web. Is that still or was it
ever the case? What does one have to watch out for when interrupting a
thread that is doing anything within Solr/Lucene?

Any advice would be appreciated.

Regards,

Robert


How to Index and query URLs as fields

2011-03-08 Thread Robert Krüger
Hi,

I've run into problems trying to achieve a seemingly simple thing. I'm indexing 
a bunch of files (local ones and potentially some accessible via other 
protocols like http or ftp) and have an index field with the url to the file, 
e.g. file:/home/foo/bar.pdf. Now I want to perform two simple types of 
queries on this, i.e. retrieve all file records located under a certain path 
(e.g. file:/home/foo/*) or find the file record for an exact URL.

What I naively tried was to index the file URL in a field (fileURL) of type 
string and simply perform queries like

fileURL:file\:/home/foo/*

and 

fileURL:file\:/home/foo/bar.pdf

and neither one returned results.

the type is defined as

fieldtype name=string  class=solr.StrField sortMissingLast=true 
omitNorms=true/

and the field as

field name=fileURL   type=string   indexed=true  stored=true  
multiValued=false / 

I am using solr 1.4.1 and use solrj to do the indexing and querying.

This seems like a rather basic requirement and obviously I am doing something 
wrong. I didn't find anything in the docs or the mailing list archive so far.

Any help, hints, pointers would be appreciated.

Robert



Re: How to Index and query URLs as fields

2011-03-08 Thread Robert Krüger

My mistake. The error turned out to be somewhere else and the described 
approach seems to work.

Sorry for the wasted bandwidth.


On Mar 8, 2011, at 11:06 AM, Robert Krüger wrote:

 Hi,
 
 I've run into problems trying to achieve a seemingly simple thing. I'm 
 indexing a bunch of files (local ones and potentially some accessible via 
 other protocols like http or ftp) and have an index field with the url to the 
 file, e.g. file:/home/foo/bar.pdf. Now I want to perform two simple types 
 of queries on this, i.e. retrieve all file records located under a certain 
 path (e.g. file:/home/foo/*) or find the file record for an exact URL.
 
 What I naively tried was to index the file URL in a field (fileURL) of type 
 string and simply perform queries like
 
 fileURL:file\:/home/foo/*
 
 and 
 
 fileURL:file\:/home/foo/bar.pdf
 
 and neither one returned results.
 
 the type is defined as
 
 fieldtype name=string  class=solr.StrField sortMissingLast=true 
 omitNorms=true/
 
 and the field as
 
 field name=fileURL   type=string   indexed=true  stored=true  
 multiValued=false / 
 
 I am using solr 1.4.1 and use solrj to do the indexing and querying.
 
 This seems like a rather basic requirement and obviously I am doing something 
 wrong. I didn't find anything in the docs or the mailing list archive so far.
 
 Any help, hints, pointers would be appreciated.
 
 Robert
 



Pragmatic more or less high availability option on 2 servers

2010-02-16 Thread Robert Krüger

Hi,

I have to set up a SOLR cluster with some availability concept (is allowed to 
require manual interaction on fault, however, if there is a better way, I'd be 
interested in recommendations).

I have two servers (A and B for the example) at my disposal.

What I was thinking about was the following setup:

Set up 2 SOLR Instances on each server, i.e.

on A)
- One master (active)
- One slave (always active) replicating index from the master using the 
mechanisms described in http://wiki.apache.org/solr/SolrReplication

on B)
- One master (active but not used for index updates) replicating index form the 
master on A
- One slave (always active) replicating index from the master on A

I'll write the configs with placeholders for properties for the address of the 
master server so I can start the slaves with the master on A or on B as the 
master to replicate from.

When B goes down, nothing happens as the slave on A is still there to serve 
queries as well as the master to serve update requests.

When A goes down updates will begin to fail but queries will still be served 
from the slave on B. A restart of B with the until then inactive master as 
master will bring index update functionality up again.

For simplicity's sake I did not mention any load-balancing or IP address 
switching stuff. I assume that I have to load-balance the access to the two 
slaves and make sure that when the master is switched to B that either it 
assumes the IP address or the clients updating the index need to be changed 
)the forme probably being simpler).

Now, my questions are:
- Will this work?
- Can this be simplified, i.e. does it make sense to have separate master/slave 
instances when they are on the same hardware or is it enough or even better to 
write the config of the slave on B so that it can be restarted in master mode 
based on the same index?

Thanks in advance,

Robert




Re: Boosting by date when only some records have one

2008-12-17 Thread Robert Krüger

Hi Lance,

thanks for your feedback!

The problem with your suggestion is that we do not want to exclude the
fields without a date but boost them differently, which Chris has
provided a solution for (see other posts in this thread).

Cheers,

Robert



Norskog, Lance wrote:
 This query:field:[* TO *]
 Gives the entries which have a value for that field. Documents with nothing 
 in 'field' will not return.  To get the opposite set:
 
 *:* -field:[* TO *]
 
 Does this help?
 Lance
 
 -Original Message-
 From: Robert Krüger [mailto:krue...@signal7.de] 
 Sent: Saturday, December 13, 2008 4:22 AM
 To: solr-user@lucene.apache.org
 Subject: Boosting by date when only some records have one
 
 
 Hi,
 
 I'm looking for a way to boost queries by date but not all documents have a 
 date associated with them. My goal is to have something like a default boost 
 for documents (e.g. 1.0) with a function for documents with dates that 
 distribute the boosts between 1.0 - x to 1.0 + x based on a valid date range.
 
 So far I have not been able to formulate a query that does that. The query I 
 tried resulted in documents being returned as results, which without the 
 boost function, weren't returned, which renders it useless.
 
 I have criteria in the schema based which I know if that document is supposed 
 to be boosted by date or not. So in pseudocode what I wanted to have is:
 
 if(document.dateField != null){
  boost = somefunction(document.dateField); } else{  boost = 1; }
 
 or (using another field as criterion)
 
 if(document.hasDateField == 1){
  boost = somefunction(document.dateField); } else{  boost = 1; }
 
 Any suggestions or sample queries anyone?
 
 Many thanks in advance,
 
 Robert
 
 



Re: Boosting by date when only some records have one

2008-12-16 Thread Robert Krüger

Hi,

thanks a lot! Looks like what I need except that I cannot use dismax
because I need to be able to do prefix queries. I'm new to Solr, so I
don't know if there's a way to formulate this in a standard query. If
not, I could extend DismaxRequestHandler so it doesn't escape the *s,
right?

Robert


Chris Hostetter wrote:
 : if(document.hasDateField == 1){
 :  boost = somefunction(document.dateField);
 : } else{
 :  boost = 1;
 : }
 
  bq = ( ( +hasDateField:true   _val_:somefunc(dateField) )
 ( -hasDateField:true   *:*^1 )
   )
 
 That covers the possiblility that hasDateField is not set for some docs.  
 The query get's simpler if you can concretely know that hasDateField will 
 always have a value of true or false...
 
  bq = ( hasDateField:false^1 ( +hasDateField:true _val_:somefunc(dateField) )
 
 
 
 
 -Hoss
 



Is making removal of wildcards configurable planned for DisMaxRequestHandler

2008-12-16 Thread Robert Krüger

Hi,

I'm rather new to Solr and for my current projects came to the
conclusion that DisMaxRequestHandler is exactly the tool I need, except
that it doesn't allow prefix queries.

I found a thread in the archive were someone mentioned the idea of
making this behaviour configurable (which characters to strip from the
query parameter). Is someone working on this or is my best option
currently to implement this behaviour by copying code from
DisMaxRequestHandler and modify the code that strips illegal operators?

Thanks in advance,

Robert



Re: Boosting by date when only some records have one

2008-12-16 Thread Robert Krüger
Chris Hostetter wrote:
 : thanks a lot! Looks like what I need except that I cannot use dismax
 : because I need to be able to do prefix queries. I'm new to Solr, so I
 
 there's nothing dismax related in that syntax, i just suggested using it 
 in a bq param becuase i assumed that's what you were using.
 
 q = +pre:fix* (hasDateField:false^1 (+hasDateField:true 
 _val_:somefunc(dateField)))
 
 :   bq = ( hasDateField:false^1 ( +hasDateField:true 
 _val_:somefunc(dateField) )
 
 
 
 
 -Hoss
 
perfect, thanks!!

Robert



Boosting by date when only some records have one

2008-12-13 Thread Robert Krüger

Hi,

I'm looking for a way to boost queries by date but not all documents
have a date associated with them. My goal is to have something like a
default boost for documents (e.g. 1.0) with a function for documents
with dates that distribute the boosts between 1.0 - x to 1.0 + x based
on a valid date range.

So far I have not been able to formulate a query that does that. The
query I tried resulted in documents being returned as results, which
without the boost function, weren't returned, which renders it useless.

I have criteria in the schema based which I know if that document is
supposed to be boosted by date or not. So in pseudocode what I wanted to
have is:

if(document.dateField != null){
 boost = somefunction(document.dateField);
} else{
 boost = 1;
}

or (using another field as criterion)

if(document.hasDateField == 1){
 boost = somefunction(document.dateField);
} else{
 boost = 1;
}

Any suggestions or sample queries anyone?

Many thanks in advance,

Robert