Re: custom field type plugin

Smiley, David W. Wed, 24 Jul 2013 07:58:58 -0700

To eliminate the possibility of errors, you need to buffer the query as
indicated in the wiki.  If you don't and you use a super-small maxDistErr
as you tell me you are doing, then you are merely making the probability
of hitting an error small (perhaps even very very small), but not
nonexistent.  I wish there was a field type that wrapped all this up so
that users wouldn't have to concern themselves with these tricky details.
I created an issue to track it:
https://issues.apache.org/jira/browse/SOLR-5072


~ David

On 7/24/13 9:26 AM, "Kevin Stone" <kevin.st...@jax.org> wrote:

>I tried reducing the maxDistErr to "0.01", just to test making it smaller.
>I got maxLevels down to 45, and slightly better query times (Indexing time
>was about the same). However, my queries are not accurate anymore. I need
>to pad by 2 or 3 whole numbers to get a hit now, which won't work in real
>use. I can play with the number a bit more, but I didn't see anything
>wrong when I had it at "0.000000009". I do know about using a small
>decimal value to pad around my coordinates, and I'll probably do that for
>the real implementation, but for testing, whole numbers were working for
>all my edge cases.
>
>-Kevin
>
>On 7/23/13 10:45 PM, "Smiley, David W." <dsmi...@mitre.org> wrote:
>
>>Kevin,
>>
>>Those are some good query response times but they could be better.
>>You've
>>configured the field type sub-optimally.  Look again at
>>http://wiki.apache.org/solr/SpatialForTimeDurations and note in
>>particular
>>maxDistErr.  You've left it at the value that comes pre-configured with
>>Solr, 0.000000009, which is ~1 meter measured in degrees, and this value
>>makes no sense when your numeric range is in whole numbers.  I suspect
>>you
>>inherited this value from Hoss's slides.  **Instead use 1.** (as shown on
>>the wiki). This affects performance in a big way since you've configured
>>the prefixTree to hold 2.22e18 values (calculated via (max-min) /
>>maxDistErr) as opposed to "just" 2e10.  Your log shows maxLevels is 50
>>for
>>quad tree.  The comments in QuadPrefixTree (and I put them there once)
>>indicate maxLevels of 50 is about as much as is supported.  But again,
>>I'm
>>not certain what the limit really is without validating.  Hopefully you
>>can stay clear of 50.  To do some tests, try querying just on the edge on
>>either side of an indexed value to make sure you match the point and then
>>don't match the indexed point as you would expect based on the
>>instructions.  Also, be sure to read more of the details on "Search" on
>>this wiki page in which you are advised to buffer the query shape
>>slightly; you didn't do this in your examples below.  This is all a bit
>>of
>>a hack when using a field that internally is using floating point instead
>>of fixed precision.
>>
>>~ David Smiley
>>
>>On 7/23/13 9:32 PM, "Kevin Stone" <kevin.st...@jax.org> wrote:
>>
>>>Sorry for the late response. I needed to find the time to load a lot of
>>>extra data (closer to what we're anticipating). I have an index with
>>>close
>>>to 220,000 documents, each with at least two coordinate regions anywhere
>>>between -10 billion to +10 billion, but could potentially have up to
>>>maybe
>>>half dozen regions in one document. The reason for the negatives, is
>>>because you can read a chromosome either backwards or forwards, so many
>>>coordinates can be minus.
>>>
>>>Here is the schema field definition:
>>>
>>>        <fieldType name="geneticLocation"
>>>         class="solr.SpatialRecursivePrefixTreeFieldType"
>>>         multiValued="true"
>>>         geo="false"
>>>         worldBounds="-100000000000 -100000000000 100000000000
>>>100000000000"
>>>         distErrPct="0"
>>>         maxDistErr="0.000000009"
>>>         units="degrees"
>>>         />
>>>
>>>
>>>Here is the first query in the log:
>>>
>>>INFO: 
>>>geneticLocation{class=org.apache.solr.schema.SpatialRecursivePrefixTreeF
>>>i
>>>e
>>>l
>>>dType,analyzer=org.apache.solr.schema.FieldType$DefaultAnalyzer,args={di
>>>s
>>>t
>>>E
>>>rrPct=0, geo=false, multiValued=true, worldBounds=-100000000000
>>>-100000000000 100000000000 100000000000, maxDistErr=0.000000009,
>>>units=degrees}} strat:
>>>RecursivePrefixTreeStrategy(prefixGridScanLevel:46,SPG:(QuadPrefixTree(m
>>>a
>>>x
>>>L
>>>evels:50,ctx:SpatialContext{geo=false, calculator=CartesianDistCalc,
>>>worldBounds=Rect(minX=-1.0E11,maxX=1.0E11,minY=-1.0E11,maxY=1.0E11)})))
>>>maxLevels: 50
>>>Jul 23, 2013 9:11:45 PM org.apache.solr.core.SolrCore execute
>>>INFO: [testIndex] webapp=/solr path=/select
>>>params={wt=xml&q=humanCoordinate:"Intersects(0+60330+6033041244+10000000
>>>0
>>>0
>>>0
>>>)"&rows=100} hits=81112 status=0 QTime=122
>>>
>>>
>>>
>>>
>>>
>>>Here are some other queries to give different timings (the one above
>>>brings back quite a lot):
>>>
>>>INFO: [testIndex] webapp=/solr path=/select
>>>params={wt=xml&q=humanCoordinate:"Intersects(0+6000000000+6900000000+100
>>>0
>>>0
>>>0
>>>00000)"&rows=100} hits=6031 status=0 QTime=10
>>>Jul 23, 2013 9:13:43 PM org.apache.solr.core.SolrCore execute
>>>INFO: [testIndex] webapp=/solr path=/select
>>>params={wt=xml&q=humanCoordinate:"Intersects(0+0+10000000+10000000000)"&
>>>r
>>>o
>>>w
>>>s=100} hits=500 status=0 QTime=15
>>>Jul 23, 2013 9:14:14 PM org.apache.solr.core.SolrCore execute
>>>INFO: [testIndex] webapp=/solr path=/select
>>>params={wt=xml&q=humanCoordinate:"Intersects(0+7831329+7831329+100000000
>>>0
>>>0
>>>)
>>>"&rows=100} hits=4 status=0 QTime=17
>>>INFO: [testIndex] webapp=/solr path=/select
>>>params={wt=xml&q=humanCoordinate:"Intersects(-10000000000+-1051057963+-1
>>>0
>>>0
>>>1
>>>057963+0)"&rows=100} hits=661 status=0 QTime=8
>>>
>>>
>>>
>>>The query times look pretty fast to me. Certainly I'm pretty impressed.
>>>Our other backup solutions (involving SQL) likely wouldn't even touch
>>>this
>>>in terms of speed.
>>>
>>>
>>>
>>>We will be testing this more in depth in the coming month. I am sort of
>>>jumping ahead of our team to research possible solutions, since this is
>>>something that worried us. Looks like it might work!
>>>
>>>Thanks,
>>>-Kevin
>>>
>>>On 7/23/13 1:47 PM, "David Smiley (@MITRE.org)" <dsmi...@mitre.org>
>>>wrote:
>>>
>>>>Oh cool!  I'm glad it at least seemed to work.  Can you post your
>>>>configuration of the field type and report from Solr's logs what the
>>>>"maxLevels" is used for this field, which is logged the first time you
>>>>use
>>>>the field type?
>>>>
>>>>Maybe there isn't a limit under 10B after all.  Some quick'n'dirty
>>>>calculations I just did indicate there shouldn't be a problem but
>>>>real-world
>>>>usage will be a better proof.  Indexing probably won't be terribly
>>>>slow,
>>>>queries could get pretty slow if the amount of indexed data is really
>>>>high. 
>>>>I'd love to hear how it works out for you.  Your use-case would benefit
>>>>a
>>>>lot from an improved prefix tree implementation.
>>>>
>>>>I don't gather how a 3rd dimension would play into this.  Support for
>>>>multi-dimensional spatial is on the drawing board.
>>>>
>>>>~ David
>>>>
>>>>
>>>>Kevin Stone wrote
>>>>> What are the dangers of trying to use a range of 10 billion? Simply a
>>>>> slower index time? Or will I get inaccurate results?
>>>>> I have tried it on a very small sample of documents, and it seemed to
>>>>> work. I could spend some time this week trying to get a more robust
>>>>>(and
>>>>> accurate) dataset loaded to play around with. The reason for the 10
>>>>> billion is to support being able to query for a region on a
>>>>>chromosome.
>>>>> 
>>>>> A user might want to know what genes overlap a point on a specific
>>>>> chromosome. Unless I can use 3 dimensional coordinates (which gave an
>>>>> error when I tried it), I'll need to multiply the coordinates by some
>>>>> offset for each chromosome to be able to normalise the data (at both
>>>>>index
>>>>> and query time). The largest chromosome (chr 1) has almost
>>>>>250,000,000
>>>>> base pairs. I could probably squeeze the rest a bit smaller, but I'd
>>>>> rather use one size for all chromosomes, since we have more than just
>>>>> human data to deal with. It would get quite messy otherwise.
>>>>> 
>>>>> 
>>>>> On 7/22/13 11:50 AM, "David Smiley (@MITRE.org)" &lt;
>>>>
>>>>> DSMILEY@
>>>>
>>>>> &gt; wrote:
>>>>> 
>>>>>>Like Hoss said, you're going to have to solve this using
>>>>>>http://wiki.apache.org/solr/SpatialForTimeDurations
>>>>>>Using PointType is *not* going to work because your durations are
>>>>>>multi-valued per document.
>>>>>>
>>>>>>It would be useful to create a custom field type that wraps the
>>>>>>capability
>>>>>>outlined on the wiki to make it easier to use without requiring the
>>>>>>user
>>>>>>to
>>>>>>think spatially.
>>>>>>
>>>>>>You mentioned that these numeric ranges extend upwards of 10 billion
>>>>>>or
>>>>>>so.
>>>>>>Unfortunately, the current "prefix tree" implementation under the
>>>>>>hood
>>>>>>for
>>>>>>non-geodetic spatial, the QuadTree, is unlikely to scale to numbers
>>>>>>that
>>>>>>big.  I don't know where the boundary is, but I doubt 10B.  You could
>>>>>>try
>>>>>>and see what happens.  I'm working (very slowly on very little spare
>>>>>>time)
>>>>>>on improving the PrefixTree implementations to scale to such large
>>>>>>numbers;
>>>>>>I hope something will be available this fall.
>>>>>>
>>>>>>~ David Smiley
>>>>>>
>>>>>>
>>>>>>Kevin Stone wrote
>>>>>>> I have a particular use case that I think might require a custom
>>>>>>>field
>>>>>>> type, however I am having trouble getting the plugin to work.
>>>>>>> My use case has to do with genetics data, and we are running into
>>>>>>>several
>>>>>>> situations were we need to be able to query multiple regions of a
>>>>>>> chromosome (or gene, or other object types). All that really boils
>>>>>>>down
>>>>>>>to
>>>>>>> is being able to give a number, e.g. 10234, and return documents
>>>>>>>that
>>>>>>>have
>>>>>>> regions containing the number. So you'd have a document with a list
>>>>>>>like
>>>>>>> ["10000:16090","400:8000","40123:43564"], and it should come back
>>>>>>>because
>>>>>>> 10234 falls between "10000:16090". If there is a better or easier
>>>>>>>way
>>>>>>>to
>>>>>>> do this please speak up. I'd rather not have to use a "join" on
>>>>>>>another
>>>>>>> index, because 1) it's more complex to set up, and 2) we might need
>>>>>>>to
>>>>>>> join against something else and you can only do one join at a time.
>>>>>>>
>>>>>>> AnywayŠ I tried creating a field type similar to a PointType just
>>>>>>>to
>>>>>>>see
>>>>>>> if I could get one working. I added the following jars to get it to
>>>>>>> compile:
>>>>>>>
>>>>>>>apache-solr-core-4.0.0,lucene-core-4.0.0,lucene-queries-4.0.0,apache
>>>>>>>-
>>>>>>>s
>>>>>>>o
>>>>>>>lr
>>>>>>>-solrj-4.0.0.
>>>>>>> I am running solr 4.0.0 on jetty, and put my jar file in a
>>>>>>>sharedLib
>>>>>>> folder, and specified it in my solr.xml (I have multiple cores).
>>>>>>>
>>>>>>> After starting up solr, I got the line that it picked up the jar:
>>>>>>> INFO: Adding 'file:/blah/blah/lib/CustomPlugins.jar' to classloader
>>>>>>>
>>>>>>> But I get this error about it not being able to find the
>>>>>>> AbstractSubTypeFieldType class.
>>>>>>> Here is the first bit of the trace:
>>>>>>>
>>>>>>> SEVERE: null:java.lang.NoClassDefFoundError:
>>>>>>> org/apache/solr/schema/AbstractSubTypeFieldType
>>>>>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>>>>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
>>>>>>> at
>>>>>>>java.security.SecureClassLoader.defineClass(SecureClassLoader.java:1
>>>>>>>4
>>>>>>>2
>>>>>>>)
>>>>>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>>>>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>>>>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>>>>> ...etcŠ
>>>>>>>
>>>>>>>
>>>>>>> Any hints as to what I did wrong? I can provide source code, or a
>>>>>>>fuller
>>>>>>> stack trace, config settings, etc.
>>>>>>>
>>>>>>> Also, I did try to unpack the solr.war, stick my jar in
>>>>>>>WEB-INF/lib,
>>>>>>>then
>>>>>>> repack. However, when I did that, I get a NoClassDefFoundError for
>>>>>>>my
>>>>>>> plugin itself.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>>
>>>>>>> The information in this email, including attachments, may be
>>>>>>>confidential
>>>>>>> and is intended solely for the addressee(s). If you believe you
>>>>>>>received
>>>>>>> this email by mistake, please notify the sender by return email as
>>>>>>>soon
>>>>>>>as
>>>>>>> possible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-----
>>>>>> Author:
>>>>>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>>>>>--
>>>>>>View this message in context:
>>>>>>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086
>>>>>>p
>>>>>>4
>>>>>>0
>>>>>>79
>>>>>>494.html
>>>>>>Sent from the Solr - User mailing list archive at Nabble.com.
>>>>> 
>>>>> 
>>>>> The information in this email, including attachments, may be
>>>>>confidential
>>>>> and is intended solely for the addressee(s). If you believe you
>>>>>received
>>>>> this email by mistake, please notify the sender by return email as
>>>>>soon
>>>>>as
>>>>> possible.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>-----
>>>> Author: 
>>>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>>>--
>>>>View this message in context:
>>>>http://lucene.472066.n3.nabble.com/custom-field-type-plugin-tp4079086p4
>>>>0
>>>>7
>>>>9
>>>>822.html
>>>>Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>>The information in this email, including attachments, may be
>>>confidential
>>>and is intended solely for the addressee(s). If you believe you received
>>>this email by mistake, please notify the sender by return email as soon
>>>as possible.
>>
>
>
>The information in this email, including attachments, may be confidential
>and is intended solely for the addressee(s). If you believe you received
>this email by mistake, please notify the sender by return email as soon
>as possible.

Re: custom field type plugin

Reply via email to