date:20110624

gui for solr index

2011-06-24 Thread Алексей Цой

is there a standard solution to Apache solr (from trunk) for the following:
- GUI view solr-index.

Re: how to index data in solr form database automatically

2011-06-24 Thread Anshum

How about having a delta-import and a cron to trigger the post?

--
Anshum Gupta
http://ai-cafe.blogspot.com


On Fri, Jun 24, 2011 at 11:13 AM, Romi romijain3...@gmail.com wrote:

 I have MySql database for my application. i implemented solr search and
 used
 dataimporthandler(DIH)to index data from database into solr. my question
 is:
 is there any way that if database gets updated then my solr indexes
 automatically gets update for new data added in the database. . It means i
 need not to run index process manually every time data base tables
 changes.If yes then please tell me how can i achieve this.

 -
 Thanks  Regards
 Romi
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3102893.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: gui for solr index

2011-06-24 Thread Bùi Văn Quý


Please use the Lucene Luke
www.getopt.org/*luke*/

On 6/24/2011 1:29 PM, Алексей Цой wrote:

is there a standard solution to Apache solr (from trunk) for the following:
- GUI view solr-index.

Re: Re; DIH Scheduling

2011-06-24 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Thu, Jun 23, 2011 at 9:13 PM, simon mtnes...@gmail.com wrote:
 The Wiki page describes a design for a scheduler, which has not been
 committed to Solr yet (I checked). I did see a patch the other day
 (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't
 look well tested.

 I think that you're basically stuck with something like cron at this
 time. If your application is written in java, take a look at the
 Quartz scheduler - http://www.quartz-scheduler.org/

It was considered and decided against.

 -Simon




-- 
-
Noble Paul

Re: how to index data in solr form database automatically

2011-06-24 Thread Romi

Yeah i am using data-import to get data from database and indexing it. but
what is cron can you please provide a link for it

-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Updating the data-config file

2011-06-24 Thread Ahmet Arslan

 Ahh! Thats interesting!
 
 I understand what you mean. Since RSS and Atom feeds have
 the same structure
 parsing them would be the same but I can do the for each
 different URLs.
 These URLs can be obtained from a db, a file or through the
 request
 parameters, right?

Exactly. You can register multiple dataSource with different names. And then 
in each each entity, you can select appropriate data source with 
dataSource=... tag.

For a db, data-config.xml would be something like:

dataSource type=HttpDataSource name=http/
dataSource type=JdbcDataSource name=db driver=com.mysql.jdbc.Driver 
url=jdbc:mysql://localhost/mydb batchSize=-1/

entity name=urls dataSource=db query=SELECT url FROM urls 
entity name=slashdot dataSource=http
pk=link
url=${urls.url}
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=DateFormatTransformer

Re: how to index data in solr form database automatically

2011-06-24 Thread Pranav Prakash

Cron is a time-based job scheduler in Unix-like computer operating systems.
en.wikipedia.org/wiki/Cron

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Fri, Jun 24, 2011 at 12:26, Romi romijain3...@gmail.com wrote:

 Yeah i am using data-import to get data from database and indexing it. but
 what is cron can you please provide a link for it

 -
 Thanks  Regards
 Romi
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query time noun, verb boosting

2011-06-24 Thread Tommaso Teofili

2011/6/23 Anshum ansh...@gmail.com

Pooja,
You could use UIMA (or any other) Parts of Speech Tagger. You could read a
little more about it here.

http://uima.apache.org/downloads/sandbox/hmmTaggerUsersGuide/hmmTaggerUsersGuide.html#sandbox.tagger.annotatorDescriptor
This would help you annotate and segregate nouns from verbs in the input.
You could then aptly form the query.
Perhaps this would take some effort but 'm assuming it'd work reasonably
well.

I've done this recently using UIMA POS tagger and other annotators within a
TokenFilter to add TypeAttribute and PayloadAttribute to each token and
eventually filter/boost when searching.
Regards,
Tommaso

--
Anshum Gupta
http://ai-cafe.blogspot.com

On Thu, Jun 23, 2011 at 11:18 AM, Pooja Verlani pooja.verl...@gmail.com
wrote:

Hi,

Say for example, a query like mammohan singh dancing, I am preferring
to
make a compulsory condition on nouns to be searched but any verb isnt
important for me, I am preferring to extract results for manmohan singh
and
not for dancing. If I can extract noun verb or can get to know that in my
index I have a concept of manmohan singh or an identity if not concept,
I
would like to define rules for doing a strict(compulsory) match of
noun(concept) and loose match(non-compulsory boosting) for the verb.

Basically, I want to avoid getting zero results for a compulsory match of
the 3 tokens(in this case manmohan singh dancing) of the query and
instead
I
want to do a compulsory match on manmohan singh since that exists in my
index and dancing shouldn't be a compulsory match for non-zero number
of
results.

Hope this explains.
Any suggestions?

Regards,
Pooja

On Thu, Jun 23, 2011 at 11:07 AM, Anshum ansh...@gmail.com wrote:

What would you mean by 'noun or some concept'. Would be better if you
could
give a rather concrete example.
About detecting parts of speech, you could use a lot of libraries but I
didn't get about boosting terms from the Index.

--
Anshum Gupta
http://ai-cafe.blogspot.com

On Thu, Jun 23, 2011 at 11:02 AM, Pooja Verlani
pooja.verl...@gmail.com
wrote:

Hi,

At the query time, I want to make the lucene query such that it
should
boost
only the noun from the query or some concept existing in the index.
Are
there any possibilities or any possible ideas that can be worked
around?

Regards,
Pooja

multicore and replication cause OOM

2011-06-24 Thread Esteban Donato

Hi,

I have a Solr with 7 cores (~150MB each).  All cores replicate at the
same time from a Solr master instance.  Every time the replication
happens I get an OOM after experiencing long response times.  This
Solr used to have 4 cores before and I've never got an OOM with that
configuration (replication occurs on daily basis).

My question is: could the new 3 cores be the cause of OOM?  Does Solr
require considerable extra heap for performing the replication?.
Should I avoid replicating all the cores at the same time?

I'm using Solr 1.4 with the following mem configuration: -Xms512m
-Xmx512m -XX:NewSize=128M -XX:MaxNewSize=128M

Appreciate any help.

Regards,
Esteban

Re: Understanding query explain information

2011-06-24 Thread lee carroll

Is it possible that synonyms are being added (synonym expansion) or at
least changing
the field length. I've saw this before. Check what exactly what terms
have been added.


On 23 June 2011 22:50, Alexander Ramos Jardim
alexander.ramos.jar...@gmail.com wrote:
 Yes, I am using synonims in index time.

 2011/6/22 lee carroll lee.a.carr...@googlemail.com

 Hi are you using synonyms ?



 On 22 June 2011 10:30, Alexander Ramos Jardim
 alexander.ramos.jar...@gmail.com wrote:
  Hi guys,
 
  I am getting some doubts about how to correctly understand the debugQuery
  output. I have a field named itemName in my index. This is a text field,
  just that. When I quqery a simple ?q=itemName:iPad , I end up with the
  following query result.
 
  Simply trying to understand why these strings generated such scores, and
 as
  far as I can understand, the only difference between them is the field
  norms, as all the other results maintain themselves.
 
  Now, how do I get these field norm values? Field Norm is the result of
 this
  formula right?
 
  *1/square root of (terms)*,* where terms is the number of terms in my
 field
  after it is indexed*
 
 
  Well, if this is true, the field norm for my first document should be 0.5
  (1/sqrt(4)) as  Livro - IPAD - O Guia do Profissional ends up with the
  terms livro|ipad|guia|profissional as tokens.
 
  What I am forgetting to take into account?
 
  ?xml version=1.0 encoding=UTF-8?
  response
 
  lst name=responseHeader
   int name=status0/int
   int name=QTime3/int
   lst name=params
   str name=debugQueryon/str
   str name=start0/str
 
   str name=rows10/str
   arr name=indent
         stron/str
         stron/str
   /arr
   str name=flitemName,score/str
   str name=version2.2/str
 
   str name=qitemName:ipad/str
   /lst
  /lst
  result name=response numFound=161 start=0 maxScore=3.6808658
   doc
   float name=score3.6808658/float
   str name=itemNameLivro - IPAD - O Guia do Profissional/str
   /doc
 
   doc
   float name=score3.1550279/float
   str name=itemNameLeitor de Cartão para Ipad - Mobimax/str
   /doc
   doc
   float name=score3.1550279/float
   str name=itemNameSleeve para iPad/str
 
   /doc
   doc
   float name=score3.1550279/float
   str name=itemNameSleeve de Neoprene para iPad/str
   /doc
   doc
   float name=score3.1550279/float
 
   str name=itemNameCarregador de parede para iPad/str
   /doc
   doc
   float name=score2.6291897/float
   str name=itemNameCase Envelope para iPad - Black - Built NY/str
   /doc
   doc
 
   float name=score2.6291897/float
   str name=itemNameCase Protetora p/ IPad de Silicone Duo - Browm
  - Iskin/str
   /doc
   doc
   float name=score2.6291897/float
   str name=itemNameCase Protetora p/ IPad de Silicone Duo - Clear
  - Iskin/str
   /doc
 
   doc
   float name=score2.6291897/float
   str name=itemNameCase p/ iPad Sleeve - Black - Built NY/str
   /doc
   doc
   float name=score2.6291897/float
   str name=itemNameBolsa de Proteção p/ iPad Preta - Geonav/str
 
   /doc
  /result
  lst name=debug
   str name=rawquerystringitemName:ipad/str
   str name=querystringitemName:ipad/str
   str name=parsedqueryitemName:ipad/str
   str name=parsedquery_toStringitemName:ipad/str
   lst name=explain
 
   str name=7369507
  3.6808658 = (MATCH) fieldWeight(itemName:ipad in 102507), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.4375 = fieldNorm(field=itemName, doc=102507)
  /str
   str name=739
  3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226401), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.375 = fieldNorm(field=itemName, doc=226401)
  /str
   str name=7356941
  3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226409), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.375 = fieldNorm(field=itemName, doc=226409)
  /str
   str name=7356931
  3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226447), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.375 = fieldNorm(field=itemName, doc=226447)
  /str
   str name=7360321
 
  3.1550279 = (MATCH) fieldWeight(itemName:ipad in 226583), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.375 = fieldNorm(field=itemName, doc=226583)
  /str
   str name=7428354
  2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223178), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.3125 = fieldNorm(field=itemName, doc=223178)
  /str
   str name=7366074
  2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223196), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)
   8.413407 = idf(docFreq=165, maxDocs=275239)
   0.3125 = fieldNorm(field=itemName, doc=223196)
  /str
   str name=7366068
  2.6291897 = (MATCH) fieldWeight(itemName:ipad in 223831), product of:
   1.0 = tf(termFreq(itemName:ipad)=1)

Re: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Dmitry Kan

If possible, can you please share some details of your setup, like the
amount of shards, how big are they size/doc_count wise, what is the user
load / s.

On Fri, Jun 24, 2011 at 1:39 AM, Shawn Heisey s...@elyograg.org wrote:

In the past I have told people on this list and in the IRC channel #solr
what I use for Java GC settings. A couple of days ago, I cleaned up my
testing methodology to more closely mimic real production queries, and
discovered that my GC settings were woefully inadequate. Here's what I was
using on a virtual machine with 9GB of RAM. I've been using this for
several months, and chose it because I had read several things praising it.
I should have done more research.

-Xms512M -Xmx2048M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode

On my backup servers, I am in the process of getting 3.2.0 ready to replace
our 1.4.1 index. I ran into a situation where committing a delta-import of
only a few thousand records took longer than 3 minutes (Perl LWP default
timeout) on every shard, where normally in production on 1.4.1 it only takes
a few seconds. This was shortly after I had hit the distributed index
pretty hard with my improved benchmarking.

Using jstat, I found that while under benchmarking load, the system was
spending 10-15% of it's time doing garbage collection, and that most of the
garbage collections were from the young generation. First I tried
increasing the young generation size with the -XX:NewSize=1024M parameter.
This helped on the total GC count, but didn't really help with how much
time was spent doing them.

A good command to see these statistics on Linux, and an Oracle link
explaining what it all means:

jstat -gc -t `pgrep java` 5000
http://download.oracle.com/**javase/6/docs/technotes/tools/**
share/jstat.htmlhttp://download.oracle.com/javase/6/docs/technotes/tools/share/jstat.html

I've learned that Solr will keep most of its data in young generation
(eden), unless that memory pool is too small, then it will move data to the
tenured generation. The key for good performance seems to be creating a
large enough young generation. You do need to have a good chunk of tenured
available, unless the solr instance has no index itself and exists only to
distribute queries to shards living on other solr instances. In that case,
it hardly uses the tenured generation. It turns out that CMSIncrementalMode
causes more young generation collections and makes them take longer, which
is exactly what Solr does NOT need.

After messing around with it for quite a while, I came up with the
following settings, which included an increase in heap size:

-Xms3072M -Xmx3072M -XX:NewSize=1536M -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled

With these settings, it spends very little time doing garbage collections.
One of my shards has been up for nearly 24 hours, has been hit with the
benchmarking script repeatedly, and it has only done 62 young generation
collections, and zero full collections, with 6.8 seconds total GC time. I
am thinking of increasing the NewSize yet again, because the tenured
generation (1.5GB in size) is only one third utilized after nearly 24 hours.

My settings will probably not work for everyone, but I hope this post will
make it easier for others to find the right solution for themselves.

Thanks,
Shawn

--
Regards,

Dmitry Kan

Re: testing subscription.

2011-06-24 Thread Dmitry Kan

passed

On Thu, Jun 23, 2011 at 10:38 PM, Esteban Donato
esteban.don...@gmail.comwrote:





-- 
Regards,

Dmitry Kan

Question about optimization

2011-06-24 Thread Marc SCHNEIDER

Hi,

I saw this in the Solr wiki : An un-optimized index is going to be *at
least* 10% slower for un-cached queries.
Is this still true? I read somewhere that recent versions of Lucene where
less sensitive to un-ptimized indexed than is the past...
Having 50 000 new (or updated) documents coming to my index every day, would
a once-a-day optimization be sufficient?

Thanks in advance,
Marc.

Re: multicore and replication cause OOM

2011-06-24 Thread Shalin Shekhar Mangar

On Fri, Jun 24, 2011 at 1:41 PM, Esteban Donato
esteban.don...@gmail.com wrote:
 I have a Solr with 7 cores (~150MB each).  All cores replicate at the
 same time from a Solr master instance.  Every time the replication
 happens I get an OOM after experiencing long response times.  This
 Solr used to have 4 cores before and I've never got an OOM with that
 configuration (replication occurs on daily basis).

 My question is: could the new 3 cores be the cause of OOM?  Does Solr
 require considerable extra heap for performing the replication?.

Yes and no. Replication itself does not consume a lot of heap (I guess
about a couple of MBs per ongoing replication). However, when the
searchers are re-opened on the newly installed index, auto warming can
cause memory usage to double for a core.

 Should I avoid replicating all the cores at the same time?

You should try that especially if you are so constrained for heap space.

 I'm using Solr 1.4 with the following mem configuration: -Xms512m
 -Xmx512m -XX:NewSize=128M -XX:MaxNewSize=128M

That seems to be a small amount of RAM for indexing/querying seven
150MB indexes in parallel.

--
Regards,
Shalin Shekhar Mangar.

Query may only contain [a-z][0-9]

2011-06-24 Thread roySolr

Hello,

Is it possible to configure into SOLR that only numbers and letters are
accepted([a-z][0-9])??

When a user gives a term like + or - i get some SOLR errors. How can i
exclude this characters?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Weird issue with solr and jconsole/jmx

2011-06-24 Thread Alexey Serba

I just encountered the same bug - JMX registered beans don't survive
Solr core reloads.

I believe the reason is that when you do core reload
* when the new core is created - it overwrites/over-register beans in
registry (in mbeanserver)
* when the new core is ready in the core register phase CoreContainer
closes old core that results to unregistering jmx beans

As a result there's only one bean in registry
id=org.apache.solr.search.SolrIndexSearcher,type=Searcher@33099cc
main left after Core reload. It is because this in the only new
(dynamically named bean) that is created by new core and not
un-registered in oldCore.close. I'll try to reproduce that in test and
file bug in Jira.


On Tue, Mar 16, 2010 at 4:25 AM, Andrew Greenburg agreenb...@gmail.com wrote:
 On Tue, Mar 9, 2010 at 7:44 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : I connected to one of my solr instances with Jconsole today and
 : noticed that most of the mbeans under the solr hierarchy are missing.
 : The only thing there was a Searcher, which I had no trouble seeing
 : attributes for, but the rest of the statistics beans were missing.
 : They all show up just fine on the stats.jsp page.
 :
 : In the past this always worked fine. I did have the core reload due to
 : config file changes this morning. Could that have caused this?

 possibly... reloading the core actually causes a whole new SolrCore
 object (with it's own registry of SOlrInfoMBeans) to be created and then
 swapped in place of hte previous core ... so perhaps you are still looking
 at the stats of the old core which is no longer in use (and hasn't been
 garbage collected because the JMX Manager still had a refrence to it for
 you? ... i'm guessing at this point)

 did disconnecting from jconsole and reconnecting show you the correct
 stats?

 Disconnecting and reconnecting didn't help. The queryCache and
 documentCache and some others started showing up after I did a commit
 and opened a new searcher, but the whole tree never did fill in.

 I'm guessing that the request handler stats stayed associated with the
 old, no longer visible core in JMX since new instances weren't created
 when the core reloaded. Does that make sense? The stats on the web
 stats page continued to be fresh.

Re: how to index data in solr form database automatically

2011-06-24 Thread Romi

would you please tell me how can i use Cron for auto index my database tables
in solr

-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103768.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query may only contain [a-z][0-9]

2011-06-24 Thread Tomás Fernández Löbbe

Probably the best place to do this is on the application layer.
Also, if the problem is with the parsing erros, have you tried with dIsmax
or edismax query parsers?

On Fri, Jun 24, 2011 at 7:15 AM, roySolr royrutten1...@gmail.com wrote:

 Hello,

 Is it possible to configure into SOLR that only numbers and letters are
 accepted([a-z][0-9])??

 When a user gives a term like + or - i get some SOLR errors. How can i
 exclude this characters?





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query may only contain [a-z][0-9]

2011-06-24 Thread roySolr

Yes i use the dismax handler, but i will fix this in my application layer.

Thanks 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103945.html
Sent from the Solr - User mailing list archive at Nabble.com.

Advice wanted on approach/architecture

2011-06-24 Thread Js



Hi List, 

I'm looking into some options on what technology to adopt
building a specific logfile search solution.
At first glance it looks
like Solr is the tool I'm looking for. I intend to write a web-based
front end for end users 

What would be a possible approach to tackle
following requirements? In other words how could these requirements be
translated into Solr on a high level.
I'm not asking for solutions, just
pointers, approaches, tips, Solr features to look at, possible pitfalls,
... 

- A query results into a set of results.

- Individual records
from this query should have the ability to be marked so (although they
match the query) those specific records don't show anymore when the same
query is rerun. 
 - I don't want to delete data from the db/index
 - I
want to avoid that my application has to take care of excluding parts of
the returned data by keeping track which record id's to exclude. 

- A
query should exclude the records which have a match in a possibly large
growing list of regexes. 

Thanks! 

Jelle

intersecting map extent with solr spatial documents

2011-06-24 Thread SpacemanSteve

The following describes a Solr query filter that determines if axis aligned
geographic bounding boxes intersect. It is used to determine which
documents in a Solr repository containing spatial data are relevant to a
given rectangular geographic region that corresponds to a displayed map. I
haven’t seen this described before. I thought it might be useful to others
and I might get some pointers on how to improve it.

OpenGeoPortal (http://geoportal-demo.atech.tufts.edu/) is a web application
supporting the rapid discovery of GIS layers. It uses Solr to combine
spatial, keyword, date and GIS datatype based searching. As a user
manipulates its map OpenGeoPortal automatically computes and displays
relevant search results. This spatial searching requires the application to
determine which GIS layers are relevant given the current extent of the map.
Each Solr document includes spatial information about a single GIS layer.
Specifically, it contains the center of the layer (in degrees latitude and
longitude stored as tdoubles) as well as the height and width of the layer
(in degrees stored as a tdouble). These values are precomputed from the
bounding boxes of the layers during ingest.

To identify relevant layers our search algorithm looks for a separating axis
(http://en.wikipedia.org/wiki/Separating_axis_theorem) between the current
bounds of the map and the bounds of each layer. If a horizontal or vertical
separating axis exists then the layer does not contain any information in
the geographic area defined by the map. If neither separating axis exists
then the layer intersects the map, and is included in the result set.

Identifying whether separating axes exists is relatively straightforward
given two axis-aligned bounding boxes. In our case, one bounding box is
defined by the map’s current extent and the other bounding box by a GIS
layer. To determine if a vertical separating exists one must determine if
the difference between the center longitude of the map and the center
longitude of the layer is greater then the sum of the width of the map with
the width of the layer. If so, a vertical separating axis exists. If not,
a vertical separating axis does not exist. (See
http://www.gamasutra.com/view/feature/3383/simple_intersection_tests_for_games.php?page=3
for a diagram.) Similarly, the presence of a horizontal separating can be
computed using center latitudes and heights.

It is possible to generate a Solr filter query that filters layers that
contain neither a horizontal or vertical separating axis given a specific
map. Naturally, this query is somewhat complicated. The query essentially
counts the number of separating axes and, using !frange, eliminates layers
that have a separating axis. In the following example, the map was centered
on latitude 42.3, longitude -71.0 and had a width and height of 0.3 degrees.
The schema defines the fields CenterX, CenterY, HalfWidth and HalfHeight.

fq={!frange l=1 u=2}
map(sum(
map(sub(abs(sub(-71.0,CenterX)),sum(0.3,HalfWidth)),0,360,1,0),
map(sub(abs(sub(42.3,CenterY)),sum(0.3,HalfHeight)),0,90,1,0)),
0,0,1,0)

The clauses that check for bounding box (e.g.,
sub(abs(sub(-71.0,CenterX)),sum(0.3,HalfWidth)) and
sub(abs(sub(42.3,CenterY)),sum(0.3,HalfHeight))) return a positive number if
a separating axis exists. Using a map function, this value is mapped to 1
if the separating axis exists and 0 if it does not. The Solr query checks
for two separating axes and computes the number of such axes using sum. The
total number of separating axes (which is 0, 1 or 2) is then mapped to the
values 0 and 1. This final map returns 1 if there are no separating axes
(that is, the bounding boxes intersect) or 0 if there is at least one
separating axis (that is, the bounding boxes do not intersect). The
outermost clause applies a frange function to eliminate those layers that do
not intersect the current map.

Ranking the layers that intersect the map is a separate issue. This is done
with several query clauses. One clause determines how the area of the map
compares area of the layer. The other determines how the center of the map
compares to the center of the layer. These clauses are used in conjunction
keyword-based queries and date-based filters to create search results based
on spatial, keyword and temporal constraints.

--
View this message in context:
http://lucene.472066.n3.nabble.com/intersecting-map-extent-with-solr-spatial-documents-tp3104098p3104098.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to index data in solr form database automatically

2011-06-24 Thread Mohammad Shariq

First write a Script in Python ( or JAVA or PHP or anyLanguage)  which reads
the data from database and index into Solr.

Now setup this script as cron-job to run automatically at certain interval.





On 24 June 2011 17:23, Romi romijain3...@gmail.com wrote:

 would you please tell me how can i use Cron for auto index my database
 tables
 in solr

 -
 Thanks  Regards
 Romi
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103768.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Thanks and Regards
Mohammad Shariq

Re: Query may only contain [a-z][0-9]

2011-06-24 Thread Esteban Donato

I think another alternative is to use phrase query and then a
PatternReplaceFilterFactory at query time to remove the unwanted
characters.  Don't know if phrase query behavior meets your
requirements thought.

On Fri, Jun 24, 2011 at 9:39 AM, roySolr royrutten1...@gmail.com wrote:
 Yes i use the dismax handler, but i will fix this in my application layer.

 Thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103945.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to index data in solr form database automatically

2011-06-24 Thread Renato Eschini


Why don't you use DataImportHandler?

We use DIH, we have a wget based bash script that is runned by cron 
every about 2 minutes. DIH is called in delta-query mode.

The bash script waork in this way:

1) first call a wget on DIH status

2) analyze wget DIH status /dataimport?command=status

2.1) if status is **busy** do nothing and exit (beacuse DIH is already 
running)


2.2) if status is **idle** do /dataimport?command=delta-importclean=false

3) exit



On 24/06/11 15:20, Mohammad Shariq wrote:

First write a Script in Python ( or JAVA or PHP or anyLanguage)  which reads
the data from database and index into Solr.

Now setup this script as cron-job to run automatically at certain interval.



--

Renato Eschini
Inera srl
Via Mazzini 138
56100 Pisa (PI)
Tel:(+39) (0)50 9911800
Fax:(+39) (0)50 9911830
Int:(+39) (0)50 9911819
Email:  r.esch...@inera.it
Msn:r_esch...@hotmail.com
Skype:  renato.eschini
WWW:http://www.inera.it

Rispetta l'ambiente - è veramente necessario stampare questa e-mail?
Please consider the environment - do you really need to print this e-mail?

Re: Query may only contain [a-z][0-9]

2011-06-24 Thread dan whelan


You should escape those characters

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters



On 6/24/11 3:15 AM, roySolr wrote:

Hello,

Is it possible to configure into SOLR that only numbers and letters are
accepted([a-z][0-9])??

When a user gives a term like + or - i get some SOLR errors. How can i
exclude this characters?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-may-only-contain-a-z-0-9-tp3103553p3103553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Query Results Differ

2011-06-24 Thread jyn7

Hi,

I am trying to understand why the two queries return different results. To
me they look similar, can some one help me understand the difference in the
results.

Query1 :
facet=trueq=timefq=supplierid:1001start=0rows=10sort=published_on desc

Query2: facet=trueq=timefq=supplierid:1001+published_on:[* TO
NOW]start=0rows=10sort=published_on desc

The first query returns only 44 rows while the second one returns 200,000
rows. When I dont have the filter for published_on, I am assuming that SOLR
should return all the results with supplier id 1001, so Query 1 should have
returned more number of results(or atleast same number of results ) than the
second query.  

Thanks.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104412.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query Results Differ

2011-06-24 Thread Stefan Matheis

+ is an urlencoded whitespace .. so your filter-query says either
supplerid or published_on.

what you could do is:
1) use a second fq= param
2) combine them both into one like this: fq=foo+%2Bbar

%2B is an urlencoded + character

HTH, Regards
Stefan

On Fri, Jun 24, 2011 at 4:27 PM, jyn7 jyotsna.namb...@gmail.com wrote:
 Hi,

 I am trying to understand why the two queries return different results. To
 me they look similar, can some one help me understand the difference in the
 results.

 Query1 :
 facet=trueq=timefq=supplierid:1001start=0rows=10sort=published_on desc

 Query2: facet=trueq=timefq=supplierid:1001+published_on:[* TO
 NOW]start=0rows=10sort=published_on desc

 The first query returns only 44 rows while the second one returns 200,000
 rows. When I dont have the filter for published_on, I am assuming that SOLR
 should return all the results with supplier id 1001, so Query 1 should have
 returned more number of results(or atleast same number of results ) than the
 second query.

 Thanks.





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104412.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Shawn Heisey


On 6/24/2011 2:19 AM, Dmitry Kan wrote:

If possible, can you please share some details of your setup, like the
amount of shards, how big are they size/doc_count wise, what is the user
load / s.


Each full chain (there are two) consists of two servers with 2 quad-core 
processors and 32GB of RAM.  There are 9 VMs contained on those two 
servers.  Six of them house large shards with 9 GB of RAM about 9.5 
million rows each, taking up about 17.5GB of disk space.  One of them 
houses a small shard (3GB RAM) that contains the newest data, usually 
about 1GB and 400,000 rows.  There is a VM (512MB) for running haproxy 
and a VM (3GB) with a Solr instance that serves as a broker - no index, 
one core has the shards parameter in solrconfig.xml.


The small shard is updated every two minutes.  Every ten minutes, 
deletes are run against all shards.  Once an hour, the small shard is 
optimized.  Once a night, data older than 7 days is distributed among 
the large shards, deleted from the small shard, and one large shard is 
optimized.  Normally data is replicated between the two chains, but 
right now the primary chain is running 1.4.1 and the backup chain is 
running 3.2.0.


According to Solr stats, the average queries per second in production is 
well below 1.  I don't know what it is during day when it peaks ... but 
it's certainly not very large.  We do maintain statistics on every 
search in a database, I just haven't worked out yet how to turn that 
into usable numbers.  The usual statistical functions don't seem to be 
enough, I'll probably have to write something myself.  If anyone knows 
an easy way to turn a series of timestamps and QTimes into per-second 
statistics on arbitrary timeframes (hourly, daily, a 10 second span, 
etc), I'm all ears.


On my newly tuned 3.2.0 index, I can get near 100 queries per second if 
I run the benchmarking script a few times in a row.  It uses 8 threads 
each pounding out 1024 queries as fast as they can.  Running it against 
the old index with the old GC settings, I can only get about 25 queries 
per second.  Both of these numbers are well above what I really need.


If I ever need more performance, I can increase the system memory so 
more of the index fits into RAM, which would also let me increase the 
java heap size.  I actually hope one day to add servers, decrease the 
number of large shards, and run without virtualization ... but the 
funding just isn't there.


Shawn

Re: Updating the data-config file

2011-06-24 Thread sabman

Thanks. I will look into this and see how it goes.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Updating-the-data-config-file-tp3101241p3104470.html
Sent from the Solr - User mailing list archive at Nabble.com.

Do unused indexes after performance?

2011-06-24 Thread Judioo

Hi,

As a proof of concept I have imported around ~11 million document in a solr
index. my schema file has multiple fields defined

dynamicField name=*_idtype=text   indexed=true  stored=true/
dynamicField name=*_start type=tdate  indexed=true  stored=true/
dynamicField name=*_end   type=tdate  indexed=true  stored=true/

dynamicField name=*   type=string indexed=true stored=true/

Above being the most important for my question.

The average document has around 40 attributes. Each document has:

* a minimum of 2 tdate fileds ( max of 10)
* a minimum of 2 *_id fields each contain a space delimited list of ids
(i.e. 4de5656 q23ew9h)

The finial dynamicField causes all fields within a document to be indexed.
This was done to firstly show the flexibility of solr and also due to me not
knowing what fields we would use to query / filter on. The total size of my
index is ~18GB

However... we now know the fields we will be querying on.

I have 3 questions

1) Do unused indexes on the same dynamicField affect solr's performance?
Our query will always be (type:book book_id:*). Will the presents of 4
million documents (type:location store_id:*) affect solr's performance?
Sounds obviously yes but may not be the case.

2) Do unused dynamicField indexes affect solr's performance?
All documents have a attribute version which is indexed as text yet this
is never used in any queries. Does their existence ( in 11 million documents
) effect performance?

3) How does one improve query times against an index
Once an index is built is there a method to optimise the query analyzers or
a method of removing unused indexes without rebuilding the entire index?

The latter is a very important one. We want to replace the current schema
with a more restrictive version. Most importantly

   dynamicField name=* type=string indexed=true stored=true /

becomes

   dynamicField name=* type=string indexed=*false* stored=true /


But this change alone does not cause the index to shrink. It would be lovely
if there was a method to re-analyze an index post import.

More than happy to be referred to related documentation.

I have read and considered
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed


But there may be some fluid knowledge held here which is undocumented.

Thank you in advance for any answers.

Re: Query Results Differ

2011-06-24 Thread jyn7

So if I use a second fq parameter, will SOLR apply an AND on both the fq
parameters?
I have multiple indexed values, so when I search for q=time, does SOLR
return results with Time in any of the indexed values ? Sorry for the silly
questions


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3104611.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query Results Differ

2011-06-24 Thread Stefan Matheis

On Fri, Jun 24, 2011 at 5:11 PM, jyn7 jyotsna.namb...@gmail.com wrote:
 So if I use a second fq parameter, will SOLR apply an AND on both the fq
 parameters?

Yes :)

On Fri, Jun 24, 2011 at 5:11 PM, jyn7 jyotsna.namb...@gmail.com wrote:
 I have multiple indexed values, so when I search for q=time, does SOLR
 return results with Time in any of the indexed values ? Sorry for the silly
 questions

No. Read here http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field
and afterwards here http://wiki.apache.org/solr/SchemaXml#Copy_Fields

Regards
Stefan

Call indexer after action on website

2011-06-24 Thread PeterKerk

People can add advertisements on my website.
What I do now is run a scheduled task on my Windows server every night at
3AM.

But I want to do a delta import as soon as the user saves a new
advertisement on my website.

Now, from the server doing the delta import is as easy as calling:
http://localhost:8983/solr/dataimport?command=delta-import

But, as you can see, that is from localhost, which I cant call from my
frontend website.
How can I do a delta import after a visitor action on the front?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Call-indexer-after-action-on-website-tp3105153p3105153.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Call indexer after action on website

2011-06-24 Thread Shawn Heisey


On 6/24/2011 10:55 AM, PeterKerk wrote:

Now, from the server doing the delta import is as easy as calling:
http://localhost:8983/solr/dataimport?command=delta-import

But, as you can see, that is from localhost, which I cant call from my
frontend website.
How can I do a delta import after a visitor action on the front?


The port number 8983 suggests that you are using the included Jetty.  
Unless you have taken steps in the container configuration to lock it 
down so only localhost has access, it should be accessible from anywhere 
that can reach it, so your application code running on the webserver can 
just request the following URL, which it could even do with the IP 
address instead of my example hostname:


http://host.example.com:8983/solr/dataimport?command=delta-import

It can even check for an error or success status using a similar URL:

http://host.example.com:8983/solr/dataimport

Thanks,
Shawn

Re: multiple spatial values

2011-06-24 Thread marthinal


Yonik Seeley-2-2 wrote:
 
 On Tue, Sep 21, 2010 at 12:12 PM, dan sutton lt;danbsut...@gmail.comgt;
 wrote:
 I was looking at the LatLonType and how it might represent multiple
 lon/lat
 values ... it looks to me like the lat would go in
 {latlongfield}_0_LatLon
 and the long in {latlongfield}_1_LatLon ... how then if we have multiple
 lat/long points for a doc when filtering for example we choose the
 correct
 points.

 e.g. if thinking in cartisean coords and we have

 P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ...

 then how does it ensure we're not erroneously picking (3,7) or (6,4)
 whilst
 filtering with the spatial query?
 
 That's why it's a single-valued field only for now...
 
 don't we have to store both values together ? what am i missing here?
 
 The problem is that we don't have a way to query both values together,
 so we must index them separately.  The basic LatLonType uses numeric
 queries on the lat and lon fields separately.
 
 -Yonik
 http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
 

I have in my index two diferents fields like you say Yonik (location_1,
location_2) but the problem is when i want to filter results that have d= 50
for location_1 and d=50 for location_2 .I really dont know to build the
query ...

For example it works perfectly :

q={!geofilt}sfield=location_1pt=36.62288966,-6.23211272d=25

but how i add the sfield location_2 ?

I try nested queries but doesnt work.

Is it possible to do from the url?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3105521.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multiple spatial values

2011-06-24 Thread Yonik Seeley

On Fri, Jun 24, 2011 at 2:11 PM, marthinal jm.rodriguez.ve...@gmail.com wrote:

 Yonik Seeley-2-2 wrote:

 On Tue, Sep 21, 2010 at 12:12 PM, dan sutton lt;danbsut...@gmail.comgt;
 wrote:
 I was looking at the LatLonType and how it might represent multiple
 lon/lat
 values ... it looks to me like the lat would go in
 {latlongfield}_0_LatLon
 and the long in {latlongfield}_1_LatLon ... how then if we have multiple
 lat/long points for a doc when filtering for example we choose the
 correct
 points.

 e.g. if thinking in cartisean coords and we have

 P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ...

 then how does it ensure we're not erroneously picking (3,7) or (6,4)
 whilst
 filtering with the spatial query?

 That's why it's a single-valued field only for now...

 don't we have to store both values together ? what am i missing here?

 The problem is that we don't have a way to query both values together,
 so we must index them separately.  The basic LatLonType uses numeric
 queries on the lat and lon fields separately.

 -Yonik
 http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


 I have in my index two diferents fields like you say Yonik (location_1,
 location_2) but the problem is when i want to filter results that have d= 50
 for location_1 and d=50 for location_2 .I really dont know to build the
 query ...

 For example it works perfectly :

 q={!geofilt}sfield=location_1pt=36.62288966,-6.23211272d=25

 but how i add the sfield location_2 ?

sfield, pt and d can all be specified directly in the spatial
functions/filters too, and that will override the global params.

Unfortunately one must currently use lucene query syntax to do an OR.
It just makes it look a bit messier.

q=_query_:{!geofilt} _query:{!geofilt sfield=location_2}

-Yonik
http://www.lucidimagination.com

RE: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Burton-West, Tom

Hi Shawn,

Thanks for sharing this information.  I also found that in our use case, for 
some reason the default settings for the concurrent garbage collector seem to 
size the young generation way too small (At least for heap sizes of 1GB or 
larger.)   Can you also let us know what version of the JVM you are using?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

Re: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Shawn Heisey


On 6/24/2011 12:53 PM, Burton-West, Tom wrote:

Thanks for sharing this information.  I also found that in our use case, for 
some reason the default settings for the concurrent garbage collector seem to 
size the young generation way too small (At least for heap sizes of 1GB or 
larger.)   Can you also let us know what version of the JVM you are using?


Sure.  This is running under CentOS 5.6, with epel, rpmforge, and 
jpackage repositories added.


[root@idxst0-b ~]# java -version
java version 1.6.0_25
Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

I used java-1.6.0-sun-1.6.0.25-1.0.cf.nosrc.rpm to make the java RPMs, 
as explained here.  Looks like I can go to 1.6.0.26 now:


http://www.city-fan.org/tips/SunJava6OnFedora*
*
Thanks,
Shawn

P.S. Tom, thanks for all the good info on the Hathi Trust blog.

Re: Query Results Differ

2011-06-24 Thread jyn7

Thanks Stefan.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Results-Differ-tp3104412p3105914.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: intersecting map extent with solr spatial documents

2011-06-24 Thread David Smiley (@MITRE.org)

Very cool!

What you've essentially described is a way of indexing  searching lat-lon
box shapes, and the cool thing is that you were able to do this without
custom coding / hacking of Solr.  Sweet!  I do have some observations about
this approach:
1. Doesn't support variable number of shapes per document. (LatLonType
doesn't either, by the way)
2. The use of function queries on CenterX, CenterY, HalfWidth, and
HalfHeight means that all these values (just the distinct ones) will be put
into RAM in Lucene's FieldCache.  Not a big deal but something to be noted.
3. The function query is going to be evaluated on every document matching
the keyword search.  That will probably perform okay; not so sure for large
indexes with a *:* query.

Again, nice job.  Could you please share an example of your ranking query?

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/intersecting-map-extent-with-solr-spatial-documents-tp3104098p3106333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reject URL requests unless from localhost for dataimport

2011-06-24 Thread Brian Lamb

Hi all,

My solr server is currently set up at www.mysite.com:8983/solr. I would like
to keep this for the time being but I would like to restrict users from
going to www.mysite.com:8983/solr/dataimport. In that case, I would only
want to be able to do localhost:8983/solr/dataimport. Is this possible? If
so, where should I look for a guide?

Thanks,

Brian Lamb

Re: Reject URL requests unless from localhost for dataimport

2011-06-24 Thread Markus Jelsma

Firewall? It's easy to set up and the most low level. You can also use a proxy 
or perhaps manage it in your servlet container.


 Hi all,
 
 My solr server is currently set up at www.mysite.com:8983/solr. I would
 like to keep this for the time being but I would like to restrict users
 from going to www.mysite.com:8983/solr/dataimport. In that case, I would
 only want to be able to do localhost:8983/solr/dataimport. Is this
 possible? If so, where should I look for a guide?
 
 Thanks,
 
 Brian Lamb

Solr integration with Oracle Coherence caching

2011-06-24 Thread Girish

is it possible? if so then how? any steps would be good!

By the way I have Java version of both available for integration, just need
to push the plug in!

43 matches

Mail list logo