date:20120418

I guess my question is what advantage are you trying
to get here?

At the start, this feels like an XY problem. How are
you intending to use the fq after you've built it? Because
if there's any way to just create an fq clause, Solr
will take care of it for you. Caching it, autowarming
it when searchers are re-opened, etc. Otherwise, you're
going to be re-inventing a bunch of stuff it seems to me,
you'll have to intercept the queries coming in in order
to apply the filter from the cache, etc.

Which also may be another way of asking How big
is this set of document IDs? If it's in the 100s, I'd
just go with an fq. If it's more than that, I'd index
some kind of set identifier that you could create for
your fqs.

And if this is gibberish, ignore me G..

Best
Erick

On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote:
 Hi, I am a long time Lucene user but new to solr.  I would like to use 
 something like the filterCache but build a such a cache not from a query but 
 custom code.  I guess I will ask my question by using techniques and vocab I 
 am familiar with.  Not sure its actually the right way so I appologize if its 
 just the wrong approach.

 The scenario is that I would like to filter a result set by a set of labeled 
 documents, I will call that set L.
 L contains app specific document IDs that are indexed as literals in the 
 lucenefield myid.
 I would imagine I could build a OpenBitSet from enumerating the termdocs and 
 look for the intersecting ids in my label set.
 Now I have my bitset that I assume I could use in a filter.

 Another approach would be to implement a hits collector, compute a fieldcache 
 from that myid field and look for the intersection in a hashtable of L at 
 scoring time, throwing out results that are not contained in the hashtable.

 Of course I am working within the confines / concepts that SOLR has layed 
 out.  Without going completely off the reservation is their a neat way of 
 doing such a thing with SOLR?

 Glad to clarify if my question makes absolutely no sense.

 Best

 C

Re: How sorlcloud distribute data among shards of the same cluster?

Try looking at DistributedUpdateProcessor, there's
a hash(cmd) method in there.

Best
Erick

On Tue, Apr 17, 2012 at 4:45 PM, emma1023 smile.emma1...@gmail.com wrote:
Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to
the server.How does solrcloud do this instead? I am working on a project
using solrcloud.But we need to monitor how the solrcloud distribute the
data. I cannot find which part of the code it is from source code.Is it
from the cloud part? Thanks.

On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene]
ml-node+s472066n3918192...@n3.nabble.com wrote:

On Apr 17, 2012, at 9:56 AM, emma1023 wrote:

It hashes the id. The doc distribution is fairly even - but sizes may be
fairly different.

How solrcloud manage distribute data among shards of the same cluster
when
you query? Is it distribute the data equally? What is the basis? Which
part
of the code that I can find about it?Thank you so much!

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com

--
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html
To unsubscribe from How sorlcloud distribute data among shards of the
same cluster?, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3917323code=c21pbGUuZW1tYTEwMjNAZ21haWwuY29tfDM5MTczMjN8LTYzMTg4ODk4Mw==
.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr hanging

2012-04-18 Thread Trym R. Møller


Hi

I am using Solr trunk and have 7 Solr instances running with 28 leaders 
and 28 replicas for a single collection.
After indexing a while (a couple of days) the solrs start hanging and 
doing a thread dump on the jvm I see blocked threads like the following:

Thread 2369: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; 
information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) 
@bci=14, line=158 (Compiled frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=1987 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, 
line=399 (Compiled frame)
 - java.util.concurrent.ExecutorCompletionService.take() @bci=4, 
line=164 (Compiled frame)
 - 
org.apache.solr.update.SolrCmdDistributor.checkResponses(boolean) 
@bci=27, line=350 (Compiled frame)
 - org.apache.solr.update.SolrCmdDistributor.finish() @bci=18, 
line=98 (Compiled frame)
 - 
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish() 
@bci=4, line=299 (Compiled frame)
 - 
org.apache.solr.update.processor.DistributedUpdateProcessor.finish() 
@bci=1, line=817 (Compiled frame)

...
 - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, 
line=582 (Interpreted frame)


I read the stack trace as my indexing client has indexed a document and 
this Solr is now waiting for the replica? to respond before returning an 
answer to the client.


The other Solrs have similar blocked threads.

Any ideas of how I can get closer to the problem? Am I reading the stack 
trace correctly? Any further information that are relevant for 
commenting this problem?


Thanks for any comments.

Best regards Trym

Re: SOLR 4 / Date Query: Spurious Results: Is it me or ... ?

Your schema didn't come through, but...

1 why terms=-1 I don't know. I have a build from this
morning and it's fine. When's yours?
2 date .vs. tdate. Yes, that's kind of confusing, but
the Trie types inject some extra stuff in the field
that allows the faster range queries, I think of it
as navigation data. These get displayed as
1970 dates (e.g. the epoch). Ignore them.
3 I don't quite understand here. If you're still talking about
a tdate field, could the navigation data account
for it? That data shouldn't belong to any document and
isn't really putting multi-values in any doc. Changing the
schema type to not be multivalued should show this is the
case if so.

Best
Erick

On Tue, Apr 17, 2012 at 7:18 PM, vybe3142 vybe3...@gmail.com wrote:
I wrote a custom handler that uses externally injected metadata (bypassing
Tika et all)

WRT Dates, I see them associated with the correct docs when retrieving all
docs:

BUT:

looking at the schema analyzer, things look wierd:
1. Top terms = -1
2. The Dates are all mixed up with some spurious 1970 dates thrown in (I can
get rid of the 1970 dates if i use type date vs tdate)
3. Multi Valued values (should only be one per doc, as per input data, even
though the schema allows it).

Any ideas what, if anything, I'm doing wrong?

See pic http://lucene.472066.n3.nabble.com/file/n3918636/Capture.jpg

Here's my SOLR schema:

--
View this message in context:
http://lucene.472066.n3.nabble.com/SOLR-4-Date-Query-Spurious-Results-Is-it-me-or-tp3918636p3918636.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr file size limit?

2012-04-18 Thread Bram Rongen

Dear fellow Solr users,

I've been using Solr for a very short time now and I'm stuck. I'm trying to
index a drupal website consisting of 1.2 million smaller nodes and 300k
larger nodes (~400kb avg)..

I'm using Solr 3.5 on a dedicated Ubuntu 10.04 box with 3TB of diskspace
and 16GB of memory. I've tried using the sun JRE and OpenJDK, both
resulting in the same problem. Indexing works great until my .fdt file
reaches the size of 4.9GB/ 5217987319b. At this point when Solr starts
merging it just keeps on merging, starting over and over.. Java is using
all the available memory even though Xmx is set at 8G. When I restart Solr
everything looks fine until merging is triggered. Whenever it hangs the
server load averages 3, searching is possible but slow, the solr admin
interface is reachable but sending new documents leads to a time-out.

I've tried using several different settings for MergePolicy and started
reindexing a couple of times but the behavior stays the same. My current
solrconf.xml can be found here: http://pastebin.com/NXDT0B8f. I'm unable to
find errors in the log which makes it really difficult to debug.. Could
anyone point me in the right direction?

I've already asked my question on stackoverflow without receiving a
solution:
http://stackoverflow.com/questions/9993633/apache-solr-3-5-hangs-when-indexing.
Maybe it can provide you with some more information.

Kind regards!
Bram Rongen

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen

I'm curious how on the fly updates are handled as a new shard is added
to an alias.  Eg, how does the system know to which shard to send an
update?

On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:
 Hi,

 speaking about ES I think it would be fair to mention that one has to
 specify number of shards upfront when the index is created - that is
 correct, however, it is possible to give index one or more aliases which
 basically means that you can add new indices on the fly and give them same
 alias which is then used to search against. Given that you can add/remove
 indices, nodes and aliases on the fly I think there is a way how to handle
 growing data set with ease. If anyone is interested such scenario has been
 discussed in detail in ES mail list.

 Regards,
 Lukas

 On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 One of big weaknesses of Solr Cloud (and ES?) is the lack of the
 ability to redistribute shards across servers.  Meaning, as a single
 shard grows too large, splitting the shard, while live updates.

 How do you plan on elastically adding more servers without this feature?

 Cassandra and HBase handle elasticity in their own ways.  Cassandra
 has successfully implemented the Dynamo model and HBase uses the
 traditional BigTable 'split'.  Both systems are complex though are at
 a singular level of maturity.

 Also Cassandra [successfully] implements multiple data center support,
 is that available in SC or ES?

 On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
  Hello Ali,
 
  I'm trying to setup a large scale *Crawl + Index + Search
 *infrastructure
 
  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
 pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than
 0.5
  seconds.
 
 
  That's fine.  Whether it's doable with any tech will depend on how much
 hardware you give it, among other things.
 
  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large.
 Each
  of these indices would likely require a logically distributed and
  replicated index.
 
 
  Yup, OK.
 
  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the
 crawl). In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
 of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search
 traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).
 
 
  There is no such thing just yet.
  There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
 to automatically index HBase content, but that was either not completed or
 not committed into HBase.
 
  However, I'm not sure which Solr-based tool in the Hadoop ecosystem
 would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
 these is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.
 
 
  Here is a summary on all of them:
  * Search on HBase - I assume you are referring to the same thing I
 mentioned above.  Not ready.
  * Solandra - uses Cassandra+Solr, plus DataStax now has a different
 (commercial) offering that combines search and Cassandra.  Looks good.
  * Lily - data stored in HBase cluster gets indexed to a separate Solr
 instance(s)  on the side.  Not really integrated the way you want it to be.
  * ElasticSearch - solid at this point, the most dynamic solution today,
 can scale well (we are working on a mny-B documents index and hundreds
 of nodes with ElasticSearch right now), etc.  But again, not integrated
 with Hadoop the way you want it.
  * IndexTank - has some technical weaknesses, not integrated with Hadoop,
 not sure about its future considering LinkedIn uses Zoie and Sensei already.
  * And there is SolrCloud, which is coming soon and will be solid, but is
 again not integrated.
 
  If I were you and I had to pick today - I'd pick ElasticSearch if I were
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?
 
  I don't know off the topic of my head, but I'm guessing several hundred

Re: Multiple document structure

2012-04-18 Thread Gora Mohanty

On 18 April 2012 10:05, abhijit bashetti bashettiabhi...@rediffmail.com wrote:

 Hi ,
 Is it possible to have 2 document structures in solr?
[...]

Do not think so, but why do you need it? Use two separate
indices, either in a multi-core setup, or in separate Solr
instances.

Regards,
Gora

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Lukáš Vlček

AFAIK it can not. You can only add new shards by creating a new index and
you will then need to index new data into that new index. Index aliases are
useful mainly for searching part. So it means that you need to plan for
this when you implement your indexing logic. On the other hand the query
logic does not need to change as you only add new indices and give them all
the same alias.

I am not an expert on this but I think that index splitting and re-sharding
can be expensive for [near] real-time search system and the point is that
you can probably use different techniques to support your large scale
needs. Index aliasing and routing in elasticsearch can help a lot in
supporting various large scale data scenarios, check the following thread
in ES ML for some examples:
https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

Just to sum it up, the fact that elasticsearch does have fixed number of
shards per index and does not support resharding and index splitting does
not mean you can not scale your data easily.

(I was not following this whole thread in every detail. So may be you may
have specific needs that can be solved only by splitting or resharding, in
such case I would recommend you to ask on ES ML with further questions, I
do not want to run into system X vs system Y flame here...)

Regards,
Lukas

On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 I'm curious how on the fly updates are handled as a new shard is added
 to an alias.  Eg, how does the system know to which shard to send an
 update?

 On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com
 wrote:
  Hi,
 
  speaking about ES I think it would be fair to mention that one has to
  specify number of shards upfront when the index is created - that is
  correct, however, it is possible to give index one or more aliases which
  basically means that you can add new indices on the fly and give them
 same
  alias which is then used to search against. Given that you can add/remove
  indices, nodes and aliases on the fly I think there is a way how to
 handle
  growing data set with ease. If anyone is interested such scenario has
 been
  discussed in detail in ES mail list.
 
  Regards,
  Lukas
 
  On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  One of big weaknesses of Solr Cloud (and ES?) is the lack of the
  ability to redistribute shards across servers.  Meaning, as a single
  shard grows too large, splitting the shard, while live updates.
 
  How do you plan on elastically adding more servers without this feature?
 
  Cassandra and HBase handle elasticity in their own ways.  Cassandra
  has successfully implemented the Dynamo model and HBase uses the
  traditional BigTable 'split'.  Both systems are complex though are at
  a singular level of maturity.
 
  Also Cassandra [successfully] implements multiple data center support,
  is that available in SC or ES?
 
  On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hello Ali,
  
   I'm trying to setup a large scale *Crawl + Index + Search
  *infrastructure
  
   using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
  pages*,
   crawled + indexed every *4 weeks, *with a search latency of less than
  0.5
   seconds.
  
  
   That's fine.  Whether it's doable with any tech will depend on how
 much
  hardware you give it, among other things.
  
   Needless to mention, the search index needs to scale to 5Billion
 pages.
  It
   is also possible that I might need to store multiple indexes -- one
 for
   crawled content, and one for ancillary data that is also very large.
  Each
   of these indices would likely require a logically distributed and
   replicated index.
  
  
   Yup, OK.
  
   However, I would like for such a system to be homogenous with the
 Hadoop
   infrastructure that is already installed on the cluster (for the
  crawl). In
   other words, I would much prefer if the replication and distribution
 of
  the
   Solr/Lucene index be done automagically on top of Hadoop/HDFS,
 instead
  of
   using another scalability framework (such as SolrCloud). In
 addition, it
   would be ideal if this environment was flexible enough to be
 dynamically
   scaled based on the size requirements of the index and the search
  traffic
   at the time (i.e. if it is deployed on an Amazon cluster, it should
 be
  easy
   enough to automatically provision additional processing power into
 the
   cluster without requiring server re-starts).
  
  
   There is no such thing just yet.
   There is no Search+Hadoop/HDFS in a box just yet.  There was an
 attempt
  to automatically index HBase content, but that was either not completed
 or
  not committed into HBase.
  
   However, I'm not sure which Solr-based tool in the Hadoop ecosystem
  would
   be ideal for this scenario. I've heard mention of Solr-on-HBase,
  Solandra,
   Lily, ElasticSearch, IndexTank

pushing updates to solr from postgresql

2012-04-18 Thread Welty, Richard

i have a setup right this instant where the dataimporthandler is being used to 
pull data for an index from a postgresql server.

i'd like to switch over to push, and am looking for some validation of my 
approach.

i have perl installed as an untrusted language on my postgresql server and am 
planning to set up triggers on the tables where insert/update/delete operations 
should cause an update of the relevant solr indexes. the trigger functions will 
build xml in the format for UpdateXmlMessages and notify Solr via http requests.


is this sensible, or am i missing something easier?

also, does anyone have any thoughts about coordinating initial indexing/full 
reindexing via dataimporthandler with the trigger based push operations?

thanks,
   richard

hierarchical faceting?

2012-04-18 Thread sam ”

I have hierarchical colors:
field name=colors type=text_pathindexed=true
stored=true multiValued=true/
text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.

Given these two documents,
Doc1: red
Doc2: red/pink

I want the result to be the following:
?fq=red
== Doc1, Doc2

?fq=red/pink
== Doc2

But, with PathHierarchyTokenizer, Doc1 is included for the query:
?fq=red/pink
== Doc1, Doc2

How can I query for hierarchical facets?
http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..
But it looks too cumbersome to me.

Is there a simpler way to implement hierarchical facets?

Problems with edismax parser and solr3.6

2012-04-18 Thread Bernd Fehling


I just looked through my logs of solr 3.6 and saw several 0 hits which were 
not seen with solr 3.5.

While tracing this down it turned out that edismax don't like queries of type 
...q=(text:ide)... any more.

If parentheses around the query term the edismax fails with solr 3.6.

Can anyone confirm this and give me feedback?

Bernd

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

Put the parent term in all the child documents at index time
and the re-issue the facet query when you expand the parent using the
parent's term. works perfect.

On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
 I have hierarchical colors:
 field name=colors type=text_pathindexed=true
 stored=true multiValued=true/
 text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
 Given these two documents,
 Doc1: red
 Doc2: red/pink
 
 I want the result to be the following:
 ?fq=red
 == Doc1, Doc2
 
 ?fq=red/pink
 == Doc2
 
 But, with PathHierarchyTokenizer, Doc1 is included for the query:
 ?fq=red/pink
 == Doc1, Doc2
 
 How can I query for hierarchical facets?
 http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..
 But it looks too cumbersome to me.
 
 Is there a simpler way to implement hierarchical facets?

How to add/remove/customize search tabs

2012-04-18 Thread Valentin, AJ

I have Apache Solr installed with my Drupal 7 site and noticed some default 
tabs available (Content, Site, Users).  Is there a way to add/change that tabs 
section?


CONFIDENTIALITY NOTICE: This email constitutes an electronic communication 
within the meaning of the Electronic Communications Privacy Act, 18 U.S.C. 
2510, and its disclosure is strictly limited to the named recipient(s) intended 
by the sender of this message. This email, and any attachments, may contain 
confidential and/or proprietary information of Scientific Research Corporation. 
If you are not a named recipient, any copying, using, disclosing or 
distributing to others the information in this email and attachments is 
STRICTLY PROHIBITED. If you have received this email in error, please notify 
the sender immediately and permanently delete the email, any attachments, and 
all copies thereof from any drives or storage media and destroy any printouts 
or hard copies of the email and attachments.

EXPORT COMPLIANCE NOTICE: This email and any attachments may contain technical 
data subject to U.S export restrictions under the International Traffic in Arms 
Regulations (ITAR) or the Export Administration Regulations (EAR). Export or 
transfer of this technical data and/or related information to any foreign 
person(s) or entity(ies), either within the U.S. or outside of the U.S., may 
require advance export authorization by the appropriate U.S. Government agency 
prior to export or transfer. In addition, technical data may not be exported or 
transferred to certain countries or specified designated nationals identified 
by U.S. embargo controls without prior export authorization. By accepting this 
email and any attachments, all recipients confirm that they understand and will 
comply with all applicable ITAR, EAR and embargo compliance requirements.

Re: How to add/remove/customize search tabs

2012-04-18 Thread Dave Stuart

This is question is probably better set on the Drupal groups page for Apache 
Solr  http://groups.drupal.org/lucene-nutch-and-solr

As this is more of a Drupal issue than a Solr issue


On 18 Apr 2012, at 16:11, Valentin, AJ wrote:

 I have Apache Solr installed with my Drupal 7 site and noticed some default 
 tabs available (Content, Site, Users).  Is there a way to add/change that 
 tabs section?
 
 
 CONFIDENTIALITY NOTICE: This email constitutes an electronic communication 
 within the meaning of the Electronic Communications Privacy Act, 18 U.S.C. 
 2510, and its disclosure is strictly limited to the named recipient(s) 
 intended by the sender of this message. This email, and any attachments, may 
 contain confidential and/or proprietary information of Scientific Research 
 Corporation. If you are not a named recipient, any copying, using, disclosing 
 or distributing to others the information in this email and attachments is 
 STRICTLY PROHIBITED. If you have received this email in error, please notify 
 the sender immediately and permanently delete the email, any attachments, and 
 all copies thereof from any drives or storage media and destroy any printouts 
 or hard copies of the email and attachments.
 
 EXPORT COMPLIANCE NOTICE: This email and any attachments may contain 
 technical data subject to U.S export restrictions under the International 
 Traffic in Arms Regulations (ITAR) or the Export Administration Regulations 
 (EAR). Export or transfer of this technical data and/or related information 
 to any foreign person(s) or entity(ies), either within the U.S. or outside of 
 the U.S., may require advance export authorization by the appropriate U.S. 
 Government agency prior to export or transfer. In addition, technical data 
 may not be exported or transferred to certain countries or specified 
 designated nationals identified by U.S. embargo controls without prior export 
 authorization. By accepting this email and any attachments, all recipients 
 confirm that they understand and will comply with all applicable ITAR, EAR 
 and embargo compliance requirements.
 

David Stuart
M  +44(0) 778 854 2157
T   +44(0) 845 519 5465
www.axistwelve.com
Axis12 Ltd | 7 Wynford Road
| London | N1 9QN | UK

AXIS12 - Enterprise Web Solutions

Reg Company No. 7215135
VAT No. 997 4801 60

This e-mail is strictly confidential and intended solely for the ordinary user 
of the e-mail account to which it is addressed. If you have received this 
e-mail in error please inform Axis12 immediately by return e-mail or telephone. 
We advise that in keeping with good computing practice the recipient of this 
e-mail should ensure that it is virus free. We do not accept any responsibility 
for any loss or damage that may arise from the use of this email or its 
contents.

Re: hierarchical faceting?

2012-04-18 Thread sam ”

Yah, that's exactly what PathHierarchyTokenizer does.
fieldType name=text_path class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.PathHierarchyTokenizerFactory/
  /analyzer
/fieldType

I think I have a query time tokenizer that tokenizes at /

?q=colors:red
== Doc1, Doc2

?q=colors:redfoobar
==

?q=colors:red/foobarasdfoaijao
== Doc1, Doc2



On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.com wrote:

 Put the parent term in all the child documents at index time
 and the re-issue the facet query when you expand the parent using the
 parent's term. works perfect.

 On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
  I have hierarchical colors:
  field name=colors type=text_pathindexed=true
  stored=true multiValued=true/
  text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
  Given these two documents,
  Doc1: red
  Doc2: red/pink
 
  I want the result to be the following:
  ?fq=red
  == Doc1, Doc2
 
  ?fq=red/pink
  == Doc2
 
  But, with PathHierarchyTokenizer, Doc1 is included for the query:
  ?fq=red/pink
  == Doc1, Doc2
 
  How can I query for hierarchical facets?
  http://wiki.apache.org/solr/HierarchicalFaceting describes
 facet.prefix..
  But it looks too cumbersome to me.
 
  Is there a simpler way to implement hierarchical facets?

minimum match and not matched words / term frequency in query result

2012-04-18 Thread giovanni.bricc...@banzai.it


Hi

I have a dismax query with a mininimum match settings, this allows some 
terms to be missing in query results.


I would like give a feedback to the user, highlighting the not matched 
words. It would be interesting also to show the words with a very low 
frequence.


For instance searching for purple pendrive I would highlight that the 
results ignore the term purple,  beacuse we don't have any.


Can you suggest how to approach the problem?

I was thinking about the debugQuery output, but since I will not get 
details about all the results I probably will  miss something.


I am trying to write a new SearchComponent but I don't know how to get 
term frequency data from a ResponseBuilder object... I am new to 
solr/lucene programming.


Thanks a lot

Solr 3.6 parsing and extraction files

2012-04-18 Thread Tod

Could someone possibly provide me with a list of jars that I need to 
extract from the apache-solr-3.6.0.tgz file to enable the parsing and 
remote streaming of office style documents?  I assume (for a multicore 
configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - 
correct?



Thanks - Tod

Re: pushing updates to solr from postgresql

2012-04-18 Thread Otis Gospodnetic

Hi Richard,

One thing to think about here is what you will do when Solr is unavailable to 
take a new document for whatever reason.  If you send docs to Solr from PG, 
docs either get indexed or not.  So you may have to catch errors and then mark 
documents in PG as not indexed.  You may want to keep track of initial and/or 
last index attempt and the total number of indexing attempts (new DB columns) 
and will probably want to use DIH to pick up unindexed documents from PG and 
get them indexed.

Also keep in mind that sending docs to Solr one by one will not be as efficient 
as sending batches of them or as efficient as getting a batch of them via DIH.  
If your data volume is low this likely won't be a problem, but if it is it high 
or is growing, you'll want to keep this in mind.

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Welty, Richard rwe...@ltionline.com
To: solr-user@lucene.apache.org 
Sent: Wednesday, April 18, 2012 10:48 AM
Subject: pushing updates to solr from postgresql
 
i have a setup right this instant where the dataimporthandler is being used to 
pull data for an index from a postgresql server.

i'd like to switch over to push, and am looking for some validation of my 
approach.

i have perl installed as an untrusted language on my postgresql server and am 
planning to set up triggers on the tables where insert/update/delete 
operations should cause an update of the relevant solr indexes. the trigger 
functions will build xml in the format for UpdateXmlMessages and notify Solr 
via http requests.


is this sensible, or am i missing something easier?

also, does anyone have any thoughts about coordinating initial indexing/full 
reindexing via dataimporthandler with the trigger based push operations?

thanks,
   richard

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen

The main point being made is established NoSQL solutions (eg,
Cassandra, HBase, et al) have solved the update problem (among many
other scalability issues, for several years).

If an update is being performed and it is not known where the record
exists, the update capability of the system is inefficient.  In
addition, in a production system, the mere possibility of losing data,
or inaccurate updates is usually a red flag.

On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:
 AFAIK it can not. You can only add new shards by creating a new index and
 you will then need to index new data into that new index. Index aliases are
 useful mainly for searching part. So it means that you need to plan for
 this when you implement your indexing logic. On the other hand the query
 logic does not need to change as you only add new indices and give them all
 the same alias.

 I am not an expert on this but I think that index splitting and re-sharding
 can be expensive for [near] real-time search system and the point is that
 you can probably use different techniques to support your large scale
 needs. Index aliasing and routing in elasticsearch can help a lot in
 supporting various large scale data scenarios, check the following thread
 in ES ML for some examples:
 https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

 Just to sum it up, the fact that elasticsearch does have fixed number of
 shards per index and does not support resharding and index splitting does
 not mean you can not scale your data easily.

 (I was not following this whole thread in every detail. So may be you may
 have specific needs that can be solved only by splitting or resharding, in
 such case I would recommend you to ask on ES ML with further questions, I
 do not want to run into system X vs system Y flame here...)

 Regards,
 Lukas

 On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 I'm curious how on the fly updates are handled as a new shard is added
 to an alias.  Eg, how does the system know to which shard to send an
 update?

 On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com
 wrote:
  Hi,
 
  speaking about ES I think it would be fair to mention that one has to
  specify number of shards upfront when the index is created - that is
  correct, however, it is possible to give index one or more aliases which
  basically means that you can add new indices on the fly and give them
 same
  alias which is then used to search against. Given that you can add/remove
  indices, nodes and aliases on the fly I think there is a way how to
 handle
  growing data set with ease. If anyone is interested such scenario has
 been
  discussed in detail in ES mail list.
 
  Regards,
  Lukas
 
  On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  One of big weaknesses of Solr Cloud (and ES?) is the lack of the
  ability to redistribute shards across servers.  Meaning, as a single
  shard grows too large, splitting the shard, while live updates.
 
  How do you plan on elastically adding more servers without this feature?
 
  Cassandra and HBase handle elasticity in their own ways.  Cassandra
  has successfully implemented the Dynamo model and HBase uses the
  traditional BigTable 'split'.  Both systems are complex though are at
  a singular level of maturity.
 
  Also Cassandra [successfully] implements multiple data center support,
  is that available in SC or ES?
 
  On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hello Ali,
  
   I'm trying to setup a large scale *Crawl + Index + Search
  *infrastructure
  
   using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
  pages*,
   crawled + indexed every *4 weeks, *with a search latency of less than
  0.5
   seconds.
  
  
   That's fine.  Whether it's doable with any tech will depend on how
 much
  hardware you give it, among other things.
  
   Needless to mention, the search index needs to scale to 5Billion
 pages.
  It
   is also possible that I might need to store multiple indexes -- one
 for
   crawled content, and one for ancillary data that is also very large.
  Each
   of these indices would likely require a logically distributed and
   replicated index.
  
  
   Yup, OK.
  
   However, I would like for such a system to be homogenous with the
 Hadoop
   infrastructure that is already installed on the cluster (for the
  crawl). In
   other words, I would much prefer if the replication and distribution
 of
  the
   Solr/Lucene index be done automagically on top of Hadoop/HDFS,
 instead
  of
   using another scalability framework (such as SolrCloud). In
 addition, it
   would be ideal if this environment was flexible enough to be
 dynamically
   scaled based on the size requirements of the index and the search
  traffic
   at the time (i.e. if it is deployed on an Amazon cluster, it should
 be
  easy
   enough to

[Job] Search Engineer Lead at Sematext International

2012-04-18 Thread Otis Gospodnetic

Hello,

If you've always wanted a full-time job working with Solr, ElasticSearch, or 
Lucene, we have a position that is all about that, offers path to team 
leadership, and will expose a person to a healthy mixture of engineering and 
business.  If you are interested, please send your resume to j...@sematext.com .

Otis

Sematext International is looking for a strong Search Engineer with interest 
and ability to interact with clients and with potential to build and lead local 
and/or remote development teams.  By “client-facing” we really mean primarily 
email, phone, Skype.


A person in this role needs to be able to:
* design large scale search systems
* have solid knowledge of either Solr or ElasticSearch or both
* efficiently troubleshoot performance, relevance, and other search-related 
issues
* speak and interact with clients


Pluses – beyond pure engineering:
* ability and desire to expand and lead a development/consulting teams
* ability to think both business and engineering
* ability to build products based on observed client needs
* ability to present in public, at meetups, conferences, etc
* ability to contribute to blog.sematext.com
* active participation in online search communities
* attention to detail
* desire to share knowledge and teach
* positive attitude, humor, agility


Location:
    * New York

Travel:
    * Minimal


Relevant pointers:
* http://sematext.com/about/jobs.html
* http://sematext.com/about/jobs.html#advantages
* http://sematext.com/engineering/index.html

solr stats component

2012-04-18 Thread Peter Markey

Hello,

I am using the stats component and I wanted help with range like function
(in facet component). To be more clear, we would like to have a similar
functionality of facet.range (i.e with gap and stuff) for the statistics
component. That is, with one call we would like to do faceting in stats
compoenent that would return us the facets only for a specified range
broken down into several buckets (based on the gap). We know that this
functionality is not available in solr but wanted to see if there's any
other indirect way of doing it. Any thoughts would be highly appreciated.

Thanks

Maximum Open Cursors using JdbcDataSource and cacheImpl

2012-04-18 Thread Keith Naas

After upgrading from 3.5.0 to 3.6.0 we have noticed that when we use a 
cacheImpl  on a nested JdbcDataSource entity, the database runs out of cursors. 
 It does not matter what transactionIsolation, autoCommit, or holdability 
setting we use.  I have only been using solr for a few months but after looking 
at EntityProcessorBase, DIHCacheSupport, and JdbcDataSource.ResultSetIterator 
it may be that the ResultSet or Statement is never closed.  In 
EntityProcessBase.getNext() if there is no cacheSupport it likely immediately 
closes the resources it was using.  Whereas with caching it might be leaving it 
open because the rowIterator is never set to null.  Since it has a reference to 
the resultSet and stmt it holds onto them and neither is ever closed.

On a related note there appear to be other possible leaks in 
JdbcDataSource.ResultSetIterator. The close() method attempts to close both the 
resultSet and the stmt.  However if it fails closing the resultSet it will not 
close the stmt.  They should probably be wrapped in separate try/catch blocks.  
It will also not close the stmt or resultSet if the ResultSetIterator throws an 
exception in its constructor.  In my experience one cannot count on the closing 
of the connection to cleanup those resources consistently.

  2012-04-18 12:02:22,017 ERROR 
[org.apache.solr.handler.dataimport.DataImporter] Full Import 
failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.sql.SQLException: ORA-01000: maximum open cursors 
exceeded

at 
oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:745)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:210)
at 
oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:804)
at 
oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1049)
at 
oracle.jdbc.driver.T4CStatement.executeMaybeDescribe(T4CStatement.java:845)
at 
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1146)
at

RE: Maximum Open Cursors using JdbcDataSource and cacheImpl

2012-04-18 Thread Dyer, James

Keith,

Can you supply your data-config.xml ?

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Keith Naas [mailto:keithn...@dswinc.com] 
Sent: Wednesday, April 18, 2012 11:43 AM
To: solr-user@lucene.apache.org
Subject: Maximum Open Cursors using JdbcDataSource and cacheImpl

After upgrading from 3.5.0 to 3.6.0 we have noticed that when we use a 
cacheImpl  on a nested JdbcDataSource entity, the database runs out of cursors. 
 It does not matter what transactionIsolation, autoCommit, or holdability 
setting we use.  I have only been using solr for a few months but after looking 
at EntityProcessorBase, DIHCacheSupport, and JdbcDataSource.ResultSetIterator 
it may be that the ResultSet or Statement is never closed.  In 
EntityProcessBase.getNext() if there is no cacheSupport it likely immediately 
closes the resources it was using.  Whereas with caching it might be leaving it 
open because the rowIterator is never set to null.  Since it has a reference to 
the resultSet and stmt it holds onto them and neither is ever closed.

On a related note there appear to be other possible leaks in 
JdbcDataSource.ResultSetIterator. The close() method attempts to close both the 
resultSet and the stmt.  However if it fails closing the resultSet it will not 
close the stmt.  They should probably be wrapped in separate try/catch blocks.  
It will also not close the stmt or resultSet if the ResultSetIterator throws an 
exception in its constructor.  In my experience one cannot count on the closing 
of the connection to cleanup those resources consistently.

  2012-04-18 12:02:22,017 ERROR 
[org.apache.solr.handler.dataimport.DataImporter] Full Import 
failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225)
... 3 more
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to 
execute query: select distinct DISPLAY_NAME from dimension where 
dimension.DIMENSION_ID = 'M' Processing Document # 11
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619)
... 5 more
Caused by: java.sql.SQLException: ORA-01000: maximum open cursors 
exceeded

at 
oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:745)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:210)
at 
oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:804)
at

Re: SOLR 4 / Date Query: Spurious Results: Is it me or ... ?

2012-04-18 Thread vybe3142

Thanks for clarifying.

I figured out the (terms=-1). It was my fault. I attempted a truncate of the
index in my test case setup by issuing a delete query and think the
subsequent commit might not have taken effect by the time the subsequent 
index queries started.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-Date-Query-Spurious-Results-Is-it-me-or-tp3918636p3920652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: hierarchical faceting?

2012-04-18 Thread sam ”

It looks like TextField is the problem.

This fixed:
fieldType name=text_path class=solr.TextField
positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.PathHierarchyTokenizerFactory
delimiter=//
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

I am assuming the text_path fields won't include whitespace characters.

?q=colors:red/pink
== Doc2   (Doc1, which has colors = red isn't included!)


Is there a tokenizer that tokenizes the string as one token?
I tried to extend Tokenizer myself  but it fails:
public class AsIsTokenizer extends Tokenizer {
@Override
public boolean incrementToken() throws IOException {
return true;//or false;
}
}


On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote:

 Yah, that's exactly what PathHierarchyTokenizer does.
 fieldType name=text_path class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.PathHierarchyTokenizerFactory/
   /analyzer
 /fieldType

 I think I have a query time tokenizer that tokenizes at /

 ?q=colors:red
 == Doc1, Doc2

 ?q=colors:redfoobar
 ==

 ?q=colors:red/foobarasdfoaijao
 == Doc1, Doc2




 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote:

 Put the parent term in all the child documents at index time
 and the re-issue the facet query when you expand the parent using the
 parent's term. works perfect.

 On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
  I have hierarchical colors:
  field name=colors type=text_pathindexed=true
  stored=true multiValued=true/
  text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
  Given these two documents,
  Doc1: red
  Doc2: red/pink
 
  I want the result to be the following:
  ?fq=red
  == Doc1, Doc2
 
  ?fq=red/pink
  == Doc2
 
  But, with PathHierarchyTokenizer, Doc1 is included for the query:
  ?fq=red/pink
  == Doc1, Doc2
 
  How can I query for hierarchical facets?
  http://wiki.apache.org/solr/HierarchicalFaceting describes
 facet.prefix..
  But it looks too cumbersome to me.
 
  Is there a simpler way to implement hierarchical facets?

Can you suggest a method or pattern to consistently promote a document with any query?

Hi, folks,

Perhaps I'm overlooking an obvious solution to a common desire... I'd like to 
return a specific document with every query, as the first result. As well, I'd 
like to have that document be the first result in a *:* query.

I'm looking into index time boosting using the boost attribute on the 
appropriate doc. I haven't tested this yet, and I'm not sure this would do 
anything for the *:* queries.

Thanks for any suggested reading or patterns...

Best,
Chris

 
--
chris_war...@yahoo.com

Re: Can you suggest a method or pattern to consistently promote a document with any query?

2012-04-18 Thread Jeevanandam Madanagopal

Chris -

Take a look - QueryElevationComponent

http://wiki.apache.org/solr/QueryElevationComponent

-Jeevanandam

On Apr 18, 2012, at 10:46 PM, Chris Warner wrote:

 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd like to 
 return a specific document with every query, as the first result. As well, 
 I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would do 
 anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
  
 --
 chris_war...@yahoo.com

Re: Can you suggest a method or pattern to consistently promote a document with any query?

2012-04-18 Thread Otis Gospodnetic

Chris,

I haven't checked if Elevate Component has an easy way to push a specific doc 
for *all* queries, but have a 
look http://wiki.apache.org/solr/QueryElevationComponent

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



- Original Message -
 From: Chris Warner chris_war...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: 
 Sent: Wednesday, April 18, 2012 1:16 PM
 Subject: Can you suggest a method or pattern to consistently promote a 
 document with any query?
 
 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd 
 like to return a specific document with every query, as the first result. As 
 well, I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would 
 do anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
  
 --
 chris_war...@yahoo.com

Re: Can you suggest a method or pattern to consistently promote a document with any query?

Thanks, Jeevanandam and Otis,

I'll take another look at Elevate. My first attempts did not yield success, as 
I was not able to find a way to elevate a document with a *:* query. Perhaps 
I'll try a * query to see what happens.

Cheers,
Chris

 


- Original Message -
From: Jeevanandam Madanagopal je...@myjeeva.com
To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com
Cc: 
Sent: Wednesday, April 18, 2012 10:21 AM
Subject: Re: Can you suggest a method or pattern to consistently promote a 
document with any query?

Chris -

Take a look - QueryElevationComponent

http://wiki.apache.org/solr/QueryElevationComponent

-Jeevanandam

On Apr 18, 2012, at 10:46 PM, Chris Warner wrote:

 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd like to 
 return a specific document with every query, as the first result. As well, 
 I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would do 
 anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
  
 --

Re: Can you suggest a method or pattern to consistently promote a document with any query?

2012-04-18 Thread Walter Underwood

That is not a useful test. Users don't look for *:*.

Test with real queries.

wunder

On Apr 18, 2012, at 10:27 AM, Chris Warner wrote:

 Thanks, Jeevanandam and Otis,
 
 I'll take another look at Elevate. My first attempts did not yield success, 
 as I was not able to find a way to elevate a document with a *:* query. 
 Perhaps I'll try a * query to see what happens.
 
 Cheers,
 Chris
 
  
 
 
 - Original Message -
 From: Jeevanandam Madanagopal je...@myjeeva.com
 To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com
 Cc: 
 Sent: Wednesday, April 18, 2012 10:21 AM
 Subject: Re: Can you suggest a method or pattern to consistently promote a 
 document with any query?
 
 Chris -
 
 Take a look - QueryElevationComponent
 
 http://wiki.apache.org/solr/QueryElevationComponent
 
 -Jeevanandam
 
 On Apr 18, 2012, at 10:46 PM, Chris Warner wrote:
 
 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd like 
 to return a specific document with every query, as the first result. As 
 well, I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would do 
 anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
   
 --

--
Walter Underwood
wun...@wunderwood.org

Re: Can you suggest a method or pattern to consistently promote a document with any query?

Browsing all documents and all facets, skipper.


Cheers,
Chris

 

- Original Message -
From: Walter Underwood wun...@wunderwood.org
To: solr-user@lucene.apache.org
Cc: 
Sent: Wednesday, April 18, 2012 10:29 AM
Subject: Re: Can you suggest a method or pattern to consistently promote a 
document with any query?

That is not a useful test. Users don't look for *:*.

Test with real queries.

wunder

On Apr 18, 2012, at 10:27 AM, Chris Warner wrote:

 Thanks, Jeevanandam and Otis,
 
 I'll take another look at Elevate. My first attempts did not yield success, 
 as I was not able to find a way to elevate a document with a *:* query. 
 Perhaps I'll try a * query to see what happens.
 
 Cheers,
 Chris
 
  
 
 
 - Original Message -
 From: Jeevanandam Madanagopal je...@myjeeva.com
 To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com
 Cc: 
 Sent: Wednesday, April 18, 2012 10:21 AM
 Subject: Re: Can you suggest a method or pattern to consistently promote a 
 document with any query?
 
 Chris -
 
 Take a look - QueryElevationComponent
 
 http://wiki.apache.org/solr/QueryElevationComponent
 
 -Jeevanandam
 
 On Apr 18, 2012, at 10:46 PM, Chris Warner wrote:
 
 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd like 
 to return a specific document with every query, as the first result. As 
 well, I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would do 
 anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
  
 --

--
Walter Underwood
wun...@wunderwood.org

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

I don't use any of that stuff in my app, so not sure how it works.

I just manage my taxonomy outside of solr at index time and don't need
any special fields or tokenizers. I use a string field type and insert
the proper field at index time and query it normally. Nothing special
required.

On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote:
 It looks like TextField is the problem.
 
 This fixed:
 fieldType name=text_path class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.PathHierarchyTokenizerFactory
 delimiter=//
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType
 
 I am assuming the text_path fields won't include whitespace characters.
 
 ?q=colors:red/pink
 == Doc2   (Doc1, which has colors = red isn't included!)
 
 
 Is there a tokenizer that tokenizes the string as one token?
 I tried to extend Tokenizer myself  but it fails:
 public class AsIsTokenizer extends Tokenizer {
 @Override
 public boolean incrementToken() throws IOException {
 return true;//or false;
 }
 }
 
 
 On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote:
 
  Yah, that's exactly what PathHierarchyTokenizer does.
  fieldType name=text_path class=solr.TextField
  positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.PathHierarchyTokenizerFactory/
/analyzer
  /fieldType
 
  I think I have a query time tokenizer that tokenizes at /
 
  ?q=colors:red
  == Doc1, Doc2
 
  ?q=colors:redfoobar
  ==
 
  ?q=colors:red/foobarasdfoaijao
  == Doc1, Doc2
 
 
 
 
  On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote:
 
  Put the parent term in all the child documents at index time
  and the re-issue the facet query when you expand the parent using the
  parent's term. works perfect.
 
  On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
   I have hierarchical colors:
   field name=colors type=text_pathindexed=true
   stored=true multiValued=true/
   text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
  
   Given these two documents,
   Doc1: red
   Doc2: red/pink
  
   I want the result to be the following:
   ?fq=red
   == Doc1, Doc2
  
   ?fq=red/pink
   == Doc2
  
   But, with PathHierarchyTokenizer, Doc1 is included for the query:
   ?fq=red/pink
   == Doc1, Doc2
  
   How can I query for hierarchical facets?
   http://wiki.apache.org/solr/HierarchicalFaceting describes
  facet.prefix..
   But it looks too cumbersome to me.
  
   Is there a simpler way to implement hierarchical facets?

Re: Can you suggest a method or pattern to consistently promote a document with any query?

Thanks to those who responded. A more thorough reading of the wiki and I see 
the need for forceElevation=true in the elevate query.

Cheers,
Chris

- Original Message -
From: Otis Gospodnetic otis_gospodne...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Chris Warner 
chris_war...@yahoo.com
Cc: 
Sent: Wednesday, April 18, 2012 10:23 AM
Subject: Re: Can you suggest a method or pattern to consistently promote a 
document with any query?

Chris,

I haven't checked if Elevate Component has an easy way to push a specific doc 
for *all* queries, but have a 
look http://wiki.apache.org/solr/QueryElevationComponent

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html

- Original Message -
 From: Chris Warner chris_war...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: 
 Sent: Wednesday, April 18, 2012 1:16 PM
 Subject: Can you suggest a method or pattern to consistently promote a 
 document with any query?

 Hi, folks,

 Perhaps I'm overlooking an obvious solution to a common desire... I'd 
 like to return a specific document with every query, as the first result. As 
 well, I'd like to have that document be the first result in a *:* query.

 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would 
 do anything for the *:* queries.

 Thanks for any suggested reading or patterns...

 Best,
 Chris

 --
 chris_war...@yahoo.com

Date granularity

2012-04-18 Thread vybe3142

A query search on a particular date:

returns 1valid result (as expected).

How can I alter the granularity of the search for example , to all matches
on the particular DAY?

Reading through various docs, I attempt to append /DAY but this doesn't
seem to work (in fact I get 0 results back when querying).

What am I neglecting? 
Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: hierarchical faceting?

2012-04-18 Thread Charlie Maroto

 The PathHierarchyTokenizerFactory is intended for file path therefore
assumes that all documents should be indexed with all of the paths to the
parent folders but you are trying to use it for a taxonomy so you can't
simply use the PathHierarchyTokenizerFactory.   Use the analysis page (
http://localhost:8983/solr/admin/analysis.jsp) so that you can see what's
happening with the content both at index and query time.

Field  (Type)  text_path
Field value (Index)  red/pink
Field value (Query) red/pink
You'd notice that the result of both is identical, therefore explaining why
both documents are retrieved:

Index Analyzer:
   red
   red/pink
Query Analyzer:
   red
   red/pink

 Carlos

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com]
Sent: Wednesday, April 18, 2012 8:10 AM
To: solr-user@lucene.apache.org
Subject: Re: hierarchical faceting?



Put the parent term in all the child documents at index time and the
re-issue the facet query when you expand the parent using the parent's
term. works perfect.



On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:

 I have hierarchical colors:

 field name=colors type=text_pathindexed=true

 stored=true multiValued=true/

 text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.



 Given these two documents,

 Doc1: red

 Doc2: red/pink



 I want the result to be the following:

 ?fq=red

 == Doc1, Doc2



 ?fq=red/pink

 == Doc2



 But, with PathHierarchyTokenizer, Doc1 is included for the query:

 ?fq=red/pink

 == Doc1, Doc2



 How can I query for hierarchical facets?

 http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..

 But it looks too cumbersome to me.



 Is there a simpler way to implement hierarchical facets?

Suggester

2012-04-18 Thread John

Using Solr 3.6, I am trying to get suggestions for phrases.
I managed getting prefixed suggestions, but not suggestions for middle of
phrase.
Can this be achieved with built in Solr suggest, or do I need to create a
special core for this purpose?

Thanks in advance.

Re: Can you suggest a method or pattern to consistently promote a document with any query?

2012-04-18 Thread Jeevanandam Madanagopal

Chris -

If you have defined 'last-components' in search handler, forceElevation=true 
may not required.  It gets invoked in search life cycle

arr name=last-components
  strelevator/str
/arr

-Jeevanandam


On Apr 18, 2012, at 11:37 PM, Chris Warner wrote:

 Thanks to those who responded. A more thorough reading of the wiki and I see 
 the need for forceElevation=true in the elevate query.
 
 Cheers,
 Chris
 
 
 - Original Message -
 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Chris Warner 
 chris_war...@yahoo.com
 Cc: 
 Sent: Wednesday, April 18, 2012 10:23 AM
 Subject: Re: Can you suggest a method or pattern to consistently promote a 
 document with any query?
 
 Chris,
 
 I haven't checked if Elevate Component has an easy way to push a specific doc 
 for *all* queries, but have a look 
 http://wiki.apache.org/solr/QueryElevationComponent
 
 Otis 
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html
 
 
 
 - Original Message -
 From: Chris Warner chris_war...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Cc: 
 Sent: Wednesday, April 18, 2012 1:16 PM
 Subject: Can you suggest a method or pattern to consistently promote a 
 document with any query?
 
 Hi, folks,
 
 Perhaps I'm overlooking an obvious solution to a common desire... I'd 
 like to return a specific document with every query, as the first result. As 
 well, I'd like to have that document be the first result in a *:* query.
 
 I'm looking into index time boosting using the boost attribute on the 
 appropriate doc. I haven't tested this yet, and I'm not sure this would 
 do anything for the *:* queries.
 
 Thanks for any suggested reading or patterns...
 
 Best,
 Chris
 
  
 --
 chris_war...@yahoo.com

Re: Solr file size limit?

2012-04-18 Thread Shawn Heisey


On 4/18/2012 6:17 AM, Bram Rongen wrote:

I'm using Solr 3.5 on a dedicated Ubuntu 10.04 box with 3TB of diskspace
and 16GB of memory. I've tried using the sun JRE and OpenJDK, both
resulting in the same problem. Indexing works great until my .fdt file
reaches the size of 4.9GB/ 5217987319b. At this point when Solr starts
merging it just keeps on merging, starting over and over.. Java is using
all the available memory even though Xmx is set at 8G. When I restart Solr
everything looks fine until merging is triggered. Whenever it hangs the
server load averages 3, searching is possible but slow, the solr admin
interface is reachable but sending new documents leads to a time-out.


Solr 3.5 works a little differently than previous versions (MMAPs all 
the index files), so if you look at the memory usage as reported by the 
OS, it's going to look all wrong.  I've got my max heap set to 8192M, 
but this is what top looks like:


Mem:  64937704k total, 58876376k used,  6061328k free,   379400k buffers
Swap:  8388600k total,77844k used,  8310756k free, 47080172k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
22798 ncindex   20   0 75.6g  21g  12g S  1.0 34.3  14312:55 java

If you add up the 47GB it says it's using for the disk cache, the 6GB 
that it says is free, and the 21GB it says that Java has resident, you 
end up with considerably more than the 64GB total RAM the machine has, 
even if you include the 77MB of swap that's used.  You can use the jstat 
command to get a better idea of how much RAM java really is using:


jstat -gc -t pid 5000

Add up the S0C, S1C, EC, OC, and PC columns.  The alignment is often 
wrong on this output, so you'll have to count the columns.  If I do this 
for my system, I end up with 8462972 KB.  Alternatively, if you have a 
GUI installed on the server or you have set up remote JMX, you can use 
JConsole to very easily get a correct number.


The extra memory reported by the OS is not really being used, it is a 
side effect of the memory mapping used by the Lucene indexes.



I've tried using several different settings for MergePolicy and started
reindexing a couple of times but the behavior stays the same. My current
solrconf.xml can be found here: http://pastebin.com/NXDT0B8f. I'm unable to
find errors in the log which makes it really difficult to debug.. Could
anyone point me in the right direction?


A MergeFactor of 4 is extremely low and will result in very frequent 
merging.  The default is 10.  I use a value of 36, but that is unusually 
high.


Looking at one of my indexes on that machine, the largest fdt file is 
7657412 KB, the other three are tiny - 9880, 12160, and 28 KB.  That 
index was recently optimized.  The total index size is over 20GB.  I 
have three indexes that size running in different cores on that 
machine.  You're definitely not running into any limits as far as Solr 
is concerned.


You might be running into I/O issues.  Are you relying on autocommit, or 
explicitly committing your updates and waiting for the commit to finish 
before doing more updates?  When there is segment merging, commits can 
take a really long time.  If you are using autocommit or not waiting for 
manual commits to finish, it might get bad enough that one commit has 
not yet finished when another is ready to take place.  I don't know what 
this would actually do, but it would not be a good situation.


How have you created your 3TB of disk space?  If you are using RAID5 or 
RAID6, you can run into very serious and unavoidable performance 
problems with writes.  If it is a single disk, it may not provide enough 
IOPS for good performance.  My servers also have 3TB of disk space, 
using six 1TB SATA drives in RAID10.  The worst-case scenario for your 
merges is equivalent to an optimize.  An optimize of one of my 20GB 
indexes takes 15 minutes even on RAID10, so I only optimize one large 
index once a day, so each large index gets optimized every six days.


I hope this helps, but I'll be happy to try and offer more, within my 
skill set.


Thanks,
Shawn

Difference between Search result from Admin console and solr/browse

2012-04-18 Thread srini

I have imported my xml documents from oracle database and indexed them. When
I search *:* in *admin console *I do get results. My xml format is not close
to what solr expects. but still when I search for any word that is part of
my xml document Solr displays whole xml document. for example if I search
for word voicemail solr displays xml documents that has word voicemail

Now when I go to solr/browse and give *:* I do see some thing but each
result is like below (no data) even if i search for same word voicemail I
am getting below. Can some body !!please Advice!

Price:
Features:
In Stock

there are only two things I can think off, one is settings in
solrconfig.xml(like below). 

requestHandler name=/browse class=solr.SearchHandler
lst name=defaults
   str name=echoParamsexplicit/str

   
   str name=wtvelocity/str

   str name=v.templatebrowse/str
   str name=v.layoutlayout/str
   str name=titleSolritas/str

   str name=dftext/str
   str name=defTypeedismax/str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str
   str name=mlt.qf
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
   str name=mlt.fltext,features,name,sku,id,manu,cat/str
   int name=mlt.count3/int
   str name=qf
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Difference-between-Search-result-from-Admin-console-and-solr-browse-tp3921323p3921323.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr file size limit?

2012-04-18 Thread Shawn Heisey


On 4/18/2012 6:17 AM, Bram Rongen wrote:

I've been using Solr for a very short time now and I'm stuck. I'm trying to
index a drupal website consisting of 1.2 million smaller nodes and 300k
larger nodes (~400kb avg)..


A followup to my previous reply: Your ramBufferSizeMB is only 32, the 
default in the example config.  I have seen recommendations indicating 
that going beyond 128MB is not usually helpful.  With such large input 
documents, that may not apply to you - try setting it to 512 or 1024.  
That will result in far fewer index segments being created.  They will 
be larger, so merges will be much less frequent but take longer.


Thanks,
Shawn

Re: Populating a filter cache by means other than a query

2012-04-18 Thread Chris Collins

Great question. 

The set could be in the millions.  I over simplified the use case somewhat to 
protect the innocent :-}.  If a user is querying a large set of documents (for 
the sake of argument lets say its high tens of millions but could be in the 
small billions), they want to potentially mark a result set or subset of those 
docs with a label/tag and use that label /tag later. Now lets throw in its 
multi tenant system and we dont want to keep re-indexing documents to add these 
tags.  Really what I would want todo is to execute a query filtering by this 
labeled set, the server fetches the labeled set out of local cache or over the 
wire or off disk and then incorporates it by one means or another as a filter 
(docset or hashtable in the hitcollector).

Personally I think the dictionary approach wouldnt be a good one.  It may 
produce the most optimal filter mechanism but will cost a bunch to construct 
the OpenBitSet.   

In a prior company I built a more generic version of this for not only 
filtering but for sorting, aggregate stats, etc.   We didn't use Solr.   I was 
curious if there was any methodology for plugging in such a scheme without 
taking a branch of solr and hacking at it.  This was a multi tenant system 
where we were producing aggregate graphs, filtering and ranking by things such 
as entity level sentiment so we produced a rather generic solution here that as 
you pointed out reinvented perhaps some things that smell similar.  It was 
about 7B docs and was multi tenant.  Users were able to overide these 
features on a document level which was necessary so their counts, sorts etc 
worked correctly.  Saying how long it took me to build and debug it if I can 
take something close off the shelf.well you know the rest of the story :-}

C


On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote:

 I guess my question is what advantage are you trying
 to get here?
 
 At the start, this feels like an XY problem. How are
 you intending to use the fq after you've built it? Because
 if there's any way to just create an fq clause, Solr
 will take care of it for you. Caching it, autowarming
 it when searchers are re-opened, etc. Otherwise, you're
 going to be re-inventing a bunch of stuff it seems to me,
 you'll have to intercept the queries coming in in order
 to apply the filter from the cache, etc.
 
 Which also may be another way of asking How big
 is this set of document IDs? If it's in the 100s, I'd
 just go with an fq. If it's more than that, I'd index
 some kind of set identifier that you could create for
 your fqs.
 
 And if this is gibberish, ignore me G..
 
 Best
 Erick
 
 On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote:
 Hi, I am a long time Lucene user but new to solr.  I would like to use 
 something like the filterCache but build a such a cache not from a query but 
 custom code.  I guess I will ask my question by using techniques and vocab I 
 am familiar with.  Not sure its actually the right way so I appologize if 
 its just the wrong approach.
 
 The scenario is that I would like to filter a result set by a set of labeled 
 documents, I will call that set L.
 L contains app specific document IDs that are indexed as literals in the 
 lucenefield myid.
 I would imagine I could build a OpenBitSet from enumerating the termdocs and 
 look for the intersecting ids in my label set.
 Now I have my bitset that I assume I could use in a filter.
 
 Another approach would be to implement a hits collector, compute a 
 fieldcache from that myid field and look for the intersection in a hashtable 
 of L at scoring time, throwing out results that are not contained in the 
 hashtable.
 
 Of course I am working within the confines / concepts that SOLR has layed 
 out.  Without going completely off the reservation is their a neat way of 
 doing such a thing with SOLR?
 
 Glad to clarify if my question makes absolutely no sense.
 
 Best
 
 C

RE: Changing precisionStep without a re-index

2012-04-18 Thread Michael Ryan

In case anyone tries to do this... If you facet on a TrieField and change the 
precisionStep to 0, you'll need to re-index. Changing precisionStep to 0 
changes the prefix returned by TrieField.getMainValuePrefix(FieldType), which 
then causes facets with a value of 0 to be returned.

-Michael

Re: Date granularity

2012-04-18 Thread Peter Markey

you could use a filter query like: fq=datefield:[NOW/DAY-1DAY TO
NOW/DAY+1DAY]

*replace datefield with your field that contains the time info

On Wed, Apr 18, 2012 at 11:11 AM, vybe3142 vybe3...@gmail.com wrote:

 A query search on a particular date:

 returns 1valid result (as expected).

 How can I alter the granularity of the search for example , to all matches
 on the particular DAY?

 Reading through various docs, I attempt to append /DAY but this doesn't
 seem to work (in fact I get 0 results back when querying).

 What am I neglecting?
 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Difference between Search result from Admin console and solr/browse

Hi,

The /browse Request Handler is built to showcase the xml documents in 
solr/example/exampledata and if you want to use it for your own data and schema 
you must modify the templates in solr/example/conf/velocity/ to display 
whatever you want to display.

Given that you use an unmodified example schmema, you should be able to get 
more or less the same results as in Admin console (which uses the Lucene query 
parser on default field text ootb) by querying for text:voicemail. If you 
then click the enable debug link at the bottom of the page and then click the 
toggle all fields links below each result hit, you will see what is contained 
in each and every field.

What you probably *should* do is to transform your oracle XMLs into XML that 
corresponds with Solr's schema, and you should tweak your schema and Velocity 
templates to match what you'd like to output in the reults. A simple way to 
prototype transforms is to write an XSL and using the XSLTUpdateRequestHandler 
at solr/update/xslt instead of the XML handler. See 
http://wiki.apache.org/solr/XsltUpdateRequestHandler

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 18. apr. 2012, at 22:49, srini wrote:

 I have imported my xml documents from oracle database and indexed them. When
 I search *:* in *admin console *I do get results. My xml format is not close
 to what solr expects. but still when I search for any word that is part of
 my xml document Solr displays whole xml document. for example if I search
 for word voicemail solr displays xml documents that has word voicemail
 
 Now when I go to solr/browse and give *:* I do see some thing but each
 result is like below (no data) even if i search for same word voicemail I
 am getting below. Can some body !!please Advice!
 
 Price:
 Features:
 In Stock
 
 there are only two things I can think off, one is settings in
 solrconfig.xml(like below). 
 
 requestHandler name=/browse class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
 
 
   str name=wtvelocity/str
 
   str name=v.templatebrowse/str
   str name=v.layoutlayout/str
   str name=titleSolritas/str
 
   str name=dftext/str
   str name=defTypeedismax/str
   str name=q.alt*:*/str
   str name=rows10/str
   str name=fl*,score/str
   str name=mlt.qf
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
   str name=mlt.fltext,features,name,sku,id,manu,cat/str
   int name=mlt.count3/int
   str name=qf
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   /str
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Difference-between-Search-result-from-Admin-console-and-solr-browse-tp3921323p3921323.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: minimum match and not matched words / term frequency in query result

Hi,

Which query terms that match may of course vary from document to document, so 
it would be hard to globally print non matching terms. But for each individual 
document match, you could deduct what terms do not match by enumerating what 
terms that DO match - using the explain output for instance.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 18. apr. 2012, at 17:34, giovanni.bricc...@banzai.it wrote:

 Hi
 
 I have a dismax query with a mininimum match settings, this allows some terms 
 to be missing in query results.
 
 I would like give a feedback to the user, highlighting the not matched words. 
 It would be interesting also to show the words with a very low frequence.
 
 For instance searching for purple pendrive I would highlight that the 
 results ignore the term purple,  beacuse we don't have any.
 
 Can you suggest how to approach the problem?
 
 I was thinking about the debugQuery output, but since I will not get details 
 about all the results I probably will  miss something.
 
 I am trying to write a new SearchComponent but I don't know how to get term 
 frequency data from a ResponseBuilder object... I am new to solr/lucene 
 programming.
 
 Thanks a lot

Re: Solr 3.6 parsing and extraction files

Hi,

I suppose you want to POST office docs into Solr for text extraction using the 
Extracting RequestHandler (SolrCell).
Have you read this page? http://wiki.apache.org/solr/ExtractingRequestHandler
You basically need all libs provided by contrib/extraction. You can see in the 
example solr/conf/solrconfig.xml which lib ../ directives are included near 
the top of the file, this should give you a hint of how to configure your own 
solrconfig.xml depending on where you put those libs.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 18. apr. 2012, at 17:36, Tod wrote:

 Could someone possibly provide me with a list of jars that I need to extract 
 from the apache-solr-3.6.0.tgz file to enable the parsing and remote 
 streaming of office style documents?  I assume (for a multicore 
 configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - correct?
 
 
 Thanks - Tod

Re: Populating a filter cache by means other than a query

Pesky users. Life would be so much easier if they'd just leave
devs alone G


Right. Well, you can certainly create your own SearchComponent and attach your
custom filter at that point, note how I'm skimping on the details here.

From left field, you might create a custom FunctionQuery that returns 0 in the
case of excluded documents. Since that gets multiplied into the score, the
resulting score is 0. Returning 1 for docs that should be kept wouldn't change
the score.

But other than that, I'll leave it to the folks in the code. Chris,
you there? G..

Best
Erick

On Wed, Apr 18, 2012 at 5:14 PM, Chris Collins ch...@geekychris.com wrote:
 Great question.

 The set could be in the millions.  I over simplified the use case somewhat to 
 protect the innocent :-}.  If a user is querying a large set of documents 
 (for the sake of argument lets say its high tens of millions but could be in 
 the small billions), they want to potentially mark a result set or subset of 
 those docs with a label/tag and use that label /tag later. Now lets throw in 
 its multi tenant system and we dont want to keep re-indexing documents to add 
 these tags.  Really what I would want todo is to execute a query filtering by 
 this labeled set, the server fetches the labeled set out of local cache or 
 over the wire or off disk and then incorporates it by one means or another as 
 a filter (docset or hashtable in the hitcollector).

 Personally I think the dictionary approach wouldnt be a good one.  It may 
 produce the most optimal filter mechanism but will cost a bunch to construct 
 the OpenBitSet.

 In a prior company I built a more generic version of this for not only 
 filtering but for sorting, aggregate stats, etc.   We didn't use Solr.   I 
 was curious if there was any methodology for plugging in such a scheme 
 without taking a branch of solr and hacking at it.  This was a multi tenant 
 system where we were producing aggregate graphs, filtering and ranking by 
 things such as entity level sentiment so we produced a rather generic 
 solution here that as you pointed out reinvented perhaps some things that 
 smell similar.  It was about 7B docs and was multi tenant.  Users were able 
 to overide these features on a document level which was necessary so their 
 counts, sorts etc worked correctly.  Saying how long it took me to build and 
 debug it if I can take something close off the shelf.well you know the 
 rest of the story :-}

 C


 On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote:

 I guess my question is what advantage are you trying
 to get here?

 At the start, this feels like an XY problem. How are
 you intending to use the fq after you've built it? Because
 if there's any way to just create an fq clause, Solr
 will take care of it for you. Caching it, autowarming
 it when searchers are re-opened, etc. Otherwise, you're
 going to be re-inventing a bunch of stuff it seems to me,
 you'll have to intercept the queries coming in in order
 to apply the filter from the cache, etc.

 Which also may be another way of asking How big
 is this set of document IDs? If it's in the 100s, I'd
 just go with an fq. If it's more than that, I'd index
 some kind of set identifier that you could create for
 your fqs.

 And if this is gibberish, ignore me G..

 Best
 Erick

 On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote:
 Hi, I am a long time Lucene user but new to solr.  I would like to use 
 something like the filterCache but build a such a cache not from a query 
 but custom code.  I guess I will ask my question by using techniques and 
 vocab I am familiar with.  Not sure its actually the right way so I 
 appologize if its just the wrong approach.

 The scenario is that I would like to filter a result set by a set of 
 labeled documents, I will call that set L.
 L contains app specific document IDs that are indexed as literals in the 
 lucenefield myid.
 I would imagine I could build a OpenBitSet from enumerating the termdocs 
 and look for the intersecting ids in my label set.
 Now I have my bitset that I assume I could use in a filter.

 Another approach would be to implement a hits collector, compute a 
 fieldcache from that myid field and look for the intersection in a 
 hashtable of L at scoring time, throwing out results that are not contained 
 in the hashtable.

 Of course I am working within the confines / concepts that SOLR has layed 
 out.  Without going completely off the reservation is their a neat way of 
 doing such a thing with SOLR?

 Glad to clarify if my question makes absolutely no sense.

 Best

 C

Re: Multiple document structure

Solr does not enforce anything about documents conforming to
the schema except:
1 a field specified in a doc must be present in the schema
2 any field in the schema with ' required=true ' must be present
in the doc.

Additionally there is no penalty for NOT putting all the fields
defined in the schema into a particular document.

What this means:
Just create your schema with all the fields you'll need for both
types of documents, probably along with a type field to
distinguish the two. Now just index the separate document
types in the same index.

Best
Erick

On Wed, Apr 18, 2012 at 9:28 AM, Gora Mohanty g...@mimirtech.com wrote:
 On 18 April 2012 10:05, abhijit bashetti bashettiabhi...@rediffmail.com 
 wrote:

 Hi ,
 Is it possible to have 2 document structures in solr?
 [...]

 Do not think so, but why do you need it? Use two separate
 indices, either in a multi-core setup, or in separate Solr
 instances.

 Regards,
 Gora

Re: Date granularity

If Peter's suggestion doesn't work, please post the results
of adding debugQuery=on to your query. The date math
stuff is sensitive to spaces, for instance and it's impossible
to tell whether you're making a simple error like that without
seeing what you're actually doing.

Best
Erick

On Wed, Apr 18, 2012 at 6:46 PM, Peter Markey sudoma...@gmail.com wrote:
 you could use a filter query like: fq=datefield:[NOW/DAY-1DAY TO
 NOW/DAY+1DAY]

 *replace datefield with your field that contains the time info

 On Wed, Apr 18, 2012 at 11:11 AM, vybe3142 vybe3...@gmail.com wrote:

 A query search on a particular date:

 returns 1valid result (as expected).

 How can I alter the granularity of the search for example , to all matches
 on the particular DAY?

 Reading through various docs, I attempt to append /DAY but this doesn't
 seem to work (in fact I get 0 results back when querying).

 What am I neglecting?
 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Core not able access latest data indexed by multiple server.

I think you're trying to do something that's you shouldn't. The trunk
SolrCloud stuff will address this issue, but for the 3.x code line having
multiple servers opening up a shared index and writing to it will produce
unpredictable results. This is really bad practice.

You'd be far ahead setting up one of these machines as a master,
the other as a slave, and always indexing to the master.

Best
Erick

On Wed, Apr 18, 2012 at 1:17 AM, Paresh Modi pm...@asite.com wrote:
Hi,

I am using Solr multicore approach in my app. we have two different servers
(ServerA1 and ServerA2) for load balancing, both the server accessing the
same index repository and request will go to any server as per load balance
algorithm.

Problem occurs in following way [Note that both the servers accessing the
same physical location(index)].

- ADD TO INDEX request for File1 go to ServerA1 for core CR1, core CR1
loaded in ServerA1 and indexing done.
- ADD TO INDEX request for File2 go to ServerA2 for core CR1, core CR1
loaded in ServerA2 and indexing done.
- SEARCH request for File2 go to ServerA1, now here core CR1 is already
loaded so it directly access the index but File2 added by ServerA2 is not
found in core loaded by ServerA1.

So this is the problem, File2 indexed by core CR1 loaded in ServerA2 is not
available in core CR1 loaded by ServerA1.

I have searched and found that the solution to this problem is reload the
CORE. when you reload the core, it will have latest indexed data. but
reloading the Core for every request is very heavy and time consuming
process.

Please let me know if anyone has any solution for this.

Waiting for your expert advice.

Thanks
Paresh

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Core-not-able-access-latest-data-indexed-by-multiple-server-tp3919113p3919113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems with edismax parser and solr3.6

Happened to see that Jan confirms this as a bug, see:
https://issues.apache.org/jira/browse/SOLR-3377

On Wed, Apr 18, 2012 at 11:00 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:

 I just looked through my logs of solr 3.6 and saw several 0 hits which were 
 not seen with solr 3.5.

 While tracing this down it turned out that edismax don't like queries of type 
 ...q=(text:ide)... any more.

 If parentheses around the query term the edismax fails with solr 3.6.

 Can anyone confirm this and give me feedback?

 Bernd

Re: Problems with edismax parser and solr3.6