Re: Multiple servers support

2011-09-26 Thread Raja Ghulam Rasool
Erick,

Many thanks for your suggestions and pointers, i am proceeding with my study
and looking forward to do a POC with Solr.

Thanks again.


On Sun, Sep 25, 2011 at 7:40 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, this is not a neutral forum G...

 A common use-case for Solr is exactly to replace
 database searches because, as you say, search
 performance in a database is often slow and limited.
 RDBMSs do very complex stuff very well, but they
 are not designed for text searching.

 Scaling is accomplished by either replication or
 sharding. Replication is used when the entire index
 fits on a single machine and you can get
 reasonable responses. I've seen 40-50M docs fit
 quite comfortably on one machine. But 150TB
 *probably* indicates that this isn't reasonable in your
 case.

 If you can't fit the entire index on one machine, then
 you shard, which splits up the single logical index
 into multiple slices and Solr automatically will query
 all the shards and assemble the parts into a single
 response.

 But you absolutely cannot guess the hardware
 requirements ahead of time. It's like answering
 How big is a Java program? There are too
 many variables. But Solr is free, right? So you
 absolutely have to get a copy and put your 2.5M
 docs on it and test (Solrmeter or jMeter are
 good options). If you get adequate throughput, add
 another 1M docs to the machine. Keep on until
 your QPS rate drops and you'll have a good idea how
 many documents you can put on a single machine.
 There's really no other way to answer that question

 Best
 Erick

 On Sun, Sep 25, 2011 at 5:55 AM, Raja Ghulam Rasool the.r...@gmail.com
 wrote:
  Hi,
 
  I am new to Solr, and I am studying it currently. We are planning to
  implement Solr in our production setup. We have 15 servers where we are
  getting the data. The data is huge, like we are supposed to keep 150 Tera
  bytes of data (in terms of documents it will be around  2592000 documents
  per server), across all servers (combined). We have the
  necessary storage capacity. Can anyone let me know whether Solr will be a
  good solution for our text search needs ? We are required to provide text
  searches or certain limited number of fields.
 
  1- Does Solr support such architecture, i.e. multiple servers ? what
  specific area in Solr do i need to explore (shards, cores etc, ???)
  2- Any idea whether we will really benefit from Solr implementation for
 text
  searches, vs let us say Oracle Text Search ? Currently our Oracle Text
  search is giving a very bad performance and we are looking to some how
  improve our text search performance
  any high level pointers or help will be greatly appreciated.
 
  thanks in advance guys
 
  --
  Regards,
  Raja
 




-- 
Regards,
Ghulam Rasool.

Blog: http://ghulamrasool.blogspot.com
Mobile: +971506141872


Unique Key error on trunk

2011-09-26 Thread Viswa S
Hello,

We use solr.UUIDField to generate unique ids, using the latest trunk (change 
list 1163767) seems to throw an error Document is missing mandatory uniqueKey 
field: id. The schema is setup to generate a id field on updates 

   field name=id type=uuid indexed=true stored=true default=NEW /

Thanks
Viswa

SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory 
uniqueKey field: id
at 
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:80)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:145)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1406)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)



error while replication

2011-09-26 Thread shinkanze
hi , 
I am replicating solr and getting this error . i am unable to make out the
cause so please kindly help


26 Sep, 2011 8:00:14 AM org.slf4j.impl.JDK14LoggerAdapter fillCallerData
SEVERE: Error during auto-warming of
key:org.apache.solr.search.QueryResultKey@150f0455:java.lang.NullPointerException
at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
at org.apache.lucene.index.Term.init(Term.java:38)
at
org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.next(NumericRangeQuery.java:530)
at
org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.init(NumericRangeQuery.java:476)
at
org.apache.lucene.search.NumericRangeQuery.getEnum(NumericRangeQuery.java:307)
at
org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:160)
at
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.init(ConstantScoreQuery.java:116)
at
org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:81)
at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at
org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:268)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:258)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880)
at
org.apache.solr.search.SolrIndexSearcher.access$000(SolrIndexSearcher.java:51)
at
org.apache.solr.search.SolrIndexSearcher$3.regenerateItem(SolrIndexSearcher.java:332)
at org.apache.solr.search.LRUCache.warm(LRUCache.java:194)
at
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1130)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)


regards 
rajat rastogi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-while-replication-tp3368783p3368783.html
Sent from the Solr - User mailing list archive at Nabble.com.


multiple dateranges/timeslots per doc: modeling openinghours.

2011-09-26 Thread britske
Sorry for the somewhat length post, I would like to make clear that I covered
my basis here, and looking for an alternative solution, because the more
trivial solutions don't seem to work for my use-case. 

Consider Bars, musea, etc. 

These places have multiple openinghours that can depend on: 
REQ 1. day of week
REQ 2. special days on which they are closed, or have in another way
different openinghours than there related 'day of week'

Now, I want to model these 'places' in a way so I'm able to do temporal
queries like: 
- which bars are open NOW (and stay open for at least another 3 hours)
- which musea are (already) open at 25-12-2011 - 10AM - and stay open until
(at least) 3PM. 

I believe having opening/closing hours available for each day at least gives
me the data needed to query the above. (Note that having
dayOfWeek*openinghours is not enough, bc. of the special cases in 2.) 

Okay knowing I need openinghours*dates for each place, how would I format
this in documents? 

OPTION A) 
---
Considering granularity: I want documents to represent Places and not
Places*dates. Although the latter would trivially allow me to do the quering
mentioned above, it has the disadvantages: 

 - same place returned multiple times (each with a different date) when
queries are not constrained to date. 
- Lot's of data needs to be duplicated, all for the conceptually 'simple' 
functionality of needing multiple date-ranges. It feels bad and a simpler
solution should exist? 
- Exploding the resultset (documents = say, 100 dates * 1.000.000 =
100.000.000. ) suddenly the size of the resultset goes from 'easily doable'
to 'hmmm I have to think about this'. Given that places also have some other
fields to sort on, Lucene fieldcache mem-usage would explode with a factor
100. 

OPTION B)
--
Another, faulty, option would be to model opening/closing hours in 2
multivalued date-fields, i.e: open, close. and insert open/close for each
day, e.g: 

open: 2011-11-08:1800 - close: 2011-11-09:0300
open: 2011-11-09:1700 - close: 2011-11-10:0500
open: 2011-11-10:1700 - close: 2011-11-11:0300

And queries would be of the form:

'open  now  close  now+3h'

But since there is no way to indicate that 'open' and 'close' are pairwise
related I will get a lot of false positives, e.g the above document would be
returned for:

open  2011-11-09:0100  close  2011-11-09:0600
because SOME opendate is before 2011-11-09:0100 (i.e: 2011-11-08:1800) and
SOME closedate is after 2011-11-09:0600 (for example: 2011-11-11:0300) but
these open and close-dates are not pairwise related.

OPTION C) The best of what I have now:
---
I have been thinking about a totally different approach using Solr dynamic
fields, in which each and every opening and closing-date gets it's own
dynamic field, e.g:

_date_2011-11-09_open: 1800
_date_2011-11-09_close: 0300
_date_2011-11-09_open: 1700
_date_2011-11-10_close: 0500
_date_2011-11-10_open: 1700
_date_2011-11-11_close: 0300

Then, the client should know the date to query, and thus the correct fields
to query. This would solve the problem, since startdate/ enddate are nor
pairwise -related, but I fear this can be a big issue from a performance
standpoint (especially memory consumption of the Lucene fieldcache)


IDEAL OPTION D) 

I'm pretty sure this does not exist out-of-the-box, but might be extended. 
Okay, Solr has a fieldtype: date, but what if it also had a fieldtype:
Daterange? A Daterange would be modeled as lt;DateTimeA,DateTimeBgt; or
lt;DateTimeA,Delta DateTimeAgt;

Then this problem would be really easily modelled as a multivalued field
'openinghours' of type 'Daterange'. 
However, I have the feeling that the standard range-query implementation
can't be used on this fieldtype, or perhaps should be run for each of the N
datereange-values in 'openinghours'. 

To make matters worse ( I didn't want to introduce this above) 
REQ 3: It may be possible that certain places have multiple opening-hours /
timeslots each day. Consider museum in Spain which get's closed around noon
because of siesta-time. 
OPTION D) would be able to handle this natively, all other options can't. 

I would very much appreciate any pointers to: 
 - how to start with option D. and if this approach is at all feasible. 
 - if option C. would suffice. (excluding REQ 3. ), and if I'm likely to run
into performance / memory troubles. 
 - any other possible solutions I haven' thought of to tackle this. 

Thanks a lot. 

Cheers,
Geert-Jan






--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-tp3368790p3368790.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr stopword problem in Query

2011-09-26 Thread Isan Fulia
Hi all,

I have a text field named* textForQuery* .
Following content has been indexed into solr in field textForQuery
*Coke Studio at MTV*

when i fired the query as
*textForQuery:(coke studio at mtv)* the results showed 0 documents

After runing the same query in debugMode i got the following results

result name=response numFound=0 start=0/
lst name=debug
str name=rawquerystringtextForQuery:(coke studio at mtv)/str
str name=querystringtextForQuery:(coke studio at mtv)/str
str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str
str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str

Why the query did not matched any document even when there is a document
with value of textForQuery as *Coke Studio at MTV*?
Is this because of the stopword *at* present in stopwordList?



-- 
Thanks  Regards,
Isan Fulia.


AW: How to map database table for facted search?

2011-09-26 Thread Chorherr Nikolaus
Thx for your response, we will try dynamic fields for this

-Ursprüngliche Nachricht-
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Samstag, 24. September 2011 21:33
An: solr-user@lucene.apache.org
Betreff: Re: How to map database table for facted search?

In general, you flatten the data when you put things into Solr. I know
that's anathema
to DB training, but this is searching G...

If you have a reasonable number of distinct column names, you could
just define your
schema to have an entry for each and index the associated values that way. Then
your facets become easy, you're just faceting on the facet_hobby
field in your example.

If that's impractical (say you can add arbitrary columns), you can do
something very similar
with dynamic fields.

You could also create a field with the column/name pairs (watch your
tokenizer!) in a single
field and facet by prefix, where the prefix was the column name (e.g.
index tokens like
hobby_sailing hobby_camping interest_reading then facet with
facet.prefix:hobby_).

There are tradeoffs for each that you'll have to experiment with.

Note that there is no penalty in Solr for defining fields in your
schema but not using
them.

Best
Erick

On Fri, Sep 23, 2011 at 12:06 AM, Chorherr Nikolaus
nikolaus.chorh...@umweltbundesamt.at wrote:
 Hi All!

 We are working first time with solr and have a simple data model

 Entity Person(column surname) has 1:n Attribute(column name) has 1:n 
 Value(column text)

 We need faceted search on the content of Attribute:name not on Attribute:name 
 itself, e.g if an Attribute of person has name=hobby, we would like to have 
 something like ... facet=truefacet.name=hobby and get back
 all related Value with count.(We do not need a facet.name=name and get back 
 all distinct values of the name column of Attribute)

 How do we have to map our database, define or document and/or define our 
 schema?

 Any help is highly appreciated - Thx in advance

 Niki





Re: Seek your wisdom for implementing 12 million docs..

2011-09-26 Thread Toke Eskildsen
On Sun, 2011-09-25 at 22:00 +0200, Ikhsvaku S wrote:
 Documents: We have close to ~12 million XML docs, of varying sizes average
 size 20 KB. These documents have 150 fields, which should be searchable 
 indexed. [...] Approximately ~6000 such documents are updated  400-800 new 
 ones
 are added each day

 Queries: [...] Also each one would want to grab as many result rows as 
 possible
 (we are limiting this to 2000). The output shall contain only 1-5 fields.

Except for the result rows (which I guess is equal to returned documents
in Solr-world), nothing you say raises any alarms. It actually sounds
very much like our local index (~10M documents, ~100 fields, 10.000+
updates/day) at the State and University Library, Denmark.

 Available hardware:
 Some of existing hardware we could find consists of existing ~300GB SAN each
 on 4 Boxes with ~96Gig each. We do couple of older HP DL380s (mainly want to
 use for offline indexing). All of this is on 10G Ethernet.

Yikes! We only use two mirrored machines for fallback, not performance.
They have 16GB each and handle index updates as well as searches. The
indexes (~60GB) reside on local SSDs.

 Questions:
 Our priority is to provide results fast, [...]

What is fast in milliseconds and how many queries/second do you
anticipate? From what you're telling, your hardware looks like overkill.
However, as Eric says, your mileage may wary: Try stuffing all your data
into your mock-up and see what happens - it shouldn't take long and you
might discover that your test machine is perfectly capable of handling
it all alone.



SOLR Index Speed

2011-09-26 Thread Lord Khan Han
Hi,

We have 500K web document and usind solr (trunk) to index it. We have
special anaylizer which little bit heavy cpu .
Our machine config:

32 x cpu
32 gig ram
SAS HD

We are sending document with 16 reduce client (from hadoop) to the stand
alone solr server. the problem is we couldnt get speedier than the 500 doc /
per sec. 500K document tooks 7-8 hours to index :(

While indexin the the solr server cpu load is around : 5-6  (32 max)  it
means  %20 of the cpu total power. We have plenty ram ...

I turned of auto commit  and give 8198 rambuffer .. there is no io wait ..

How can I make it faster ?

PS: solr streamindex  is not option because we need to submit javabin...

thanks..


Re: NRT and commit behavior

2011-09-26 Thread Vadim Kisselmann
Tirthankar,

are you indexing 1.smaller docs or 2.books?
if 1.  your caches are too big for your memory, as Erick already said.
Try to allocate 10GB für JVM, leave 14GB for your HDD-Cache and make your
caches smaller.

if 2.  read the blog-posts on hathitrust.com.
http://www.hathitrust.org/blogs/large-scale-search

Regards
Vadim


2011/9/24 Erick Erickson erickerick...@gmail.com

 No G. The problem is that number of documents isn't a reliable
 indicator of resource consumption. Consider the difference between
 indexing a twitter message and a book. I can put a LOT more docs
 of 140 chars on a single machine of size X than I can books.

 Unfortunately, the only way I know of is to test. Use something like
 jMeter of SolrMeter to fire enough queries at your machine to
 determine when you're over-straining resources and shard at that
 point (or get a bigger machine G)..

 Best
 Erick

 On Wed, Sep 21, 2011 at 8:24 PM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
  Okay, but is there any number that if we reach on the index size or total
 docs in the index or the size of physical memory that sharding should be
 considered.
 
  I am trying to find the winning combination.
  Tirthankar
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, September 16, 2011 7:46 AM
  To: solr-user@lucene.apache.org
  Subject: Re: NRT and commit behavior
 
  Uhm, you're putting  a lot of index into not very much memory. I really
 think you're going to have to shard your index across several machines to
 get past this problem. Simply increasing the size of your caches is still
 limited by the physical memory you're working with.
 
  You really have to put a profiler on the system to see what's going on.
 At that size there are too many things that it *could* be to definitively
 answer it with e-mails
 
  Best
  Erick
 
  On Wed, Sep 14, 2011 at 7:35 AM, Tirthankar Chatterjee 
 tchatter...@commvault.com wrote:
  Erick,
  Also, we had  our solrconfig where we have tried increasing the
 cache making the below value for autowarm count as 0 helps returning the
 commit call within the second, but that will slow us down on searches
 
  filterCache
   class=solr.FastLRUCache
   size=16384
   initialSize=4096
   autowarmCount=4096/
 
 !-- Cache used to hold field values that are quickly accessible
  by document id.  The fieldValueCache is created by default
  even if not configured here.
   fieldValueCache
 class=solr.FastLRUCache
 size=512
 autowarmCount=128
 showItems=32
   /
 --
 
!-- queryResultCache caches results of searches - ordered lists of
  document ids (DocList) based on a query, a sort, and the range
  of documents requested.  --
 queryResultCache
   class=solr.LRUCache
   size=16384
   initialSize=4096
   autowarmCount=4096/
 
   !-- documentCache caches Lucene Document objects (the stored fields
 for each document).
Since Lucene internal document ids are transient, this cache
  will not be autowarmed.  --
 documentCache
   class=solr.LRUCache
   size=512
   initialSize=512
   autowarmCount=512/
 
  -Original Message-
  From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com]
  Sent: Wednesday, September 14, 2011 7:31 AM
  To: solr-user@lucene.apache.org
  Subject: RE: NRT and commit behavior
 
  Erick,
  Here is the answer to your questions:
  Our index is 267 GB
  We are not optimizing...
  No we have not profiled yet to check the bottleneck, but logs indicate
 opening the searchers is taking time...
  Nothing except SOLR
  Total memory is 16GB tomcat has 8GB allocated Everything 64 bit OS and
  JVM and Tomcat
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Sunday, September 11, 2011 11:37 AM
  To: solr-user@lucene.apache.org
  Subject: Re: NRT and commit behavior
 
  Hmm, OK. You might want to look at the non-cached filter query stuff,
 it's quite recent.
  The point here is that it is a filter that is applied only after all of
 the less expensive filter queries are run, One of its uses is exactly ACL
 calculations. Rather than calculate the ACL for the entire doc set, it only
 calculates access for docs that have made it past all the other elements of
 the query See SOLR-2429 and note that it is a 3.4 (currently being
 released) only.
 
  As to why your commits are taking so long, I have no idea given that you
 really haven't given us much to work with.
 
  How big is your index? Are you optimizing? Have you profiled the
 application to see what the bottleneck is (I/O, CPU, etc?). What else is
 running on your machine? It's quite surprising that it takes that long. How
 much memory are you giving the JVM? etc...
 
  You might want to review:
  http://wiki.apache.org/solr/UsingMailingLists
 
  Best
  Erick
 
 
  On Fri, Sep 9, 2011 at 9:41 AM, 

RE: Best Solr escaping?

2011-09-26 Thread Bob Sandiford
I won't guarantee this is the 'best algorithm', but here's what we use.  (This 
is in a final class with only static helper methods):

// Set of characters / Strings SOLR treats as having special meaning in a 
query, and the corresponding Escaped versions.
// Note that the actual operators '' and '||' don't show up here - we'll 
just escape the characters '' and '|' wherever they occur.
private static final String[] SOLR_SPECIAL_CHARACTERS = new String[] {+, 
-, , |, !, (, ), {, }, [, ], ^, \, ~, *, ?, 
:, \\};
private static final String[] SOLR_REPLACEMENT_CHARACTERS = new String[] 
{\\+, \\-, \\, \\|, \\!, \\(, \\), \\{, \\}, \\[, \\], 
\\^, \\\, \\~, \\*, \\?, \\:, };


/**
 * Escapes all special characters from the Search Terms, so they don't get 
confused with
 * the Solr query language special characters.
 * @param value - Search Term to escape
 * @return - escaped Search value, suitable for a Solr q parameter
 */
public static String escapeSolrCharacters(String value)
{
return StringUtils.replaceEach(value, SOLR_SPECIAL_CHARACTERS, 
SOLR_REPLACEMENT_CHARACTERS);
}

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com

 -Original Message-
 From: Bill Bell [mailto:billnb...@gmail.com]
 Sent: Sunday, September 25, 2011 12:22 AM
 To: solr-user@lucene.apache.org
 Subject: Best Solr escaping?
 
 What is the best algorithm for escaping strings before sending to Solr?
 Does
 someone have some code?
 
 A few things I have witnessed in q using DIH handler
 * Double quotes -  that are not balanced can cause several issues from
 an
 error (strip the double quote?), to no results.
 * Should we use + or %20 ­ and what cases make sense:
  * Dr. Phil Smith or Dr.+Phil+Smith or Dr.%20Phil%20Smith - also
 what is
  the impact of double quotes?
 * Unmatched parenthesis I.e. Opening ( and not closing.
  * (Dr. Holstein
  * Cardiologist+(Dr. Holstein
 Regular encoding of strings does not always work for the whole string
 due to
 several issues like white space:
 * White space works better when we use back quote Bill\ Bell
 especially
 when using facets.
 
 Thoughts? Code? Ideas? Better Wikis?
 
 




Re: email - DIH

2011-09-26 Thread jb
Hi Alonso, Gora,

I run in the same Problem with the MailEntityProcessor.
I have an Email-Folder called Test. Inside there a only two messages.
When I run the DIH everything looks find, except that the two Emails doesn't
get indexed.

Are there any adidtional informations to this problem?

I'm using Solr 3.4.0 (earlier Version the same problem)

Here my config:


dataConfig
document
   entity name=email transformer=TemplateTransformer
processor=MailEntityProcessor user=s...@zahn-gmbh.de password=SHI-Test 
   host=mail.zahn-gmbh.de protocol=imap folders=*
fetchMailsSince=2000-01-01 00:00:00 
   deltaFetch=false processAttachement=false 
batchSize=100
fetchSize=1024 recurse=true 
field column=id template=email-${email.messageId}/
field column=quelle template=Email/
field column=title template=${email.subject}/
field column=author template=${email.from}/
field column=last_modified template=${email.sentDate}
dateTimeFormat=-MM-dd hh:mm:ss/
field column=text template=${email.content}/
field column=content_type template=Email/
field column=quelle template=Comunigate/
field column=doctype template=Email/
   /entity
/document
/dataConfig


And here my response (using the command:
http://localhost:8080/apache-solr-3.4.0/dataimport-mail?command=full-importcommit=true;):


26.09.2011 15:52:53 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-3.4.0 path=/dataimport-mail
params={commit=truecommand=full-import} status=0 QTime=16 
26.09.2011 15:52:53 org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
26.09.2011 15:52:53 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read dataimport-mail.properties
26.09.2011 15:52:53 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
26.09.2011 15:52:53 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1

commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_4,version=1317035795833,generation=4,filenames=[segments_4]
26.09.2011 15:52:53 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1317035795833
26.09.2011 15:52:53 org.apache.solr.handler.dataimport.MailEntityProcessor
logConfig
INFO: user : s...@zahn-gmbh.de
pwd : SHI-Test
protocol : imap
host : mail.zahn-gmbh.de
folders : Test
recurse : true
exclude : []
include : []
batchSize : 20
fetchSize : 1024
read timeout : 6
conection timeout : 3
custom filter : 
fetch mail since : Sat Jan 01 00:00:00 CET 2000

26.09.2011 15:52:54 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
26.09.2011 15:52:54 org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2

commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_4,version=1317035795833,generation=4,filenames=[segments_4]

commit{dir=H:\_Projekt.lfd\zahn\solr_home_34\data\index,segFN=segments_5,version=1317035795834,generation=5,filenames=[segments_5]
26.09.2011 15:52:54 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1317035795834
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@17af46e main
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@17af46e main from Searcher@5e8d7d main

fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=1,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0,item_doctype={field=doctype,memSize=4224,tindexSize=32,time=0,phase1=0,nTerms=0,bigTerms=0,termInstances=0,uses=2}}
26.09.2011 15:52:54 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@17af46e main

fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@17af46e main from Searcher@5e8d7d main

filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=2,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@17af46e main

filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
26.09.2011 15:52:54 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming 

Re: mlt content stream help

2011-09-26 Thread dan whelan

On 9/24/11 12:17 PM, Erick Erickson wrote:

What version of Solr?

I am using solr 3.2


When you copied the default, did you set up
default values for MLT?


This is what I need help with.

How should the request handler / solrconfig be setup?



Showing us the request you used


The request is exactly the same as the url in the wiki using the example 
solr / exampledocs



and the relevant portions of your
solrconifg file would help a lot, you might want to review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.com  wrote:

I would like to use MLT and the content stream feature in solr like on this
page:

http://wiki.apache.org/solr/MoreLikeThisHandler

How should the request handler / solrconfig be setup?

I enabled streaming and I set a requestHandler up by copying the default
request handler and I changed the name to:

name=/mlt

but when accessing the url like the example on the wiki I get a NPE because
q is not supplied

I'm sure I am just doing it wrong just not sure what.

Thanks,

dan





Re: Update ingest rate drops suddenly

2011-09-26 Thread eks dev
Just to bring closure on this one, we were slurping data from the
wrong DB (hardly desktop class machine)...

Solr did not cough on 41Mio records @34k updates / sec.,  single threaded.
Great!



On Sat, Sep 24, 2011 at 9:18 PM, eks dev eks...@yahoo.co.uk wrote:
 just looking for hints where to look for...

 We were testing single threaded ingest rate on solr, trunk version on
 atypical collection (a lot of small documents), and we noticed
 something we are not able to explain.

 Setup:
 We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
 machine with enough memory and 8 cores.   Schema has 5 stored fields,
 4 of them indexed no positions no norms.
 Average net document size (optimized index size / number of documents)
 is around 100 bytes.

 On a test with 40 Mio document:
 - we had update ingest rate  on first 4,4Mio documents @  incredible
 34k records / second...
 - then it dropped, suddenly to 20k records per second and this rate
 remained stable (variance 1k) until...
 - we hit 13Mio, where ingest rate dropped again really hard, from one
 instant in time to another to 10k records per second.

 it stayed there until we reached the end @40Mio (slightly reducing, to
 ca 9k, but this is not long enough to see trend).

 Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
 regular). CPU in turn was  following the ingest rate trend, inicating
 that we were waiting on something. No searches , no commits, nothing.

 autoCommit was turned off. Updates were streaming directly from the database.

 -
 I did not expect something like this, knowing lucene merges in
 background. Also, having such sudden drops in ingest rate is
 indicative that we are not leaking something. (drop would have been
 much more gradual). It is some caches, but why two really significant
 drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
 k/second :)

 I am not really acquainted with the new MergePolicy and flushing
 settings, but I suspect this is something there we could tweak.

 Could it be windows is somehow, hmm, quirky with solr default
 directory on win64/jvm (I think it is MMAP by default)... We did not
 saturate IO with such a small documents I guess, It is a just couple
 of Gig over 1-2 hours.

 All in all, it works good, but is having such hard update ingest rate
 drops normal?

 Thanks,
 eks.



Re: Solr stopword problem in Query

2011-09-26 Thread Bill Bell
This is pretty serious issue

Bill Bell
Sent from mobile


On Sep 26, 2011, at 4:09 AM, Isan Fulia isan.fu...@germinait.com wrote:

 Hi all,
 
 I have a text field named* textForQuery* .
 Following content has been indexed into solr in field textForQuery
 *Coke Studio at MTV*
 
 when i fired the query as
 *textForQuery:(coke studio at mtv)* the results showed 0 documents
 
 After runing the same query in debugMode i got the following results
 
 result name=response numFound=0 start=0/
 lst name=debug
 str name=rawquerystringtextForQuery:(coke studio at mtv)/str
 str name=querystringtextForQuery:(coke studio at mtv)/str
 str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str
 str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str
 
 Why the query did not matched any document even when there is a document
 with value of textForQuery as *Coke Studio at MTV*?
 Is this because of the stopword *at* present in stopwordList?
 
 
 
 -- 
 Thanks  Regards,
 Isan Fulia.


Re: mlt content stream help

2011-09-26 Thread Erick Erickson
Please don't say it's just like the example. If it was,
then it would most likely be working.

If you don't take the time to show us what you've tried,
and the results you get back, then there's not much we
can do to help.

Best
Erick

On Mon, Sep 26, 2011 at 7:18 AM, dan whelan d...@adicio.com wrote:
 On 9/24/11 12:17 PM, Erick Erickson wrote:

 What version of Solr?

 I am using solr 3.2

 When you copied the default, did you set up
 default values for MLT?

 This is what I need help with.

 How should the request handler / solrconfig be setup?


 Showing us the request you used

 The request is exactly the same as the url in the wiki using the example
 solr / exampledocs

 and the relevant portions of your
 solrconifg file would help a lot, you might want to review:

 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick

 On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.com  wrote:

 I would like to use MLT and the content stream feature in solr like on
 this
 page:

 http://wiki.apache.org/solr/MoreLikeThisHandler

 How should the request handler / solrconfig be setup?

 I enabled streaming and I set a requestHandler up by copying the default
 request handler and I changed the name to:

 name=/mlt

 but when accessing the url like the example on the wiki I get a NPE
 because
 q is not supplied

 I'm sure I am just doing it wrong just not sure what.

 Thanks,

 dan





drastic performance decrease with 20 cores

2011-09-26 Thread Bictor Man
Hi everyone,

Sorry if this issue has been discussed before, but I'm new to the list.

I have a solr (3.4) instance running with 20 cores (around 4 million docs
each).
The instance has allocated 13GB in a 16GB RAM server. If I run several sets
of queries sequentially in each of the cores, the I/O access goes very high,
so does the system load, while the CPU percentage remains low.
It takes almost 1 hour to complete the set of queries.

If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
the I/O access goes down and the CPU goes up, taking only around 5 minutes
to complete all sets of queries.

Meaning that for me is MUCH more performant having 2 solr instances running
with half the data and half the memory than a single instance will all the
data and memory.

It would be even way faster to have 1 instance with half the cores/memory,
run the queues, shut it down, start a new instance and repeat the process
than having a big instance running everything.

Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
trigger the garbage collector and run the sets of queries again, the
behavior still remains slow taking like 30 minutes.

am I missing something here? does solr change its caching policy depending
on the number of cores at startup or something similar?

Any hints will be very appreciated.

Thanks


Solr Cloud Number of Shard Limitation?

2011-09-26 Thread Jamie Johnson
Is there any limitation, be it technical or for sanity reasons, on the
number of shards that can be part of a solr cloud implementation?


Re: mlt content stream help

2011-09-26 Thread dan whelan

OK. This is exactly what i did.

With a fresh download of solr 3.2

unpack and go to example directory

start solr: java -jar start.jar

the go to exampledocs and run: ./post.sh *xml

Then go here:

http://localhost:8983/solr/mlt?stream.body=electronics%20memorymlt.fl=manu,catmlt.interestingTerms=listmlt.mintf=0

Problem accessing /solr/mlt. Reason:
NOT_FOUND


The page gives no instructions on setting up mlt or the url is incorrect.





On 9/26/11 8:25 AM, Erick Erickson wrote:

Please don't say it's just like the example. If it was,
then it would most likely be working.

If you don't take the time to show us what you've tried,
and the results you get back, then there's not much we
can do to help.

Best
Erick

On Mon, Sep 26, 2011 at 7:18 AM, dan wheland...@adicio.com  wrote:

On 9/24/11 12:17 PM, Erick Erickson wrote:

What version of Solr?

I am using solr 3.2


When you copied the default, did you set up
default values for MLT?

This is what I need help with.

How should the request handler / solrconfig be setup?



Showing us the request you used

The request is exactly the same as the url in the wiki using the example
solr / exampledocs


and the relevant portions of your
solrconifg file would help a lot, you might want to review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Thu, Sep 22, 2011 at 9:08 AM, dan wheland...@adicio.comwrote:

I would like to use MLT and the content stream feature in solr like on
this
page:

http://wiki.apache.org/solr/MoreLikeThisHandler

How should the request handler / solrconfig be setup?

I enabled streaming and I set a requestHandler up by copying the default
request handler and I changed the name to:

name=/mlt

but when accessing the url like the example on the wiki I get a NPE
because
q is not supplied

I'm sure I am just doing it wrong just not sure what.

Thanks,

dan







Re: Solr stopword problem in Query

2011-09-26 Thread Rahul Warawdekar
Hi Isan,

Does your search return any documents when you remove the 'at' keyword and
just search for Coke studio MTV ?
Also, can you please provide the snippet of schema.xml file where you have
mentioned this field name and its type description ?

On Mon, Sep 26, 2011 at 6:09 AM, Isan Fulia isan.fu...@germinait.comwrote:

 Hi all,

 I have a text field named* textForQuery* .
 Following content has been indexed into solr in field textForQuery
 *Coke Studio at MTV*

 when i fired the query as
 *textForQuery:(coke studio at mtv)* the results showed 0 documents

 After runing the same query in debugMode i got the following results

 result name=response numFound=0 start=0/
 lst name=debug
 str name=rawquerystringtextForQuery:(coke studio at mtv)/str
 str name=querystringtextForQuery:(coke studio at mtv)/str
 str name=parsedqueryPhraseQuery(textForQuery:coke studio ? mtv)/str
 str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str

 Why the query did not matched any document even when there is a document
 with value of textForQuery as *Coke Studio at MTV*?
 Is this because of the stopword *at* present in stopwordList?



 --
 Thanks  Regards,
 Isan Fulia.




-- 
Thanks and Regards
Rahul A. Warawdekar


solr DIH for mongodb

2011-09-26 Thread Kiwi de coder
hi,

do we got any DIH plugin which is for mongodb?

regards,
kiwi


drastic performance decrease with 20 cores

2011-09-26 Thread Bictor Man
Hi everyone,

Sorry if this issue has been discussed before, but I'm new to the list.

I have a solr (3.4) instance running with 20 cores (around 4 million docs
each).
The instance has allocated 13GB in a 16GB RAM server. If I run several sets
of queries sequentially in each of the cores, the I/O access goes very high,
so does the system load, while the CPU percentage remains always low.
It takes almost 1 hour to complete the set of queries.

If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
the I/O access goes down and the CPU goes up, taking only around 5 minutes
to complete all sets of queries.

Meaning that for me is MUCH more performant having 2 solr instances running
with half the data and half the memory than a single instance will all the
data and memory.

It would be even way faster to have 1 instance with half the cores/memory,
run the queues, shut it down, start a new instance and repeat the process
than having a big instance running everything.

Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
trigger the garbage collector and run the sets of queries again, the
behavior still remains slow taking like 30 minutes.

am I missing something here? does solr change its caching policy depending
on the number of cores at startup or something similar?

Any hints will be very appreciated.

Thanks,
Victor


Re: drastic performance decrease with 20 cores

2011-09-26 Thread Shawn Heisey

On 9/26/2011 9:33 AM, Bictor Man wrote:

Hi everyone,

Sorry if this issue has been discussed before, but I'm new to the list.

I have a solr (3.4) instance running with 20 cores (around 4 million docs
each).
The instance has allocated 13GB in a 16GB RAM server. If I run several sets
of queries sequentially in each of the cores, the I/O access goes very high,
so does the system load, while the CPU percentage remains low.
It takes almost 1 hour to complete the set of queries.

If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
the I/O access goes down and the CPU goes up, taking only around 5 minutes
to complete all sets of queries.


With 13 of your 16GB of RAM being gobbled up by the Java process running 
Solr, and some of your memory taken up by the OS itself, you've probably 
only got about 2GB of free RAM left for the OS disk cache.  Not knowing 
what kind of data you're indexing, I can only guess how big your indexes 
are, but with around 80 million total documents, I imagine that it is 
MUCH larger than 2GB.


If I'm right, this means that your Solr server is unable to keep index 
data in RAM, so it ends up going out to the disk every time it needs to 
make a query, and that is SLOW.  The ideal situation is to have enough 
free memory so that the OS can put all index data into its disk cache, 
making access to it nearly instantaneous.  You may never reach that 
ideal with your setup, but if you can get between a third and half the 
index into RAM, it'll probably still perform well.


Do you really need to allocate 13GB to Solr?  If it crashes when you 
allocate less, you may have very large Solr caches in in solrconfig.xml 
that you can reduce.  You do want to take advantage of Solr caching, but 
if you have to choose between disk caching and Solr caching, go for disk.


It's unusual, but not necessarily wrong, to have so many large cores on 
one machine.  Why are things set up that way?  Are you using a 
distributed index, or do you have 20 separate indexes?


The bottom line - you need more memory.  Running with 32GB or even 64GB 
would probably serve you very well.  You probably also need more 
machines.  For redundancy purposes, you'll want to have two complete 
copies of your index on separate hardware and some kind of load balancer 
with failover capability.  You may also want to look into increasing 
your I/O speed, with 15k RPM SAS drives, RAID10, or even SSD.


Depending on the needs of your application, you may be able to decrease 
your index size by changing your schema and re-indexing, especially in 
the area of stored fields.  Typically what you want to do is store only 
the data required to construct a search results grid, and go to the 
original data source for full details when someone opens a specific 
result.  You can also look into changing the field types on your index 
to remove Lucene features you don't need.


The needs of every Solr installation are different, and even my advice 
might be wrong for your particular setup, but you can rarely go wrong by 
adding memory.


Thanks,
Shawn



Re: drastic performance decrease with 20 cores

2011-09-26 Thread François Schiettecatte
You have not said how big your index is but I suspect that allocating 13GB for 
your 20 cores is starving the OS of memory for caching file data. Have you 
tried 6GB with 20 cores? I suspect you will see the same performance as 6GB  
10 cores.

Generally it is better to allocate just enough memory to SOLR to run optimally 
rather than as much as possible. 'Just enough' depends as well. You will need 
to try out different allocations and see where the sweet spot is.

Cheers

François


On Sep 26, 2011, at 9:53 AM, Bictor Man wrote:

 Hi everyone,
 
 Sorry if this issue has been discussed before, but I'm new to the list.
 
 I have a solr (3.4) instance running with 20 cores (around 4 million docs
 each).
 The instance has allocated 13GB in a 16GB RAM server. If I run several sets
 of queries sequentially in each of the cores, the I/O access goes very high,
 so does the system load, while the CPU percentage remains always low.
 It takes almost 1 hour to complete the set of queries.
 
 If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
 the I/O access goes down and the CPU goes up, taking only around 5 minutes
 to complete all sets of queries.
 
 Meaning that for me is MUCH more performant having 2 solr instances running
 with half the data and half the memory than a single instance will all the
 data and memory.
 
 It would be even way faster to have 1 instance with half the cores/memory,
 run the queues, shut it down, start a new instance and repeat the process
 than having a big instance running everything.
 
 Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
 trigger the garbage collector and run the sets of queries again, the
 behavior still remains slow taking like 30 minutes.
 
 am I missing something here? does solr change its caching policy depending
 on the number of cores at startup or something similar?
 
 Any hints will be very appreciated.
 
 Thanks,
 Victor



how to implemente a query like like '%pattern%'

2011-09-26 Thread libnova
Hi all.

how can we do a query similar to 'like' ?


if I have this phrase like a single token in the index: This phrase has 
various words (using KeywordTokenizerFactory)
and i like a exact match of:  phrase has various or various words form 
instance... 
 
How can i do this??

Thanks a lot.

Rode.


-
No se encontraron virus en este mensaje.
Comprobado por AVG - www.avg.com
Versión: 10.0.1410 / Base de datos de virus: 1520/3920 - Fecha de publicación: 
09/26/11




Re: SOLR error with custom FacetComponent

2011-09-26 Thread Chris Hostetter
: 
: Unfortunately the facet fields are not static. The field are dynamic SOLR
: fields and are generated by different applications.
: The field names will be populated into a data store (like memcache) and
: facets have to be driven from that data store.
: 
: I need to write a Custom FacetComponent which picks up the facet fields from
: the data store.

It sounds like you don't need custom facet *code* you just need to 
dynamicly decide what fields to facet on -- i would suggest in that case 
that instead of subclassing FacetComponent you instead write a standalone 
SearchComponent that you configure to run before the 
FacetComponent which would modify the request params to add the new 
facet.field (and any f.*.facet.field.*) params you decide you want ot 
use at run time -- the more you can decouple your custom code from the 
existing code, the less maintence headaches you are likely to have.

as for your original problem

:  I'm getting an error saying Error instantiating SearchComponent My Custom
:  Class is not a org.apache.solr.handler.component.SearchComponent.
: 
:  My custom class inherits from *FacetComponent* which extends from *
:  SearchComponent*.

...this sounds like it is likely a problem with the classloaders -- even 
though you subclass FacetComponent, if a differnet branch of 
the classloader tree loads your custom code, it may not recognize that the 
FacetCOmponet class instance you subclass is the same as the 
FacetComponent class it already knows about.

where exactly did you put the class/jar contianing your subclass? did you 
you specify a lib/ directive in your solrconfig.xml for it?

if you added/moved/copied *any* jars into example/lib that's a good tip 
off that you made a mistake...

https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins


-Hoss


aggregate functions in Solr?

2011-09-26 Thread Esteban Donato
Hello guys,

  I need to implement a functionality which requires something similar
to aggregate functions in SQL.  My Solr schema looks like this:

-doc_id: integer
-date: date
-value1: integer
-value2: integer

  Basically the index contains some numerical values (value1, value2,
etc) per doc and date.  Given a date range query, I need to return
some stats consolidated by docs for that given date range.  I typical
response could be something like this:

doc_id, sum(value1),  avg(value2),  sum(value1)/sum(value2).

  I checked StatsComponent using stats.facet=doc_id but it seems it
doesn't cover my needs (especially for complex stats like
sum(value1)/sum(value2)).  Also checked FieldCollapsing but I couldn't
find a way to configure an aggregate function there.

  Is there any way to implement this, or I will have to resolve it out of Solr?

Regards,
Esteban


Re: Unique Key error on trunk

2011-09-26 Thread Viswa S

You can replicate it with the example app by replacing the id definition in 
schema.xml with
 
field name=id type=uuid indexed=true stored=true default=NEW /

Removing the id fields in the one of the example doc.xml and posting it to solr.

Thanks
Viswa

On Sep 26, 2011, at 12:15 AM, Viswa S wrote:

 Hello,
 
 We use solr.UUIDField to generate unique ids, using the latest trunk (change 
 list 1163767) seems to throw an error Document is missing mandatory 
 uniqueKey field: id. The schema is setup to generate a id field on updates 
 
field name=id type=uuid indexed=true stored=true default=NEW /
 
 Thanks
 Viswa
 
 SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory 
 uniqueKey field: id
   at 
 org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:80)
   at 
 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:145)
   at 
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
   at 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:127)
   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
   at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
   at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1406)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
   at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at 
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
   at 
 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
   at org.mortbay.jetty.Server.handle(Server.java:326)
   at 
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
   at 
 org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
   at 
 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
   at 
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
 



Re: mlt content stream help

2011-09-26 Thread Chris Hostetter

Dan:

The disconnect here seems to be that these examples urls on the 
MoreLikeThisHandler wiki page assume a /mlt request handler exists, but 
no handler by that name has ever actually existed in the solr example 
configs.  (the wiki page doesn't explicitly state that those URLs will 
work with the example configs, but it certianly suggests it) 

Instead of copying the *default* request handler config (using 
SearchHandler) verbatim, you need to create a handler declaration that 
uses the MoreLikeThisHandler class, ala...

  requestHandler name=/mlt class=solr.MoreLikeThisHandler
  /requestHandler

...you can add more configuration (to specify things like default params 
and what not) but that's the minimum config that you need to get the MLT 
Handler up and running.

I've updated the wiki page to reflect this -- thanks for helping to catch 
the mistake

: http://wiki.apache.org/solr/MoreLikeThisHandler
: 
: How should the request handler / solrconfig be setup?
: 
: I enabled streaming and I set a requestHandler up by copying the
: default
: request handler and I changed the name to:
: 
: name=/mlt
: 
: but when accessing the url like the example on the wiki I get a NPE
: because
: q is not supplied


-Hoss


RE: SOLR Index Speed

2011-09-26 Thread Jaeger, Jay - DOT
500 / second would be 1,800,000 per hour (much more than 500K documents).

1)  how big is each document?
2)  how big are your index files?
3)  as others have recently written, make sure you don't give your JRE so much 
memory that your OS is starved for memory to use for file system cache.

JRJ

-Original Message-
From: Lord Khan Han [mailto:khanuniver...@gmail.com] 
Sent: Monday, September 26, 2011 6:09 AM
To: solr-user@lucene.apache.org
Subject: SOLR Index Speed

Hi,

We have 500K web document and usind solr (trunk) to index it. We have
special anaylizer which little bit heavy cpu .
Our machine config:

32 x cpu
32 gig ram
SAS HD

We are sending document with 16 reduce client (from hadoop) to the stand
alone solr server. the problem is we couldnt get speedier than the 500 doc /
per sec. 500K document tooks 7-8 hours to index :(

While indexin the the solr server cpu load is around : 5-6  (32 max)  it
means  %20 of the cpu total power. We have plenty ram ...

I turned of auto commit  and give 8198 rambuffer .. there is no io wait ..

How can I make it faster ?

PS: solr streamindex  is not option because we need to submit javabin...

thanks..


How to reserve ids?

2011-09-26 Thread Gabriele Kahlout
Hello,

While indexing there are certain urls/ids I'd never want to appear in the
search results (so be indexed). Is there already a 'supported by design'
mechanism to do that to point me too, or should I just create this blacklist
as an processor in the update chain?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Boost Exact matches on Specific Fields

2011-09-26 Thread balaji
Hi all

I am new to SOLR and have a doubt on Boosting the Exact Terms to the top
on a Particular field

For ex :

 I have a text field names ts_category and I want to give more boost to
this field rather than other fields, SO in my Query I pass the following in
the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on
SCORE desc

 When I do a search against Hospitals . I get Hospitalization
Management , Hospital Equipment  Supplies  on Top rather than the exact
matches of Hospitals

  So It would be great , If I could be helped over here


Thanks
Balaji  
   


 



Thanks in Advance
Balaji

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to implemente a query like like '%pattern%'

2011-09-26 Thread Tomás Fernández Löbbe
If you need those kinds of searches then you should probably not be using
the KeywordTokenizerFactory, is there any reason why you can't switch to a
WhitespaceTokenizer for example? then you could use a simple phrase query
for your search case. if you need everything as a Token, you could use a
copyfield and duplicate the field and have them both.

Are those acceptable options for you?

Tomás

2011/9/26 Rode González (libnova) r...@libnova.es

 Hi all.

 how can we do a query similar to 'like' ?


 if I have this phrase like a single token in the index: This phrase has
 various words (using KeywordTokenizerFactory)
 and i like a exact match of:  phrase has various or various words form
 instance...

 How can i do this??

 Thanks a lot.

 Rode.


 -
 No se encontraron virus en este mensaje.
 Comprobado por AVG - www.avg.com
 Versión: 10.0.1410 / Base de datos de virus: 1520/3920 - Fecha de
 publicación: 09/26/11





RE: A fieldType for a address street

2011-09-26 Thread Jaeger, Jay - DOT
We used copyField to copy the address to two fields:

1.  Which contains just the first token up to the first whitespace
2.  Which copies all of it, but translates to lower case.

Then our users can enter either a street number, a street name, or both.   We 
copied all of it to the second field because it is not, in general, possible to 
distinguish between a house number and something else: a house number is not 
always present, and when present is not always numeric.  Both are 
solr.TextField:


fieldType name=streetnumber class=solr.TextField
   analyzer
  tokenizer class=solr.PatternTokenizerFactory
 pattern=(^\S+)
 group=1
  /
  filter class=solr.LowerCaseFilterFactory /
   /analyzer
/fieldType


JRJ

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, September 23, 2011 9:27 AM
To: solr-user@lucene.apache.org
Subject: Re: A fieldType for a address street

Nicolas,

A text or ngram field should do it.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


- Original Message -
 From: Nicolas Martin nmar...@doyousoft.com
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Friday, September 23, 2011 5:55 AM
 Subject: A fieldType for a address street
 
 Hi solR users!
 
 I'd like to make research on my client database, in particular, i need to 
 find client by their address (ex : 100 avenue des champs élysée)
 
 Does anyone know a good fieldType to store my addresses to enable me to 
 search 
 client by address easily ?
 
 
 thank you all



RE: SOLR Index Speed

2011-09-26 Thread Lan
Are you batching the documents before sending them to the solr server? Are
you doing a commit only at the end? Also since you have 32 cores, you can
try upping the number of concurrent updaters from 16 to 32. 



Jaeger, Jay - DOT wrote:
 
 500 / second would be 1,800,000 per hour (much more than 500K documents).
 
 1)  how big is each document?
 2)  how big are your index files?
 3)  as others have recently written, make sure you don't give your JRE so
 much memory that your OS is starved for memory to use for file system
 cache.
 
 JRJ
 
 -Original Message-
 From: Lord Khan Han [mailto:khanuniver...@gmail.com] 
 Sent: Monday, September 26, 2011 6:09 AM
 To: solr-user@lucene.apache.org
 Subject: SOLR Index Speed
 
 Hi,
 
 We have 500K web document and usind solr (trunk) to index it. We have
 special anaylizer which little bit heavy cpu .
 Our machine config:
 
 32 x cpu
 32 gig ram
 SAS HD
 
 We are sending document with 16 reduce client (from hadoop) to the stand
 alone solr server. the problem is we couldnt get speedier than the 500 doc
 /
 per sec. 500K document tooks 7-8 hours to index :(
 
 While indexin the the solr server cpu load is around : 5-6  (32 max)  it
 means  %20 of the cpu total power. We have plenty ram ...
 
 I turned of auto commit  and give 8198 rambuffer .. there is no io wait ..
 
 How can I make it faster ?
 
 PS: solr streamindex  is not option because we need to submit javabin...
 
 thanks..
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-Index-Speed-tp3368945p3370765.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to implemente a query like like '%pattern%'

2011-09-26 Thread Chris Hostetter

: References:
: cafwsjvnqkaufwspqrkm4sckb-0gvak-vktkfrnmfwgzwltm...@mail.gmail.com
: In-Reply-To:
: cafwsjvnqkaufwspqrkm4sckb-0gvak-vktkfrnmfwgzwltm...@mail.gmail.com
: Subject: how to implemente a query like  like '%pattern%' 

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Re: Unique Key error on trunk

2011-09-26 Thread Chris Hostetter

: Subject: Re: Unique Key error on trunk
: 
: 
: You can replicate it with the example app by replacing the id definition in 
schema.xml with
:  
: field name=id type=uuid indexed=true stored=true default=NEW 
/


thanks for reporting this Viswa, I've filed a bug to track it...

https://issues.apache.org/jira/browse/SOLR-2796


-Hoss


Searching multiple fields

2011-09-26 Thread Mark
I have a use case where I would like to search across two fields but I 
do not want to weight a document that has a match in both fields higher 
than a document that has a match in only 1 field.


For example.

Document 1
 - Field A: Foo Bar
 - Field B: Foo Baz

Document 2
 - Field A: Foo Blarg
 - Field B: Something else

Now when I search for Foo I would like document 1 and 2 to be 
similarly scored however document 1 will be scored much higher in this 
use case because it matches in both fields. I could create a third field 
and use copyField directive to search across that but I was wondering if 
there is an alternative way. It would be nice if we could search across 
some sort of virtual field that will use both underlying fields but 
not actually increase the size of the index.


Thanks


Re: How to apply filters to stored data

2011-09-26 Thread Chris Hostetter

: Hi Erick, The problem I am trying to solve is to filter invalid entities.
: Users might mispell or enter a new entity name. This new/invalid entities
: need to pass through a KeepWordFilter so that it won't pollute our
: autocomplete result. 

how are you doing autocomplete?

if you are using the Suggest feature of solr, then thta's based on the 
indexed terms anyway (last time i checked) so you don't need to manipulate 
the stored field values.

In general, the only way to manipluate the stored field values is to do it 
in an update processor -- which can mutate the documents long before the 
schema is ever even consulted.

-Hoss


Re: drastic performance decrease with 20 cores

2011-09-26 Thread Bictor Man
Hi guys,

thanks for your replies. indeed the filesystem caching seems to be the
difference. sadly I can't add more memory and the 6GB/20core combination
doesn't work. so I'll just try to tweak it as much as I can.

thanks a lot.


2011/9/26 François Schiettecatte fschietteca...@gmail.com

 You have not said how big your index is but I suspect that allocating 13GB
 for your 20 cores is starving the OS of memory for caching file data. Have
 you tried 6GB with 20 cores? I suspect you will see the same performance as
 6GB  10 cores.

 Generally it is better to allocate just enough memory to SOLR to run
 optimally rather than as much as possible. 'Just enough' depends as well.
 You will need to try out different allocations and see where the sweet spot
 is.

 Cheers

 François


 On Sep 26, 2011, at 9:53 AM, Bictor Man wrote:

  Hi everyone,
 
  Sorry if this issue has been discussed before, but I'm new to the list.
 
  I have a solr (3.4) instance running with 20 cores (around 4 million docs
  each).
  The instance has allocated 13GB in a 16GB RAM server. If I run several
 sets
  of queries sequentially in each of the cores, the I/O access goes very
 high,
  so does the system load, while the CPU percentage remains always low.
  It takes almost 1 hour to complete the set of queries.
 
  If I stop solr and restart it with 6GB allocated and 10 cores, after a
 bit
  the I/O access goes down and the CPU goes up, taking only around 5
 minutes
  to complete all sets of queries.
 
  Meaning that for me is MUCH more performant having 2 solr instances
 running
  with half the data and half the memory than a single instance will all
 the
  data and memory.
 
  It would be even way faster to have 1 instance with half the
 cores/memory,
  run the queues, shut it down, start a new instance and repeat the process
  than having a big instance running everything.
 
  Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
  trigger the garbage collector and run the sets of queries again, the
  behavior still remains slow taking like 30 minutes.
 
  am I missing something here? does solr change its caching policy
 depending
  on the number of cores at startup or something similar?
 
  Any hints will be very appreciated.
 
  Thanks,
  Victor




Re: How to apply filters to stored data

2011-09-26 Thread Jithin
Is UpdateProcessor triggered  when updating an existing document or for new
documents also?

On Tue, Sep 27, 2011 at 6:00 AM, Chris Hostetter-3 [via Lucene] 
ml-node+s472066n3371110...@n3.nabble.com wrote:


 : Hi Erick, The problem I am trying to solve is to filter invalid entities.

 : Users might mispell or enter a new entity name. This new/invalid entities

 : need to pass through a KeepWordFilter so that it won't pollute our
 : autocomplete result.

 how are you doing autocomplete?

 if you are using the Suggest feature of solr, then thta's based on the
 indexed terms anyway (last time i checked) so you don't need to manipulate
 the stored field values.

 In general, the only way to manipluate the stored field values is to do it
 in an update processor -- which can mutate the documents long before the
 schema is ever even consulted.

 -Hoss


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3371110.html
  To unsubscribe from How to apply filters to stored data, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3366230code=aml0aGluMTk4N0BnbWFpbC5jb218MzM2NjIzMHwtMTEwMTgwMTA3Ng==.





-- 
Thanks
Jithin Emmanuel


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-apply-filters-to-stored-data-tp3366230p3371200.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: external file field partial data match in key field

2011-09-26 Thread abhayd
i found answer to my question ..
basically it works only with complete match..

--
View this message in context: 
http://lucene.472066.n3.nabble.com/external-file-field-partial-data-match-in-key-field-tp3368547p3371328.html
Sent from the Solr - User mailing list archive at Nabble.com.


Any plans to support function queries on score?

2011-09-26 Thread Way Cool
Hi, guys,

Do you have any plans to support function queries on score field? for
example, sort=floor(product(score, 100)+0.5) desc?

So far I am getting the following error:
undefined field score

I can't use subquery in this case because I am trying to use secondary
sorting, however I will be open for that if someone successfully use
another field to boost the results.

Thanks,

YH
http://thetechietutorials.blogspot.com/


Re: Searching multiple fields

2011-09-26 Thread Otis Gospodnetic
Hi Mark,

Eh, I don't have Lucene/Solr source code handy, but I *think* for that you'd 
need to write custom Lucene similarity.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Mark static.void@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 8:12 PM
Subject: Searching multiple fields

I have a use case where I would like to search across two fields but I do not 
want to weight a document that has a match in both fields higher than a 
document that has a match in only 1 field.

For example.

Document 1
- Field A: Foo Bar
- Field B: Foo Baz

Document 2
- Field A: Foo Blarg
- Field B: Something else

Now when I search for Foo I would like document 1 and 2 to be similarly 
scored however document 1 will be scored much higher in this use case because 
it matches in both fields. I could create a third field and use copyField 
directive to search across that but I was wondering if there is an alternative 
way. It would be nice if we could search across some sort of virtual field 
that will use both underlying fields but not actually increase the size of the 
index.

Thanks




Re: How to reserve ids?

2011-09-26 Thread Otis Gospodnetic
Hi Gabriele,

Either the latter option, or just treat them as stop words if you just want to 
remove those urls/ids from indexed docs (may still get highlighted).

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Gabriele Kahlout gabri...@mysimpatico.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 3:33 PM
Subject: How to reserve ids?

Hello,

While indexing there are certain urls/ids I'd never want to appear in the
search results (so be indexed). Is there already a 'supported by design'
mechanism to do that to point me too, or should I just create this blacklist
as an processor in the update chain?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).




Re: Boost Exact matches on Specific Fields

2011-09-26 Thread Way Cool
If I were you, probably I will try defining two fields:
1. ts_category as a string type
2. ts_category1 as a text_en type
Make sure copy ts_category to ts_category1.

You can use the following as qf in your dismax:
qf=body^4.0 title^5.0 ts_category^10.0 ts_category1^5.0
or something like that.

YH
http://thetechietutorials.blogspot.com/


On Mon, Sep 26, 2011 at 2:06 PM, balaji mcabal...@gmail.com wrote:

 Hi all

I am new to SOLR and have a doubt on Boosting the Exact Terms to the top
 on a Particular field

 For ex :

 I have a text field names ts_category and I want to give more boost to
 this field rather than other fields, SO in my Query I pass the following in
 the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on
 SCORE desc

 When I do a search against Hospitals . I get Hospitalization
 Management , Hospital Equipment  Supplies  on Top rather than the exact
 matches of Hospitals

  So It would be great , If I could be helped over here


 Thanks
 Balaji







 Thanks in Advance
 Balaji

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr DIH for mongodb

2011-09-26 Thread Otis Gospodnetic
Hi,

Here is a 1 month old thread I found on search-lucene -- didn't even have to do 
a search, I got it as a suggestion from AutoComplete when I started typing the 
word mongodb :)

http://search-lucene.com/m/8AEE31AaTd32


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Kiwi de coder kiwio...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 11:58 AM
Subject: solr DIH for mongodb

hi,

do we got any DIH plugin which is for mongodb?

regards,
kiwi




Re: Update ingest rate drops suddenly

2011-09-26 Thread Otis Gospodnetic
Aha!  See, it was the DB after all! ;)  Thanks for following up, I was curious.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: eks dev eks...@yahoo.co.uk
To: solr-user solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 10:21 AM
Subject: Re: Update ingest rate drops suddenly

Just to bring closure on this one, we were slurping data from the
wrong DB (hardly desktop class machine)...

Solr did not cough on 41Mio records @34k updates / sec.,  single threaded.
Great!



On Sat, Sep 24, 2011 at 9:18 PM, eks dev eks...@yahoo.co.uk wrote:
 just looking for hints where to look for...

 We were testing single threaded ingest rate on solr, trunk version on
 atypical collection (a lot of small documents), and we noticed
 something we are not able to explain.

 Setup:
 We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD,
 machine with enough memory and 8 cores.   Schema has 5 stored fields,
 4 of them indexed no positions no norms.
 Average net document size (optimized index size / number of documents)
 is around 100 bytes.

 On a test with 40 Mio document:
 - we had update ingest rate  on first 4,4Mio documents @  incredible
 34k records / second...
 - then it dropped, suddenly to 20k records per second and this rate
 remained stable (variance 1k) until...
 - we hit 13Mio, where ingest rate dropped again really hard, from one
 instant in time to another to 10k records per second.

 it stayed there until we reached the end @40Mio (slightly reducing, to
 ca 9k, but this is not long enough to see trend).

 Nothing unusual happening with jvm memory ( tooth-saw  200- 450M fully
 regular). CPU in turn was  following the ingest rate trend, inicating
 that we were waiting on something. No searches , no commits, nothing.

 autoCommit was turned off. Updates were streaming directly from the database.

 -
 I did not expect something like this, knowing lucene merges in
 background. Also, having such sudden drops in ingest rate is
 indicative that we are not leaking something. (drop would have been
 much more gradual). It is some caches, but why two really significant
 drops? 33k/sec to 20k and than to 10k... We would love to keep it  @34
 k/second :)

 I am not really acquainted with the new MergePolicy and flushing
 settings, but I suspect this is something there we could tweak.

 Could it be windows is somehow, hmm, quirky with solr default
 directory on win64/jvm (I think it is MMAP by default)... We did not
 saturate IO with such a small documents I guess, It is a just couple
 of Gig over 1-2 hours.

 All in all, it works good, but is having such hard update ingest rate
 drops normal?

 Thanks,
 eks.





Re: SOLR Index Speed

2011-09-26 Thread Otis Gospodnetic
Hello,

 PS: solr streamindex  is not option because we need to submit javabin...


If you are referring to StreamingUpdateSolrServer, then the above statement 
makes no sense and you should give SUSS a try.

Are you sure your 16 reducers produce more than 500 docs/second?
I think somebody already suggested increasing the number of reducers to ~32.
What happens to your CPU load and indexing speed then?


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Lord Khan Han khanuniver...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 7:09 AM
Subject: SOLR Index Speed

Hi,

We have 500K web document and usind solr (trunk) to index it. We have
special anaylizer which little bit heavy cpu .
Our machine config:

32 x cpu
32 gig ram
SAS HD

We are sending document with 16 reduce client (from hadoop) to the stand
alone solr server. the problem is we couldnt get speedier than the 500 doc /
per sec. 500K document tooks 7-8 hours to index :(

While indexin the the solr server cpu load is around : 5-6  (32 max)  it
means  %20 of the cpu total power. We have plenty ram ...

I turned of auto commit  and give 8198 rambuffer .. there is no io wait ..

How can I make it faster ?

PS: solr streamindex  is not option because we need to submit javabin...

thanks..




Re: error while replication

2011-09-26 Thread Otis Gospodnetic
Rajat,

What version?  If  3.4.0, I'd try 3.4.0 first.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: shinkanze rajatrastogi...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 5:45 AM
Subject: error while replication

hi , 
I am replicating solr and getting this error . i am unable to make out the
cause so please kindly help


26 Sep, 2011 8:00:14 AM org.slf4j.impl.JDK14LoggerAdapter fillCallerData
SEVERE: Error during auto-warming of
key:org.apache.solr.search.QueryResultKey@150f0455:java.lang.NullPointerException
        at org.apache.lucene.util.StringHelper.intern(StringHelper.java:36)
        at org.apache.lucene.index.Term.init(Term.java:38)
        at
org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.next(NumericRangeQuery.java:530)
        at
org.apache.lucene.search.NumericRangeQuery$NumericRangeTermEnum.init(NumericRangeQuery.java:476)
        at
org.apache.lucene.search.NumericRangeQuery.getEnum(NumericRangeQuery.java:307)
        at
org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:160)
        at
org.apache.lucene.search.ConstantScoreQuery$ConstantScorer.init(ConstantScoreQuery.java:116)
        at
org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:81)
        at
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
        at
org.apache.lucene.search.IndexSearcher.searchWithFilter(IndexSearcher.java:268)
        at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:258)
        at org.apache.lucene.search.Searcher.search(Searcher.java:171)
        at
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101)
        at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880)
        at
org.apache.solr.search.SolrIndexSearcher.access$000(SolrIndexSearcher.java:51)
        at
org.apache.solr.search.SolrIndexSearcher$3.regenerateItem(SolrIndexSearcher.java:332)
        at org.apache.solr.search.LRUCache.warm(LRUCache.java:194)
        at
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1481)
        at org.apache.solr.core.SolrCore$2.call(SolrCore.java:1130)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)


regards 
rajat rastogi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/error-while-replication-tp3368783p3368783.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: matching reponse and request

2011-09-26 Thread Otis Gospodnetic
Hi Roland,

Have a look at hit #1 
here: http://search-lucene.com/?q=manifoldcffc_project=Solr

I think this is what you are after.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Roland Tollenaar rwatollen...@gmail.com
To: solr-user@lucene.apache.org
Sent: Sunday, September 25, 2011 4:24 AM
Subject: Re: matching reponse and request

Hi Otis,

this is absolutely brilliant! I did not think it were possible.

It opens up a new possibility.

If I insert device ID's in this manner (as in a unique identifier of the 
device sending the request) , might it be possible to control (at least 
block or permit) the permissions of the user?

It seems like something of the sort is possible but I only come up with 
this:

http://search-lucene.com/m/Yuib11zCeYN

No redirect to where the permissions can be set (in schema) and how the 
requests are identified to come from a particular user/device..

Thanks for your help.

Kind regards,

Roland


Otis Gospodnetic wrote:
 Hi Roland,
 
 Check this:
 
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 lst name=params
 str name=indenton/str
 str name=start0/str
 str name=qsolr/str
 str name=foo1/str            === from foo=1
 str name=version2.2/str
 str name=rows10/str
 /lst
  
 I added foo=1 to the request to Solr and got the above back.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 
 
 From: Roland Tollenaar rwatollen...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Saturday, September 24, 2011 4:07 AM
 Subject: matching reponse and request

 Hi,

 sorry for this question but I am hoping it has a quick solution.

 I am sending multiple get request queries to solr but solr is not returning 
 the responses in the sequence I send the requests.

 The shortest responses arrive back first

 I am wondering whether I can add a tag to the request which will be given 
 back to me in the response so that when the response comes I can connect it 
 to re original request and handle it in the appropriate manner.

 If this is possible, how?

 Help appreciated!

 Regards,

 Roland.







Re: solr DIH for mongodb

2011-09-26 Thread Kiwi de coder
wow, this search engine is powerful !

too bad after look throught it, still got not solution.

seem like I need to get my hand dirty to make one :)

kiwi


On Tue, Sep 27, 2011 at 12:08 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi,

 Here is a 1 month old thread I found on search-lucene -- didn't even have
 to do a search, I got it as a suggestion from AutoComplete when I started
 typing the word mongodb :)

 http://search-lucene.com/m/8AEE31AaTd32


 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


 
 From: Kiwi de coder kiwio...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Monday, September 26, 2011 11:58 AM
 Subject: solr DIH for mongodb
 
 hi,
 
 do we got any DIH plugin which is for mongodb?
 
 regards,
 kiwi
 
 
 



Re: Boost Exact matches on Specific Fields

2011-09-26 Thread Balaji S
Hi

   You mean to say copy the String field to a Text field or the reverse .
This is the approach I am currently following

Step 1: Created a FieldType


 fieldType name=string_lower class=solr.TextField
sortMissingLast=true omitNorms=true
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
/analyzer
 /fieldType

Step 2 : field name=str_category type=string_lower indexed=true
stored=true/

Step 3 : copyField source=ts_category dest=str_category/

And in the SOLR Query planning to q=hospitalsqf=body^4.0 title^5.0
ts_category^10.0 str_category^8.0


The One Question I have here is All the above mentioned fields will have
Hospital present in them , will the above approach work to get the exact
match on the top and bring Hospitalization below in the results


Thanks
Balaji


On Tue, Sep 27, 2011 at 9:38 AM, Way Cool way1.wayc...@gmail.com wrote:

 If I were you, probably I will try defining two fields:
 1. ts_category as a string type
 2. ts_category1 as a text_en type
 Make sure copy ts_category to ts_category1.

 You can use the following as qf in your dismax:
 qf=body^4.0 title^5.0 ts_category^10.0 ts_category1^5.0
 or something like that.

 YH
 http://thetechietutorials.blogspot.com/


 On Mon, Sep 26, 2011 at 2:06 PM, balaji mcabal...@gmail.com wrote:

  Hi all
 
 I am new to SOLR and have a doubt on Boosting the Exact Terms to the
 top
  on a Particular field
 
  For ex :
 
  I have a text field names ts_category and I want to give more boost
 to
  this field rather than other fields, SO in my Query I pass the following
 in
  the QF params qf=body^4.0 title^5.0 ts_category^21.0 and also sort on
  SCORE desc
 
  When I do a search against Hospitals . I get Hospitalization
  Management , Hospital Equipment  Supplies  on Top rather than the exact
  matches of Hospitals
 
   So It would be great , If I could be helped over here
 
 
  Thanks
  Balaji
 
 
 
 
 
 
 
  Thanks in Advance
  Balaji
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Boost-Exact-matches-on-Specific-Fields-tp3370513p3370513.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: drastic performance decrease with 20 cores

2011-09-26 Thread Otis Gospodnetic
The following should help with size estimation:

http://search-lucene.com/?q=estimate+memoryfc_project=Solr

http://issues.apache.org/jira/browse/LUCENE-3435

I'll just add that with that much RAM you'll be more than fine.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: François Schiettecatte fschietteca...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 12:43 PM
Subject: Re: drastic performance decrease with 20 cores

You have not said how big your index is but I suspect that allocating 13GB for 
your 20 cores is starving the OS of memory for caching file data. Have you 
tried 6GB with 20 cores? I suspect you will see the same performance as 6GB  
10 cores.

Generally it is better to allocate just enough memory to SOLR to run optimally 
rather than as much as possible. 'Just enough' depends as well. You will need 
to try out different allocations and see where the sweet spot is.

Cheers

François


On Sep 26, 2011, at 9:53 AM, Bictor Man wrote:

 Hi everyone,
 
 Sorry if this issue has been discussed before, but I'm new to the list.
 
 I have a solr (3.4) instance running with 20 cores (around 4 million docs
 each).
 The instance has allocated 13GB in a 16GB RAM server. If I run several sets
 of queries sequentially in each of the cores, the I/O access goes very high,
 so does the system load, while the CPU percentage remains always low.
 It takes almost 1 hour to complete the set of queries.
 
 If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
 the I/O access goes down and the CPU goes up, taking only around 5 minutes
 to complete all sets of queries.
 
 Meaning that for me is MUCH more performant having 2 solr instances running
 with half the data and half the memory than a single instance will all the
 data and memory.
 
 It would be even way faster to have 1 instance with half the cores/memory,
 run the queues, shut it down, start a new instance and repeat the process
 than having a big instance running everything.
 
 Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
 trigger the garbage collector and run the sets of queries again, the
 behavior still remains slow taking like 30 minutes.
 
 am I missing something here? does solr change its caching policy depending
 on the number of cores at startup or something similar?
 
 Any hints will be very appreciated.
 
 Thanks,
 Victor





Re: solr DIH for mongodb

2011-09-26 Thread Otis Gospodnetic
From: Kiwi de coder kiwio...@gmail.com

wow, this search engine is powerful !


Thanks, glad it helps.

too bad after look throught it, still got not solution.

seem like I need to get my hand dirty to make one :)


:)
Please consider contributing: http://wiki.apache.org/solr/HowToContribute

Otis


kiwi



On Tue, Sep 27, 2011 at 12:08 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

Hi,

Here is a 1 month old thread I found on search-lucene -- didn't even have to 
do a search, I got it as a suggestion from AutoComplete when I started typing 
the word mongodb :)

http://search-lucene.com/m/8AEE31AaTd32


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



From: Kiwi de coder kiwio...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, September 26, 2011 11:58 AM
Subject: solr DIH for mongodb


hi,

do we got any DIH plugin which is for mongodb?

regards,
kiwi








Re: How to implement Spell Checker using Solr?

2011-09-26 Thread anupamxyz
I have been able to setup Solr Spell checker on my web application. It is a
file based spell checker that i have implemented. I would like to add that
the same isn't that accurate, since I haven't applied any specific algorithm
for having the most relevant search result. Kindly do let me know in case
you have any issues in implementing the same at your end.

regards,
Anupam

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-implement-Spell-Checker-using-Solr-tp3268450p3371563.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to implement Spell Checker using Solr?

2011-09-26 Thread tamanjit.bin...@yahoo.co.in
Firstly, just to make it clear the dictionary is made out of already indexed
terms, rather it is built upon it if you are using *str
name=classnamesolr.IndexBasedSpellChecker/str* which you are.

Next lot of changes are required for your *solrconfig.xml*

1. str name=fieldspell/str is the name of the field which will be used
to create your dictionary. Does it exist in schema.xml?

2. str name=queryAnalyzerFieldTypetextSpell/str is the name of
FieldType used for your dictionary building, as in the str
name=fieldspell/str should be of type textSpell in schema.xml. Is it
so?

Now for you internal error from crawling. This is most probably because your
siolrconfig.xml/schema.xml has been changed. This I assume so because as you
say before trying to implement spellcheck this was working.

/Also, I am not too sure so as to how I can make my search work based on the
search control in my application Like how can I search with the word and
have the suggestion at the same time, since when the search item is say
form/formm, then I should have essentially separate URL created. Does
Solr Spell checker component take care of it on its own. if so how and
exactly how the Solrconfig and Schema xmls should be configured for the
same.


Please note: I would prefer to use a filebased dictionary for the search, so
kindly suggest on those lines.
/

If you are looking for filebased searching, you are going in the wrong
direction. You are trying to use indexbasedspellchecker class when actually
what you need is

lst name=spellchecker
str name=namefile/str
str name=classnamesolr.FileBasedSpellChecker/str
str name=sourceLocationspellings.txt/str
str name=characterEncodingUTF-8/str
str name=spellcheckIndexDir./spellcheckerFile/str
/lst

Kindly read about spellchecker more.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-implement-Spell-Checker-using-Solr-tp3268450p3371620.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr stopword problem in Query

2011-09-26 Thread Isan Fulia
Hi Rahul,

I also tried searching Coke Studio MTV but no documents were returned.

Here is the snippet of my schema file.

 fieldType name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/

filter class=solr.StopFilterFactory
ignoreCase=true

words=stopwords_en.txt
enablePositionIncrements=true

/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/

filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/

filter class=solr.PorterStemFilterFactory/
  /analyzer

  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/

filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/

filter class=solr.StopFilterFactory
ignoreCase=true

words=stopwords_en.txt
enablePositionIncrements=true

/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/

filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/

filter class=solr.PorterStemFilterFactory/
  /analyzer

/fieldType


*field name=content type=text indexed=false stored=true
multiValued=false/
field name=title type=text indexed=false stored=true
multiValued=false/

**field name=textForQuery type=text indexed=true stored=false
multiValued=true omitTermFreqAndPositions=true/**

copyField source=content dest=textForQuery/
copyField source=title dest=textForQuery/*


Thanks,
Isan Fulia.


On 26 September 2011 21:19, Rahul Warawdekar rahul.warawde...@gmail.comwrote:

 Hi Isan,

 Does your search return any documents when you remove the 'at' keyword and
 just search for Coke studio MTV ?
 Also, can you please provide the snippet of schema.xml file where you have
 mentioned this field name and its type description ?

 On Mon, Sep 26, 2011 at 6:09 AM, Isan Fulia isan.fu...@germinait.com
 wrote:

  Hi all,
 
  I have a text field named* textForQuery* .
  Following content has been indexed into solr in field textForQuery
  *Coke Studio at MTV*
 
  when i fired the query as
  *textForQuery:(coke studio at mtv)* the results showed 0 documents
 
  After runing the same query in debugMode i got the following results
 
  result name=response numFound=0 start=0/
  lst name=debug
  str name=rawquerystringtextForQuery:(coke studio at mtv)/str
  str name=querystringtextForQuery:(coke studio at mtv)/str
  str name=parsedqueryPhraseQuery(textForQuery:coke studio ?
 mtv)/str
  str name=parsedquery_toStringtextForQuery:coke studio *? *mtv/str
 
  Why the query did not matched any document even when there is a document
  with value of textForQuery as *Coke Studio at MTV*?
  Is this because of the stopword *at* present in stopwordList?
 
 
 
  --
  Thanks  Regards,
  Isan Fulia.
 



 --
 Thanks and Regards
 Rahul A. Warawdekar




-- 
Thanks  Regards,
Isan Fulia.


Re: what is delata query and how to write?

2011-09-26 Thread Gora Mohanty
On Tue, Sep 27, 2011 at 10:51 AM, nagarjuna nagarjuna.avul...@gmail.com wrote:
 Hi everybody.

 right now i have little bit idea about the solr query ..but i am not
 clear about delta query
 wht it is? and how to write ?any sample delta query?

http://lmgtfy.com/?q=solr+delta+query

There are many useful links among the first several.

Regards,
Gora


Re: what is delata query and how to write?

2011-09-26 Thread nagarjuna
Hi gora can u pls quit ur answers like these..
i may get the perfect answer from anybody but not u,so kindly
please be quit
i already googled and i saw many links as a beginner i am unable to got the
main intention behind using the delta query,even we have query.and i
didn't find the samples thats y i posted this thread...
if u want to really help to me then u try for the samples and send me the
link i will also tryu know still i am googling if i got i will post
answer for my thread if anybody got i will get the answer thts it my
intension

--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-delata-query-and-how-to-write-tp3371639p3371681.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to reserve ids?

2011-09-26 Thread Gabriele Kahlout
I'm interested in the stopwords solution as it sounds like less work but i'm 
not sure i understand how it works. By having msn.com as a stopword it doesnt 
mean i wont get msn.com as a result for say 'hotmail'. My understanding is that 
msn.com will never make it to the similarity function and thus affect the score 
calculation. But seldom does the url anyway (in my searches on content)!