Re: Updating FAQ for International Characters?

2010-03-11 Thread Eric Pugh
So I am using Sunspot to post over, which means an extra layer of  
indirection between mean and my XML!  I will look tomorrow.



On Mar 10, 2010, at 7:21 PM, Chris Hostetter wrote:



: Any time a character like that was index Solr through a unknown  
entity error.

: But if converted to #192; or Agrave; then everything works great.
:
: I tried out using Tomcat versus Jetty and got the same results.   
Before I edit


Uh, you mean like the characters in exampledocs/utf8-example.xml ?

it contains literale utf8 characters, and it works fine.

Based on your #192; comment I assume you are posting XML ... are  
you

sure you are using the utf8 charset?

-Hoss



-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal










Facet pagination

2010-03-11 Thread Avlesh Singh
Is there a way to get *total count of facets* per field?

Meaning, if my facets are -
lst name=facet_fields
lst name=first_char
int name=s305807/int
int name=d264748/int
int name=p181084/int
int name=m130546/int
int name=r98544/int
int name=b82741/int
int name=k77157/int
/lst
/lst

Then, is the underneath possible?
lst name=first_char *totalFacetCount=7'*
where 7 is the count of all facets available. In this example - s, d, p, m,
r, b and k.

I need this to fetch paginated facets of a field for a given query; not by
doing next-previous.

Cheers
Avlesh


Advance Search

2010-03-11 Thread Suram


How can i achieve the advance search in solr .

i need search books like (eg title = The Book of Three,author= Lloyd
Alexander, price = 99.00) 

How can i querying this
-- 
View this message in context: 
http://old.nabble.com/Advance-Search-tp27861279p27861279.html
Sent from the Solr - User mailing list archive at Nabble.com.



Cant commit on 125 GB index

2010-03-11 Thread Frederico Azeiteiro
Hi, 

I'm having timeouts commiting on a 125 GB index with about 2200
docs.

 

I'm inserting new docs every 5m and commiting after that.

 

I would like to try the autocommit option and see if I can get better
results. I need the docs indexed available for searching in about 10
minutes after the insert.

 

I was thinking of using something like

 

autoCommit

  maxDocs5000/maxDocs

  maxTime86000/maxTime

/autoCommit

 

I update about 4000 docs every 15m.

 

Can you share your thoughts on this config?

Do you think this will solve my commits timeout problem?

 

Thanks,

Frederico



Multiple SOLR queries on same index

2010-03-11 Thread Kranti™ K K Parisa
Hi,

Is it possible to execute multiple SOLR queries (basically same
structure/fields but due to the headersize limitations for long query URLs,
thinking of having multiple SOLR queries)
on single index like a batch or so?

Best Regards,
Kranti K K Parisa


Re: index merge

2010-03-11 Thread Mark Fletcher
Hi All,

Thank you for the very valuable suggestions.
I am planning to try using the Master - Slave configuration.

Best Rgds,
Mark.

On Mon, Mar 8, 2010 at 11:17 AM, Mark Miller markrmil...@gmail.com wrote:

 On 03/08/2010 10:53 AM, Mark Fletcher wrote:

 Hi Shalin,

 Thank you for the mail.
 My main purpose of having 2 identical cores
 COREX - always serves user request
 COREY - every day once, takes the updates/latest data and passess it on to
 COREX.
 is:-

 Suppose say I have only one COREY and suppose a request comes to COREY
 while
 the update of the latest data is happening on to it. Wouldn't it degrade
 performance of the requests at that point of time?


 Yes - but your not going to help anything by using two indexes - best you
 can do it use two boxes. 2 indexes on the same box will actually
 be worse than one if they are identical and you are swapping between them.
 Writes on an index will not affect reads in the way you are thinking - only
 in that its uses IO and CPU that the read process cant. Thats going to
 happen with 2 indexes on the same box too - except now you have way more
 data to cache and flip between, and you can't take any advantage of things
 just being written possibly being in the cache for reads.

 Lucene indexes use a write once strategy - when writing new segments, you
 are not touching the segments being read from. Lucene is already doing the
 index juggling for you at the segment level.


 So I was planning to keep COREX and COREY always identical. Once COREY has
 the latest it should somehow sync with COREX so that COREX also now has
 the
 latest. COREY keeps on getting the updates at a particular time of day and
 it will again pass it on to COREX. This process continues everyday.

 What is the best possible way to implement this?

 Thanks,

 Mark.


 On Mon, Mar 8, 2010 at 9:53 AM, Shalin Shekhar Mangar
 shalinman...@gmail.com  wrote:



 Hi Mark,

  On Mon, Mar 8, 2010 at 7:38 PM, Mark Fletcher
 mark.fletcher2...@gmail.com  wrote:



 I ran the SWAP command. Now:-
 COREX has the dataDir pointing to the updated dataDir of COREY. So COREX
 has the latest.
 Again, COREY (on which the update regularly runs) is pointing to the old
 index of COREX. So this now doesnt have the most updated index.

 Now shouldn't I update the index of COREY (now pointing to the old
 COREX)
 so that it has the latest footprint as in COREX (having the latest COREY
 index)so that when the update again happens to COREY, it has the latest
 and
 I again do the SWAP.

 Is a physical copying of the index  named COREY (the latest and now
 datDir
 of COREX after SWAP) to the index COREX  (now the dataDir of COREY.. the
 orginal non-updated index of COREX) the best way for this or is there
 any
 other better option.

 Once again, later when COREY is again updated with the latest, I will
 run
 the SWAP again and it will be fine with COREX again pointing to its
 original
 dataDir (now the updated one).So every even SWAP command run will point
 COREX back to its original dataDir. (same case with COREY).

 My only concern is after the SWAP is done, updating the old index (which
 was serving previously and now replaced by the new index). What is the
 best
 way to do that? Physically copy the latest index to the old one and make
 it
 in sync with the latest one so that by the time it is to get the latest
 updates it has the latest in it so that the new ones can be added to
 this
 and it becomes the latest and is again swapped?



 Perhaps it is best if we take a step back and understand why you need two
 identical cores?

 --
 Regards,
 Shalin Shekhar Mangar.







 --
 - Mark

 http://www.lucidimagination.com






field length normalization

2010-03-11 Thread muneeb

Hi,

In my schema, the document title field has omitNorms=false, which, if I am
not wrong, causes length of titles to be counted in the scoring. 

But when I query with: word1 word2 word3 I dont know why still the top two
documents title have these words and other words, where as the document
which has exact and only these query words is coming on third place.

Setting omitNorms to false, should bring the titles with exact words on top
shouldn't it?

Also I realized when debugged query, that all three top documents have same
score, shouldn't this be different as they have different title lengths?

Thanks very much.
-A
-- 
View this message in context: 
http://old.nabble.com/field-length-normalization-tp27862618p27862618.html
Sent from the Solr - User mailing list archive at Nabble.com.



mincount doesn't work with FacetQuery

2010-03-11 Thread Steve Radhouani
I'm faceting with a query range (with addFacetQuery) and setting mincount to
10 (with setFacetMinCount(10)), but Solr is not respecting this mincount;
it's still giving me all responses, even those having less than 10 retrieved
documents.

I'm wondering wether there's another way to define the mincount while using
addFacetQuery. Actually, when I use this same mincount with addFacetField,
it works perfectly.

Any ideas?

Thanks


Re: Architectural help

2010-03-11 Thread Constantijn Visinescu
Assuming you create the view in such a way that it returns 1 row for each
solrdocument you want indexed: yes

On Wed, Mar 10, 2010 at 7:54 PM, blargy zman...@hotmail.com wrote:


 So I can just create a view  (or temporary table) and then just have a
 simple
 select * from (view or table) in my DIH config?


 Constantijn Visinescu wrote:
 
  Try making a database view that contains everything you want to index,
 and
  then just use the DIH.
 
  Worked when i tested it ;)
 
  On Wed, Mar 10, 2010 at 1:56 AM, blargy zman...@hotmail.com wrote:
 
 
  I was wondering if someone could be so kind to give me some
 architectural
  guidance.
 
  A little about our setup. We are RoR shop that is currently using Ferret
  (no
  laughs please) as our search technology. Our indexing process at the
  moment
  is quite poor as well as our search results. After some deliberation we
  have
  decided to switch to Solr to satisfy our search requirements.
 
  We have about 5M records ranging in size all coming from a DB source
  (only
  2
  tables). What will be the most efficient way of indexing all of these
  documents? I am looking at DIH but before I go down that road I wanted
 to
  get some guidance. Are there any pitfalls I should be aware of before I
  start? Anything I can do now that will help me down the road?
 
  I have also been exploring the Sunspot rails plugin
  (http://outoftime.github.com/sunspot/) which so far seems amazing.
 There
  is
  an easy way to reindex all of your models like Model.reindex but I doubt
  this is the most efficient. Has anyone had any experience using Sunspot
  with
  their rails environment and if so should I bother with the DIH?
 
  Please let me know of any suggestions/opinions you may have. Thanks.
 
 
  --
  View this message in context:
  http://old.nabble.com/Architectural-help-tp27844268p27844268.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://old.nabble.com/Architectural-help-tp27844268p27854256.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Advance Search

2010-03-11 Thread Erick Erickson
Have you looked at dismax?

Erick

On Thu, Mar 11, 2010 at 4:40 AM, Suram reactive...@yahoo.com wrote:



 How can i achieve the advance search in solr .

 i need search books like (eg title = The Book of Three,author= Lloyd
 Alexander, price = 99.00)

 How can i querying this
 --
 View this message in context:
 http://old.nabble.com/Advance-Search-tp27861279p27861279.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Apache Solr module with drupal - where to change key word in context?

2010-03-11 Thread llobash

I am using the apache solr module with our Drupal site.
Out data is not clean enough to use the key word in context blurb under the
title in the result set.
I would like to change it to the first N characters in the body of the node.
Can anyone direct me to the file and line(s) where I would do this?
thanks!
Lynn
-- 
View this message in context: 
http://old.nabble.com/Apache-Solr-module-with-drupal---where-to-change-key-word-in-context--tp27863711p27863711.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Distributed search fault tolerance

2010-03-11 Thread Shawn Heisey
I guess I must be including too much information in my questions, 
running into tl;dr with them.  Later today when I have more time I'll 
try to make it more bite-size.


On 3/9/2010 2:28 PM, Shawn Heisey wrote:
I attended the Webinar on March 4th.  Many thanks to Yonik for putting 
that on.  That has led to some questions about the best way to bring 
fault tolerance to our distributed search.  High level question: 
Should I go with SolrCloud, or stick with 1.4 and use load balancing?  
I hope the rest of this email isn't too disjointed for understanding.




Re: mincount doesn't work with FacetQuery

2010-03-11 Thread Erik Hatcher

Steve -

I'm a bit confused... each facet.query (using HTTP parameter  
nomenclature) only adds a single value to the response, the number of  
docs within the current constraints that match that query.   
facet.mincount is specifically for facet.field, which adds a name/ 
value pair for each value in the field, and that's where you want to  
limit the number of values returned.


Perhaps you could provide a concrete example with the solr response  
(XML, JSON, or something readable format) for the facet data isn't  
making sense to you.


Or maybe SolrJ has some faults in presenting the response properly?

Erik

On Mar 11, 2010, at 8:11 AM, Steve Radhouani wrote:

I'm faceting with a query range (with addFacetQuery) and setting  
mincount to
10 (with setFacetMinCount(10)), but Solr is not respecting this  
mincount;
it's still giving me all responses, even those having less than 10  
retrieved

documents.

I'm wondering wether there's another way to define the mincount  
while using
addFacetQuery. Actually, when I use this same mincount with  
addFacetField,

it works perfectly.

Any ideas?

Thanks




Aggregate functions on faceted result

2010-03-11 Thread Marcus Herou
Hi.

We would like to be able to create trend graphs which have date in the
X-axle and sum(pagerank) on the Y-Axle. We have the field pageRank stored as
an external field (since it is updated all the time).

I have started to build a SearchComponent which will be named something like
FacetFunctionComponent but felt that I should drop a mail here asking if it
is already possible.

Is it even remotely possible to create this function in SOLR ?

Cheers

//Marcus Herou


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: distinct on my result

2010-03-11 Thread stocki

okay.
we have a lot of products and i just importet the name of each product to a
core.
make an edgengram to this and my autoCOMPLETION runs.

but i want an auto-suggestion:

example.

autoCompletion--   I: harry O: harry potter...
but when the input ist -- I. potter -- O: /

so what i want is, that i get harry potter ... when i tipping potter
into my search field!

any idea ? 

i think the solution is a mixe of termsComponent and EdgeNGram or not ? 

i am a little bit despair, and in this forum are too many information about
it =( 


gwk-4 wrote:
 
 Hi,
 
 The autosuggest core is filled by a simple script (written in PHP) which 
 request facet values for all the possible strings one can search for and 
 adds them one by one as a document. Our case has some special issues due 
 to the fact that we search in multiple languages (Typing España will 
 suggest Spain and the other way around when on the Spanish site). We 
 have about 97500 documents yeilding approximately 12500 different 
 documents in our autosuggest-core and the autosuggest-update script 
 takes about 5 minutes to do a full re-index (all this is done on a 
 separate server and replicated so the indexing has no impact on the 
 performance of the site).
 
 Regards,
 
 gwk
 
 On 3/10/2010 3:09 PM, stocki wrote:
 okay. thx

 my suggestion run in another core;)

 do you distinct during the import with DIH ?

 
 
 

-- 
View this message in context: 
http://old.nabble.com/distinct-on-my-result-tp27849951p27864088.html
Sent from the Solr - User mailing list archive at Nabble.com.



Index size on disk

2010-03-11 Thread Tomas
Hello, I needed an easy way to see the index size (the actual size on disk, not 
just the number of documents indexed) and as i didn't found anything for doing 
that on the documentation or on the list, I coded a fast solution.

I added the Index size as a statistic of the searcher, that way the value can 
be seen on the statistics page of the Solr admin. To do this I modified the 
method 

public NamedList getStatistics() {... 

on the class 

org.apache.solr.search.SolrIndexSearcher

by adding the line

lst.add(indexSize, this.calculateIndexSize(reader.directory()).toString() +  
MB);

and added the methods: 

 private BigDecimal calculateIndexSize(Directory directory) {
  long size = 0L;
  try {
for(String filePath:directory.listAll()) {
size+=directory.fileLength(filePath);
  }
} catch (IOException e) {
return new BigDecimal(-1);
}
return getSizeInMB(size, 2);
  }

private BigDecimal getSizeInMB(long size, int scale) {
BigDecimal divisor = new BigDecimal(1024);
BigDecimal sizeKb = new BigDecimal(size).divide(divisor, scale + 1, 
BigDecimal.ROUND_HALF_UP);
return sizeKb.divide(divisor, scale, BigDecimal.ROUND_HALF_UP);
}

I'm running Solr 1.4 on a JBoss 4.0.5 with Java 1.5 and this worked just fine. 
Does anyone see a potential problem on this?

I'm assuming that the solr index will never have directories inside (that's why 
I'm just looping on the index parent directory), is there any case when this is 
not true?

Tomás



  Yahoo! Cocina

Encontra las mejores recetas con Yahoo! Cocina.


http://ar.mujer.yahoo.com/cocina/

Call for presentations - Berlin Buzzwords - Summer 2010

2010-03-11 Thread Isabel Drost
Call for Presentations Berlin Buzzwords
 http://buzzwordsberlin.de
  Berlin Buzzwords 2010 - Search, Store, Scale
   7/8 June 2010


This is to announce the Berlin Buzzwords 2010. The first conference on scalable 
and open search, data processing and data storage in Germany, taking place in 
Berlin.

The event will comprise presentations on scalable data processing. We invite 
you 
to submit talks on the topics:

Information retrieval / Search - Lucene, Solr, katta or comparable solutions
NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking 
for presentations on the implementation of the systems themselves, real world 
applications and case studies. 

Important Dates (all dates in GMT +2):

Submission deadline: April 17th 2010, 23:59
Notification of accepted speakers: May 1st, 2010. 
Publication of final schedule: May 9th, 2010. 
Conference: June 7/8. 2010.

High quality, technical submissions are called for, ranging from principles to 
practice. We are looking for real world use cases, background on the 
architecture of specific projects and a deep dive into architectures built on 
top of e.g. Hadoop clusters. 

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp no later 
than April 17th, 2010. Acceptance notifications will be sent out on May 1st. 
Please include your name, bio and email, the title of the talk, a brief 
abstract 
in English language. Please indicate whether you want to give a short (30min) 
or 
long (45min) presentation and indicate the level of experience with the topic 
your audience should have (e.g. whether your talk will be suitable for newbies 
or is targeted for experienced users.)

The presentation format is short: either 30 or 45 minutes including questions. 
We will be enforcing the schedule rigorously. 

If you are interested in sponsoring the event (e.g. we would be happy to 
provide 
videos after the event, free drinks for attendees as well as an after-show 
party), please contact us. 

Follow @hadoopberlin on Twitter for updates. News on the conference will be 
published on our website at http://berlinbuzzwords.de

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on 
http://berlinbuzzwords.de Please re-distribute this CfP to people who might be 
interested.

Contact us at: 
newthinking communications GmbH
Schönhauser Allee 6/7
10119 Berlin, Germany
Andreas Gebhard a...@newthinking.de
Isabel Drost i...@newthinking.de
+49(0)30-9210 596


signature.asc
Description: This is a digitally signed message part.


Solr Performance Issues

2010-03-11 Thread Siddhant Goel
Hi everyone,

I have an index corresponding to ~2.5 million documents. The index size is
43GB. The configuration of the machine which is running Solr is - Dual
Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB
RAM, and 250 GB HDD.

I'm observing a strange trend in the queries that I send to Solr. The query
times for queries that I send earlier is much lesser than the queries I send
afterwards. For instance, if I write a script to query solr 5000 times (with
5000 distinct queries, most of them containing not more than 3-5 words) with
10 threads running in parallel, the average times for queries goes from
~50ms in the beginning to ~6000ms. Is this expected or is there something
wrong with my configuration. Currently I've configured the queryResultCache
and the documentCache to contain 2048 entries (hit ratios for both is close
to 50%).

Apart from this, a general question that I want to ask is that is such a
hardware enough for this scenario? I'm aiming at achieving around 20 queries
per second with the hardware mentioned above.

Thanks,

Regards,

-- 
- Siddhant


Re: mincount doesn't work with FacetQuery

2010-03-11 Thread Chris Hostetter

: I'm faceting with a query range (with addFacetQuery) and setting mincount to
: 10 (with setFacetMinCount(10)), but Solr is not respecting this mincount;
: it's still giving me all responses, even those having less than 10 retrieved
: documents.

if by all responses you mean all facet queries then that is the 
correct behavior -- facet.mincount is a param that affects facet.field, 
not fact.query.

The documentation notes this, in that all of the params are divided by 
section...

   http://wiki.apache.org/solr/SimpleFacetParameters

...if you'd like to open a feature request, it would be fairly easy to 
make facet.query (and facet.date) consider facet.mincount as well.


-Hoss



Re: Solr Performance Issues

2010-03-11 Thread Erick Erickson
How many outstanding queries do you have at a time? Is it possible
that when you start, you have only a few queries executing concurrently
but as your test runs you have hundreds?

This really is a question of how your load test is structured. You might
get a better sense of how it works if your tester had a limited number
of threads running so the max concurrent requests SOLR was serving
at once were capped (30, 50, whatever).

But no, I wouldn't expect SOLR to bog down the way you're describing
just because it was running for a while.

HTH
Erick

On Thu, Mar 11, 2010 at 9:39 AM, Siddhant Goel siddhantg...@gmail.comwrote:

 Hi everyone,

 I have an index corresponding to ~2.5 million documents. The index size is
 43GB. The configuration of the machine which is running Solr is - Dual
 Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB
 RAM, and 250 GB HDD.

 I'm observing a strange trend in the queries that I send to Solr. The query
 times for queries that I send earlier is much lesser than the queries I
 send
 afterwards. For instance, if I write a script to query solr 5000 times
 (with
 5000 distinct queries, most of them containing not more than 3-5 words)
 with
 10 threads running in parallel, the average times for queries goes from
 ~50ms in the beginning to ~6000ms. Is this expected or is there something
 wrong with my configuration. Currently I've configured the queryResultCache
 and the documentCache to contain 2048 entries (hit ratios for both is close
 to 50%).

 Apart from this, a general question that I want to ask is that is such a
 hardware enough for this scenario? I'm aiming at achieving around 20
 queries
 per second with the hardware mentioned above.

 Thanks,

 Regards,

 --
 - Siddhant



Content Highlighting

2010-03-11 Thread Lee Smith
With the highlighting options will Solr highlight the found text something like 
google search does ?

I cant seem to get this working ?

Hope someone can advise.




Re: Content Highlighting

2010-03-11 Thread Erick Erickson
Please see:
http://wiki.apache.org/solr/UsingMailingLists

http://wiki.apache.org/solr/UsingMailingListsand repost with additional
information.

Best
Erick

On Thu, Mar 11, 2010 at 10:10 AM, Lee Smith l...@weblee.co.uk wrote:

 With the highlighting options will Solr highlight the found text something
 like google search does ?

 I cant seem to get this working ?

 Hope someone can advise.





release schedule?

2010-03-11 Thread Harold Ship
Hello

 

I'm new to this list, so please excuse me if I'm asking in the wrong
place. 

 

I have been tasked with planning the next release of our software.
Today, we are using Solr 1.4.0, and we plan to release a new version of
our software later this year.

 

I would like to know, if possible:

-  Are there any planned Solr releases for this year?

-  What are the planned release dates/contents, etc.?

-  Are there any beta releases to work with in the meantime?

 

Thank you,

Harold Ship

NGSoft



Re: Call for presentations - Berlin Buzzwords - Summer 2010

2010-03-11 Thread Isabel Drost
On 11.03.2010 Isabel Drost wrote:
 Call for Presentations Berlin Buzzwords

It should have been http://berlinbuzzwords.de of course...

Isabel


signature.asc
Description: This is a digitally signed message part.


Re: distinct on my result

2010-03-11 Thread stocki

hey,

okay i show your my settings ;)
i use an extra core with the standard requesthandler.


SCHEMA.XML
field name=id type=string  indexed=true stored=true required=true
/
field name=name type=textindexed=true stored=true
required=true /
field name=suggest type=autocomplete indexed=true stored=true 
multiValued=true/
copyField source=name  dest=suggest/

so i copy my names to the field suggest and use the EdgeNGramFilter and some
others 

fieldType name=autocomplete class=solr.TextField
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /   
filter class=solr.EdgeNGramFilterFactory 
maxGramSize=100
minGramSize=1 /  
filter class=solr.StandardFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German2
protected=protwords.txt/ 
filter class=solr.SnowballPorterFilterFactory 
language=English
protected=protwords.txt/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory 
maxGramSize=100
minGramSize=1 /
filter class=solr.StandardFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German2
protected=protwords.txt/ 
filter class=solr.SnowballPorterFilterFactory 
language=English
protected=protwords.txt/ 
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/

/analyzer
/fieldType


so with this konfig i get the results above ...

maybe i have t many filters ;) ?!



gwk-4 wrote:
 
 Hi,
 
 I'm no expert on the full-text search features of Solr but I guess that 
 has something to do with your fieldtype, or query. Are you using the 
 standard request handler or dismax for your queries? And what analysers 
 are you using on your product name field?
 
 Regards,
 
 gwk
 
 On 3/11/2010 3:24 PM, stocki wrote:
 okay.
 we have a lot of products and i just importet the name of each product to
 a
 core.
 make an edgengram to this and my autoCOMPLETION runs.

 but i want an auto-suggestion:

 example.

 autoCompletion--I: harry O: harry potter...
 but when the input ist --  I. potter -- O: /

 so what i want is, that i get harry potter ... when i tipping potter
 into my search field!

 any idea ?

 i think the solution is a mixe of termsComponent and EdgeNGram or not ?

 i am a little bit despair, and in this forum are too many information
 about
 it =(


 gwk-4 wrote:

 Hi,

 The autosuggest core is filled by a simple script (written in PHP) which
 request facet values for all the possible strings one can search for and
 adds them one by one as a document. Our case has some special issues due
 to the fact that we search in multiple languages (Typing España will
 suggest Spain and the other way around when on the Spanish site). We
 have about 97500 documents yeilding approximately 12500 different
 documents in our autosuggest-core and the autosuggest-update script
 takes about 5 minutes to do a full re-index (all this is done on a
 separate server and replicated so the indexing has no impact on the
 performance of the site).

 Regards,

 gwk

 On 3/10/2010 3:09 PM, stocki wrote:

 okay. thx

 my suggestion run in another core;)

 do you distinct during the import with DIH ?






 
 
 

-- 
View this message in context: 
http://old.nabble.com/distinct-on-my-result-tp27849951p27865058.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Performance Issues

2010-03-11 Thread Siddhant Goel
Hi Erick,

The way the load test works is that it picks up 5000 queries, splits them
according to the number of threads (so if we have 10 threads, it schedules
10 threads - each one sending 500 queries). So it might be possible that the
number of queries at a point later in time is greater than the number of
queries earlier in time. I'm not very sure about that though. Its a simple
Ruby script that starts up threads, calls the search function in each
thread, and then waits for each of them to exit.

How many queries per second can we expect Solr to serve, given this kind of
hardware? If what you suggest is true, then is it possible that while Solr
is serving a query, another query hits it, which increases the response time
even further? I'm not sure about it. But yes I can observe the query times
going up as I increase the number of threads.

Thanks,

Regards,

On Thu, Mar 11, 2010 at 8:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 How many outstanding queries do you have at a time? Is it possible
 that when you start, you have only a few queries executing concurrently
 but as your test runs you have hundreds?

 This really is a question of how your load test is structured. You might
 get a better sense of how it works if your tester had a limited number
 of threads running so the max concurrent requests SOLR was serving
 at once were capped (30, 50, whatever).

 But no, I wouldn't expect SOLR to bog down the way you're describing
 just because it was running for a while.

 HTH
 Erick

 On Thu, Mar 11, 2010 at 9:39 AM, Siddhant Goel siddhantg...@gmail.com
 wrote:

  Hi everyone,
 
  I have an index corresponding to ~2.5 million documents. The index size
 is
  43GB. The configuration of the machine which is running Solr is - Dual
  Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache,
 8GB
  RAM, and 250 GB HDD.
 
  I'm observing a strange trend in the queries that I send to Solr. The
 query
  times for queries that I send earlier is much lesser than the queries I
  send
  afterwards. For instance, if I write a script to query solr 5000 times
  (with
  5000 distinct queries, most of them containing not more than 3-5 words)
  with
  10 threads running in parallel, the average times for queries goes from
  ~50ms in the beginning to ~6000ms. Is this expected or is there something
  wrong with my configuration. Currently I've configured the
 queryResultCache
  and the documentCache to contain 2048 entries (hit ratios for both is
 close
  to 50%).
 
  Apart from this, a general question that I want to ask is that is such a
  hardware enough for this scenario? I'm aiming at achieving around 20
  queries
  per second with the hardware mentioned above.
 
  Thanks,
 
  Regards,
 
  --
  - Siddhant
 




-- 
- Siddhant


Re: distinct on my result

2010-03-11 Thread gwk

Hi,

Try replacing KeywordTokenizerFactory with a WhitespaceTokenizerFactory 
so it'll create separate terms per word. After a reindex it should work.


Regards,

gwk

On 3/11/2010 4:33 PM, stocki wrote:

hey,

okay i show your my settings ;)
i use an extra core with the standard requesthandler.


SCHEMA.XML
field name=id type=string  indexed=true stored=true required=true
/
field name=name type=textindexed=true stored=true
required=true /
field name=suggest type=autocomplete indexed=true stored=true
multiValued=true/
copyField source=name  dest=suggest/

so i copy my names to the field suggest and use the EdgeNGramFilter and some
others

fieldType name=autocomplete class=solr.TextField
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory 
maxGramSize=100
minGramSize=1 / 
filter class=solr.StandardFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=German2
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory 
language=English
protected=protwords.txt/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt /
 /analyzer
 analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
filter class=solr.EdgeNGramFilterFactory 
maxGramSize=100
minGramSize=1 /
filter class=solr.StandardFilterFactory/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/

 /analyzer
/fieldType


so with this konfig i get the results above ...

maybe i have t many filters ;) ?!



gwk-4 wrote:
   

Hi,

I'm no expert on the full-text search features of Solr but I guess that
has something to do with your fieldtype, or query. Are you using the
standard request handler or dismax for your queries? And what analysers
are you using on your product name field?

Regards,

gwk

On 3/11/2010 3:24 PM, stocki wrote:
 

okay.
we have a lot of products and i just importet the name of each product to
a
core.
make an edgengram to this and my autoCOMPLETION runs.

but i want an auto-suggestion:

example.

autoCompletion-- I: harry O: harry potter...
but when the input ist --   I. potter -- O: /

so what i want is, that i get harry potter ... when i tipping potter
into my search field!

any idea ?

i think the solution is a mixe of termsComponent and EdgeNGram or not ?

i am a little bit despair, and in this forum are too many information
about
it =(


gwk-4 wrote:

   

Hi,

The autosuggest core is filled by a simple script (written in PHP) which
request facet values for all the possible strings one can search for and
adds them one by one as a document. Our case has some special issues due
to the fact that we search in multiple languages (Typing España will
suggest Spain and the other way around when on the Spanish site). We
have about 97500 documents yeilding approximately 12500 different
documents in our autosuggest-core and the autosuggest-update script
takes about 5 minutes to do a full re-index (all this is done on a
separate server and replicated so the indexing has no impact on the
performance of the site).

Regards,

gwk

On 3/10/2010 3:09 PM, stocki wrote:

 

okay. thx

my suggestion run in another core;)

do you distinct during the import with DIH ?


   



 
   



 
   




Re: Snapshot / Distribution Process

2010-03-11 Thread Bill Au
Have you started rsyncd on the master?  Make sure that it is enabled before
you start:

http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

You can also tried running snappuller with the -V option to et more
debugging info.

Bill

On Wed, Mar 10, 2010 at 4:09 PM, Lars R. Noldan l...@sixfeetup.com wrote:

 Is anyone aware of a comprehensive guide for setting up the Snapshot
 Distribution process on Solr 1.3?

 I'm working through:
 http://wiki.apache.org/solr/CollectionDistribution#The_Snapshot_and_Distribution_Process

 And have run into a roadblock where the solr/bin/snappuller finds the
 appropriate snapshot, but rsync fails.  (according to the logs.)

 Any guidance you can provide, even if it's asking for additional
 troubleshooting information is welcome and appreciated.

 Thanks
 Lars
 --
 l...@sixfeetup.com | +1 (317) 861-5948 x609
 six feet up presents INDIGO : The Help Line for Plone
 More info at http://sixfeetup.com/indigo or call +1 (866) 749-3338


What does means ~2, ~3, ~4 in DisjunctionMaxQuery?

2010-03-11 Thread Marc Sturlese

I am debuggin a 2 words query build using dismax. So it's build from
DisjunctionMaxQueries being the minShouldMatch 100% and tie breaker
multiplier = 0.3

+((DisjunctionMaxQuery((content:john | title:john~0.3)
DisjunctionMaxQuery((content:malone | title:malone)~0.3))~2)

And a 3 words one (with same tie and mm):
+((DisjunctionMaxQuery((content:john^3.0 | region:john)~0.3)
DisjunctionMaxQuery((content:malone^3.0 | region:malone)~0.3)
DisjunctionMaxQuery((content:lawyer^3.0 | region:lawyer)~0.3))~3)

I have tryed to read carefully lucene documentation of DisjunctionMaxQuery
but can't find what ~2 (for the first query) and ~3 (for the second query)
do. In case I search for 4 words it will be ~4

I know ~ its used to specify the slop in phrase queries. Does it means any
sort of slope here in the DisjunctionMaxQueries??

Thanks in advance
-- 
View this message in context: 
http://old.nabble.com/What-does-means-%7E2%2C-%7E3%2C-%7E4-in-DisjunctionMaxQuery--tp27866033p27866033.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: What does means ~2, ~3, ~4 in DisjunctionMaxQuery?

2010-03-11 Thread Erik Hatcher


On Mar 11, 2010, at 11:42 AM, Marc Sturlese wrote:

I am debuggin a 2 words query build using dismax. So it's build from
DisjunctionMaxQueries being the minShouldMatch 100% and tie breaker
multiplier = 0.3

+((DisjunctionMaxQuery((content:john | title:john~0.3)
DisjunctionMaxQuery((content:malone | title:malone)~0.3))~2)


the ~2 is BooleanQuery's way of saying the minimum number that should  
match value.




And a 3 words one (with same tie and mm):
+((DisjunctionMaxQuery((content:john^3.0 | region:john)~0.3)
DisjunctionMaxQuery((content:malone^3.0 | region:malone)~0.3)
DisjunctionMaxQuery((content:lawyer^3.0 | region:lawyer)~0.3))~3)


And likewise for ~3 here.  It's being computed based on the mm  
parameter you're providing, which is 100%.


I know ~ its used to specify the slop in phrase queries. Does it  
means any

sort of slope here in the DisjunctionMaxQueries??


It's actually purely on the BooleanQuery for that factor.

Erik



Re: Snapshot / Distribution Process

2010-03-11 Thread Chris Hostetter

: Subject: Snapshot / Distribution Process
: In-Reply-To: 27854256.p...@talk.nabble.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss



Re: HTMLStripTransformer not working with data importer

2010-03-11 Thread James Ostheimer
Hi-

I can't seem to make any of the transfomers work, I am using the
DataImporter to pull in data from a wordpress instance (see below).  Neither
REGEX or HTMLStrip seems to do anything to my content.

Do I have to include a separate jar with the transformers?  Are the
transformers in 1.4 (particularly the HTMLStrip)?

James

On Wed, Mar 10, 2010 at 10:47 PM, James Ostheimer james.osthei...@gmail.com
 wrote:

 HI-

 I am working a contract to index some wordpress data.  For the posts I of
 course have html in the content of the column, I'd like to strip it out.
  Here is my data importer config

 dataConfig
 dataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost:3306/econetsm user=*** password=***
 /
 document
 entity name=post transformer=HTMLStripTransformer
 query=SELECT id, post_content, post_title FROM elinstmkting_posts e
 onError=abort
 deltaQuery=SELECT * FROM elinstmkting_posts e where
 post_modified_gmt  '${dataimporter.last_index_time}'
field column=POST_TITLE name=post_title
 stripHTML=false/
 field column=POST_CONTENT name=post_content
 stripHTML=true  /
 /entity
 /document
 /dataConfig

 Looks perfect according to the wiki docs, but the html is found when I
 search for strong (strong tag) and html is returned in the field.

 I assume I am doing something stupid wrong, I am using the latest stable
 solr (1.4.0).

 Does it matter that the post data is not a complete html document (it
 doesn't have a html start tag or a body tag)?

 James



How to edit / compile the SOLR source code

2010-03-11 Thread JavaGuy84

Hi,

Sorry for asking this very simple question but I am very new to SOLR and I
want to play with its source code.

As a initial step I have a requirement to enable wildcard search (*text) in
SOLR. I am trying to figure out a way to import the complete SOLR build to
Eclipse and edit QueryParsing.java file but I am not able to import (I tried
to import with ant project in Eclipse and selected the build.xml file and
got an error stating javac is not present in the build.xml file).

Can someone help me out with the initial steps on how to import / edit /
compile / test the SOLR source?

Thanks a lot for your help!!!

Thanks,
B
-- 
View this message in context: 
http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Architectural help

2010-03-11 Thread Chris Hostetter

: We have about 5M records ranging in size all coming from a DB source (only 2
: tables). What will be the most efficient way of indexing all of these
: documents? I am looking at DIH but before I go down that road I wanted to

The main question to ask yourself is what your indexing freshness 
requirements are.  

If you have a small amount of data, or if a large percentage of your data 
is changing all the time, and you can tollerate lag in how quickly updates 
to your data make it into the index, then doing complete re/full-builds 
(with DIH or anything else) periodicly is certianly the simplest way to 
go.

If you have a lot of data, or a small percentage of your data is changing 
within the largest interval of time you are willing to wait before your 
index is updated, then a batch delta indexing approach like DIH's 
deltaQuery provides is only a little bit more effort on top of 
implementing fullbuilds.

if you really need your index to be updated as soon as the authoritative 
data changes, then having your publishing flow immediately make changes to 
the index by pushing it over HTTP to the /update API is probably your best 
bet.



-Hoss



Re: Scaling indexes with high document count

2010-03-11 Thread Chris Hostetter

: I wonder if anyone might have some insight/advice on index scaling for high
: document count vs size deployments...

Your general approach sounds reasonable, although specifics of how you'll 
need to tune the caches and how much hardware you'll need will largely 
depend on the specifics of the data and the queries.

I'm not sure what you mean by this though...


: As searching would always be performed on replicas - the indexing cores
: wouldn't be tuned with much autowarming/read cache, but have loads of
: 'maxdocs' cache. The searchers would be the other way 'round - lots of

what do you mean by 'maxdocs' cache ?



-Hoss



issue with delete index

2010-03-11 Thread muneeb

Hi,

I have made some changes to my schema, including setting of omitNorms to
false for a few fields. I am using Solr1.4 with SolrJ client. I deleted my
index using the client:

solrserver.deleteByQuery(*:*);
solrserver.optimize();

But after reindexing and running the queries i don't see any difference in
query results, as if it didn't take 'omitNorms' settings into consideration. 

Can anyone tell me how to delete the index entirely so that new changes can
take place?

Thanks!!
-M
-- 
View this message in context: 
http://old.nabble.com/issue-with-delete-index-tp27866630p27866630.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: embedded server / servlet container

2010-03-11 Thread Chris Hostetter

: I am trying to provide an embedded server to a web application deployed in a
: servlet container (like tomcat).

If you are trying to use Solr inside another webapp, my suggestion would 
just be to incorporate the existing Solr servlets, jsps, dispatch filter, 
and web.xml specifics from solr inot your app, and let them do their own 
thing -- it's going to make your life much easier from an upgrade 
standpoint.

Better still: run solr.war as it's own webapp in the same servlet 
container.


-Hoss



Highlighting Results

2010-03-11 Thread Lee Smith
Hi All

Im not sure where i'm going wrong but highlighting does not seem to work for me.

I have indexed around 5000 PDF documents which went well.

Running normal queries against the attr_content works well.

When adding any hl code it does not seem to make a bit of difference.

Here is an example query: ?q=attr_content:Some 
Namehl=truehl.fl=attr_contenthl.fragsize=50rows=5

If I am correct fragsize should be limiting the returned content for 
attr_content ?? and the keyowrds found in attr_contnet should be surronded with 
the em tags ?

The attr_content is a stored if this helps.

Hope someone can point me in the right direction.

Thank you if you can !




Re: How to edit / compile the SOLR source code

2010-03-11 Thread Trey
Yep, as you've discovered, the import from ant build file doesn't work for
the solr build.xml in eclipse.

There is an excellent how-to for getting Solr up and running in Eclipse for
debugging purposes here:
http://www.lucidimagination.com/developers/artiicles/setting-up-apache-solr-in-eclipse

Once you have the setup in place from the above tutorial, you can then go to
any of the Solr Jar files and attach the source, which will allow you to
debug into and modify the Solr code.  If you need to step into any of the
lucene code you'll have to pull it down separately, but you can attach the
same way.  The last step (after you've made your changes) is that you would
just need to rebuild with Ant (run ant from the directory containing the
build.xml file to see the build options for Solr).  I think that just
running ant example there should do the trick.

-Trey

On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote:


 Hi,

 Sorry for asking this very simple question but I am very new to SOLR and I
 want to play with its source code.

 As a initial step I have a requirement to enable wildcard search (*text) in
 SOLR. I am trying to figure out a way to import the complete SOLR build to
 Eclipse and edit QueryParsing.java file but I am not able to import (I
 tried
 to import with ant project in Eclipse and selected the build.xml file and
 got an error stating javac is not present in the build.xml file).

 Can someone help me out with the initial steps on how to import / edit /
 compile / test the SOLR source?

 Thanks a lot for your help!!!

 Thanks,
 B
 --
 View this message in context:
 http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: field length normalization

2010-03-11 Thread Siddhant Goel
Did you reindex after setting omitNorms to false? I'm not sure whether or
not it is needed, but it makes sense.

On Thu, Mar 11, 2010 at 5:34 PM, muneeb muneeba...@hotmail.com wrote:


 Hi,

 In my schema, the document title field has omitNorms=false, which, if I
 am
 not wrong, causes length of titles to be counted in the scoring.

 But when I query with: word1 word2 word3 I dont know why still the top
 two
 documents title have these words and other words, where as the document
 which has exact and only these query words is coming on third place.

 Setting omitNorms to false, should bring the titles with exact words on top
 shouldn't it?

 Also I realized when debugged query, that all three top documents have same
 score, shouldn't this be different as they have different title lengths?

 Thanks very much.
 -A
 --
 View this message in context:
 http://old.nabble.com/field-length-normalization-tp27862618p27862618.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
- Siddhant


Re: field length normalization

2010-03-11 Thread muneeb


: 
: Did you reindex after setting omitNorms to false? I'm not sure whether or
: not it is needed, but it makes sense.

Yes i deleted the old index and reindexed it.
Just to add another fact, that the titlles length is less than 10. I am not
sure if solr has pre-set values for length normalizations, because for
titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the
debugQuery section).


-- 
View this message in context: 
http://old.nabble.com/field-length-normalization-tp27862618p27867025.html
Sent from the Solr - User mailing list archive at Nabble.com.



dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Ya-Wen Hsu
Hi all,

I'm facing the same issue as previous post here: 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since no 
one answers this post, I thought I'll ask again. In my case, I use below 
setting for index
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/
and
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 preserveOriginal=1/ for query.

When I use query with word ain't, no result is returned. When I turned on the 
logging, I found the word is interpreted as (ain't ain) t.

0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
0.0 = no match on required clause ((description:(ain't ain) t^2.0 | 
name:(ain't ain) t^3.0 | search_keywords:(ain't ain) t)~0.1)

Does anyone know why ain't be parsed as (ain't ain) t and how to fix it so it 
can match documents that include ain't in the name? Thanks in advance!

Wen



Profiling Solr

2010-03-11 Thread Jean-Sebastien Vachon
Hi,

I'm trying to identify the bottleneck to get acceptable performance of a single 
shard containing 4.7 millions of documents using my own machine (Mac Pro - Quad 
Core with 8Gb of RAM with 4Gb allocated to the JVM). 

I tried using YourKit but I don't get anything about Solr classes. I'm new to 
Yourkit so I might be doing something wrong but it seems pretty straight 
forward.

I am running Solr within a Tomcat instance within Eclipse. Does anyone have an 
idea about what could be wrong in my setup?

I'm making individual requests (one at a time) and the response times are 
horrible (about 15 sec on average). I need to bring this way below 1 second.

Here is a sample query:

http://localhost:8983/jobs_part3/select/?q=*:*collapse=truecollapse.field=hash_idfacet=truefacet.field=county_idfacet.field=advertiser_idfacet.field=county_idsort=county_id+ascrows=100collapse.type=adjacent

I know that collapsing results has a big hit on performance but it is a must 
have for us.

Thanks for any hints.

= JVM Parameters =

-Xms4g -Xmx4g -d64 -server


Re: Profiling Solr

2010-03-11 Thread Yonik Seeley
On Thu, Mar 11, 2010 at 1:11 PM, Jean-Sebastien Vachon
js.vac...@videotron.ca wrote:
 Hi,

 I'm trying to identify the bottleneck to get acceptable performance of a 
 single shard containing 4.7 millions of documents using my own machine (Mac 
 Pro - Quad Core with 8Gb of RAM with 4Gb allocated to the JVM).

 I tried using YourKit but I don't get anything about Solr classes.

Sometimes org.apache.* can be in the ignore list by default along
with java.*, I guess because people are looking for bottlenecks in
their own code and don't want to look into other libraries.

-Yonik
http://www.lucidimagination.com


Multi valued fields

2010-03-11 Thread Jean-Sebastien Vachon
Hi All,

I'd like to know if it is possible to do the following on a multi-value field:

Given the following data:

document A:  field1   = [ A B C D]
document B:  field 1  = [A B]
document C:  field 1  = [A]

Can I build a query such as : 

-field: A

which will return all documents that do not have exclusive A in the their 
field's values. By exclusive I mean that I don't want documents that only have 
A in their list of values. In my sample case, the query would return doc A and 
B.
Because they both have other values in field1.

It this kind of query possible with Solr/Lucene?

Thanks





Re: issue with delete index

2010-03-11 Thread Yonik Seeley
On Thu, Mar 11, 2010 at 12:22 PM, muneeb muneeba...@hotmail.com wrote:
 I have made some changes to my schema, including setting of omitNorms to
 false for a few fields. I am using Solr1.4 with SolrJ client. I deleted my
 index using the client:

 solrserver.deleteByQuery(*:*);
 solrserver.optimize();

Solr implements a *:* by removing the index, so this should have been fine.

 But after reindexing and running the queries i don't see any difference in
 query results, as if it didn't take 'omitNorms' settings into consideration.

Did you restart Solr so that the schema was re-read?

-Yonik
http://www.lucidimagination.com


Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Yonik Seeley
On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote:
 Hi all,

 I'm facing the same issue as previous post here: 
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html. Since 
 no one answers this post, I thought I'll ask again. In my case, I use below 
 setting for index
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=0 preserveOriginal=1/
 and
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=0 preserveOriginal=1/ for query.

 When I use query with word ain't, no result is returned. When I turned on 
 the logging, I found the word is interpreted as (ain't ain) t.


The problem is preserving the original in the query analyzer - try
removing that.  And if you aren't doing prefix or wildcard queries,
preserveOriginal doesn't buy you anything but wasted index space.

It's the same issue of why you can't generate and catenate at the same
time with the query parser.

-Yonik
http://www.lucidimagination.com


Re: How to edit / compile the SOLR source code

2010-03-11 Thread Erick Erickson
See Trey's comment, but before you go there.

What about SOLR's wildcard searching capabilities aren't
working for you now? There are a couple of tricks for making
leading wildcard searches work quickly, but this is a solved
problem. Although whether the existing solutions work in
your situation may be an open question...

Or do you have to hack into the parser for other reasons?

Best
Erick

On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote:


 Hi,

 Sorry for asking this very simple question but I am very new to SOLR and I
 want to play with its source code.

 As a initial step I have a requirement to enable wildcard search (*text) in
 SOLR. I am trying to figure out a way to import the complete SOLR build to
 Eclipse and edit QueryParsing.java file but I am not able to import (I
 tried
 to import with ant project in Eclipse and selected the build.xml file and
 got an error stating javac is not present in the build.xml file).

 Can someone help me out with the initial steps on how to import / edit /
 compile / test the SOLR source?

 Thanks a lot for your help!!!

 Thanks,
 B
 --
 View this message in context:
 http://old.nabble.com/How-to-edit---compile-the-SOLR-source-code-tp27866410p27866410.html
 Sent from the Solr - User mailing list archive at Nabble.com.




RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One 
problem is that it would also remove real words. Apparently 40-60% of the words 
in large corpora occur only once 
(http://en.wikipedia.org/wiki/Hapax_legomenon.)  

There are a couple of use cases where removing words that occur only once might 
be a problem.  

One is for genealogical searches where a user might want to retrieve a document 
if their relative is only mentioned once in the document.  We have quite a few 
government documents and other resources such as the Lineage Book of the 
Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  
In this case, if we remove one of the words in the quote because it occurs only 
once in a document, then the phrase search would fail.  For example if someone 
were searching Macbeth and entered the phrase query: Eye of newt and toe of 
frog it would fail if we had removed newt from the index because newt 
occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of 
about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 
were in the unix dictionary, so at least 1800 words that would be removed would 
be real words as opposed to OCR errors (a spot check of the words not in the 
unix /usr/share/dict/words file revealed most of them also as real words rather 
than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 
30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 
were in the unix dictionary so your intuition that a lot of OCR errors will 
occur only once seems spot on.  A quick look at the words not in the dictionary 
revealed a mix of technical terms, common names, and obvious OCR nonsense such 
as ffll.lj'slall'lm 

I guess the question I need to determine is whether the benefit of removing 
words that occur only once outweighs the costs in terms of the two use cases 
outlined above.   When we get our new test server set up, sometime in the next 
month, I think I will go ahead and prune a test index of 500K docs and do some 
performance testing just to get an idea of the potential performance gains of 
pruning the index.

I have some other questions about index pruning, but I want to do a bit more 
reading and then I'll post a question to either the Solr or Lucene list.  Can 
you suggest which list I should post an index pruning question to?

Tom








-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR

 Can anyone suggest any practical solutions to removing some fraction of the 
 tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com


Re: dismax and WordDelimiterFilterFactory with PreserveOriginal = 1

2010-03-11 Thread Erick Erickson
Kind of a shot in the dark here, but your parameters for index and query on
WordDelimiterFilterFactory are different, especially suspicious is
catenateWords.

You could test this by looking in your index with the SOLR admin page and/or
Luke to see what your actual terms are.

And don't forget you'll have to re-index after restarting SOLR for any
index
changes to take effect

HTH
Erick

On Thu, Mar 11, 2010 at 2:20 PM, Ya-Wen Hsu y...@eline.com wrote:

 Yonik, thank you for your reply. When I don't use PreserveOriginal = 1 for
 WordDelimiterFilterFactory, the query ain't is parsed as ain t and no
 match is found in this case too. If I remove ' from the query, then I can
 get results. I used the analysis tool and see the term ain't is processed as
 ain t, and get matches when the title includes ain't. But I got no
 result when using ain't query with dismax.

 The debug output looks like:
 (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
 +(long_description:ain t^2.0 | name:ain t^3.0 | search_keywords:ain
 t)~0.1 (long_description:save^2.0 | name:save^3.0 |
 search_keywords:saved)~0.1) ()


 Below is my configuration for text field type.

 fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
!--filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/--
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


 I get results back when I tried to use solr.LowerCaseTokenizerFactory
 instead of solr.WhitespaceTokenizerFactory. However, the concern here is
 this might reduce the quality of relevant search. Does anyone have a better
 idea on what to try next? Thanks!

 Wen
 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, March 11, 2010 10:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: dismax and WordDelimiterFilterFactory with PreserveOriginal =
 1

 On Thu, Mar 11, 2010 at 1:07 PM, Ya-Wen Hsu y...@eline.com wrote:
  Hi all,
 
  I'm facing the same issue as previous post here:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg19511.html.
 Since no one answers this post, I thought I'll ask again. In my case, I use
 below setting for index
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/
  and
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 preserveOriginal=1/ for query.
 
  When I use query with word ain't, no result is returned. When I turned
 on the logging, I found the word is interpreted as (ain't ain) t.


 The problem is preserving the original in the query analyzer - try
 removing that.  And if you aren't doing prefix or wildcard queries,
 preserveOriginal doesn't buy you anything but wasted index space.

 It's the same issue of why you can't generate and catenate at the same
 time with the query parser.

 -Yonik
 http://www.lucidimagination.com



Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Robert,

 I've been thinking about this since you suggested it on another thread.  One 
 problem is that it would also remove real words. Apparently 40-60% of the 
 words in large corpora occur only once 
 (http://en.wikipedia.org/wiki/Hapax_legomenon.)


You are correct. I really hate recommending you 'remove data', but at
the same time, as perhaps an intermediate step, this could be a
brutally simple approach to move you along.


 I guess the question I need to determine is whether the benefit of removing 
 words that occur only once outweighs the costs in terms of the two use cases 
 outlined above.   When we get our new test server set up, sometime in the 
 next month, I think I will go ahead and prune a test index of 500K docs and 
 do some performance testing just to get an idea of the potential performance 
 gains of pruning the index.

Well, one thing I did with Andrzej's patch is immediately
relevance-test this approach against some corpora I had. The results
are on the JIRA issue, and the test collection itself is in
openrelevance.

In my opinion the p...@n is probably overstated, and the MAP values are
probably understated (due to it being a pooled relevance collection),
but I think its fair to say for that specific large text collection,
pruning terms that only appear in the document a single time does not
hurt relevance.

At the same time I will not dispute that it could actually help p...@n, I
am just saying I'm not sold :)

Either way its extremely interesting, cut your index size in half, and
get the same relevance!


 I have some other questions about index pruning, but I want to do a bit more 
 reading and then I'll post a question to either the Solr or Lucene list.  Can 
 you suggest which list I should post an index pruning question to?


I would recommend posting it to the JIRA issue:
http://issues.apache.org/jira/browse/LUCENE-1812

This way someone who knows more (Andrzej) could see it, too.


-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Thanks Simon,

We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
looking for unlikely mixes of unicode character blocks.  For example some of
the CJK material ends up with Cyrillic characters. (except we would have to
watch out for any Russian-Chinese dictionaries:)

Tom



 
 
 There wasn't any completely satisfactory solution; there were a large
 number
 of two and three letter n-grams so we were able to use a dictionary
 approach
 to eliminate those (names tend to be longer).  We also looked for runs of
 punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
 eliminated longer words which consisted of runs of not-ocurring-in-English
 bigrams.
 
 Hope this helps
 
 -Simon
 

 --

 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote:

 Thanks Simon,

 We can probably implement your suggestion about runs of punctuation and
 unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
 looking for unlikely mixes of unicode character blocks.  For example some of
 the CJK material ends up with Cyrillic characters. (except we would have to
 watch out for any Russian-Chinese dictionaries:)


Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

-- 
Robert Muir
rcm...@gmail.com


Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter

: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
: looking for unlikely mixes of unicode character blocks.  For example some of
: the CJK material ends up with Cyrillic characters. (except we would have to
: watch out for any Russian-Chinese dictionaries:)

Since you are dealing with multiple langugaes, and multiple varient usages 
of langauges (ie: olde english) I wonder if one way to try and generalize 
the idea of unlikely letter combinations into a math problem (instead of 
grammer/spelling problem) would be to score all the hapax legomenon 
words in your index based on the frequency of (character) N-grams in 
each of those words, relative the entire corpus, and then eliminate any of 
the hapax legomenon words whose score is below some cut off threshold 
(that you'd have to pick arbitrarily, probably by eyeballing the sorted 
list of words and their contexts to deide if they are legitimate)

?


-Hoss



Re: Cleaning up dirty OCR

2010-03-11 Thread Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:

 I wonder if one way to try and generalize 
 the idea of unlikely letter combinations into a math problem (instead of 
 grammer/spelling problem) would be to score all the hapax legomenon 
 words in your index


Hmm, how about a classifier? Common words are the yes training set, hapax 
legomenons are the no set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder






RE: Scaling indexes with high document count

2010-03-11 Thread Peter S

Hi,

 

Thanks for your reply (an apologies for the orig msg being ent multiple times 
to the list - googlemail problems).

 

I actually meant to put 'maxBufferredDocs'. I admit I'm not that familar with 
this parameter, but as I understand it, it is the number of documents that are 
held in ram before flushing to disk. I've noticed the ramBufferSizeMB is a 
similar parameter, but using memory as the threshold rather than number of docs.

 

Is it best not to set these too high on indexers?

 

In my environment, all writes are done via SolrJ, where documents are placed in 
a SolrDocumentList and commit()ed when the list reaches 1000 (default value), 
or a configured commit thread interval is reached (default is 20s, whichever 
comes first). I suppose this is a SolrJ-side version of 'maxBufferedDocs', so 
maybe I don't need to set maxBufferedDocs in solrconfig? (the SolrJ 'client' is 
on the same machine as the index)

 

For the indexer cores (essentially write-only indexes), I wasn't planning on 
configuring extra memory for read cache (Lucene value cache or filter cache), 
as no queries would/should be received on these. Should I reconsider this? 
They'll be plenty of RAM available for indexers to use and still leave enough 
for the OS file system cache to do its thing. Do you have any suggestions as to 
what would be the best way to use this memory to achieve optimal indexing 
speed? 

The main things I do now to tune for fast indexing are: 

 * commiting lists of docs rather than each one separately

 * not optimizing too often

 * bump up the mergeFactor (I use a value of 25)

 

 

Many Thanks!

Peter

 

 

 
 Date: Thu, 11 Mar 2010 09:19:12 -0800
 From: hossman_luc...@fucit.org
 To: solr-user@lucene.apache.org
 Subject: Re: Scaling indexes with high document count
 
 
 : I wonder if anyone might have some insight/advice on index scaling for high
 : document count vs size deployments...
 
 Your general approach sounds reasonable, although specifics of how you'll 
 need to tune the caches and how much hardware you'll need will largely 
 depend on the specifics of the data and the queries.
 
 I'm not sure what you mean by this though...
 
 
 : As searching would always be performed on replicas - the indexing cores
 : wouldn't be tuned with much autowarming/read cache, but have loads of
 : 'maxdocs' cache. The searchers would be the other way 'round - lots of
 
 what do you mean by 'maxdocs' cache ?
 
 
 
 -Hoss
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

Interesting.  I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.  


hossman wrote:
 
 
 
 Since you are dealing with multiple langugaes, and multiple varient usages 
 of langauges (ie: olde english) I wonder if one way to try and generalize 
 the idea of unlikely letter combinations into a math problem (instead of 
 grammer/spelling problem) would be to score all the hapax legomenon 
 words in your index based on the frequency of (character) N-grams in 
 each of those words, relative the entire corpus, and then eliminate any of 
 the hapax legomenon words whose score is below some cut off threshold 
 (that you'd have to pick arbitrarily, probably by eyeballing the sorted 
 list of words and their contexts to deide if they are legitimate)
 
   ?
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West

We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the no set provide mixed signals to the classifier?

Tom


Walter Underwood-2 wrote:
 
 
 Hmm, how about a classifier? Common words are the yes training set,
 hapax legomenons are the no set, and n-grams are the features.
 
 But why isn't the OCR program already doing this?
 
 wunder
 
 
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: field length normalization

2010-03-11 Thread Jay Hill
The fieldNorm is computed like this: fieldNorm = lengthNorm * documentBoost
* documentFieldBoosts

and the lengthNorm is: lengthNorm  =  1/(numTermsInField)**.5
[note that the value is encoded as a single byte, so there is some precision
loss]

So the values are not pre-set for the lengthNorm, but for some counts the
fieldLength value winds up being the same because of the precision los. Here
is a list of lengthNorm values for 1 to 10 term fields:

# of termslengthNorm
   1  1.0
   2 .625
   3 .5
   4 .5
   5 .4375
   6 .375
   7 .375
   8 .3125
   9 .3125
  10 .3125

That's why, in your example, the lengthNorm for 3 and 4 is the same.

-Jay
http://www.lucidimagination.com





On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote:



 :
 : Did you reindex after setting omitNorms to false? I'm not sure whether or
 : not it is needed, but it makes sense.

 Yes i deleted the old index and reindexed it.
 Just to add another fact, that the titlles length is less than 10. I am not
 sure if solr has pre-set values for length normalizations, because for
 titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the
 debugQuery section).


 --
 View this message in context:
 http://old.nabble.com/field-length-normalization-tp27862618p27867025.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr Performance Issues

2010-03-11 Thread Mike Malloy

I dont mean to turn this into a sales pitch, but there is a tool for Java app
performance management that you may find helpful. Its called New Relic
(www.newrelic.com) and the tool can be installed in 2 minutes. It can give
you very deep visibility inside Solr and other Java apps. (Full disclosure I
work at New Relic.)
Mike

Siddhant Goel wrote:
 
 Hi everyone,
 
 I have an index corresponding to ~2.5 million documents. The index size is
 43GB. The configuration of the machine which is running Solr is - Dual
 Processor Quad Core Xeon 5430 - 2.66GHz (Harpertown) - 2 x 12MB cache, 8GB
 RAM, and 250 GB HDD.
 
 I'm observing a strange trend in the queries that I send to Solr. The
 query
 times for queries that I send earlier is much lesser than the queries I
 send
 afterwards. For instance, if I write a script to query solr 5000 times
 (with
 5000 distinct queries, most of them containing not more than 3-5 words)
 with
 10 threads running in parallel, the average times for queries goes from
 ~50ms in the beginning to ~6000ms. Is this expected or is there something
 wrong with my configuration. Currently I've configured the
 queryResultCache
 and the documentCache to contain 2048 entries (hit ratios for both is
 close
 to 50%).
 
 Apart from this, a general question that I want to ask is that is such a
 hardware enough for this scenario? I'm aiming at achieving around 20
 queries
 per second with the hardware mentioned above.
 
 Thanks,
 
 Regards,
 
 -- 
 - Siddhant
 
 

-- 
View this message in context: 
http://old.nabble.com/Solr-Performance-Issues-tp27864278p27872139.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to edit / compile the SOLR source code

2010-03-11 Thread Erick Erickson
Leaving aside some historical reasons, the root of
the issue is that any search has to identify all the
terms in a field that satisfy it. Let's take a normal
non-leading wildcard case first.

Finding all the terms like 'some*' will have to
deal with many fewer terms than 's*'. Just dealing with
that many terms will decrease performance, regardless
of the underlying mechanisms used. Imagine you're
searching down an ordered list of all the terms for
a field, assembling a list, and then comparing that list with
all the terms in that field with your list.

So, pure wildcard serches, i.e. just *, would have to
handle all the terms in the index for the field.

The situation with leading wildcards is worse than
trailing, since all the terms in the index have to be
examined. Even doing something as bad as
a* will examine only terms starting in a. But looking
for *a has to examine each and every term in the index
because australia and zebra both qualify, there aren't
any good shortcuts if you think of having an ordered
list of terms in a field.

So performance can degrade pretty dramatically when
you allow this kind of thing and the original writers
(my opinion here, I wasn't one of them) decided it was
much better to disallow it by default and require users
to dig around for the why rather than have them
crash and burn a lot by something that seems innocent
if you aren't familiar with the issues involved.

A better approach is, and this isn't very obvious,
is to index your terms reversed, and do leading wildcard
searches on the *reversed* field as trailing wildcards.
E.g. 'some' gets indexed as 'emos' and the wildcard
search '*me' gets searched in the reversed field as
'em*'.

There may still be performance issues if you allow
single-letter wildcards, e.g. s* or *s, although a lot of
work has been done in this area in the last few years.
You'll have to measure in your situation. And beware
that a really common problem when deciding how many
real letters to allow is that it all works fine in your test
data, but when you load your real corpus and suddenly
SOLR/Lucene has to deal with 100,000 terms that
might match rather than the 1,000 in your test set, response
time changesfor the worse.

So I'd look around for the reversed idea (See SOLR-1321
in the JIRA), and at least one of the schema examples
has it.

One hurdle for me was asking the question does it
really help the user to allow one or two leading
characters in a wildcard search?. Surprisingly often,
that's of no use to real users because so many
terms match that it's overwhelming. YMMV, but it's
a good question to ask if you find yourself in a
quagmire because you allow a* type of queries.

There are other strategies too, but that seems easiest

Now, all that said, SOLR has done significant work
to make wildcards work well, these are just general
things to look out for when thinking about wildcards...

I really think hacking the parser will come back to bite
you as both as a maintenance and performance issue,
I wouldn't go there without a pretty exhaustive look at
other options.

HTH
Erick

On Thu, Mar 11, 2010 at 6:29 PM, JavaGuy84 bbar...@gmail.com wrote:


 Eric,

 Thanks a lot for your reply.

 I was able to successfully hack the query parser and enabled the leading
 wild card search.

 As of today I hacked the code for this reason only, I am not sure how to
 make the leading wild card search to work without hacking the code and this
 type of search is the preferred type of search in our organization.

 I had previously searched all over the web to find out 'why' that feature
 was disabled as default but couldn't find any solid answer stating the
 reason. In one of the posting in nabble it was mentioned that it might take
 a performance hit if we enable the leading wild card search, can you please
 let me know your comments on that?

 But I am very much interested in contributing some new stuff to SOLR group
 so I consider this as a starting point..


 Thanks,
 Barani

 Erick Erickson wrote:
 
  See Trey's comment, but before you go there.
 
  What about SOLR's wildcard searching capabilities aren't
  working for you now? There are a couple of tricks for making
  leading wildcard searches work quickly, but this is a solved
  problem. Although whether the existing solutions work in
  your situation may be an open question...
 
  Or do you have to hack into the parser for other reasons?
 
  Best
  Erick
 
  On Thu, Mar 11, 2010 at 12:07 PM, JavaGuy84 bbar...@gmail.com wrote:
 
 
  Hi,
 
  Sorry for asking this very simple question but I am very new to SOLR and
  I
  want to play with its source code.
 
  As a initial step I have a requirement to enable wildcard search (*text)
  in
  SOLR. I am trying to figure out a way to import the complete SOLR build
  to
  Eclipse and edit QueryParsing.java file but I am not able to import (I
  tried
  to import with ant project in Eclipse and selected the build.xml file
 and
  got an error stating 

Re: How to edit / compile the SOLR source code

2010-03-11 Thread JavaGuy84

Erik,

That was a wonderful explanation, I hope many folks in this forum will be
benefited from the explanation you have given here. 

Actually I Googled and found the solution when you had earlier mentioned
that I can do a leading wildcard without hacking the code. 

I found out the patch that had been already available to resolve this issue
(by using ReversedWildcardFilterFactory) and I have started to implement
that idea.


Thanks a lot for your valuable time..

SOLR rocks

Thanks,
Barani



Erick Erickson wrote:
 
 Leaving aside some historical reasons, the root of
 the issue is that any search has to identify all the
 terms in a field that satisfy it. Let's take a normal
 non-leading wildcard case first.
 
 Finding all the terms like 'some*' will have to
 deal with many fewer terms than 's*'. Just dealing with
 that many terms will decrease performance, regardless
 of the underlying mechanisms used. Imagine you're
 searching down an ordered list of all the terms for
 a field, assembling a list, and then comparing that list with
 all the terms in that field with your list.
 
 So, pure wildcard serches, i.e. just *, would have to
 handle all the terms in the index for the field.
 
 The situation with leading wildcards is worse than
 trailing, since all the terms in the index have to be
 examined. Even doing something as bad as
 a* will examine only terms starting in a. But looking
 for *a has to examine each and every term in the index
 because australia and zebra both qualify, there aren't
 any good shortcuts if you think of having an ordered
 list of terms in a field.
 
 So performance can degrade pretty dramatically when
 you allow this kind of thing and the original writers
 (my opinion here, I wasn't one of them) decided it was
 much better to disallow it by default and require users
 to dig around for the why rather than have them
 crash and burn a lot by something that seems innocent
 if you aren't familiar with the issues involved.
 
 A better approach is, and this isn't very obvious,
 is to index your terms reversed, and do leading wildcard
 searches on the *reversed* field as trailing wildcards.
 E.g. 'some' gets indexed as 'emos' and the wildcard
 search '*me' gets searched in the reversed field as
 'em*'.
 
 There may still be performance issues if you allow
 single-letter wildcards, e.g. s* or *s, although a lot of
 work has been done in this area in the last few years.
 You'll have to measure in your situation. And beware
 that a really common problem when deciding how many
 real letters to allow is that it all works fine in your test
 data, but when you load your real corpus and suddenly
 SOLR/Lucene has to deal with 100,000 terms that
 might match rather than the 1,000 in your test set, response
 time changesfor the worse.
 
 So I'd look around for the reversed idea (See SOLR-1321
 in the JIRA), and at least one of the schema examples
 has it.
 
 One hurdle for me was asking the question does it
 really help the user to allow one or two leading
 characters in a wildcard search?. Surprisingly often,
 that's of no use to real users because so many
 terms match that it's overwhelming. YMMV, but it's
 a good question to ask if you find yourself in a
 quagmire because you allow a* type of queries.
 
 There are other strategies too, but that seems easiest
 
 Now, all that said, SOLR has done significant work
 to make wildcards work well, these are just general
 things to look out for when thinking about wildcards...
 
 I really think hacking the parser will come back to bite
 you as both as a maintenance and performance issue,
 I wouldn't go there without a pretty exhaustive look at
 other options.
 
 HTH
 Erick
 
 On Thu, Mar 11, 2010 at 6:29 PM, JavaGuy84 bbar...@gmail.com wrote:
 

 Eric,

 Thanks a lot for your reply.

 I was able to successfully hack the query parser and enabled the leading
 wild card search.

 As of today I hacked the code for this reason only, I am not sure how to
 make the leading wild card search to work without hacking the code and
 this
 type of search is the preferred type of search in our organization.

 I had previously searched all over the web to find out 'why' that feature
 was disabled as default but couldn't find any solid answer stating the
 reason. In one of the posting in nabble it was mentioned that it might
 take
 a performance hit if we enable the leading wild card search, can you
 please
 let me know your comments on that?

 But I am very much interested in contributing some new stuff to SOLR
 group
 so I consider this as a starting point..


 Thanks,
 Barani

 Erick Erickson wrote:
 
  See Trey's comment, but before you go there.
 
  What about SOLR's wildcard searching capabilities aren't
  working for you now? There are a couple of tricks for making
  leading wildcard searches work quickly, but this is a solved
  problem. Although whether the existing solutions work in
  your situation may be an open question...
 
  Or do you 

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter

: Interesting.  I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.  

Well it doesn't have to be a strict ratio cutoff .. you could look at the 
average frequency of all character Ngrams in your index, and then 
consider any Ngram that appeared fewer then X stddev's below the average 
to be suspicious, and eliminate any work that contains Y or more 
suspicious Ngrams.

Of you could just start really simple and eliminate any work that contains 
an Ngram that doesn't appear in *any* other word in your corpus.

I don't deal with a lot of multi-lingual stuff, but my understanding is 
that this sort of thing gets a lot easier if you can partition your docs 
by language -- and even if you can't, doing some langauge detection on the 
(dirty) OCRed text to get a language guess (and then partition by language 
and attempt to find the suspicious words in each partition)


-Hoss



Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir

 I don't deal with a lot of multi-lingual stuff, but my understanding is
 that this sort of thing gets a lot easier if you can partition your docs
 by language -- and even if you can't, doing some langauge detection on the
 (dirty) OCRed text to get a language guess (and then partition by language
 and attempt to find the suspicious words in each partition)


and if you are really OCR'ing Urdu text and trying to search it automatically,
then this is your last priority.

-- 
Robert Muir
rcm...@gmail.com


Re: embedded server / servlet container

2010-03-11 Thread Dennis Gearon
How would that work in a PHP environment. I've already come to my own 
conclusion that using the JSON output would be safer (definitely) and faster 
(probably) than using PHP output and eval(); 

So what to do when it gets to the PHP process is no problem. But it's setting 
up an embedded server on a shared host that I'm working on. I assume that use 
PHP to access the localhost port for SOLR once I get it all going.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 3/11/10, Chris Hostetter hossman_luc...@fucit.org wrote:

 From: Chris Hostetter hossman_luc...@fucit.org
 Subject: Re: embedded server / servlet container
 To: solr-user@lucene.apache.org
 Date: Thursday, March 11, 2010, 9:24 AM
 
 : I am trying to provide an embedded server to a web
 application deployed in a
 : servlet container (like tomcat).
 
 If you are trying to use Solr inside another webapp, my
 suggestion would 
 just be to incorporate the existing Solr servlets, jsps,
 dispatch filter, 
 and web.xml specifics from solr inot your app, and let them
 do their own 
 thing -- it's going to make your life much easier from an
 upgrade 
 standpoint.
 
 Better still: run solr.war as it's own webapp in the same
 servlet 
 container.
 
 
 -Hoss
 
 


Re: Architectural help

2010-03-11 Thread Dennis Gearon
What is DIH? I feel like I'm saying, Duh . . ., sorry.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 3/11/10, Constantijn Visinescu baeli...@gmail.com wrote:

 From: Constantijn Visinescu baeli...@gmail.com
 Subject: Re: Architectural help
 To: solr-user@lucene.apache.org
 Date: Thursday, March 11, 2010, 5:25 AM
 Assuming you create the view in such
 a way that it returns 1 row for each
 solrdocument you want indexed: yes
 
 On Wed, Mar 10, 2010 at 7:54 PM, blargy zman...@hotmail.com
 wrote:
 
 
  So I can just create a view  (or temporary table)
 and then just have a
  simple
  select * from (view or table) in my DIH config?
 
 
  Constantijn Visinescu wrote:
  
   Try making a database view that contains
 everything you want to index,
  and
   then just use the DIH.
  
   Worked when i tested it ;)
  
   On Wed, Mar 10, 2010 at 1:56 AM, blargy zman...@hotmail.com
 wrote:
  
  
   I was wondering if someone could be so kind
 to give me some
  architectural
   guidance.
  
   A little about our setup. We are RoR shop
 that is currently using Ferret
   (no
   laughs please) as our search technology. Our
 indexing process at the
   moment
   is quite poor as well as our search results.
 After some deliberation we
   have
   decided to switch to Solr to satisfy our
 search requirements.
  
   We have about 5M records ranging in size all
 coming from a DB source
   (only
   2
   tables). What will be the most efficient way
 of indexing all of these
   documents? I am looking at DIH but before I
 go down that road I wanted
  to
   get some guidance. Are there any pitfalls I
 should be aware of before I
   start? Anything I can do now that will help
 me down the road?
  
   I have also been exploring the Sunspot rails
 plugin
   (http://outoftime.github.com/sunspot/) which so far
 seems amazing.
  There
   is
   an easy way to reindex all of your models
 like Model.reindex but I doubt
   this is the most efficient. Has anyone had
 any experience using Sunspot
   with
   their rails environment and if so should I
 bother with the DIH?
  
   Please let me know of any
 suggestions/opinions you may have. Thanks.
  
  
   --
   View this message in context:
   http://old.nabble.com/Architectural-help-tp27844268p27844268.html
   Sent from the Solr - User mailing list
 archive at Nabble.com.
  
  
  
  
 
  --
  View this message in context:
  http://old.nabble.com/Architectural-help-tp27844268p27854256.html
  Sent from the Solr - User mailing list archive at
 Nabble.com.
 
 



How to get Facet results only on a range of search results documents

2010-03-11 Thread Shishir Jain
Hi,

I would like to return Facet results only on the range of search results
(say 1-100) not on the whole set of search results. Any idea how can I do
it?

Here is the reason I want to do it:

My document set is quite huge: About 100 Million documents. When a query is
run, the returned results are on average about 1 or so. And I want to do
faceting on the defined window of 100 documents around the result set the
user is looking at, as the faceting is most relevant only around the result
document the user is looking at.

Thanks  Regards,
Shishir Jain


local solr geo_distance

2010-03-11 Thread wicketnewuser

Hi I'm getting geo_distance as str eventhough I'm define the field as
tdouble. my search looks like
/solr/select?qt=geolat=xx.xxlong=yy.yyq=*radius=10
Is there anyway i can get is as
double instead of str
-- 
View this message in context: 
http://old.nabble.com/local-solr-geo_distance-tp27873810p27873810.html
Sent from the Solr - User mailing list archive at Nabble.com.



Best Practices for Runtime Index Updates

2010-03-11 Thread Kranti™ K K Parisa
Hi,

What are the Best Practices for Runtime Index Updates? Means we have index
and user may add some data like tags, notes..etc to each solr document.
during this scenario how quick we could update the index, and how quick we
could show the updates to the end user on UI?

Best Regards,
Kranti K K Parisa


DIH field options

2010-03-11 Thread blargy

How can you simply add a static value like? field name=id value=123/
How does one add a static multi-value field? field name=category_ids
values=123, 456/

Is there any documentation on all the options for the field tag in
data-config.xml?

Thanks for the help
-- 
View this message in context: 
http://old.nabble.com/DIH-field-options-tp27873996p27873996.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH field options

2010-03-11 Thread Tommy Chheng

 The wiki page has most of the info you need
*http://wiki*.apache.org/*solr*/DataImportHandler

To use multi-value fields, your schema.xml must define it with 
multiValued=true



On 3/11/10 10:58 PM, blargy wrote:

How can you simply add a static value like?field name=id value=123/
How does one add a static multi-value field?field name=category_ids
values=123, 456/

Is there any documentation on all the options for the field tag in
data-config.xml?

Thanks for the help


--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com