date:20100727

Hi,

IMHO you can do this with date range queries and (date) facets.
The DateMathParser will allow you to normalize dates on min/hours/days.
If you hit a limit there, then just add a field with an integer for
either min/hour/day. This way you'll loose the month information - which
is sometimes what you want.

You probably want the document entity to be a query with fields:
query
user (id? if you have that)
sessionid
date

the most popular query within a date range is the query that was logged
most times? Do a search on the date range:
q=date:[start TO end]
with facet on the query which gives you the count similar to group by 
count aggregation functionality in an RDBMS. You can do multiple facets
at the same time but be carefull what you are querying for - it will
impact the facet count. You can use functions to change the base of each
facet.

http://wiki.apache.org/solr/SimpleFacetParameters

Cheers,
Chantal

On Tue, 2010-07-27 at 01:43 +0200, Mark wrote:
 We are thinking about using Cassandra to store our search logs. Can 
 someone point me in the right direction/lend some guidance on design? I 
 am new to Cassandra and I am having trouble wrapping my head around some 
 of these new concepts. My brain keeps wanting to go back to a RDBMS design.
 
 We will be storing the user query, # of hits returned and their session 
 id. We would like to be able to answer the following questions.
 
 - What is the n most popular queries and their counts within the last x 
 (mins/hours/days/etc). Basically the most popular searches within a 
 given time range.
 - What is the most popular query within the last x where hits = 0. Same 
 as above but with an extra where clause
 - For session id x give me all their other queries
 - What are all the session ids that searched for 'foos'
 
 We accomplish the above functionality w/ MySQL using 2 tables. One for 
 the raw search log information and the other to keep the 
 aggregate/running counts of queries.
 
 Would this sort of ad-hoc querying be better implemented using Hadoop + 
 Hive? If so, should I be storing all this information in Cassandra then 
 using Hadoop to retrieve it?
 
 Thanks for your suggestions

Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

2010-07-27 Thread David Stuart

I would use the string version as Drupal will probably populate it with a url
like thing something that may not validate as type url

On 27 Jul 2010, at 04:00, Savannah Beckett wrote:

I am trying to merge the schema.xml that is the solr/nutch setup with the one
from drupal apache solr module. I encounter a field that is not mergeable.
From drupal module:
field name=url type=string indexed=true stored=true/
From solr/nutch setup:
field name=url type=url stored=true indexed=true
required=true/
I am not sure if there are any more stuff like this that is not mergeable.

Is there a easy way to deal with schema.xml?
Thanks.
From: David Stuart david.stu...@progressivealliance.co.uk
To: solr-user@lucene.apache.org
Sent: Mon, July 26, 2010 1:46:58 PM
Subject: Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

Hi Savannah,

I have just answered this question over on drupal.org.
http://drupal.org/node/811062

Response number 5 and 11 will help you. On the solrconfig.xml side of things
you will only really need Drupal's version.

Although still in alpha my Nutch module will help you out with integration
http://drupal.org/project/nutch

Regards,

David Stuart

On 26 Jul 2010, at 21:37, Savannah Beckett wrote:

I am using Drupal ApacheSolr module to integrate solr with drupal. I
already
integrated solr with nutch. I already moved nutch's solrconfig.xml and
schema.xml to solr's example directory, and it work. I tried to append
Drupal's
ApacheSolr module's own solrconfig.xml and schema.xml into the same xml
files,
but I got the following error when I java -jar start.jar:

Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log
SEVERE: Exception during parsing file:
solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document
following the root element must be well-formed.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)

at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
at org.apache.solr.core.Config.init(Config.java:110)
at org.apache.solr.core.SolrConfig.init(SolrConfig.java:130)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134)

at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)

Why? does solrconfig.xml allow to have 2 config sections? does
schema.xml
allow to have 2 schema sections?

Thanks.

Any tips/guidelines to turning the Solr/luence performance in a master/slave/sharding environment

2010-07-27 Thread Chengyang

How to reduce the index files size, decreate the sync time between each nodes. 
decrease the index create/update time.
Thanks.

Russian stemmer

Hello,

I'm using SnowballPorterFilterFactory with language=Russian.
The stemming works ok except people names, geographical places.
Here are some examples:

searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.

Are there other stemming plugins for the russian language that can handle
this?
If not, what are the options. A simple solution may be to use the wildcard
queries in Standard mode instead of the DisMaxQueryHandler:
Ковров*

but I'd like to avoid it.

Thanks.

Re: Russian stemmer

All of your examples stem to ковров:

   assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове,
  new String[] { ковров, ковров, ковров, ковров });
}

Are you sure you enabled this at *both* index and query time?

2010/7/27 Oleg Burlaca o...@burlaca.com

 Hello,

 I'm using SnowballPorterFilterFactory with language=Russian.
 The stemming works ok except people names, geographical places.
 Here are some examples:

 searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.

 Are there other stemming plugins for the russian language that can handle
 this?
 If not, what are the options. A simple solution may be to use the wildcard
 queries in Standard mode instead of the DisMaxQueryHandler:
 Ковров*

 but I'd like to avoid it.

 Thanks.




-- 
Robert Muir
rcm...@gmail.com

Spellchecking and frequency

2010-07-27 Thread dan sutton

Hi,

I've recently been looking into Spellchecking in solr, and was struck by how
limited the usefulness of the tool was.

Like most corpora , ours contains lots of different spelling mistakes for
the same word, so the 'spellcheck.onlyMorePopular' is not really that useful
unless you click on it numerous times.

I was thinking that since most of the time people spell words correctly why
was there no other frequency parameter that could enter into the score? i.e.
something like:

spell_score ~ edit_dist * freq

I'm sure others have come across this issue and was wonding what
steps/algorithms they have used to overcome these limitations?

Cheers,
Dan

Re: Russian stemmer

another look, your problem is ковров itself... its mapped to ковр

a workaround might be to use the protected words functionality to
keep ковров and any other problematic people/geo names as-is.

separately, in trunk there is an alternative russian stemmer
(RussianLightStemFilterFactory), which might give you less problems on
average, but I noticed it has this same problem with the example you gave.

On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote:

 All of your examples stem to ковров:

assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове,
   new String[] { ковров, ковров, ковров, ковров });
 }

 Are you sure you enabled this at *both* index and query time?

 2010/7/27 Oleg Burlaca o...@burlaca.com

 Hello,

 I'm using SnowballPorterFilterFactory with language=Russian.
 The stemming works ok except people names, geographical places.
 Here are some examples:

 searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.

 Are there other stemming plugins for the russian language that can handle
 this?
 If not, what are the options. A simple solution may be to use the wildcard
 queries in Standard mode instead of the DisMaxQueryHandler:
 Ковров*

 but I'd like to avoid it.

 Thanks.




 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com

Re: Russian stemmer

Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and
Query time, because the search works ok,
except names and geo locations.

I've noticed that searching by
Коврова

also shows documents that contain Коврову, Коврове

Search by Ковров, 7 results:
http://www.sova-center.ru/search/?q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2

Search by Коврова, 26 results:
http://www.sova-center.ru/search/?lg=1q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2%D0%B0

Adding such words in stopwords.txt will be a tedious task, as there are 7
millions russian names :)

Kind Regards,
Oleg Burlaca



On Tue, Jul 27, 2010 at 11:35 AM, Robert Muir rcm...@gmail.com wrote:

 another look, your problem is ковров itself... its mapped to ковр

 a workaround might be to use the protected words functionality to
 keep ковров and any other problematic people/geo names as-is.

 separately, in trunk there is an alternative russian stemmer
 (RussianLightStemFilterFactory), which might give you less problems on
 average, but I noticed it has this same problem with the example you gave.

 On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir rcm...@gmail.com wrote:

  All of your examples stem to ковров:
 
 assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове,
new String[] { ковров, ковров, ковров, ковров });
  }
 
  Are you sure you enabled this at *both* index and query time?
 
  2010/7/27 Oleg Burlaca o...@burlaca.com
 
  Hello,
 
  I'm using SnowballPorterFilterFactory with language=Russian.
  The stemming works ok except people names, geographical places.
  Here are some examples:
 
  searching for Ковров should also find Коврова, Коврову, Ковровом,
 Коврове.
 
  Are there other stemming plugins for the russian language that can
 handle
  this?
  If not, what are the options. A simple solution may be to use the
 wildcard
  queries in Standard mode instead of the DisMaxQueryHandler:
  Ковров*
 
  but I'd like to avoid it.
 
  Thanks.
 
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 



 --
 Robert Muir
 rcm...@gmail.com

Re: Russian stemmer

A similar word is Немцов.
The strange thing is that searching for Немцова will not find documents
containing Немцов

Немцова: 14 articles
http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0

Немцов: 74 articles
http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2

Re: Russian stemmer

Actually the situation with Немцов из ок,
I've just checked how Yandex works with Немцов and Немцова:
http://nano.yandex.ru/project/inflect/

I think there are two solutions:
a) manually search for both Немцов and then Немцова
b) use wildcard query: Немцов*

Robert, thanks for the RussianLightStemFilterFactory info,
I've found this page
http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
that somehow describes it. Where can I read more about
RussianLightStemFilterFactory ?

Regards,
Oleg

2010/7/27 Oleg Burlaca o...@burlaca.com

 A similar word is Немцов.
 The strange thing is that searching for Немцова will not find documents
 containing Немцов

 Немцова: 14 articles

 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0

 Немцов: 74 articles

 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2

Re: Russian stemmer

2010/7/27 Oleg Burlaca o...@burlaca.com

 Actually the situation with Немцов из ок,
 I've just checked how Yandex works with Немцов and Немцова:
 http://nano.yandex.ru/project/inflect/

 I think there are two solutions:
 a) manually search for both Немцов and then Немцова
 b) use wildcard query: Немцов*


Well, here is one idea of a more general solution.
The problem with protected words is you must have a complete list.

One idea would be to add a filter that protects any words from stemming that
match a regular expression:
In english maybe someone wants to avoid any capitalized words to reduce
trouble: [A-Z].*
in your case then some pattern like [A-Я].*ов might prevent problems.


 Robert, thanks for the RussianLightStemFilterFactory info,
 I've found this page
 http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
 that somehow describes it. Where can I read more about
 RussianLightStemFilterFactory ?


Here is the link:
http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf


 Regards,
 Oleg

 2010/7/27 Oleg Burlaca o...@burlaca.com

  A similar word is Немцов.
  The strange thing is that searching for Немцова will not find documents
  containing Немцов
 
  Немцова: 14 articles
 
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
 
  Немцов: 74 articles
 
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
 
 
 
 




-- 
Robert Muir
rcm...@gmail.com

clustering component

2010-07-27 Thread Matt Mitchell

Hi,

I'm attempting to get the carrot based clustering component (in trunk) to
work. I see that the clustering contrib has been disabled for the time
being. Does anyone know if this will be re-enabled soon, or even better,
know how I could get it working as it is?

Thanks,
Matt

Re: clustering component

2010-07-27 Thread Stanislaw Osinski

Hi Matt,

I'm attempting to get the carrot based clustering component (in trunk) to
 work. I see that the clustering contrib has been disabled for the time
 being. Does anyone know if this will be re-enabled soon, or even better,
 know how I could get it working as it is?


I've recently created a patch to update the clustering algorithms in
branch_3x:

https://issues.apache.org/jira/browse/SOLR-1804

The patch should also work with trunk, but I haven't verified it yet.

S.

Re: slave index is bigger than master index

2010-07-27 Thread Muneeb Ali


We have three dedicated servers for solr, two for slaves and one for master,
all with linux/debian packages installed. 

I understand that replication does always copies over the index in an exact
form as in master index directory (or it is supposed to do that at least),
and if the master index was optimized after indexing, one doesn't need to
run an optimize call again on master to optimize the slave's index. But in
our case thats what fixed it and I agree it is even more confusing now :s

Another problem is, we are serving live services using slave nodes, so I
dont want to effect the live search, while playing with slave nodes'
indices. 

We will be running the indexing on master node today over the night. Lets
see if it does it again.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p998750.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,

thanks for that suggestion. I wasn't aware of that. I've already added a
temporary field in my ScriptTransformer that does basically the same.

However, with this approach indexing time went up from 20min to more
than 5 hours.

The new approach is to query the solr index for that other database that
I've already setup. This is only a bit slower than the original query
(20min). (I'm using URLDataSource to be 1.4.1 conform.)

As with the db entity before, for every document a request is sent to
the solr core even if it is useless because the input variable is empty.
It seems that once an entity processor kicks in you cannot avoid the
initial request to its data source?

Thanks,
Chantal

On Mon, 2010-07-26 at 16:22 +0200, MitchK wrote:
 Hi Chantal,
 
 did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
 custom DIH Function ?
 If not, I think this will be a solution.
 Just check, whether ${prog.vip} is an empty string or null.
 If so, you need to replace it with a value that never can response anything.
 
 So the vip-field will always be empty for such queries. 
 Maybe that helps?
 
 Hopefully, the variable resolver is able to resolve something like
 ${dih.functions.getReplacementIfNeeded(prog.vip).
 
 Kind regards,
 - Mitch
 
 
 
 Chantal Ackermann wrote:
  
  Hi,
  
  my use case is the following:
  
  In a sub-entity I request rows from a database for an input list of
  strings:
  entity name=prog ...
  field name=vip ... /* multivalued, not required */
  entity name=ssc_entry dataSource=ssc onError=continue
  query=select SSC_VALUE from SSC_VALUE
  where SSC_ATTRIBUTE_ID=1
and SSC_VALUE in (${prog.vip})
  field column=SSC_VALUE name=vip_ssc /
  /entity
  /entity
  
  The root entity is prog and it has an optional multivalued field
  called vip. When the list of vip values is empty, the SQL for the
  sub-entity above throws an SQLException. (Working with Oracle which does
  not allow an empty expression in the in-clause.)
  
  Two things:
  (A) best would be not to run the query whenever ${prog.vip} is null or
  empty.
  (B) From the documentation, it is not clear that onError is only checked
  in the transformer runs but not checked when the SQL for the entity
  throws an exception. (Trunk version JdbcDataSource lines 250pp).
  
  IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
  (A) is not easily fixable, making (B) work would be helpful.)
  
  Looking through the code, I've realized that the replacement of the
  variables is done in a very generic way. I've not yet seen an
  appropriate way to check on those variables in order to stop the
  processing of the entity if the variable is empty.
  Is there a way to do this? Or maybe there is a completely different way
  to get my use case working. Any help most appreciated!
  
  Thanks,
  Chantal

LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler

Good Morning, afternoon or evening...

If someone installed Solr using the LucidWorks.jar (1.4) installation how
can one make a small change and recompile.

Is there a LucidWorks (tomcat) build somewhere?

Regards
ericz

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread Alessandro Benedetti

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan jsh...@coh.org


 Every so often I need to index new batches of scanned PDFs and occasionally
 Adobe's OCR can't recognize the text in a couple of these documents. In
 these situations I would like to type in a small amount of text onto the
 document and have it be extracted by Solr CELL.

 Adobe Pro 9 has a number of different ways to add text directly to a PDF
 file:

 *Typewriter
 *Sticky Note
 *Callout boxes
 *Text boxes

 I tried indexing documents with each of these text additions with Solr
 1.4.1 + Solr CELL but can't extract the text in any of these boxes.

 If someone has modified their Solr CELL installation to use more recent
 versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
 on whether newer versions can pull the text out of any of these various text
 boxes I'd appreciate that very much.

 -Jon




 -
 SECURITY/CONFIDENTIALITY WARNING:
 This message and any attachments are intended solely for the individual or
 entity to which they are addressed. This communication may contain
 information that is privileged, confidential, or exempt from disclosure
 under applicable law (e.g., personal health information, research data,
 financial information). Because this e-mail has been sent without
 encryption, individuals other than the intended recipient may be able to
 view the information, forward it to others or tamper with the information
 without the knowledge or consent of the sender. If you are not the intended
 recipient, or the employee or person responsible for delivering the message
 to the intended recipient, any dissemination, distribution or copying of the
 communication is strictly prohibited. If you received the communication in
 error, please notify the sender immediately by replying to this message and
 deleting the message and any accompanying files from your system. If, due to
 the security risks, you do not wish to receive further communications via
 e-mail, please reply to this message and inform the sender that you do not
 wish to receive further e-mail from the sender.

 -




-- 
--

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

DIH $deleteDocByQuery

2010-07-27 Thread Maddy.Jsh


Hi,

I have been using DIH to do index documents from database. I am hoping to
use DIH to delete documents from index. I search in wiki and found the
special commands in DIH to do so.

http://wiki.apache.org/solr/DataImportHandler#Special_Commands


But there is no example on how to use them. I tried searching in the web but
couldn't find any samples.

Any help regarding this would be most welcome.

Thanks,
Maddy.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-deleteDocByQuery-tp998816p998816.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NullPointerException with CURL, but not in browser

2010-07-27 Thread Rene Rath

Ouch! Absolutely correct - quoting the URL fixed it. Thanks for saving me a
sleepless night!

cheers - rene

2010/7/26 Chris Hostetter hossman_luc...@fucit.org


 : However, when I'm trying this very URL with curl within my (perl) script,
 I
 : receive a NullPointerException:
 : CURL-COMMAND: curl -sL
 :
 http://localhost:8983/solr/select?indent=onversion=2.2q=*fq=ListId%3A881start=0rows=0fl=*%2Cscoreqt=standardwt=standard

 it appears you aren't quoting the URL, so that first  character  is
 causing the shell to think yo uare done with the command, and you want it
 to be backgrounded (allthough i'm not certain, since it depends on how you
 are having perl execute curl)

 i would suggest that you avoid exec/system calls to curl from Perl, and
 use an LWP::UserAgent instead.


 -Hoss

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,



 However, with this approach indexing time went up from 20min to more 
 than 5 hours. 
 
This is 15x slower than the initial solution... wow.
From MySQL I know that IN ()-clauses are the embodiment of endlessness -
they perform very, very badly.

New idea:
Create a method which returns the query-string:

returnString(theVIP)
{
   if ( theVIP != null || theVIP != )
   {
   return a query-string to find the vip
   }
   else
   {
   return SELECT 1 // you need to modify this, so that it
matches your field-definition
   }
}

The main-idea is to perform a blazing fast query, instead of a complex
IN-clause-query.
Does this sounds like a solution???



 The new approach is to query the solr index for that other database that 
 I've already setup. This is only a bit slower than the original query 
 (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
 
Unfortunately I can not follow you. 
You are querying a solr-index for a database?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler

I did not realize the LucidWords.jar comes with an option to install the
sources :-)

On Tue, Jul 27, 2010 at 10:59 AM, Eric Grobler impalah...@googlemail.comwrote:

 Good Morning, afternoon or evening...

 If someone installed Solr using the LucidWorks.jar (1.4) installation how
 can one make a small change and recompile.

 Is there a LucidWorks (tomcat) build somewhere?

 Regards
 ericz

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,


 New idea:
 Create a method which returns the query-string:
 
 returnString(theVIP)
 {
if ( theVIP != null || theVIP != )
{
return a query-string to find the vip
}
else
{
return SELECT 1 // you need to modify this, so that it
 matches your field-definition
}
 }
 
 The main-idea is to perform a blazing fast query, instead of a complex
 IN-clause-query.
 Does this sounds like a solution???

I was using in because it's a multiValued input that results in
multiValued output (not necessarily but it's most probable - it's either
empty or multiple values).
I don't understand how I can make your solution work with multivalued
input/output?

  The new approach is to query the solr index for that other database that 
  I've already setup. This is only a bit slower than the original query 
  (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
  
 Unfortunately I can not follow you. 
 You are querying a solr-index for a database?

Yes, because I've already put one up (second core) and used SolrJ to get
what I want later on, but it would be better to compute the relation
between the two indexes at index time instead of at query time. (If it
would have worked with the db entity the second index wouldn't have been
required, anymore.)
But now that it works well with the url entity I'm fine with maintaining
that second index. It's not that much effort.
I've subclassed URLDataSource to add a check whether the list of input
values is empty and only proceed when this is not the case. If realized
that I have to throw an exception and add the onError attribute to the
entity to make that work.

Thanks!
Chantal

Re: slave index is bigger than master index

2010-07-27 Thread Peter Karich


 We have three dedicated servers for solr, two for slaves and one for master,
 all with linux/debian packages installed. 

 I understand that replication does always copies over the index in an exact
 form as in master index directory (or it is supposed to do that at least),
 and if the master index was optimized after indexing, one doesn't need to
 run an optimize call again on master to optimize the slave's index. But in
 our case thats what fixed it and I agree it is even more confusing now :s
   

Thats why I said: try it on the slaves too ;-)
In our case it helped too to shrink 2*index to 1*index.
I think the data which necessary for the replication won't cleanup
before the next replication or before an optimize.
For us it was crucial to shrink the size because of limited
disc-resources and to make sure that the next
replication does not increase the index to 3*times of the initial size.

@muneeb so I think, optimization is not necessary or do you have disc
limitations too?
@Hoss or others: does this explanation sound logically?

 Another problem is, we are serving live services using slave nodes, so I
 dont want to effect the live search, while playing with slave nodes'
 indices. 
   

What do you mean here? Optimizing is too CPU expensive?

 We will be running the indexing on master node today over the night. Lets
 see if it does it again.
   

Do you mean increase to double size?

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,

instead of:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=select SSC_VALUE from SSC_VALUE 
where SSC_ATTRIBUTE_ID=1 
  and SSC_VALUE in (${prog.vip}) 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

you do:

entity name=prog ... 
field name=vip ... /* multivalued, not required */ 
entity name=ssc_entry dataSource=ssc onError=continue 
query=${yourCustomFunctionToReturnAQueryString(prog.vip,
..., ...)} 
field column=SSC_VALUE name=vip_ssc / 
/entity 
/entity 

The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2)
{
if(vip != null  !vip.equals())
{
 StringBuilder sb = new StringBuilder(50);
 sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where
SSC_ATTRIBUTE_ID=1 
   and SSC_VALUE in (
 sb.append(vip);//VIP-value
 sb.append(querystring2);//just the closing )
 return sb.toString();
 }
 else
 {
return SELECT \\ AS yourFieldName;
 }
}

I expect that this method is called for every vip-value, if there is one.

Solr DIH uses the returned querystring to query the database. So, if
vip-value is empty or null, you can use a different query that is blazing
fast (i.e. SELECT  AS yourFieldName - just an example to show the logic).
This query should return a row with an empty string. So Solr fills the
current field with an empty string.

I don't know how to prevent Solr from calling your ssc_entry-entity, when
vip is null or empty.
But this would be a solution to handle empty vip-strings as efficient as
possible. 



 If realized 
 that I have to throw an exception and add the onError attribute to the 
 entity to make that work. 
 
I am curious:
Can you show how to make a method throwing an exception that is accepted by
the onError-attribute?

I hope we do not talk past eachother here. :-)

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html
Sent from the Solr - User mailing list archive at Nabble.com.

question: solrCloud with multiple cores on each machine

2010-07-27 Thread Yatir Ben Shlomo

Hi
 I am using solrCloud.
Suppose I have a total 4 machines dedicated for solr.
I want to have 2 machines as replication (salves) and 2 masters
But I want to work with 8 logical cores rather 2.
i.e. each master (and each slave) will have 4 cores on it.
the reason is that I can optimize the cores one at a time so the IO intensity 
at any given moment will be low and will not degrade the online performance

Is there a way to configure my solr.xml so that when I am doing a distributed 
search (distrib=true) it will know to query all 8 cores ?

Thanks
Yatir

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,

thanks for the code. Currently, I've got a different solution running
but it's always good to have examples.

  If realized 
  that I have to throw an exception and add the onError attribute to the 
  entity to make that work. 
  
 I am curious:
 Can you show how to make a method throwing an exception that is accepted by
 the onError-attribute?

the catch clause looks for Exception so it's actually easy. :-D

Anyway, I've found a cleaner way. It is better to subclass the
XPathEntityProcessor and put it in a state that prevents it from calling
initQuery which triggers the dataSource.getData() call.
I have overridden the initContext() method setting a go/no go flag that
I am using in the overridden nextRow() to find out whether to delegate
to the superclass or not.

This way I can also avoid the code that fills the tmp field with an
empty value if there is no value to query on.

Cheers,
Chantal

RE: Spellcheck help

2010-07-27 Thread Marc Ghorayeb

Thanks for the input, i'll check it out!
Marc

 Subject: RE: Spellcheck help
 Date: Fri, 23 Jul 2010 13:12:04 -0500
 From: james.d...@ingrambook.com
 To: solr-user@lucene.apache.org

 In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84):

 final static String PATTERN = (?:(?!( + NMTOKEN + 
 :|\\d+)))[\\p{L}_\\-0-9]+;

 and remove the |\\d+ to make it:

 final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+;

 My testing shows this solves your problem.  The caution is to test it against 
 all your use cases because obviously someone thought we should ignore leading 
 digits from keywords.  Surely there's a reason why although I can't think of 
 it.

 James Dyer
 E-Commerce Systems
 Ingram Book Company
 (615) 213-4311

 -Original Message-
 From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] 
 Sent: Saturday, July 17, 2010 12:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck help

 Can anybody help me with this? :(

 -Original Message- 
 From: Marc Ghorayeb
 Sent: Thursday, July 08, 2010 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck help

 Hello,I've been trying to get rid of a bug when using the spellcheck but so 
 far with no success :(When searching for a word that starts with a number, 
 for example 3dsmax, i get the results that i want, BUT the spellcheck says 
 it is not correctly spelled AND the collation gives me 33dsmax. Further 
 investigation shows that the spellcheck is actually only checking dsmax 
 which it considers does not exist and gives me 3dsmax for better results, 
 but since i have spellcheck.collate = true, the collation that i show is 
 33dsmax with the first 3 being the one discarded by the spellchecker... 
 Otherwise, the spellcheck works correctly for normal words... any ideas? 
 :(My spellcheck field is fairly classic, whitespace tokenizer, with 
 lowercase filter...Any help would be greatly appreciated :)Thanks,Marc
 _
 Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement !
 http://www.messengersurvotremobile.com/?d=iPhone 

_
Exclu : Téléchargez la nouvelle version de Messenger !
http://clk.atdmt.com/FRM/go/244627952/direct/01/

Re: Russian stemmer

Thanks Robert for all your help,

The idea of ы[A-Z].* stopwords is ideal for the english language,
although in russian nouns are inflected: Борис, Борису, Бориса, Борисом

I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned
it's more accurate).

Once again thanks,
Oleg Burlaca

On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote:

 2010/7/27 Oleg Burlaca o...@burlaca.com

  Actually the situation with Немцов из ок,
  I've just checked how Yandex works with Немцов and Немцова:
  http://nano.yandex.ru/project/inflect/
 
  I think there are two solutions:
  a) manually search for both Немцов and then Немцова
  b) use wildcard query: Немцов*
 

 Well, here is one idea of a more general solution.
 The problem with protected words is you must have a complete list.

 One idea would be to add a filter that protects any words from stemming
 that
 match a regular expression:
 In english maybe someone wants to avoid any capitalized words to reduce
 trouble: [A-Z].*
 in your case then some pattern like [A-Я].*ов might prevent problems.


  Robert, thanks for the RussianLightStemFilterFactory info,
  I've found this page
  http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
  that somehow describes it. Where can I read more about
  RussianLightStemFilterFactory ?
 
 
 Here is the link:

 http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf


  Regards,
  Oleg
 
  2010/7/27 Oleg Burlaca o...@burlaca.com
 
   A similar word is Немцов.
   The strange thing is that searching for Немцова will not find
 documents
   containing Немцов
  
   Немцова: 14 articles
  
  
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
  
   Немцов: 74 articles
  
  
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
  
  
  
  
 



 --
 Robert Muir
 rcm...@gmail.com

Re: Russian stemmer

right, but your problem is this is the current output:

Ковров - Ковр
Коврову - Ковров
Ковровом - Ковров
Коврове - Ковров

so, if Ковров was simply left alone, all your forms would match...

2010/7/27 Oleg Burlaca o...@burlaca.com

 Thanks Robert for all your help,

 The idea of ы[A-Z].* stopwords is ideal for the english language,
 although in russian nouns are inflected: Борис, Борису, Бориса, Борисом

 I'll try the RussianLightStemFilterFactory (the article in the PDF
 mentioned
 it's more accurate).

 Once again thanks,
 Oleg Burlaca

 On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir rcm...@gmail.com wrote:

  2010/7/27 Oleg Burlaca o...@burlaca.com
 
   Actually the situation with Немцов из ок,
   I've just checked how Yandex works with Немцов and Немцова:
   http://nano.yandex.ru/project/inflect/
  
   I think there are two solutions:
   a) manually search for both Немцов and then Немцова
   b) use wildcard query: Немцов*
  
 
  Well, here is one idea of a more general solution.
  The problem with protected words is you must have a complete list.
 
  One idea would be to add a filter that protects any words from stemming
  that
  match a regular expression:
  In english maybe someone wants to avoid any capitalized words to reduce
  trouble: [A-Z].*
  in your case then some pattern like [A-Я].*ов might prevent problems.
 
 
   Robert, thanks for the RussianLightStemFilterFactory info,
   I've found this page
  
 http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
   that somehow describes it. Where can I read more about
   RussianLightStemFilterFactory ?
  
  
  Here is the link:
 
 
 http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
 
 
   Regards,
   Oleg
  
   2010/7/27 Oleg Burlaca o...@burlaca.com
  
A similar word is Немцов.
The strange thing is that searching for Немцова will not find
  documents
containing Немцов
   
Немцова: 14 articles
   
   
  
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
   
Немцов: 74 articles
   
   
  
 
 http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
   
   
   
   
  
 
 
 
  --
  Robert Muir
  rcm...@gmail.com
 




-- 
Robert Muir
rcm...@gmail.com

Highlighting parameters wiki

2010-07-27 Thread Stephen Green

The wiki entry for hl.highlightMultiTerm:

http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm

doesn't appear to be correct.  It says:

If the SpanScorer is also being used, enables highlighting for
range/wildcard/fuzzy/prefix queries. Default is false.

But the code in DefaultSolrHighlighter (both on the 1.4 branch that
I'm using and in the trunk) does:

Boolean highlightMultiTerm =
request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM,
true);
if(highlightMultiTerm == null) {
  highlightMultiTerm = false;
}

which looks to me like like it's going to default to true, since
getBool will never return null, and if it gets a null value from the
parameters internally, it will return true.

Shall I file a Jira on this one?  Perhaps it's easier just to fix the Wiki page?

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com

RE: Spellcheck help

2010-07-27 Thread Dyer, James

If you could, let me know how your testing goes with this change.  I too am 
interested in having the Collate work as good as it can.  It looks like the 
code would be better with this change but then again I don't know what the 
original author was thinking when this was put in.

James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311

-Original Message-
From: Marc Ghorayeb [mailto:dekay...@hotmail.com] 
Sent: Tuesday, July 27, 2010 8:07 AM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck help


Thanks for the input, i'll check it out!
Marc

 Subject: RE: Spellcheck help
 Date: Fri, 23 Jul 2010 13:12:04 -0500
 From: james.d...@ingrambook.com
 To: solr-user@lucene.apache.org
 
 In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84):
 
 final static String PATTERN = (?:(?!( + NMTOKEN + 
 :|\\d+)))[\\p{L}_\\-0-9]+;
 
 and remove the |\\d+ to make it:
 
 final static String PATTERN = (?:(?! + NMTOKEN + :))[\\p{L}_\\-0-9]+;
 
 My testing shows this solves your problem.  The caution is to test it against 
 all your use cases because obviously someone thought we should ignore leading 
 digits from keywords.  Surely there's a reason why although I can't think of 
 it.
 
 James Dyer
 E-Commerce Systems
 Ingram Book Company
 (615) 213-4311
 
 -Original Message-
 From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] 
 Sent: Saturday, July 17, 2010 12:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spellcheck help
 
 Can anybody help me with this? :(
 
 -Original Message- 
 From: Marc Ghorayeb
 Sent: Thursday, July 08, 2010 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck help
 
 
 Hello,I've been trying to get rid of a bug when using the spellcheck but so 
 far with no success :(When searching for a word that starts with a number, 
 for example 3dsmax, i get the results that i want, BUT the spellcheck says 
 it is not correctly spelled AND the collation gives me 33dsmax. Further 
 investigation shows that the spellcheck is actually only checking dsmax 
 which it considers does not exist and gives me 3dsmax for better results, 
 but since i have spellcheck.collate = true, the collation that i show is 
 33dsmax with the first 3 being the one discarded by the spellchecker... 
 Otherwise, the spellcheck works correctly for normal words... any ideas? 
 :(My spellcheck field is fairly classic, whitespace tokenizer, with 
 lowercase filter...Any help would be greatly appreciated :)Thanks,Marc
 _
 Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement !
 http://www.messengersurvotremobile.com/?d=iPhone 
 
  
_
Exclu : Téléchargez la nouvelle version de Messenger !
http://clk.atdmt.com/FRM/go/244627952/direct/01/

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan

Hi Yonik,
I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am 
using regular out of box dismax requesthandler.
The query is a simple one with 4 filter queries (fq's) and one sort query. 
During the index generation, I delete a set of rows based on date filter, then 
add new rows to the index. Then another process queries the index and generates 
some stats and updates the index again. Not sure if during this process 
something is going wrong with the index.

Thanks
Kalyan

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, July 27, 2010 12:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Querying throws java.util.ArrayList.RangeCheck

Do you have any custom code, or is this stock solr (and which version,
and what is the request)?

-Yonik
http://www.lucidimagination.com

On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan
kalyan.manepa...@orbitz.com wrote:
 Hi,
   I am stuck at this weird problem during querying. While querying the solr 
 index I am getting the following error.
 Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 
 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at 
 java.util.ArrayList.get(ArrayList.java:322) at 
 org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at 
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at 
 org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at 
 org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at 
 org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at

 During debugging I found that the SolrIndexReader is trying to read a 
 document which doesnt exist in the index.
 I tried optimizing the index and restarting the server but still no luck.

 Any help in resolving this issue will be appreciated.

 Thanks
 Kalyan

Is it possible to get keyword/match's position?

2010-07-27 Thread Ryan Chan

According to SO:
http://stackoverflow.com/questions/1557616/retrieving-per-keyword-field-match-position-in-lucene-solr-possible

It is not possible, but it is one year ago, is it still true for now?

Thanks.

Re: java GC overhead limit exceeded

2010-07-27 Thread Text Analysis

Look into -XX:-GCUseOverheadLimit

On 7/26/10, Jonathan Rochkind rochk...@jhu.edu wrote:
 I am now occasionally getting a Java GC overhead limit exceeded error
 in my Solr. This may or may not be related to recently adding much
 better (and more) warming querries.

 I can get it when trying a 'commit', after deleting all documents in my
 index, or in other cases.

 Anyone run into this, and have suggestions as to how to set my java
 options to eliminate?  I'm not sure this simply means that my heap size
 needs to be bigger, it seems to be something else.

 Any advice appreciated. Googling didn't get me much I trusted.

 Jonathan


-- 
Sent from my mobile device

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom

Hi Jason,

Are you looking for the total number of unique terms or total number of term 
occurrences?

Checkindex reports both, but does a bunch of other work so is probably not the 
fastest.

If you are looking for total number of term occurrences, you might look at 
contrib/org/apache/lucene/misc/HighFreqTerms.java.
 
If you are just looking for the total number of unique terms, I wonder if there 
is some low level API that would allow you to just access the in-memory 
representation of the tii file and then multiply the number of terms in it by 
your indexDivisor (default 128). I haven't dug in to the code so I don't 
actually know how the tii file gets loaded into a data structure in memory.  If 
there is api access, it seems like this might be the quickest way to get the 
number of unique terms.  (Of course you would have to do this for each segment).

Tom
-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, July 26, 2010 8:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Total number of terms in an index?


: Sorry, like the subject, I mean the total number of terms.

it's not stored anywhere, so the only way to fetch it is to actually 
iteate all of the terms and count them (that's why LukeRequestHandler is 
slow slow to compute this particular value)

If i remember right, someone mentioned at one point that flex would let 
you store data about stuff like this in your index as part of the segment 
writing, but frankly i'm still not sure how that iwll help -- because you 
unless your index is fully optimized, you still have to iterate the terms 
in each segment to 'de-dup' them.


-Hoss

SpatialSearch: sorting by distance

2010-07-27 Thread Pavel Minchenkov

Hi,

I'm trying to sort by distance like this:

sort=dist(2,lat,lon,55.755786,37.617633) asc

In general results are sorted, but some documents are not in right order.
I'm using DistanceUtils.getDistanceMi(...) from lucene spatial to calculate
real distance after reading documents from Solr.

Solr version from trunk.

fieldType name=double class=solr.TrieDoubleField precisionStep=0
omitNorms=true positionIncrementGap=0/
field name=lat type=double indexed=true stored=true/
field name=lon type=double indexed=true stored=true/

Thanks.

-- 
Pavel Minchenkov

does this indicate a commit happened for every add?

2010-07-27 Thread Robert Petersen

I'm adding lots of small docs with several threads to solr and the adds
start fast but then slow down.  I didn't do any explicit commits and
autocommit is turned off but the logs show lots of commit activity on
this core and restarting this solr core logged the below.  Where did all
these commits come from, the exact same number as my adds?  I'm
stumped...

Jul 27, 2010 10:07:17 AM org.apache.solr.update.DirectUpdateHandler2
close
INFO: closed
DirectUpdateHandler2{commits=456389,autocommits=0,optimizes=0,rollbacks=
0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,e
rrors=0,cumulative_adds=456393,cumulative_deletesById=0,cumulative_delet
esByQuery=0,cumulative_errors=0}

Re: Spellchecking and frequency

2010-07-27 Thread Mark Holland

Hi,

I found the suggestions returned from the standard solr spellcheck not to be
that relevant. By contrast, aspell, given the same dictionary and mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
the java aspell library. I also extended the SpellCheckComponent to take the
matrix of suggested words and query the corpus to find the first combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote:

 Hi,

 I've recently been looking into Spellchecking in solr, and was struck by
 how
 limited the usefulness of the tool was.

 Like most corpora , ours contains lots of different spelling mistakes for
 the same word, so the 'spellcheck.onlyMorePopular' is not really that
 useful
 unless you click on it numerous times.

 I was thinking that since most of the time people spell words correctly why
 was there no other frequency parameter that could enter into the score?
 i.e.
 something like:

 spell_score ~ edit_dist * freq

 I'm sure others have come across this issue and was wonding what
 steps/algorithms they have used to overcome these limitations?

 Cheers,
 Dan

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread David Thibault

Alessandro  all,

I was having the same issue with Tika crashing on certain PDFs.  I also noticed 
the bug where no content was extracted after upgrading Tika.  

When I went to the SOLR issue you link to below, I applied all the patches, 
downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and got 
the following error:
SEVERE: java.lang.NoSuchMethodError: 
org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
at 
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
at java.lang.Thread.run(Thread.java:619)

This is really weird because I DID apply the SolrResourceLoader patch that adds 
the getClassLoader method.  I even verified by going opening up the JARs and 
looking at the class file in Eclipse...I can see the 
SolrResourceLoader.getClassLoader() method.  

Does anyone know why it can't find the method?  After patching the source I did 
ant clean dist in the base directory of the Solr source tree and everything 
looked like it compiles (BUILD SUCCESSFUL).  Then I copied all the jars from 
dist/ and all the library dependencies from contrib/extraction/lib/ into my 
SOLR_HOME. Restarting tomcat, everything in the logs looked good.

I'm stumped.  It would be very nice to have a Solr implementation using the 
newest versions of PDFBox  Tika and actually have content being extracted...=)

Best,
Dave


-Original Message-
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] 
Sent: Tuesday, July 27, 2010 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan jsh...@coh.org


 Every so often I need to index new batches of scanned PDFs and occasionally
 Adobe's OCR can't recognize the text in a couple of these documents. In
 these situations I would like to type in a small amount of text onto the
 document and have it be extracted by Solr CELL.

 Adobe Pro 9 has a number of different ways to add text directly to a PDF
 file:

 *Typewriter
 *Sticky Note
 *Callout boxes
 *Text boxes

 I tried indexing documents with each of these text additions with Solr
 1.4.1 + Solr CELL but can't extract the text in any of these boxes.

 If someone has modified their Solr CELL installation to use more recent
 versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
 on whether newer versions can pull the text out of any of these various text
 boxes I'd appreciate that very much.

 -Jon




 -
 SECURITY/CONFIDENTIALITY WARNING:
 This

Re: Total number of terms in an index?

2010-07-27 Thread Michael McCandless

In trunk (flex) you can ask each segment for its unique term count.

But to compute the unique term count across all segments is
necessarily costly (requires merging them, to de-dup), as Hoss
described.

Mike

On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi Jason,

 Are you looking for the total number of unique terms or total number of term 
 occurrences?

 Checkindex reports both, but does a bunch of other work so is probably not 
 the fastest.

 If you are looking for total number of term occurrences, you might look at 
 contrib/org/apache/lucene/misc/HighFreqTerms.java.

 If you are just looking for the total number of unique terms, I wonder if 
 there is some low level API that would allow you to just access the in-memory 
 representation of the tii file and then multiply the number of terms in it by 
 your indexDivisor (default 128). I haven't dug in to the code so I don't 
 actually know how the tii file gets loaded into a data structure in memory.  
 If there is api access, it seems like this might be the quickest way to get 
 the number of unique terms.  (Of course you would have to do this for each 
 segment).

 Tom
 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Monday, July 26, 2010 8:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Total number of terms in an index?


 : Sorry, like the subject, I mean the total number of terms.

 it's not stored anywhere, so the only way to fetch it is to actually
 iteate all of the terms and count them (that's why LukeRequestHandler is
 slow slow to compute this particular value)

 If i remember right, someone mentioned at one point that flex would let
 you store data about stuff like this in your index as part of the segment
 writing, but frankly i'm still not sure how that iwll help -- because you
 unless your index is fully optimized, you still have to iterate the terms
 in each segment to 'de-dup' them.


 -Hoss

RE: Spellchecking and frequency

2010-07-27 Thread Dyer, James

Mark,

I'd like to see your code if you open a JIRA for this.  I recently
opened SOLR-2010 with a patch that does something similar to the second
part only of what you describe (find combinations that actually return a
match).  But I'm not sure if my approach is the best one so I would like
to see yours to compare.

James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311

-Original Message-
From: Mark Holland [mailto:mark.holl...@zoopla.co.uk] 
Sent: Tuesday, July 27, 2010 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellchecking and frequency

Hi,

I found the suggestions returned from the standard solr spellcheck not
to be
that relevant. By contrast, aspell, given the same dictionary and
mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps
jazzy,
the java aspell library. I also extended the SpellCheckComponent to take
the
matrix of suggested words and query the corpus to find the first
combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although
it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote:

 Hi,

 I've recently been looking into Spellchecking in solr, and was struck
by
 how
 limited the usefulness of the tool was.

 Like most corpora , ours contains lots of different spelling mistakes
for
 the same word, so the 'spellcheck.onlyMorePopular' is not really that
 useful
 unless you click on it numerous times.

 I was thinking that since most of the time people spell words
correctly why
 was there no other frequency parameter that could enter into the
score?
 i.e.
 something like:

 spell_score ~ edit_dist * freq

 I'm sure others have come across this issue and was wonding what
 steps/algorithms they have used to overcome these limitations?

 Cheers,
 Dan

Re: Timeout in distributed search

2010-07-27 Thread Chris Hostetter


:   Is there anyway to have time out support in distributed search. I 
: searched https://issues.apache.org/jira/browse/SOLR-502 but looks it is 
: not in main release of solr1.4

note that issue is marked Fix Version/s: 1.3 ... that means it 
was fixed in Solr 1.3, well before 1.4 came out.

You should also take a look at the functionality added in SOLR-850, which 
explicitly deals with hard timeouts in distributed searching...

https://issues.apache.org/jira/browse/SOLR-850

...that was first included in Solr 1.4



-Hoss

Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry

2010-07-27 Thread Chris Hostetter

: 
: I was wondering if anyone has found any resolution to this email thread?

As Grant asked in his reply when this thread was first started (December 
2009)...

 It sounds like you are either using embedded mode or you have some 
 custom code.  Are you sure you are releasing your resources correctly?

...there was no response to his question for clarification.

the problem, given the info we have to work with, definitely seems to be 
that the custom code utilizing the SolrCore directly is not releasing the 
resources that it is using in every case.

if you are claling hte execute method, that means you have a 
SOlrQueryRequest object -- which means you somehow got an instance of 
a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with it) 
and you are somehow not releasing that SolrIndexSearcher (probably because 
you are not calling close() on your SolrQueryRequest)

But it relaly all depends on how you got ahold of that 
SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every 
method in SolrCore that gives you access to a SolrIndexSearcher is 
documented very clearly on how to release it when you are done with it 
so the ref count can be decremented.


-Hoss

Re: help finding illegal chars in XML doc

2010-07-27 Thread Chris Hostetter


: Thanks for your reply. I could not find in the log files any mention to
: that.  By the way I only have _MM_DD.request.log files in my directory.
: 
: Do I have to enable any specific log or level to catch those errors?

if you are using that java -jar start.jar command for the example jetty 
nstance then the log messages i'm refering to are written directly to your 
console.  if you are using running solr in some other servlet container, 
then it all depneds on the servlet container...

http://wiki.apache.org/solr/SolrLogging
http://wiki.apache.org/solr/LoggingInDefaultJettySetup



-Hoss

Difficulties with Highlighting

2010-07-27 Thread Nathaniel Grove

I'm a relative beginner at SOLR, indexing and searching Unicode Tibetan 
texts. I am trying to use the highlighter but it just returns, empty 
elements, such as:


   lst name=highlighting
   lst name=kt-d-0103-text-v4p262a/
   /lst

What am I doing wrong?

The query that generated that is:

http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=onversion=2.2q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atextstart=0rows=10fl=*%2Cscoreqt=standardwt=standardhl=truehl.fl=pg_bohl.snippets=50

The hit is in the multivalued field named pg_bo and in a doc with that 
id #. I've looked at the various highlighting parameters (not that I 
fully understand them) and tried fiddling with those but nothing helped. 
I did notice that if you change the hl.fl=*. Then you get the type field 
highlighted:


lst name=highlighting
   lst name=kt-d-0103-text-v4p262a
  arr name=type
   stremtext/em/str
   /arr
   /lst
/lst

But that's not much help. We are using a custom Tibetan tokenizer for 
the Unicode Tibetan text fields. Would this have something to do with it?


Any suggestions would be appreciated!

Thanks for your help,

Than Grove

--
Nathaniel Grove
Research Associate  Technical Director
Tibetan  Himalayan Library
University of Virginia
http://www.thlib.org

Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry

2010-07-27 Thread Ken Krugler



On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote:


:
: I was wondering if anyone has found any resolution to this email  
thread?


As Grant asked in his reply when this thread was first started  
(December 2009)...



It sounds like you are either using embedded mode or you have some
custom code.  Are you sure you are releasing your resources  
correctly?


...there was no response to his question for clarification.

the problem, given the info we have to work with, definitely seems  
to be
that the custom code utilizing the SolrCore directly is not  
releasing the

resources that it is using in every case.

if you are claling hte execute method, that means you have a
SOlrQueryRequest object -- which means you somehow got an instance of
a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with  
it)
and you are somehow not releasing that SolrIndexSearcher (probably  
because

you are not calling close() on your SolrQueryRequest)


One thing that bit me previously with using APIs in this area of Solr  
is that if you call CoreContainer.getCore(), this increments the open  
count, so you have to balance each getCore() call with a close() call.


The naming here could be better - I think it's common to have an  
expectation that calls to get something don't change any state. Maybe  
openCore()?


-- Ken


But it relaly all depends on how you got ahold of that
SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every
method in SolrCore that gives you access to a SolrIndexSearcher is
documented very clearly on how to release it when you are done  
with it

so the ref count can be decremented.


-Hoss



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Difficulties with Highlighting

2010-07-27 Thread Erik Hatcher

Than -

Looks like maybe your text_bo field type isn't analyzing how you'd
like? Though that's just a hunch. I pasted the value of that field
returned in the link you provided into your analysis.jsp page and it
chunked tokens by whitespace. Though I could be experiencing a copy/
paste/i18n issue.

Also looks like you're on Solr 1.3 - so it's likely quite worth
upgrading to 1.4.1 (don't know if that directly affects this
highlighting issue, just a general recommendation).

Erik

On Jul 27, 2010, at 3:43 PM, Nathaniel Grove wrote:

I'm a relative beginner at SOLR, indexing and searching Unicode
Tibetan texts. I am trying to use the highlighter but it just
returns, empty elements, such as:

lst name=highlighting
lst name=kt-d-0103-text-v4p262a/
/lst

What am I doing wrong?

The query that generated that is:

http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=onversion=2.2q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atextstart=0rows=10fl=*%2Cscoreqt=standardwt=standardhl=truehl.fl=pg_bohl.snippets=50

The hit is in the multivalued field named pg_bo and in a doc with
that id #. I've looked at the various highlighting parameters (not
that I fully understand them) and tried fiddling with those but
nothing helped. I did notice that if you change the hl.fl=*. Then
you get the type field highlighted:

lst name=highlighting
lst name=kt-d-0103-text-v4p262a
arr name=type
stremtext/em/str
/arr
/lst
/lst

But that's not much help. We are using a custom Tibetan tokenizer
for the Unicode Tibetan text fields. Would this have something to do
with it?

Any suggestions would be appreciated!

Thanks for your help,

Than Grove

--
Nathaniel Grove
Research Associate Technical Director
Tibetan Himalayan Library
University of Virginia
http://www.thlib.org

Re: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Jason Ronallo

I am getting a similar error with today's nightly build:

HTTP Status 500 - Index: 54, Size: 24
java.lang.IndexOutOfBoundsException: Index: 54, Size: 24 at
java.util.ArrayList.RangeCheck(ArrayList.java:547) at
java.util.ArrayList.get(ArrayList.java:322) at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:264) at

I'm adding and deleting a batch of documents. Currently during
indexing for each document there is a commit. In some cases the
document is deleted just before it is added with a commit for the
delete and a commit for the add.

It appears that if I wait to commit until the end of all indexing, I
avoid this error.

Jason

On Tue, Jul 27, 2010 at 10:25 AM, Manepalli, Kalyan
kalyan.manepa...@orbitz.com wrote:
 Hi Yonik,
 I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am 
 using regular out of box dismax requesthandler.
 The query is a simple one with 4 filter queries (fq's) and one sort query.
 During the index generation, I delete a set of rows based on date filter, 
 then add new rows to the index. Then another process queries the index and 
 generates some stats and updates the index again. Not sure if during this 
 process something is going wrong with the index.

 Thanks
 Kalyan

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
 Sent: Tuesday, July 27, 2010 12:15 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Querying throws java.util.ArrayList.RangeCheck

 Do you have any custom code, or is this stock solr (and which version,
 and what is the request)?

 -Yonik
 http://www.lucidimagination.com

 On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan
 kalyan.manepa...@orbitz.com wrote:
 Hi,
   I am stuck at this weird problem during querying. While querying the solr 
 index I am getting the following error.
 Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 
 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at 
 java.util.ArrayList.get(ArrayList.java:322) at 
 org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at 
 org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at 
 org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) 
 at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at 
 org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at

 During debugging I found that the SolrIndexReader is trying to read a 
 document which doesnt exist in the index.
 I tried optimizing the index and restarting the server but still no luck.

 Any help in resolving this issue will be appreciated.

 Thanks
 Kalyan

min/max, StatsComponent, performance

2010-07-27 Thread Jonathan Rochkind

I thought I asked a variation of this before, but I don't see it on the 
list, apologies if this is a duplicate, but I have new questions.


So I need to find the min and max value of a result set. Which can be 
several million documents. One way to do this is the StatsComponent.


One problem is that I'm having performance problems with StatsComponent 
across so many documents, adding the stats component on the field I'm 
interested in is adding 10s to my query response time.


So one question is if there's any way to increase StatsComponent 
performance. Does it use any caches, or does it operate without caches?  
My Solr is running near the top of it's heap size, although I'm not 
currently getting any OOM errors, perhaps not enough free memory is 
somehow hurting StatsComponent performance. Or any other ideas for 
increasing StatsComponent performance?


But it also occurs to me that the StatsComponent is doing a lot more 
than I need. I just need min/max. And the cardinality of this field is a 
couple orders of magnitude lower than the total number of documents. But 
StatsComponent is also doing a bunch of other things, like sum, median, 
etc.  Perhaps if there were a way to _just_ get min/max, it would be 
faster. Is there any way to get min/max values in a result set other 
than StatsComponent?


Jonathan

Indexing Problem: Where's my data?

2010-07-27 Thread Michael Griffiths

Hi,

(The first version of this was rejected for spam).

I'm setting up a test instance of Solr, and keep running into the problem of 
having Solr not work the way I think it should work. Specifically, the data I 
want to go into the index isn't there after indexing. I'm extracting the data 
from MSSQL via DataImportHandler, JDBC 4.0.

My data is set up that for every product ID there is one category 
(hierarchical, but I'm not dealing with that ATM), a family, and a set of 
attributes (which includes name, etc). After indexing, I get Category, Family, 
and Product ID - but nothing from my attribute values (STRING_NAME, below) - 
which is the most useful data.

Is there something wrong with my schema?

I thought it might be that the schema.xml file wasn't respecting the names I 
assigned via the DataImportHandler; when I changed to the column names in the 
schema.xml, I picked up Family and Category (previously, it was only product 
ID).

I'm really banging my head against the wall at this point, so I'd appreciate 
any help. My step will probably be to do a considerably more complicated 
denormalization (in terms of the SQL), which would make the Solr end simpler 
(but that has problems of its own).

Config information below.

Any help appreciated.

Thanks,
Michael

Data Config:


dataConfig
dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver 
url=jdbc:sqlserver://localhost\DEVELOPMENT/Databases/data:1433  /
document name=products
entity onError=continue name=product query=select 
Product_ID,Category_ID from TB_Product
field column=PRODUCT_ID name=pid /
field column=CATEGORY_ID name=cid /

entity name=facets query=select * from TB_PROD_SPECS where 
PRODUCT_ID=${product.Product_ID}
field column=STRING_VALUE /
field column=NUMERIC_VALUE /
entity name=attributes query=select 
ATTRIBUTE_NAME,ATTRIBUTE_TYPE from TB_ATTRIBUTE where 
ATTRIBUTE_ID=${facets.ATTRIBUTE_ID}
field column=Attribute_Name name=Attribute Name /
/entity
/entity

entity name=category query=select CATEGORY_NAME,PARENT_CATEGORY 
from TB_CATEGORY where CATEGORY_ID='${product.Category_ID}'
field column=Category_Name name=Category /
field column=Parent_Category name=Parent Category /
/entity

entity name=family_id query=select FAMILY_ID from 
TB_PROD_FAMILY where Product_ID = ${product.Product_ID}
entity name=family query=select 
FAMILY_Name,PARENT_FAMILY_ID,ROOT_FAMILY,CATEGORY_ID from TB_Family where 
Family_ID = ${family_id.FAMILY_ID}
field column=FAMILY_NAME name=Family /
field column=ROOT_FAMILY name=Root Family /
field column=PARENT_FAMILY name=Parent Family /
field column=Category_id name=Category ID /
/entity
/entity
/entity
/document
/dataConfig


Schema:

fields
field name=Product_ID type=int indexed=true stored=true 
required=true /
field name=Family_NAME type=textTight indexed=true 
stored=false multivalued=true/
field name=Category_Name type=textTight indexed=true 
stored=true multiValued=true omitNorms=true /

field name=STRING_VALUE type=textTight indexed=true 
stored=false multivalued=true/
field name=ATTRIBUTE_NAME type=textTight indexed=true 
stored=false multivalued=true/
field name=text type=text indexed=true stored=false 
multiValued=true/

dynamicField name=*_i  type=stringindexed=true  
stored=true multivalued=true/

/fields

uniqueKeyProduct_ID/uniqueKey
defaultSearchFieldtext/defaultSearchField

solrQueryParser defaultOperator=OR/

copyField source=* dest=text/

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan

Yonik,
One more update on this. I used the filter query that was throwing 
error and used it to delete a subset of results. 
After that the queries started working correctly. 
Which indicates that the particular docId was present in the index somewhere, 
but lucene was not able to find it.

-Kalyan


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, July 27, 2010 4:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Querying throws java.util.ArrayList.RangeCheck

I haven't been able to reproduce anything...
But if you guys are sure you're not running any custom code, then
there's definitely seems to be a bug somewhere.

Can anyone reproduce this in something you can share?

-Yonik
http://www.lucidimagination.com

Re: Indexing Problem: Where's my data?

2010-07-27 Thread kenf_nc


for STRING_VALUE, I assume there is a property in the 'select *' results
called string_value? if so I'm not sure why it wouldn't work. If not, then
that's why, it doesn't have anything to put there.

For ATTRIBUTE_NAME, is it possibly a case issue? you called it
'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just
something to check I guess.

Also, not sure why you are using name= in your fields, for example, 

field column=PARENT_FAMILY name=Parent Family / 
I thought 'column' was the source field name and 'name' was supposed to be
the schema field name and if not there it would assume 'column' name. You
don't have a schema field called Parent Family so it looks like it's
defaulting to column name too which is lucky for you I suppose. But you may
want to either remove 'name=' or make it match the schema. (and I may be
completely wrong on this, it's been a while since I got DIH going).


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting parameters wiki

2010-07-27 Thread Koji Sekiguchi


(10/07/27 23:16), Stephen Green wrote:

The wiki entry for hl.highlightMultiTerm:

http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm

doesn't appear to be correct.  It says:

If the SpanScorer is also being used, enables highlighting for
range/wildcard/fuzzy/prefix queries. Default is false.

But the code in DefaultSolrHighlighter (both on the 1.4 branch that
I'm using and in the trunk) does:

 Boolean highlightMultiTerm =
request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM,
true);
 if(highlightMultiTerm == null) {
   highlightMultiTerm = false;
 }

which looks to me like like it's going to default to true, since
getBool will never return null, and if it gets a null value from the
parameters internally, it will return true.

Shall I file a Jira on this one?  Perhaps it's easier just to fix the Wiki page?

Steve
   

Hi Steve,

Please just fix the wiki page. Thank you for reporting this!

Koji

--
http://www.rondhuit.com/en/

How to 'filter' facet results

2010-07-27 Thread David Thompson

Is there a way to tell Solr to only return a specific set of facet values?  I 
feel like the facet query must be able to do this, but I'm not really 
understanding the facet query.  In my specific case, I'd like to only see facet 
values for the same values I pass in as query filters, i.e. if I run this query:
fq=keyword:man OR keyword:bear OR keyword:pig
facet=on
facet.field:keyword

then I only want it to return the facet counts for man, bear, and pig.  The 
resulting docs might have a number of different values for keyword, in addition 
to those specified in the filter because keyword is a multiValued field.  How 
can I tell it to only return the facet values for man, bear, and pig?  On the 
client side I could programmatically remove the other facets that I don't care 
about, except that the resulting docs could return hundreds of different 
values.  If I were faceting on a single value, I could say facet.prefix=man, 
and 
that would work, but mostly I need this to work for more than one filter value. 
 
Is there a way to set multiple facet.prefix values?  Any ideas?

-dKt

RE: How to 'filter' facet results

2010-07-27 Thread Jonathan Rochkind

 Is there a way to tell Solr to only return a specific set of facet values?  I
 feel like the facet query must be able to do this, but I'm not really
 understanding the facet query.  In my specific case, I'd like to only see 
 facet
 values for the same values I pass in as query filters, i.e. if I run this 
 query:
fq=keyword:man OR keyword:bear OR keyword:pig
facet=on
facet.field:keyword

 then I only want it to return the facet counts for man, bear, and pig.  The
 resulting docs might have a number of different values for keyword, in 
 addition

For the general case of filtering facet values, I've wanted to do that too in 
more complex situations, and there is no good way I've found. 

For your very specific use case though, yeah, you can do it with facet.query.  
Leave out the facet.field, but instead:

facet.query=keyword:man
facet.query=keyword:bear
facet.query=keyword:pig

You'll get three facet.query results in the response, one each for man, bear, 
pig. 

Solr behind the scenes will kind of do three seperate 'sub-queries', one for 
each facet.query, but since the query itself should be cached, you shouldn't 
notice much difference. Especially if you have a warming query that facets on 
the keyword field (I'm never entirely sure when caches created by warming 
queries will be used by a facet.query, or if it depends on the facet method in 
use, but it can't hurt). 

Jonathan

Re: Tika, Solr running under Tomcat 6 on Debian

I would start over from the Solr 1.4.1 binary distribution and follow
the instructions on the wiki:

http://wiki.apache.org/solr/ExtractingRequestHandler

(Java classpath stuff is notoriously difficult, especially when
dynamically configured and loaded. I often cannot tell if Java cannot
load the class it prints, or if that class requires others.)

On Sat, Jul 24, 2010 at 11:21 PM, Tim AtLee timat...@gmail.com wrote:
Hello

I desperately hope someone can help me here... I'm a bit out of my league
here.

I am trying to implement content extraction using Tika and Solr as part of a
search package for a product I am using. I have been successful in getting
Solr to work so far as indexing text, and returning search results, however
I am hitting a wall when I try to use Tika for content extraction.

I add the following configuration to solrconfig.xml:
requestHandler name=/extract/tika
class=org.apache.solr.handler.extraction.ExtractingRequestHandler

lst name=defaults
/lst
!-- This path only extracts - never updates --
lst name=invariants
bool name=extractOnlytrue/bool
/lst
/requestHandler

During a test, I receive the following error:
org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.extraction.ExtractingRequestHandler'

The full text of this error is listed below.

So, as I indicated in the subject line, I am using Debian linux Squeeze
(testing). Tomcat is at version 6.0.26 and is installed by apt.

Solr is also installed from apt, and is at version:
1.4.0.2010.04.24.07.20.22.

Java -version looks like this:
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)

The JDK is also at the same version, and also from apt.

I have built Tika from source (nightly build) using mvn2, and placed
the complied jar's in /lib. /lib is located at /var/solr/site/lib, along
with /var/solr/site/conf and /var/solr/site/data. Hopefully this is the
right place to put the jar's.

I also tried building solr from source (also the nightly build), and was
able to get solr sort of working (not Tika). I could run a single instance,
but getting multiple instances running didn't seem to be in the cards. I
didn't pursue this any further. If this is the route I should go down, if
anyone can direct me on how to install a built Solr war and configure it so
I can use multiple instances, I'll gladly try it out.

I found a similar issue to mine at
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/d2b0462d72664840b72118cb4437cbd403e2a...@ndhamrexm22.amer.pfizer.comhttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/%3cd2b0462d72664840b72118cb4437cbd403e2a...@ndhamrexm22.amer.pfizer.com%3e,
From that email, I tried copying the built Solr jars into the Solr site's
lib directory, then realized that the likelihood of that working was pretty
slim - jars built from a nightly build trying to work with a .war from 1.4.0
was probably not going work. As you might have guessed, it didn't. This is
when I tried building Solr from source (thinking that if all the Solr stuff
was at the same revision, it might work).

I have not tried all of this under Jetty. It's my understanding that Jetty
won't let me do multiple instances, and since this is a requirement for what
I'm doing, I'm more or less constrained to Tomcat.

I have also seen some other references to using OpenJDK instead of Sun JDK.
This resulted in the same error (don't recall the site where I saw this
referenced).

Any help would be greatly appreciated. I am new to Tomcat and Solr, so I
may have some dumb follow-up questions that will be googled thoroughly
first. Sorry in advance..

Tim

-
org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.extraction.ExtractingRequestHandler'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:414)
at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:450)
at
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152)
at org.apache.solr.core.SolrCore.lt;initgt;(SolrCore.java:557)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
at
org.apache.catalina.core.ApplicationFilterConfig.lt;initgt;(ApplicationFilterConfig.java:115)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3838)
at

Re: Spellchecking and frequency

2010-07-27 Thread Erick Erickson

Yonik's Law of Patches reads: A half-baked patch in Jira, with no
documentation, no tests and no backwards compatibilty is better than no
patch at all.

It'd be perfectly appropriate, IMO, for you to post an outline of what your
enhancements do over on the SOLR dev list and get a reaction from the folks
over there as to whether it should be a Jira or not... see
solr-...@lucene.apache.org

Best
Erick

On Tue, Jul 27, 2010 at 2:04 PM, Mark Holland mark.holl...@zoopla.co.ukwrote:

 Hi,

 I found the suggestions returned from the standard solr spellcheck not to
 be
 that relevant. By contrast, aspell, given the same dictionary and mispelled
 words, gives much more accurate suggestions.

 I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
 the java aspell library. I also extended the SpellCheckComponent to take
 the
 matrix of suggested words and query the corpus to find the first
 combination
 of suggestions which returned a match. This works well for my use case,
 where term frequency is irrelevant to spelling or scoring.

 I'd like to publish the code in case someone finds it useful (although it's
 a bit crude at the moment and will need a decent tidy up). Would it be
 appropriate to open up a Jira issue for this?

 Cheers,
 ~mark

 On 27 July 2010 09:33, dan sutton danbsut...@gmail.com wrote:

  Hi,
 
  I've recently been looking into Spellchecking in solr, and was struck by
  how
  limited the usefulness of the tool was.
 
  Like most corpora , ours contains lots of different spelling mistakes for
  the same word, so the 'spellcheck.onlyMorePopular' is not really that
  useful
  unless you click on it numerous times.
 
  I was thinking that since most of the time people spell words correctly
 why
  was there no other frequency parameter that could enter into the score?
  i.e.
  something like:
 
  spell_score ~ edit_dist * freq
 
  I'm sure others have come across this issue and was wonding what
  steps/algorithms they have used to overcome these limitations?
 
  Cheers,
  Dan

Re: Solr 3.1 and ExtractingRequestHandler resulting in blank content

There are two different datasets that Solr (Lucene really) saves from
a document: raw storage and the indexed terms. I don't think the
ExtractingRequestHandler ever automatically stored the raw data; in
fact Lucene works in Strings internally, not raw byte arrays (this is
changing).

It should be indexed- that means if you search 'text' with a word from
the document, it will find those documents and bring back the file
name. Your app has to then use the file name.  Solr/Lucene is not
intended as a general-purpose content store, only an index.

The ERH wiki page doesn't quite say this. It describes what the ERH
does rather than what it does not do :)

On Mon, Jul 26, 2010 at 12:00 PM, David Thibault dthiba...@esperion.com wrote:
 Hello all,

 I’m working on a project with Solr.  I had 1.4.1 working OK using 
 ExtractingRequestHandler except that it was crashing on some PDFs.  I noticed 
 that Tika bundled with 1.4.1 was 0.4, which was kind of old.  I decided to 
 try updating to 0.7 as per the directions here: 
 http://wiki.apache.org/solr/ExtractingRequestHandler  but it was giving me 
 errors (I forget what they were specifically).

 Then I tried downloading Solr 3.1 from the source repository, which I noticed 
 came with Tika 0.7.  I figured this would be an easier route to get working.  
 Now I’m testing with 3.1 and 0.7 and I’m noticing my documents are going into 
 Solr OK, but they all have blank content (no document text stored in Solr).  
 I did see that the default “text” field is not stored. Changing that to 
 stored=true didn’t help.  Changing to 
 fmap.content=attr_contentuprefix=attr_content didn’t help either.  I have 
 attached all relevant info here.  Please let me know if someone sees 
 something I don’t (it’s entirely possible as I’m relatively new to Solr).

 Schema.xml:
 ?xml version=1.0 encoding=UTF-8 ?
 schema name=example version=1.3
  types
    fieldType name=string class=solr.StrField sortMissingLast=true 
 omitNorms=true/
    fieldType name=boolean class=solr.BoolField sortMissingLast=true 
 omitNorms=true/
    fieldtype name=binary class=solr.BinaryField/
    fieldType name=int class=solr.TrieIntField precisionStep=0 
 omitNorms=true positionIncrementGap=0/
    fieldType name=float class=solr.TrieFloatField precisionStep=0 
 omitNorms=true positionIncrementGap=0/
    fieldType name=long class=solr.TrieLongField precisionStep=0 
 omitNorms=true positionIncrementGap=0/
    fieldType name=double class=solr.TrieDoubleField precisionStep=0 
 omitNorms=true positionIncrementGap=0/
    fieldType name=tint class=solr.TrieIntField precisionStep=8 
 omitNorms=true positionIncrementGap=0/
    fieldType name=tfloat class=solr.TrieFloatField precisionStep=8 
 omitNorms=true positionIncrementGap=0/
    fieldType name=tlong class=solr.TrieLongField precisionStep=8 
 omitNorms=true positionIncrementGap=0/
    fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 
 omitNorms=true positionIncrementGap=0/
    fieldType name=date class=solr.TrieDateField omitNorms=true 
 precisionStep=0 positionIncrementGap=0/
    fieldType name=tdate class=solr.TrieDateField omitNorms=true 
 precisionStep=6 positionIncrementGap=0/
    fieldType name=pint class=solr.IntField omitNorms=true/
    fieldType name=plong class=solr.LongField omitNorms=true/
    fieldType name=pfloat class=solr.FloatField omitNorms=true/
    fieldType name=pdouble class=solr.DoubleField omitNorms=true/
    fieldType name=pdate class=solr.DateField sortMissingLast=true 
 omitNorms=true/
    fieldType name=sint class=solr.SortableIntField 
 sortMissingLast=true omitNorms=true/
    fieldType name=slong class=solr.SortableLongField 
 sortMissingLast=true omitNorms=true/
    fieldType name=sfloat class=solr.SortableFloatField 
 sortMissingLast=true omitNorms=true/
    fieldType name=sdouble class=solr.SortableDoubleField 
 sortMissingLast=true omitNorms=true/
    fieldType name=random class=solr.RandomSortField indexed=true /
    fieldType name=text_ws class=solr.TextField 
 positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory/
      /analyzer
    /fieldType
    fieldType name=text class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=true
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory
                ignoreCase=true
                words=stopwords.txt
                enablePositionIncrements=true
                /
        filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory 
 protected=protwords.txt/
        filter class=solr.PorterStemFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter

Re: slave index is bigger than master index

Ah! You have junk files piling up in the slave index directory. When
this happens, you may have to remove data/index entirely. I'm not sure
if Solr replication will handle that, or if you have to copy the whole
index to reset it.

You said the slaves time out- maybe the files are so large that the
master  slave need socket timeouts changed? In solrconfig.xml, these
two lines control that. Maybe they need to be increased.

str name=httpConnTimeout5000/str
str name=httpReadTimeout1/str


On Tue, Jul 27, 2010 at 3:59 AM, Peter Karich peat...@yahoo.de wrote:

 We have three dedicated servers for solr, two for slaves and one for master,
 all with linux/debian packages installed.

 I understand that replication does always copies over the index in an exact
 form as in master index directory (or it is supposed to do that at least),
 and if the master index was optimized after indexing, one doesn't need to
 run an optimize call again on master to optimize the slave's index. But in
 our case thats what fixed it and I agree it is even more confusing now :s


 Thats why I said: try it on the slaves too ;-)
 In our case it helped too to shrink 2*index to 1*index.
 I think the data which necessary for the replication won't cleanup
 before the next replication or before an optimize.
 For us it was crucial to shrink the size because of limited
 disc-resources and to make sure that the next
 replication does not increase the index to 3*times of the initial size.

 @muneeb so I think, optimization is not necessary or do you have disc
 limitations too?
 @Hoss or others: does this explanation sound logically?

 Another problem is, we are serving live services using slave nodes, so I
 dont want to effect the live search, while playing with slave nodes'
 indices.


 What do you mean here? Optimizing is too CPU expensive?

 We will be running the indexing on master node today over the night. Lets
 see if it does it again.


 Do you mean increase to double size?




-- 
Lance Norskog
goks...@gmail.com

Re: Indexing Problem: Where's my data?

Solr respects case for field names.  Database fields are supplied in
lower-case, so it should be 'attribute_name' and 'string_value'. Also
'product_id', etc.

It is easier if you carefully emulate every detail in the examples,
for example lower-case names.

On Tue, Jul 27, 2010 at 2:59 PM, kenf_nc ken.fos...@realestate.com wrote:

 for STRING_VALUE, I assume there is a property in the 'select *' results
 called string_value? if so I'm not sure why it wouldn't work. If not, then
 that's why, it doesn't have anything to put there.

 For ATTRIBUTE_NAME, is it possibly a case issue? you called it
 'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just
 something to check I guess.

 Also, not sure why you are using name= in your fields, for example,
 field column=PARENT_FAMILY name=Parent Family /
 I thought 'column' was the source field name and 'name' was supposed to be
 the schema field name and if not there it would assume 'column' name. You
 don't have a schema field called Parent Family so it looks like it's
 defaulting to column name too which is lucky for you I suppose. But you may
 want to either remove 'name=' or make it match the schema. (and I may be
 completely wrong on this, it's been a while since I got DIH going).


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)