RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Yuval Feinstein
Thanks, Ahmet.
Yes, my solrconfig.xml file is very similar to what you wrote.
When I use echoparams=all and defType=myqp, I get:

lst name=params
str name=qhi/str
str name=echoparamsall/str
str name=defTypemyqp/str
/lst

However, when I do not use the defType (hoping it will be automatically 
Inserted from solrconfig),  I get:

lst name=params
str name=qhi/str
str name=echoparamsall/str
/lst

Can you see what I am doing wrong?
Thanks,
Yuval


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Tuesday, June 08, 2010 3:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Making my QParserPlugin the default one, with cores

It appears that the defType parameter is not being set by the request 
handler. 

What do you get when you append echoParams=all to your search url?

So you have something like this entry in solrconfig.xml

requestHandler name=standard class=solr.SearchHandler default=true
lst name=defaults
str name=defTypemyqp/str
/lst
/requestHandler 



 





  


Re: question about the fieldCollapseCache

2010-06-09 Thread Martijn v Groningen
The fieldCollapseCache should not be used as it is now, it uses too
much memory. It stores any information relevant for a field collapse
search. Like document collapse counts, collapsed document ids /
fields, collapsed docset and uncollapsed docset (everything per unique
search). So the memory usage will grow for each unique query (and fast
with all this information). So its best I think to disable this cache
for now.

Martijn

On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote:
 Hi All,

 I've been running some tests using 6 shards each one containing about 1 
 millions documents.
 Each shard is running in its own virtual machine with 7 GB of ram (5GB 
 allocated to the JVM).
 After about 1100 unique queries the shards start to struggle and run out of 
 memory. I've reduced all
 other caches without significant impact.

 When I remove completely the fieldCollapseCache, the server can keep up for 
 hours
 and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)

 The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
 eat 3 GB of ram?

 Can someone tell me what is put in this cache? Has anyone experienced this 
 kind of problem?

 I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
 single field (pint) and
 collapse.maxdocs set to 200 000.

 Thanks for any hints...




Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Here's my config for the updateProcessor. It not uses another signature method 
but i've used TextProfileSignature as well and it works - sort of.


  updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldsig/str
  bool name=overwriteDupestrue/bool
  str name=fieldscontent/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain


Of course, you must define the updateProcessor in your requestHandler, it's 
commented out in mine at the moment.


  requestHandler name=/update class=solr.XmlUpdateRequestHandler
!--
   lst name=defaults
str name=update.processordedupe/str
   /lst
--
  /requestHandler


Also, i see you define minTokenLen = 3. Where does that come from? Haven't 
seen anything on the wiki specifying such a parameter.


On Tuesday 08 June 2010 19:45:35 Neeb wrote:
 Hey Andrew,
 
 Just wondering if you ever managed to run TextProfileSignature based
 deduplication. I would appreciate it if you could send me the code fragment
 for it from  solrconfig.
 
 I have currently something like this, but not sure if I am doing it right:
 
  updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   str name=signatureFieldsignature/str
   bool name=overwriteDupestrue/bool
   str name=fieldstitle,author,abstract/str
   str
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature
 /str str name=minTokenLen3/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 --
 
 Thanks in advance,
 -Ali
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Well, it got me too! KMail didn't properly order this thread. Can't seem to 
find Hatcher's reply anywhere. ??!!?


On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
 Andrew Clegg wrote:
  Re. your config, I don't see a minTokenLength in the wiki page for
  deduplication, is this a recent addition that's not documented yet?
 
 Sorry about this -- stupid question -- I should have read back through the
 thread and refreshed my memory.
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: [Blacklight-development] facet data cleanup

2010-06-09 Thread Erik Hatcher


On Jun 8, 2010, at 1:57 PM, Naomi Dushay wrote:

Missing Facet Values:
---

to find how many documents are missing values:		 
facet.missing=truefacet.mincount=really big


http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.mincount=1000facet.missing=true

to find the documents with missing values:
		http://your.solr.baseurl/select?qt=standardq=+uniquekey:[* TO *] - 
ffldname:[* TO *]


You could shorten that query to just q=-field_name:[* TO *]

Solr's lucene query parser supports top-level negative clauses.

And I'm assuming every doc has a unique key, so you could use *:*  
instead of uniquekey:[* TO *] - but I doubt one is really better than  
the other.


Erik



Re: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Erik Hatcher
Yuval - my only hunch is that you're hitting a different request  
handler than where you configured the default defType.  Send us the  
URL you're hitting Solr with, and the full request handler mapping.   
And you're sure you're the exact core you're hitting (since you  
mention multicore) you think you are?  Look at Solr's admin to see  
where the solr home directory is and ensure you're looking at the  
right solrconfig.xml.


Erik

On Jun 9, 2010, at 12:52 AM, Yuval Feinstein wrote:


Thanks, Ahmet.
Yes, my solrconfig.xml file is very similar to what you wrote.
When I use echoparams=all and defType=myqp, I get:

lst name=params
str name=qhi/str
str name=echoparamsall/str
str name=defTypemyqp/str
/lst

However, when I do not use the defType (hoping it will be  
automatically

Inserted from solrconfig),  I get:

lst name=params
str name=qhi/str
str name=echoparamsall/str
/lst

Can you see what I am doing wrong?
Thanks,
Yuval


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com]
Sent: Tuesday, June 08, 2010 3:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Making my QParserPlugin the default one, with cores

It appears that the defType parameter is not being set by the  
request handler.


What do you get when you append echoParams=all to your search url?

So you have something like this entry in solrconfig.xml

requestHandler name=standard class=solr.SearchHandler  
default=true

lst name=defaults
str name=defTypemyqp/str
/lst
/requestHandler














Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg


Markus Jelsma wrote:
 
 Well, it got me too! KMail didn't properly order this thread. Can't seem
 to 
 find Hatcher's reply anywhere. ??!!?
 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index search optimization for fulltext remote streaming

2010-06-09 Thread Danyal Mark

We have following solr configuration:

java -Xms512M -Xmx1024M -Dsolr.solr.home=solr home directory -jar
start.jar

in SolrConfig.xml

 indexDefaults  
useCompoundFilefalse/useCompoundFile
mergeFactor4/mergeFactor  
maxBufferedDocs20/maxBufferedDocs  
ramBufferSizeMB1024/ramBufferSizeMB
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypenative/lockType  
 /indexDefaults


mainIndex
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB1024/ramBufferSizeMB
mergeFactor4/mergeFactor
!-- Deprecated --
!--maxBufferedDocs10/maxBufferedDocs--
!--maxMergeDocs2147483647/maxMergeDocs--
unlockOnStartupfalse/unlockOnStartup  
reopenReaderstrue/reopenReaders  
deletionPolicy class=solr.SolrDeletionPolicy  
  str name=maxCommitsToKeep1/str
  str name=maxOptimizedCommitsToKeep0/str  
/deletionPolicy
 infoStream file=INFOSTREAM.txtfalse/infoStream
  /mainIndex


Also, we have used autoCommit=false. We have our PC spec:

Core2-Duo
2GB RAM
Solr Server running in localhost
Index Directory is also in local FileSystem
Input Fulltext files using remoteStreaming from another PC


Here, when we indexed 10 Fulltext documents, the total time taken is
40mins. We want to optimize the time lesser to this. We have been studying
on UpdateRequestProcessorChain section

requestHandler name=/update class=solr.XmlUpdateRequestHandler
  lst name=defaults
   str name=update.processordedupe/str
  /lst
 /requestHandler  

How to use this UpdateRequestProcessorChain in /update/extract/ to run
indexing in multiple chains (i.e multiple threads). Can you suggest me if I
can optimize the process changing any of these configurations?

with regards,
Danyal Mark 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-search-optimization-for-fulltext-remote-streaming-tp828274p881809.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to get multicore to work?

2010-06-09 Thread xdzgor

Hi - I can't seem to get multicores to work. I have a solr installtion
which does not have a solr.xml file - I assume this means it is not
multicore.

If I create a solr.xml, as described on
http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for
example I get 404 errors when trying to search, and solr/admin does not
work.

Is there more than simply making solr.xml to get multicores to work?

Thanks,
Peter
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Neeb

Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).

Cheerz,
Ali


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to get multicore to work?

2010-06-09 Thread Chris Rode
If you take a look in the examples directory there is a directory called
multicore. This is an example of the solrhome of a multicore setup.

Otherwise take a look at the logged output of Solr itself. It should tell
you what is wrong with the setup

On 9 June 2010 11:08, xdzgor p...@alphasolutions.dk wrote:


 Hi - I can't seem to get multicores to work. I have a solr installtion
 which does not have a solr.xml file - I assume this means it is not
 multicore.

 If I create a solr.xml, as described on
 http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for
 example I get 404 errors when trying to search, and solr/admin does not
 work.

 Is there more than simply making solr.xml to get multicores to work?

 Thanks,
 Peter
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Ahmet Arslan

 Thanks, Ahmet.
 Yes, my solrconfig.xml file is very similar to what you
 wrote.
 When I use echoparams=all and defType=myqp, I get:
 
 lst name=params
 str name=qhi/str
 str name=echoparamsall/str
 str name=defTypemyqp/str
 /lst
 
 However, when I do not use the defType (hoping it will be
 automatically 
 Inserted from solrconfig),  I get:
 
 lst name=params
 str name=qhi/str
 str name=echoparamsall/str
 /lst
 

In echoParams=all  p should be capital. Just use echoParams=all and don't 
include defType explicitly. echoParams=all will display default parameters 
that you specify in solrconfig.xml. You can debug this way.

http://wiki.apache.org/solr/CoreQueryParameters#echoParams

If you don't see str name=defTypemyqp/str listed under lst 
name=params then it is not written in solrconfig.xml.

May be you forgot to restart core after editing solrconfig.xml?






Copyfield multi valued to single value

2010-06-09 Thread Marc Ghorayeb

Hello,
Is there a way to copy a multivalued field to a single value by taking for 
example the first index of the multivalued field?
I am actually trying to sort my index by Title and my index contains Tika 
extracted titles which come in as multi valued hence why my title field is 
multi valued. However when i do a sort on the title field, it crashes because 
well it cannot compare two arrays i guess which is logical. So my thought was 
to copy only one value from the array to another field.
Maybe there is another way to do that? Can anyone help me?
Thanks in advance!
Marc  
_
Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec 
Windows 7
http://clk.atdmt.com/FRM/go/229960614/direct/01/

requesthandler, variable ...

2010-06-09 Thread stockii

Hello.

i want to call the termscomponent with this request: 
http://host/solr/app/select/?q=har

i want the same result when i use this request:
http://host/solr/app/terms/?q=harterms.prefix=har
--lst name=terms
lst name=suggest
int name=hardcore9/int
int name=hardcore evo9/int
int name=hardcore evo 20109/int
...


. this is my solrconfig.xml requestHandler

searchComponent name=termsComponent
class=org.apache.solr.handler.component.TermsComponent/
   
   requestHandler name=standard
class=org.apache.solr.handler.component.SearchHandler  
 lst name=defaults
str name=qtterms/str  
 /lst 
arr name=components
  strtermsComponent/str
/arr
   /requestHandler
 
 !-- qt=terms --
   requestHandler name=terms
class=org.apache.solr.handler.component.SearchHandler
 lst name=defaults
  bool name=termstrue/bool
str name=terms.flsuggest/str
str name=terms.sortindex/str
str name=terms.prefixstr name=q//str
/lst 
arr name=components
  strtermsComponent/str
/arr
  /requestHandler


it this possible ? str name=terms.prefixstr name=q//str 

or ho can i put the q-value on the place of term.prefix ?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/requesthandler-variable-tp881906p881906.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Yuval Feinstein
Thanks again Ahmet and Erik.
Turns out that this was calling the correct query parser all along.
The real problem was a combination of the query cache and my hacking the query 
to enable BM25 scoring.
When I use a standard BooleanQuery, this behaved as published.
Now I have to understand how to tweak my Lucene query data structure so that 
the query caching works like the standard Lucene queries.
Cheers,
Yuval

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, June 09, 2010 1:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Making my QParserPlugin the default one, with cores


 Thanks, Ahmet.
 Yes, my solrconfig.xml file is very similar to what you
 wrote.
 When I use echoparams=all and defType=myqp, I get:
 
 lst name=params
 str name=qhi/str
 str name=echoparamsall/str
 str name=defTypemyqp/str
 /lst
 
 However, when I do not use the defType (hoping it will be
 automatically 
 Inserted from solrconfig),  I get:
 
 lst name=params
 str name=qhi/str
 str name=echoparamsall/str
 /lst
 

In echoParams=all  p should be capital. Just use echoParams=all and don't 
include defType explicitly. echoParams=all will display default parameters 
that you specify in solrconfig.xml. You can debug this way.

http://wiki.apache.org/solr/CoreQueryParameters#echoParams

If you don't see str name=defTypemyqp/str listed under lst 
name=params then it is not written in solrconfig.xml.

May be you forgot to restart core after editing solrconfig.xml?



  


how to test solr's performance?

2010-06-09 Thread Li Li
are there any built-in tools for performance test? thanks


AW: how to get multicore to work?

2010-06-09 Thread Markus.Rietzler
- solr.xml have to reside in the solr.home dir. you can setup this with the 
java-option
  -Dsolr.solr.home=
- admin is per core, so solr/CORENAME/admin will work

it is quite simple to setup.

 -Ursprüngliche Nachricht-
 Von: xdzgor [mailto:p...@alphasolutions.dk] 
 Gesendet: Mittwoch, 9. Juni 2010 12:08
 An: solr-user@lucene.apache.org
 Betreff: how to get multicore to work?
 
 
 Hi - I can't seem to get multicores to work. I have a solr 
 installtion
 which does not have a solr.xml file - I assume this means it is not
 multicore.
 
 If I create a solr.xml, as described on
 http://wiki.apache.org/solr/CoreAdmin, my solr installation 
 fails - for
 example I get 404 errors when trying to search, and 
 solr/admin does not
 work.
 
 Is there more than simply making solr.xml to get multicores to work?
 
 Thanks,
 Peter
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-wor
k-tp881826p881826.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: question about the fieldCollapseCache

2010-06-09 Thread Jean-Sebastien Vachon
ok great.

I believe this should be mentioned in the wiki.

Later

On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote:

 The fieldCollapseCache should not be used as it is now, it uses too
 much memory. It stores any information relevant for a field collapse
 search. Like document collapse counts, collapsed document ids /
 fields, collapsed docset and uncollapsed docset (everything per unique
 search). So the memory usage will grow for each unique query (and fast
 with all this information). So its best I think to disable this cache
 for now.
 
 Martijn
 
 On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote:
 Hi All,
 
 I've been running some tests using 6 shards each one containing about 1 
 millions documents.
 Each shard is running in its own virtual machine with 7 GB of ram (5GB 
 allocated to the JVM).
 After about 1100 unique queries the shards start to struggle and run out of 
 memory. I've reduced all
 other caches without significant impact.
 
 When I remove completely the fieldCollapseCache, the server can keep up for 
 hours
 and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
 
 The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
 eat 3 GB of ram?
 
 Can someone tell me what is put in this cache? Has anyone experienced this 
 kind of problem?
 
 I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
 single field (pint) and
 collapse.maxdocs set to 200 000.
 
 Thanks for any hints...
 
 



Re: question about the fieldCollapseCache

2010-06-09 Thread Martijn v Groningen
I agree. I'll add this information to the wiki.

On 9 June 2010 14:32, Jean-Sebastien Vachon js.vac...@videotron.ca wrote:
 ok great.

 I believe this should be mentioned in the wiki.

 Later

 On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote:

 The fieldCollapseCache should not be used as it is now, it uses too
 much memory. It stores any information relevant for a field collapse
 search. Like document collapse counts, collapsed document ids /
 fields, collapsed docset and uncollapsed docset (everything per unique
 search). So the memory usage will grow for each unique query (and fast
 with all this information). So its best I think to disable this cache
 for now.

 Martijn

 On 8 June 2010 19:05, Jean-Sebastien Vachon js.vac...@videotron.ca wrote:
 Hi All,

 I've been running some tests using 6 shards each one containing about 1 
 millions documents.
 Each shard is running in its own virtual machine with 7 GB of ram (5GB 
 allocated to the JVM).
 After about 1100 unique queries the shards start to struggle and run out of 
 memory. I've reduced all
 other caches without significant impact.

 When I remove completely the fieldCollapseCache, the server can keep up for 
 hours
 and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)

 The size of the fieldCollapseCache was set to 5000 items. How can 5000 
 items eat 3 GB of ram?

 Can someone tell me what is put in this cache? Has anyone experienced this 
 kind of problem?

 I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
 single field (pint) and
 collapse.maxdocs set to 200 000.

 Thanks for any hints...







-- 
Met vriendelijke groet,

Martijn van Groningen


Solr spellcheck config

2010-06-09 Thread Bogdan Gusiev
Hi everyone,

I am trying to build the spellcheck index with *IndexBasedSpellChecker*

lst name=spellchecker
  str name=namedefault/str
  str name=fieldtext/str
  str name=spellcheckIndexDir./spellchecker/str
/lst

And I want to specify the dynamic field *_text as the field option:

dynamicField name=*_text stored=false type=text
multiValued=true indexed=true

How it can be done?

Thanks, Bogdan

-- 
Bogdan Gusiev.
agre...@gmail.com


Issue with response header in SOLR running on Linux instance

2010-06-09 Thread bbarani

Hi,

I have been using SOLR for sometime now and had no issues till I was using
it in windows. Yesterday I moved the SOLR code to Linux servers and started
to index the data. Indexing completed successfully in the linux severs but
when I queried the index, the response header returned (by the SOLR instance
running in Linux server) is different from the response header returned in
SOLR instance that is running on windows instance.

Response header returned by SOLR instance running in windows machine

- lst name=responseHeader
  int name=status0/int 
  int name=QTime2219/int 
- lst name=params
  str name=indenton/str 
  str name=start0/str 
  str name=qcredit/str 
  str name=version2.2/str 
  str name=rows10/str 
  /lst
  /lst


Response header returned by SOLR instance running in Linux machine

- response
- responseHeader
  status0/status 
  QTime26/QTime 
- lst name=params
  str name=qcredit/str 
  /lst
  /responseHeader

Any idea why this happens?

Thanks,
Barani

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-response-header-in-SOLR-running-on-Linux-instance-tp882181p882181.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Core Unload

2010-06-09 Thread abhatna...@vantage.com

Refering

http://lucene.472066.n3.nabble.com/unloading-a-solr-core-doesn-t-free-any-memory-td501246.html#a501246


Do we have any solution to free up memory after Solr Core Unload?


Ankit
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Core-Unload-tp882187p882187.html
Sent from the Solr - User mailing list archive at Nabble.com.


custom scorer in Solr

2010-06-09 Thread Fornoville, Tom
Hi all,

 

We are currently working on a proof-of-concept for a client using Solr
and have been able to configure all the features they want except the
scoring.

 

Problem is that they want scores that make results fall in buckets:

*   Bucket 1: exact match on category (score = 4)
*   Bucket 2: exact match on name (score = 3)
*   Bucket 3: partial match on category (score = 2)
*   Bucket 4: partial match on name (score = 1)

 

First thing we did was develop a custom similarity class that would
return the correct score depending on the field and an exact or partial
match.

 

The only problem now is that when a document matches on both the
category and name the scores are added together.

Example: searching for restaurant returns documents in the category
restaurant that also have the word restaurant in their name and thus get
a score of 5 (4+1) but they should only get 4.

 

I assume for this to work we would need to develop a custom Scorer class
but we have no clue on how to incorporate this in Solr.

Maybe there is even a simpler solution that we don't know about.

 

All suggestions welcome!

 

Thanks,

Tom



Re: Issue with response header in SOLR running on Linux instance

2010-06-09 Thread Markus Jelsma
Hi,


Check your requestHandler. It may preset some values that you don't see. Your 
echoParams setting may be explicit instead of all [1]. Alternatively, you 
could add the echoParams parameter to your query if it isn't set as an 
invariant in your requestHandler.

[1]: http://wiki.apache.org/solr/CoreQueryParameters

Cheers,
 
On Wednesday 09 June 2010 15:25:09 bbarani wrote:
 Hi,
 
 I have been using SOLR for sometime now and had no issues till I was using
 it in windows. Yesterday I moved the SOLR code to Linux servers and started
 to index the data. Indexing completed successfully in the linux severs but
 when I queried the index, the response header returned (by the SOLR
  instance running in Linux server) is different from the response header
  returned in SOLR instance that is running on windows instance.
 
 Response header returned by SOLR instance running in windows machine
 
 - lst name=responseHeader
   int name=status0/int
   int name=QTime2219/int
 - lst name=params
   str name=indenton/str
   str name=start0/str
   str name=qcredit/str
   str name=version2.2/str
   str name=rows10/str
   /lst
   /lst
 
 
 Response header returned by SOLR instance running in Linux machine
 
 - response
 - responseHeader
   status0/status
   QTime26/QTime
 - lst name=params
   str name=qcredit/str
   /lst
   /responseHeader
 
 Any idea why this happens?
 
 Thanks,
 Barani
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Anyone using Solr spatial from trunk?

2010-06-09 Thread Rob Ganly
... but decided not to use it anyway?

that's pretty much correct.  the huge commercial scale of the project
dictates that we need as much system stability as possible from the outset;
thus the tools we are use must be established, community-tested and trusted
versions.  we also noticed that some of the regular non-geospatial queries
seemed to run slightly slower than on 1.4, with only a fraction of the total
amount of records we'd be searching in production (but that wasn't the main
reason for our decision).

i would perhaps use it for a much smaller [private] project where speed,
scaling and reliability weren't such critical issues.

future proofing was also a consideration:

*With all the changes currently occurring with Solr, I would go so far as
to say that users should continue to use Solr 1.4. However, if you need
access to one of the many new features introduced in Solr 1.5+ or Lucene
3.x, then given Solr 3.1 a shot, and report back your experiences.  *(from
http://blog.jteam.nl/2010/04/14/state-of-solr/*).*

On 8 June 2010 21:09, Darren Govoni dar...@ontrenet.com wrote:

 So let me understand what you said. You went through the trouble to
 implement a geospatial
 solution using Solr 1.5, it worked really well. You saw no signs of
 instability, but decided not to use it anyway?

 Did you put it through a routine of tests and witness some stability
 problem? Or just guessing it had them?

 I'm just curious the reasoning behind your comment.

 On Tue, 2010-06-08 at 09:05 +0100, Rob Ganly wrote:

  i used the 1.5 build a few weeks ago, implemented the geospatial
  functionality and it worked really well.
 
  however due to the unknown quantity in terms of stability (and the
 uncertain
  future of 1.5) etc. we decided not to use it in production.
 
  rob ganly
 
  On 8 June 2010 03:50, Darren Govoni dar...@ontrenet.com wrote:
 
   I've been experimenting with it, but haven't quite gotten it to work as
   yet.
  
   On Mon, 2010-06-07 at 17:47 -0700, Jay Hill wrote:
  
I was wondering about the production readiness of the new-in-trunk
   spatial
functionality. Is anyone using this in a production environment?
   
-Jay
  
  
  





Re: AW: XSLT for JSON

2010-06-09 Thread stockii

help me please =(
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/XSLT-for-JSON-tp845386p882319.html
Sent from the Solr - User mailing list archive at Nabble.com.


How Solr Manages Connected Database Updates

2010-06-09 Thread Sumit Arora
Hey All,

I am new to Solr Area, and just started exploring it and done basic stuff,
now I am stuck with logic :

How Solr Manages Connected Database Updates

Scenario :

-- Wrote one Indexing Program which runs on Tomcat , and by running this
program, it reads  data from connected MySql Database and then perform
Indexing.

Use Case - Database is not fixed, Its a data base for a web application,
from where user keep on inserting data, so database have frequent updates.
almost every minute.

How automatically solr should grab those changes and perform Index updation
?


Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ?
(Several ways could be ?)

Is there any one who can provide his comments or share his experience If
some one gone though from similar situation ?

Thanks,
-Sumit


Diagnosing solr timeout

2010-06-09 Thread Paul
Hi all,

In my app, it seems like solr has become slower over time. The index
has grown a bit, and there are probably a few more people using the
site, but the changes are not drastic.

I notice that when a solr search is made, the amount of cpu and ram
spike precipitously.

I notice in the solr log, a bunch of entries in the same second that end in:

status=0 QTime=212
status=0 QTime=96
status=0 QTime=44
status=0 QTime=276
status=0 QTime=8552
status=0 QTime=16
status=0 QTime=20
status=0 QTime=56

and then:

status=0 QTime=315919
status=0 QTime=325071

My questions: How do I figure out what to fix? Do I need to start java
with more memory? How do I tell what is the correct amount of memory
to use? Is there something particularly inefficient about something
else in my configuration, or the way I'm formulating the solr request,
and how would I narrow down what it could be? I can't tell, but it
seems like it happens after solr has been running unattended for a
little while. Should I have a cron job that restarts solr every day?
Could the solr process be starved by something else on the server
(although -- the only other thing that is particularly running is
apache/passenger/rails app)?

In other words, I'm at a total loss about how to fix this.

Thanks!

P.S. In case this helps, here's the exact log entry for the first item
that failed:

Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute
INFO: [resources] webapp=/solr path=/select
params={hl.fragsize=600facet.missing=truefacet=falsefacet.mincount=1ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48facet.limit=-1hl.fl=texthl.maxAnalyzedChars=512000wt=javabinhl=truerows=30version=1fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uristart=0q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5facet.field=genrefacet.field=archivefacet.field=freeculturefacet.field=has_full_textfacet.field=federationisShard=truefq=year:1882}
status=0 QTime=315919


Dataimport in debug mode store a last index date

2010-06-09 Thread Marc Emery
Hi,

When using the data import handler and clicking on 'Debug now' it stores the
current date as 'last_index_time' into the dataimport.properties file.
Is it the right behaviour, as debug don't do a commit?

Thanks
marc


Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
Have you looked at the garbage collector statistics? I've experienced this kind 
of issues in the past
and I was getting huge spikes when the GC was doing its job.

On 2010-06-09, at 10:52 AM, Paul wrote:

 Hi all,
 
 In my app, it seems like solr has become slower over time. The index
 has grown a bit, and there are probably a few more people using the
 site, but the changes are not drastic.
 
 I notice that when a solr search is made, the amount of cpu and ram
 spike precipitously.
 
 I notice in the solr log, a bunch of entries in the same second that end in:
 
 status=0 QTime=212
 status=0 QTime=96
 status=0 QTime=44
 status=0 QTime=276
 status=0 QTime=8552
 status=0 QTime=16
 status=0 QTime=20
 status=0 QTime=56
 
 and then:
 
 status=0 QTime=315919
 status=0 QTime=325071
 
 My questions: How do I figure out what to fix? Do I need to start java
 with more memory? How do I tell what is the correct amount of memory
 to use? Is there something particularly inefficient about something
 else in my configuration, or the way I'm formulating the solr request,
 and how would I narrow down what it could be? I can't tell, but it
 seems like it happens after solr has been running unattended for a
 little while. Should I have a cron job that restarts solr every day?
 Could the solr process be starved by something else on the server
 (although -- the only other thing that is particularly running is
 apache/passenger/rails app)?
 
 In other words, I'm at a total loss about how to fix this.
 
 Thanks!
 
 P.S. In case this helps, here's the exact log entry for the first item
 that failed:
 
 Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute
 INFO: [resources] webapp=/solr path=/select
 params={hl.fragsize=600facet.missing=truefacet=falsefacet.mincount=1ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48facet.limit=-1hl.fl=texthl.maxAnalyzedChars=512000wt=javabinhl=truerows=30version=1fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uristart=0q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5facet.field=genrefacet.field=archivefacet.field=freeculturefacet.field=has_full_textfacet.field=federationisShard=truefq=year:1882}
 status=0 QTime=315919



Re: Diagnosing solr timeout

2010-06-09 Thread Paul
Have you looked at the garbage collector statistics? I've experienced this 
kind of issues in the past
and I was getting huge spikes when the GC was doing its job.

I haven't, and I'm not sure what a good way to monitor this is. The
problem occurs maybe once a week on a server. Should I run jstat the
whole time and redirect the output to a log file? Is there another way
to get that info?

Also, I was suspecting GC myself. So, if it is the problem, what do I
do about it? It seems like increasing RAM might make the problem worse
because it would wait longer to GC, then it would have more to do.


TrieRange for storage of dates

2010-06-09 Thread Jason Rutherglen
What is the best practice? Perhaps we can amend the article at
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
to include the recommendation (ie, dates are commonly unique).
I'm assuming using a long is the best choice.


Re: Tomcat startup script

2010-06-09 Thread Sixten Otto
On Tue, Jun 8, 2010 at 4:18 PM,  cbenn...@job.com wrote:
 The following should work on centos/redhat, don't forget to edit the paths,
 user, and java options for your environment. You can use chkconfig to add it
 to your startup.

Thanks, Colin.

Sixten


Some questions about ability of solr.

2010-06-09 Thread Vitaliy Avdeev
I am keeping some data int Json format in HBase table.
I would like to index this data with solr.
Is there any examples of indexing HBase table?

Evry node in HBase has atribyte that saves the data then it was writed int
table.
Is there any option to search no only by text but also to search the data
for period of time then it was writed into the HBase?


Re: general debugging techniques?

2010-06-09 Thread Jim Blomo
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 : That is still really small for 5MB documents. I think the default solr
 : document cache is 512 items, so you would need at least 3 GB of memory
 : if you didn't change that and the cache filled up.

 that assumes that the extracted text tika extracts from each document is
 the same size as the original raw files *and* that he's configured that
 content field to be stored ... in practice if you only stored=true the

Most times the extracted text is much smaller, though there are
occasional zip files that may expand in size (and in an unrelated
note, multifile zip archives cause tika 0.7 to hang currently).

 fast, 128MB is really, really, really small for a typical Solr instance.

In any case I bumped up the heap to 3G as suggested, which has helped
stability.  I have found that in practice I need to commit every
extraction because a crash or error will wipe out all extractions
after the last commit.

 if you are only seeing one log line per request, then you are just looking
 at the request log ... there should be more logs with messages from all
 over the code base with various levels of severity -- and using standard
 java log level controls you can turn these up/down for various components.

Unfortunately, I'm not very familiar with java deploys so I don't know
where the standard controls are yet.  As a concrete example, I do see
INFO level logs, but haven't found a way to move up DEBUG level in
either solr or tomcat.  I was hopeful debug statements would point to
where extraction/indexing hangs were occurring.  I will keep poking
around, thanks for the tips.

Jim


Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
I use the following article as a reference when dealing with GC related issues

http://www.petefreitag.com/articles/gctuning/

I suggest you activate the verbose option and send GC stats to a file. I don't 
remember exactly what
was the option but you should find the information easily

Good luck

On 2010-06-09, at 11:35 AM, Paul wrote:

 Have you looked at the garbage collector statistics? I've experienced this 
 kind of issues in the past
 and I was getting huge spikes when the GC was doing its job.
 
 I haven't, and I'm not sure what a good way to monitor this is. The
 problem occurs maybe once a week on a server. Should I run jstat the
 whole time and redirect the output to a log file? Is there another way
 to get that info?
 
 Also, I was suspecting GC myself. So, if it is the problem, what do I
 do about it? It seems like increasing RAM might make the problem worse
 because it would wait longer to GC, then it would have more to do.



Re: AW: how to get multicore to work?

2010-06-09 Thread xdzgor

Thanks for the comments. I still can't get this multicore thing to work!

Here is my directory structure:

d:
__apachesolr
lucidworks
__lucidworks
solr
__bin
__conf
__lib
tomcat

There is no solr.xml, and solr.solr.home points to
d:\apachesolr\lucidworkd\lucidworks\solr

As it stands, solr works fine, and sites like
http://locahost:8983/solr/admin also work.

As soon as I put a solr.xml in the solr directory, and restart the tomcat
service. It all stops working.
solr persistent=false
  cores adminPath=/admin/cores
core name=core0 instanceDir=. /
  /cores
/solr

Any idea where I can look?
Where is the solr startup log written?

Thanks,
Peter
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p883780.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: general debugging techniques?

2010-06-09 Thread Lance Norskog
https://issues.apache.org/jira/browse/LUCENE-2387

There is a memory leak that causes the last PDF binary file image to
stick around while working on the next binary image. When you commit
after every extraction, you clear up this memory leak.

This is fixed in trunk and should make it into a 'bug fix' Solr 1.4.1
if such a thing happens.

Lance

On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo jim.bl...@pbworks.com wrote:
 On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 : That is still really small for 5MB documents. I think the default solr
 : document cache is 512 items, so you would need at least 3 GB of memory
 : if you didn't change that and the cache filled up.

 that assumes that the extracted text tika extracts from each document is
 the same size as the original raw files *and* that he's configured that
 content field to be stored ... in practice if you only stored=true the

 Most times the extracted text is much smaller, though there are
 occasional zip files that may expand in size (and in an unrelated
 note, multifile zip archives cause tika 0.7 to hang currently).

 fast, 128MB is really, really, really small for a typical Solr instance.

 In any case I bumped up the heap to 3G as suggested, which has helped
 stability.  I have found that in practice I need to commit every
 extraction because a crash or error will wipe out all extractions
 after the last commit.

 if you are only seeing one log line per request, then you are just looking
 at the request log ... there should be more logs with messages from all
 over the code base with various levels of severity -- and using standard
 java log level controls you can turn these up/down for various components.

 Unfortunately, I'm not very familiar with java deploys so I don't know
 where the standard controls are yet.  As a concrete example, I do see
 INFO level logs, but haven't found a way to move up DEBUG level in
 either solr or tomcat.  I was hopeful debug statements would point to
 where extraction/indexing hangs were occurring.  I will keep poking
 around, thanks for the tips.

 Jim




-- 
Lance Norskog
goks...@gmail.com


Re: How Solr Manages Connected Database Updates

2010-06-09 Thread Lance Norskog
The DataImportHandler has a tool for fetching recent updates in the
database and indexing only those newchanged records.  It has no
scheduler. You would set up the DIH configuration and then write a
cron job to run it at regular intervals.

Lance

On Wed, Jun 9, 2010 at 7:51 AM, Sumit Arora sumit1...@gmail.com wrote:
 Hey All,

 I am new to Solr Area, and just started exploring it and done basic stuff,
 now I am stuck with logic :

 How Solr Manages Connected Database Updates

 Scenario :

 -- Wrote one Indexing Program which runs on Tomcat , and by running this
 program, it reads  data from connected MySql Database and then perform
 Indexing.

 Use Case - Database is not fixed, Its a data base for a web application,
 from where user keep on inserting data, so database have frequent updates.
 almost every minute.

 How automatically solr should grab those changes and perform Index updation
 ?


 Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ?
 (Several ways could be ?)

 Is there any one who can provide his comments or share his experience If
 some one gone though from similar situation ?

 Thanks,
 -Sumit




-- 
Lance Norskog
goks...@gmail.com


Master master?

2010-06-09 Thread Glen Stampoultzis
Does Solr handling having two masters that are also slaves to each other (ie
in a cycle)?


Regards,

Glen


Re: Index-time vs. search-time boosting performance

2010-06-09 Thread Lance Norskog
Is it necessary that a document 1 year old be more relevant than one
that's 1 year and 1 hour old? In other words, can the boosting be
logarithmic wrt time instead of linear?

A schema design tip: you can store a separate date field which is
rounded down to the hour. This will make for a much smaller term
dictionary and therefore faster searching  range queries.

On Mon, Jun 7, 2010 at 4:08 AM, Asif Rahman a...@newscred.com wrote:
 I still need a relatively precise boost.  No less precise than hourly.  I
 think that would make for a pretty messy field query.


 On Mon, Jun 7, 2010 at 2:15 AM, Lance Norskog goks...@gmail.com wrote:

 If you are unhappy with the performance overhead of a function boost,
 you can push it into a field query by boosting date ranges.

 You would group in date ranges: documents in September would be
 boosted 1.0, October 2.0, November 3.0 etc.


 On 6/5/10, Asif Rahman a...@newscred.com wrote:
  Thanks everyone for your help so far.  I'm still trying to get to the
 bottom
  of whether switching over to index-time boosts will give me a performance
  improvement, and if so if it will be noticeable.  This is all under the
  assumption that I can achieve the scoring functionality that I need with
  either index-time or search-time boosting (given the loss of precision.
  I
  can always dust off the old profiler to see what's going on with the
  search-time boosts, but testing the index-time boosts will require a full
  reindex, which could take days with our dataset.
 
  On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir rcm...@gmail.com wrote:
 
  On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman a...@newscred.com wrote:
 
   Perhaps I should have been more specific in my initial post.  I'm
 doing
   date-based boosting on the documents in my index, so as to assign a
  higher
   score to more recent documents.  Currently I'm using a boost function
 to
   achieve this.  I'm wondering if there would be a performance
 improvement
  if
   instead of using the boost function at search time, I indexed the
  documents
   with a date-based boost.
  
  
  Asif, without knowing more details, before you look at performance you
  might
  want to consider the relevance impacts of switching to index-time
 boosting
  for your use case too.
 
  You can read more about the differences here:
  http://lucene.apache.org/java/3_0_1/scoring.html
 
  But I think the most important for this date-influenced use case is:
 
  Indexing time boosts are preprocessed for storage efficiency and
 written
  to
  the directory (when writing the document) in a single byte (!)
 
  If you do this as an index-time boost, your boosts will lose lots of
  precision for this reason.
 
  --
  Robert Muir
  rcm...@gmail.com
 
 
 
 
  --
  Asif Rahman
  Lead Engineer - NewsCred
  a...@newscred.com
  http://platform.newscred.com
 


 --
 Lance Norskog
 goks...@gmail.com




 --
 Asif Rahman
 Lead Engineer - NewsCred
 a...@newscred.com
 http://platform.newscred.com




-- 
Lance Norskog
goks...@gmail.com


Re: Need help with document format

2010-06-09 Thread Lance Norskog
This is what Field Collapsing does. It is a complex feature and is not
in the Solr trunk yet.

On Tue, Jun 8, 2010 at 9:15 AM, Moazzam Khan moazz...@gmail.com wrote:
 How would I do a facet search if I did this and not get duplicates?

 Thanks,
 Moazzam

 On Mon, Jun 7, 2010 at 10:07 AM, Israel Ekpo israele...@gmail.com wrote:
 I think you need a 1:1 mapping between the consultant and the company, else
 how are you going to run your queries for let's say consultants that worked
 for Google or AOL between March 1999 and August 2004?

 If the mapping is 1:1, your life would be easier and you would not need to
 do extra parsing of the results your retrieved.

 Unfortunately, it looks like your are doing to have a lot of records.

 With an RDBMS, it is easier to do joins but with Lucene and Solr you have to
 denormalize all the relationships.

 Hence in this particular scenario, if you have 5 consultants that worked for
 4 distinct companies you will have to send 20 documents to Solr

 On Mon, Jun 7, 2010 at 10:15 AM, Moazzam Khan moazz...@gmail.com wrote:

 Thanks for the replies guys.


 I am currently storing consultants like this ..

 doc
  id123/id
  FirstNametony/FirstName
  LastNamemarjo/LastName
  CompanyGoogle/Company
  CompanyAOL/Company
 doc

 I have a few multi valued fields so if I do it the way Israel
 suggested it, I will have tons of records. Do you think it will be
 better if I did this instead ?


 doc
  id123/id
  FirstNametony/FirstName
  LastNamemarjo/LastName
  CompanyGoogle_StartDate_EndDate/Company
  CompanyAOL_StartDate_EndDate/Company
 doc

 Or is what you guys said better?

 Thanks for all the help.

 Moazzam


 On Mon, Jun 7, 2010 at 1:10 AM, Lance Norskog goks...@gmail.com wrote:
  And for 'present', you would pick some time far in the future:
  2100-01-01T00:00:00Z
 
  On 6/5/10, Israel Ekpo israele...@gmail.com wrote:
  You need to make each document added to the index a 1 to 1 mapping for
 each
  company and consultant combo
 
  schema
 
  fields
      !-- Concatenation of company and consultant id --
      field name=consultant_id_company_id type=string indexed=true
  stored=true required=true/
      field name=consultant_firstname type=string indexed=true
  stored=true multiValued=false/
      field name=consultant_lastname type=string indexed=true
  stored=true multiValued=false/
 
      !-- The name of the company the consultant worked for --
      field name=company type=text indexed=true stored=true
  multiValued=false/
      field name=start_date type=tdate indexed=true stored=true
  multiValued=false/
      field name=end_date type=tdate indexed=true stored=true
  multiValued=false/
  /fields
 
  defaultSearchFieldtext/defaultSearchField
 
  copyField source=consultant_firstname dest=text/
  copyField source=consultant_lastname dest=text/
  copyField source=company dest=text/
 
  /schema
 
  !--
 
  So for instance, you have 2 consultants
 
  Michael Davis and Tom Anderson who worked for AOL and Microsoft, Yahoo,
  Google and Facebook.
 
  Michael Davis = 1
  Tom Anderson = 2
 
  AOL = 1
  Microsoft = 2
  Yahoo = 3
  Google = 4
  Facebook = 5
 
  This is how you would add the documents to the index
 
  --
 
  doc
      consultant_id_company_id1_1/consultant_id_company_id
      consultant_firstnameMichael/consultant_firstname
      consultant_lastnameDavis/consultant_lastname
      companyAOL/company
      start_date2006-02-13T15:26:37Z/start_date
      end_date2008-02-13T15:26:37Z/end_date
  /doc
 
  doc
      consultant_id_company_id1_4/consultant_id_company_id
      consultant_firstnameMichael/consultant_firstname
      consultant_lastnameDavis/consultant_lastname
      companyGoogle/company
      start_date2006-02-13T15:26:37Z/start_date
      end_date2009-02-13T15:26:37Z/end_date
  /doc
 
  doc
      consultant_id_company_id2_3/consultant_id_company_id
      consultant_firstnameTom/consultant_firstname
      consultant_lastnameAnderson/consultant_lastname
      companyYahoo/company
      start_date2001-01-13T15:26:37Z/start_date
      end_date2009-02-13T15:26:37Z/end_date
  /doc
 
  doc
      consultant_id_company_id2_4/consultant_id_company_id
      consultant_firstnameTom/consultant_firstname
      consultant_lastnameAnderson/consultant_lastname
      companyGoogle/company
      start_date1999-02-13T15:26:37Z/start_date
      end_date2010-02-13T15:26:37Z/end_date
  /doc
 
 
  The you can search as
 
  q=company:X AND start_date:[X TO *] AND end_date:[* TO Z]
 
  On Fri, Jun 4, 2010 at 4:58 PM, Moazzam Khan moazz...@gmail.com
 wrote:
 
  Hi guys,
 
 
  I have a list of consultants and the users (people who work for the
  company) are supposed to be able to search for consultants based on
  the time frame they worked for, for a company. For example, I should
  be able to search for all consultants who worked for Bear Stearns in
  the month of july. What is the best of accomplishing this?
 
  I was thinking of formatting the document like this
 
  company
    name 

Indexing HTML

2010-06-09 Thread Blargy

What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)? 

I know there is the built in HTMLStripTransformer but that doesn't seem to
work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

field column=description name=description tidy=true
ignoreErrors=true propertiesFile=config/tidy.properties/
field column=description name=description stripHTML=true/

However this method isn't fool-proof as you can see by my ignoreErrors
option. 

I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete  html? Thanks






-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can query boosting be used with a custom request handlers?

2010-06-09 Thread Andy
I want to try out the bobo plugin for Solr, which is a custom request  handler  
(http://code.google.com/p/bobo-browse/wiki/SolrIntegration).

At the same time I want to use BoostQParserPlugin to boost my queries, 
something like {!boost b=log(popularity)}foo

Can I use the {!boost} feature in conjunction with an external custom request 
handler like the bobo plugin, or does {!boost} only work with the standard 
request handler?


  


Re: Diagnosing solr timeout

2010-06-09 Thread Lance Norskog
Every time you reload the index it is to rebuild the facet cached
data. Could that be it?

Also, how big are the fields being highlighted? And are they indexed
with term vectors? (If not, the text is re-analyzed in flight with
term vectors.)

How big are the caches? Are they growing  growing?

On Wed, Jun 9, 2010 at 11:12 AM, Jean-Sebastien Vachon
js.vac...@videotron.ca wrote:
 I use the following article as a reference when dealing with GC related issues

 http://www.petefreitag.com/articles/gctuning/

 I suggest you activate the verbose option and send GC stats to a file. I 
 don't remember exactly what
 was the option but you should find the information easily

 Good luck

 On 2010-06-09, at 11:35 AM, Paul wrote:

 Have you looked at the garbage collector statistics? I've experienced this 
 kind of issues in the past
 and I was getting huge spikes when the GC was doing its job.

 I haven't, and I'm not sure what a good way to monitor this is. The
 problem occurs maybe once a week on a server. Should I run jstat the
 whole time and redirect the output to a log file? Is there another way
 to get that info?

 Also, I was suspecting GC myself. So, if it is the problem, what do I
 do about it? It seems like increasing RAM might make the problem worse
 because it would wait longer to GC, then it would have more to do.





-- 
Lance Norskog
goks...@gmail.com


Re: Indexing HTML

2010-06-09 Thread Lance Norskog
The HTMLStripChar variants are newer and might work better.

On Wed, Jun 9, 2010 at 8:38 PM, Blargy zman...@hotmail.com wrote:

 What is the preferred way to index html using DIH (my html is stored in a
 blob field in our database)?

 I know there is the built in HTMLStripTransformer but that doesn't seem to
 work well with malformed/incomplete HTML. I've created a custom transformer
 to first tidy up the html using JTidy then I pass it to the
 HTMLStripTransformer like so:

 field column=description name=description tidy=true
 ignoreErrors=true propertiesFile=config/tidy.properties/
 field column=description name=description stripHTML=true/

 However this method isn't fool-proof as you can see by my ignoreErrors
 option.

 I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
 Is this something I should look into? Are there any alternatives that deal
 with malformed/incomplete  html? Thanks






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


how to have shards parameter by default

2010-06-09 Thread Scott Zhang
Hi. I am running distributed search on solr.
I have 70 solr instances. So each time I want to search I need to use
?shards=localhost:7500/solr,localhost..7620/solr

It is very long url.

so how can I encode shards into config file then i don't need to type each
time.


thanks.
Scott


Re: how to have shards parameter by default

2010-06-09 Thread Scott Zhang
I tried put shards into default request handler.
But now each time if search, solr hangs forever.
So what's the correct solution?

Thanks.

  requestHandler name=standard class=solr.SearchHandler
default=true
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str

   int name=rows10/int
   str name=fl*/str
   str name=version2.1/str
   str
name=shardslocalhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr/str
   !--  --
 /lst
  /requestHandler

On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang macromars...@gmail.comwrote:

 Hi. I am running distributed search on solr.
 I have 70 solr instances. So each time I want to search I need to use
 ?shards=localhost:7500/solr,localhost..7620/solr

 It is very long url.

 so how can I encode shards into config file then i don't need to type each
 time.


 thanks.
 Scott



Re: Indexing HTML

2010-06-09 Thread Blargy

Does the HTMLStripChar apply at index time or query time? Would it matter to
use over the other?

As a side question, if I want to perform highlighter summaries against this
field do I need to store the whole field or just index it with
TermVector.WITH_POSITIONS_OFFSETS? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing HTML

2010-06-09 Thread Blargy

Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing HTML

2010-06-09 Thread Ken Krugler


On Jun 9, 2010, at 8:38pm, Blargy wrote:



What is the preferred way to index html using DIH (my html is stored  
in a

blob field in our database)?

I know there is the built in HTMLStripTransformer but that doesn't  
seem to
work well with malformed/incomplete HTML. I've created a custom  
transformer

to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

field column=description name=description tidy=true
ignoreErrors=true propertiesFile=config/tidy.properties/
field column=description name=description stripHTML=true/

However this method isn't fool-proof as you can see by my ignoreErrors
option.

I quickly took a peek at Tika and I noticed that it has its own  
HtmlParser.
Is this something I should look into? Are there any alternatives  
that deal

with malformed/incomplete  html? Thanks


Actually the Tika HtmlParser just wraps TagSoup - that's a good option  
for cleaning up busted HTML.


-- Ken


http://ken-blog.krugler.org
+1 530-265-2225





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g