Re: Use a different folder for schema.xml

2012-08-22 Thread Ravish Bhagdev
You can include one xml file into another, something like


   1. ?xml version='1.0' encoding='utf-8'?
   2. !DOCTYPE document [ !ENTITY  resourcedb SYSTEM
   3. 'file:/some/absolute/path/a.xml' ]
   4. resource
   5. childofbresourcedb;childofb
   6. /resource


- Ravish

On Wed, Aug 22, 2012 at 8:56 AM, Alexander Cougarman acoug...@bwc.orgwrote:

 Thanks, Lance. Please forgive my ignorance, but what do you mean by soft
 links/XML include feature? Can you provide an example? Thanks again.

 Sincerely,
 Alex

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: 22 August 2012 9:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Use a different folder for schema.xml

 It is possible to store the entire conf/ directory somewhere. To store
 only the schema.xml file, try soft links or the XML include feature:
 conf/schema.xml includes from somewhere else.

 On Tue, Aug 21, 2012 at 11:31 PM, Alexander Cougarman acoug...@bwc.org
 wrote:
  Hi. For our Solr instance, we need to put the schema.xml file in a
 different location than where it resides now. Is this possible? Thanks.
 
  Sincerely,
  Alex
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
 filter class=solr.LowerCaseFilterFactory/ is already present in your
field type definition (its twice now)

Are you adding quotes around your query by any chance?

Ravish

On Wed, Aug 22, 2012 at 11:31 AM, meghana meghana.rav...@amultek.comwrote:

 I want to apply case-insensitive search for field *myfield* in solr.

 I googled a bit for that , and i found that , i need to apply
 *LowerCaseFilterFactory *to Field Type and field should be of
 solr.TextFeild.

 I applied that in my *schema.xml* and re-index the data, then also my
 search
 seems to be case-sensitive.

 Below is search that i perform.
 *
 http://localhost:8080/solr/select?q=myfield:cloud
 universityhl=onhl.snippets=99hl.fl=myfield*

 Below is definition for field type

  fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/


 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords_en.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
 filter class=solr.PorterStemFilterFactory/
   /analyzer
 /fieldType

 and below is my field definition

  field name=myfield type=text_en_splitting indexed=true
 stored=true
 /

 Not sure , what is wrong with this. Please help me to resolve this.

 Thanks




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
OK.  Try without quotes like myfield:cloud+university and see if it has any
effect.

Also, try both queries with debugging turned on and post the output of the
same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging )

It must be some field configuration issue or that double quotes are causing
some analyzers to not work on your query.

Hope this helps.

Ravish

On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote:

 @Ravish Bhagdev , Yes I am adding double quotes around my search , as shown
 in my post. Like,

 myfield:cloud university





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
Also, try comparing your field configuration to Solrs default text field
and see if you can spot any differences.

Ravish

On Wed, Aug 22, 2012 at 1:09 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote:

 OK.  Try without quotes like myfield:cloud+university and see if it has
 any effect.

 Also, try both queries with debugging turned on and post the output of the
 same ( http://wiki.apache.org/solr/CommonQueryParameters#Debugging )

 It must be some field configuration issue or that double quotes are
 causing some analyzers to not work on your query.

 Hope this helps.

 Ravish

 On Wed, Aug 22, 2012 at 12:11 PM, meghana meghana.rav...@amultek.comwrote:

 @Ravish Bhagdev , Yes I am adding double quotes around my search , as
 shown
 in my post. Like,

 myfield:cloud university





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002610.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr - case-insensitive search do not work

2012-08-22 Thread Ravish Bhagdev
Did you see my message about debugging parameters?  Try that and see what's
happening behind the scenes.

I can confirm that by default the queries are NOT case sensitive.

Ravish

On Wed, Aug 22, 2012 at 2:45 PM, meghana meghana.rav...@amultek.com wrote:

 Hi Ravish , the defination for text_en_splitting in solr default schema and
 of mine are same.. still its not working... any idea?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-case-insensitive-search-do-not-work-tp4002605p4002645.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Score threshold 'reasonably', independent of results returned

2012-08-22 Thread Ravish Bhagdev
Commercial solutions often have %age that is meant to signify the quality
of match.  Solr has relative score and you cannot tell by just looking at
this value if a result is relevant enough to be in first page or not.
 Score depends on what else is in the index so not easy to normalize in
the way you suggest.

Ravish

On Wed, Aug 22, 2012 at 4:03 PM, Mou mouna...@gmail.com wrote:

 Hi,
 I think that this totally depends on your requirements and thus applicable
 for a user scenario. Score does not have any absolute meaning, it is always
 relative to the query. If you want to watch some particular queries and
 want
 to show results with score above previously set threshold, you can use
 this.

 If I always have that x% threshold in place , there may be many queries
 which would not return anything and I certainly do not want that.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Score-threshold-reasonably-independent-of-results-returned-tp4002312p4002673.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

2012-05-21 Thread Ravish Bhagdev
Ahh, this is because I have to override DefaultSimilarity to turn off
tf/idf scoring?  But this will apply to all the fields and general search
on text fields as well?  Is there a way to apply custom similarity to
specific field types or fields only?  Is there no way of turning TF/IDF off
without this?

Thanks,
Ravish

On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev
ravish.bhag...@gmail.comwrote:

 Hi All,

 I was wondering if omitNorms will have any effect on MLT handler at all?

 I'm using schema version 1.2 with Solr 1.4 and have defined couple of
 fields, which I want to use for MLT lookup and don't want factors like
 field length or TF/IDF to affect the scores.  The definitions are as below:

  fieldType name=lowercase class=solr.TextField
 positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true
  analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory /
 /analyzer
  /fieldType

 fieldType name=text_nonorms class=solr.TextField
 positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 /
  filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EnglishPorterFilterFactory protected=protwords.txt
 /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
 /analyzer
  analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 /
  filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EnglishPorterFilterFactory protected=protwords.txt
 /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
 /analyzer
  /fieldType

 !-- and the fields that use the above field types are --
  field name=PROFILE_TAGS type=lowercase indexed=true stored=true
 multiValued=true termVectors=true/
  field name=PROFILE_TAGS_TXT type=text_nonorms indexed=true
 stored=true multiValued=true termVectors=true/

 In My solrconfig.xml I have defined following for my MLT request handler:

   requestHandler name=/mlt class=solr.MoreLikeThisHandler
  lst name=defaults
 str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str
  str name=mlt.qfPROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0/str
 int name=mlt.mindf1/int
  int name=mlt.mintf1/int
 str name=flid,score/str
  str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str
 /lst
   /requestHandler


 However, when I run my query as follows:

 http://localhost:9090/solr/mlt?fl=*,scorestart=0q=id:4417454.matchRecordqt=/mltfq=targetDB:ConnectMeDBrows=1000debugQuery=on

 the debug scoring info shows following:

 str name=5042172.matchRecord
 0.17156276 = (MATCH) product of:
   1.4296896 = (MATCH) sum of:
 0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472),
 product of:
   0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of:
 5.0 = boost
 3.8795946 = idf(docFreq=538, maxDocs=9598)
 0.0032871156 = queryNorm
   3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1)
 3.8795946 = idf(docFreq=538, maxDocs=9598)
 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
 0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472),
 product of:
   0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of:
 5.0 = boost
 6.298109 = idf(docFreq=47, maxDocs=9598)
 0.0032871156 = queryNorm
   6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1)
 6.298109 = idf(docFreq=47, maxDocs=9598)
 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
 0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472), product
 of:
   0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of:
 5.0 = boost
 5.6806736 = idf(docFreq=88, maxDocs=9598)
 0.0032871156 = queryNorm
   5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1)
 5.6806736 = idf(docFreq=88, maxDocs=9598)
 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
   0.12 = coord(3/25)
 /str

 Which seems to suggest that the TF/IDF is being performed on these fields!
  Also, does it make any difference if I specify omitNorms in field
 definition vs specifying in fieldType definition?

 I will appreciate any help with this.

 Thanks,
 Ravish



Re: No Effect of omitNorms and omitTermFreqAndPositions when using MLT handler?

2012-05-21 Thread Ravish Bhagdev
I found this:

https://issues.apache.org/jira/browse/LUCENE-2236

So, it seems this feature is not supported in Solr 1.4 at all.  Is there
any possible work around?  If not, I'll have to consider splitting my
schema into two which will be quite a big change :(

- Ravish

On Mon, May 21, 2012 at 11:03 AM, Ravish Bhagdev
ravish.bhag...@gmail.comwrote:

 Ahh, this is because I have to override DefaultSimilarity to turn off
 tf/idf scoring?  But this will apply to all the fields and general search
 on text fields as well?  Is there a way to apply custom similarity to
 specific field types or fields only?  Is there no way of turning TF/IDF off
 without this?

 Thanks,
 Ravish


 On Mon, May 21, 2012 at 10:24 AM, Ravish Bhagdev ravish.bhag...@gmail.com
  wrote:

 Hi All,

 I was wondering if omitNorms will have any effect on MLT handler at all?

 I'm using schema version 1.2 with Solr 1.4 and have defined couple of
 fields, which I want to use for MLT lookup and don't want factors like
 field length or TF/IDF to affect the scores.  The definitions are as below:

  fieldType name=lowercase class=solr.TextField
 positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true
  analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory /
 /analyzer
  /fieldType

 fieldType name=text_nonorms class=solr.TextField
 positionIncrementGap=100 omitNorms=true omitTermFreqAndPositions=true
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 /
  filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EnglishPorterFilterFactory protected=protwords.txt
 /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
 /analyzer
  analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=0 catenateNumbers=0
 catenateAll=0 /
  filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EnglishPorterFilterFactory protected=protwords.txt
 /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
 /analyzer
  /fieldType

 !-- and the fields that use the above field types are --
  field name=PROFILE_TAGS type=lowercase indexed=true
 stored=true multiValued=true termVectors=true/
  field name=PROFILE_TAGS_TXT type=text_nonorms indexed=true
 stored=true multiValued=true termVectors=true/

 In My solrconfig.xml I have defined following for my MLT request handler:

   requestHandler name=/mlt class=solr.MoreLikeThisHandler
  lst name=defaults
 str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str
  str name=mlt.qfPROFILE_TAGS^10.0 PROFILE_TAGS_TXT^2.0/str
 int name=mlt.mindf1/int
  int name=mlt.mintf1/int
 str name=flid,score/str
  str name=mlt.flPROFILE_TAGS,PROFILE_TAGS_TXT/str
 /lst
   /requestHandler


 However, when I run my query as follows:

 http://localhost:9090/solr/mlt?fl=*,scorestart=0q=id:4417454.matchRecordqt=/mltfq=targetDB:ConnectMeDBrows=1000debugQuery=on

 the debug scoring info shows following:

 str name=5042172.matchRecord
 0.17156276 = (MATCH) product of:
   1.4296896 = (MATCH) sum of:
 0.24737607 = (MATCH) weight(PROFILE_TAGS_TXT:system^5.0 in 1472),
 product of:
   0.06376338 = queryWeight(PROFILE_TAGS_TXT:system^5.0), product of:
 5.0 = boost
 3.8795946 = idf(docFreq=538, maxDocs=9598)
 0.0032871156 = queryNorm
   3.8795946 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:system in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:system)=1)
 3.8795946 = idf(docFreq=538, maxDocs=9598)
 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
 0.65193653 = (MATCH) weight(PROFILE_TAGS_TXT:adapt^5.0 in 1472),
 product of:
   0.10351306 = queryWeight(PROFILE_TAGS_TXT:adapt^5.0), product of:
 5.0 = boost
 6.298109 = idf(docFreq=47, maxDocs=9598)
 0.0032871156 = queryNorm
   6.298109 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:adapt in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:adapt)=1)
 6.298109 = idf(docFreq=47, maxDocs=9598)
 1.0 = fieldNorm(field=PROFILE_TAGS_TXT, doc=1472)
 0.530377 = (MATCH) weight(PROFILE_TAGS_TXT:optic^5.0 in 1472),
 product of:
   0.093365155 = queryWeight(PROFILE_TAGS_TXT:optic^5.0), product of:
 5.0 = boost
 5.6806736 = idf(docFreq=88, maxDocs=9598)
 0.0032871156 = queryNorm
   5.6806736 = (MATCH) fieldWeight(PROFILE_TAGS_TXT:optic in 1472),
 product of:
 1.0 = tf(termFreq(PROFILE_TAGS_TXT:optic)=1)
 5.6806736 = idf(docFreq=88, maxDocs=9598)
 1.0

Re: A tool for frequent re-indexing...

2012-04-17 Thread Ravish Bhagdev
Thanks.  This is useful to know as well.

I was actually after
SolrEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
which
I failed to notice until pointed out by previous reply because I'm using
1.4 still.

Cheers,
Ravish

On Fri, Apr 6, 2012 at 11:01 AM, Valeriy Felberg
valeri.felb...@gmail.comwrote:

 I've implemented something like described in
 https://issues.apache.org/jira/browse/SOLR-3246. The idea is to add an
 update request processor at the end of the update chain in the core
 you want to copy. The processor converts the SolrInputDocument to XML
 (there is some utility method for doing this) and dumps the XML into a
 file which can be fed into Solr again with curl. If you have many
 documents you will probably want to distribute the XML files into
 different directories using some common prefix in the id field.

 On Fri, Apr 6, 2012 at 5:18 AM, Ahmet Arslan iori...@yahoo.com wrote:
  I am considering writing a small tool that would read from
  one solr core
  and write to another as a means of quick re-indexing of
  data.  I have a
  large-ish set (hundreds of thousands) of documents that I've
  already parsed
  with Tika and I keep changing bits and pieces in schema and
  config to try
  new things often.  Instead of having to go through the
  process of
  re-indexing from docs (and some DBs), I thought it may be
  much more faster
  to just read from one core and write into new core with new
  schema, analysers and/or settings.
 
  I was wondering if anyone else has done anything similar
  already?  It would
  be handy if I can use this sort of thing to spin off another
  core write to
  it and then swap the two cores discarding the older one.
 
  You might find these relevant :
 
  https://issues.apache.org/jira/browse/SOLR-3246
 
  http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor
 
 



Re: pagerank??

2012-04-04 Thread Ravish Bhagdev
You might want to look into Nutch and its LinkRank instead of Solr for
this.  For obtaining such information, you need a crawler to crawl through
the links.  Not what Solr is meant for.

Rav

On Wed, Apr 4, 2012 at 8:46 AM, Bing Li lbl...@gmail.com wrote:

 According to my knowledge, Solr cannot support this.

 In my case, I get data by keyword-matching from Solr and then rank the data
 by PageRank after that.

 Thanks,
 Bing

 On Wed, Apr 4, 2012 at 6:37 AM, Manuel Antonio Novoa Proenza 
 mano...@estudiantes.uci.cu wrote:

  Hello,
 
  I have in my Solr index , many indexed documents.
 
  Let me know any way or efficient function to calculate the page rank of
  websites indexed.
 
 
  s
 
  10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
  INFORMATICAS...
  CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
  http://www.uci.cu
  http://www.facebook.com/universidad.uci
  http://www.flickr.com/photos/universidad_uci



Re: Incremantally updating a VERY LARGE field - Is this possibe ?

2012-04-04 Thread Ravish Bhagdev
Updating a single field is not possible in solr.  The whole record has to
be rewritten.

300 MB is still not that big a file.  Have you tried doing the indexing (if
its only a one time thing) by giving it ~2 GB or xmx?

A single file with that size is strange!  May I ask what is it?

Rav

On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 vybe3...@gmail.com wrote:


 Some days ago, I posted about an issue with SOLR running out of memory when
 attempting to index large text files (say 300 MB ). Details at

 http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html

 Two things I need to point out:

 1. I don't need Tika for content extraction as the files are already in
 plain text format.
 2. The heap space error was caused by a futile Tika/SOLR attempt at
 creating
 the corresponding huge XML document in memory

 I've decided to develop a custom handler that
 1. reads the file text directly
 2. attempts to create a SOLR document and directly add the text data to the
 corresponding field.

 One approach I've taken is to read manageable chunks of text data
 sequentially from the file and process. We've used this approach
 sucessfully
 with Lucene in the past and I'm attempting to make it work with SOLR too. I
 got most of the work done yesterday, but need a bit of guidance w.r.t.
 point
 2.

 How can I achieve updating the same field multiple times. Looking at the
 SOLR source, processor.addField() merely
 a. adds to the in-memory field map and
 b. attempts to write EVERYTHING to the index later on.

 In my situation, (a) eventually causes a heap space error.




 Here's part of the handler code.



 Thanks much

 Thanks

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Incremantally updating a VERY LARGE field - Is this possibe ?

2012-04-04 Thread Ravish Bhagdev
Yes, I think there are good reasons why it works like that.  Focus of
search system is to be efficient on query side at cost of being not that
efficient on storage.

You must however also note that by default a field's length is limited to
1 words in solrconf.xml which you may also need to modify.  But I guess
if its going out of memory you might have already done this?

Ravish

On Wed, Apr 4, 2012 at 1:34 PM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose
 it's too far from completion.

 On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev ravish.bhag...@gmail.com
 wrote:

  Updating a single field is not possible in solr.  The whole record has to
  be rewritten.
 
  300 MB is still not that big a file.  Have you tried doing the indexing
 (if
  its only a one time thing) by giving it ~2 GB or xmx?
 
  A single file with that size is strange!  May I ask what is it?
 
  Rav
 
  On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 vybe3...@gmail.com wrote:
 
  
   Some days ago, I posted about an issue with SOLR running out of memory
  when
   attempting to index large text files (say 300 MB ). Details at
  
  
 
 http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
  
   Two things I need to point out:
  
   1. I don't need Tika for content extraction as the files are already in
   plain text format.
   2. The heap space error was caused by a futile Tika/SOLR attempt at
   creating
   the corresponding huge XML document in memory
  
   I've decided to develop a custom handler that
   1. reads the file text directly
   2. attempts to create a SOLR document and directly add the text data to
  the
   corresponding field.
  
   One approach I've taken is to read manageable chunks of text data
   sequentially from the file and process. We've used this approach
   sucessfully
   with Lucene in the past and I'm attempting to make it work with SOLR
  too. I
   got most of the work done yesterday, but need a bit of guidance w.r.t.
   point
   2.
  
   How can I achieve updating the same field multiple times. Looking at
 the
   SOLR source, processor.addField() merely
   a. adds to the in-memory field map and
   b. attempts to write EVERYTHING to the index later on.
  
   In my situation, (a) eventually causes a heap space error.
  
  
  
  
   Here's part of the handler code.
  
  
  
   Thanks much
  
   Thanks
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 



 --
 Sincerely yours
 Mikhail Khludnev
 ge...@yandex.ru

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: Search for library returns 0 results, but search for marion library returns many results

2012-04-04 Thread Ravish Bhagdev
Yes, can you check if results you get with marion library match on marion
or library?  By default solr uses OR between words (specified in
solrconfig.xml).  You can also easily check this by enabling highlighting.

Ravish

On Wed, Apr 4, 2012 at 4:11 PM, Joshua Sumali jsum...@kobo.com wrote:

 Did you try to append debugQuery=on to get more information?

  -Original Message-
  From: Sean Adams-Hiett [mailto:s...@advantage-companies.com]
  Sent: Wednesday, April 04, 2012 10:43 AM
  To: solr-user@lucene.apache.org
  Subject: Search for library returns 0 results, but search for marion
 library
  returns many results
 
  This is cross posted on Drupal.org: http://drupal.org/node/1515046
 
  Summary: I have a fairly clean install of Drupal 7 with
 Apachesolr-1.0-beta18. I
  have created a content type called document with a number of fields. I am
  working with 30k+ records, most of which are related to Marion, IA in
 some
  way. A search for library (without the quotes) returns no results,
 while a
  search for marion library returns thousands of results. That doesn't
 make
  any sense to me at all.
 
  Details:
  ul
liDrupal 7 (latest stable version)/li
liApachesolr-1.0-beta18/li
liCustom content type with many fields/li
liLAMP stack running on Centos Linode/li
liPHP 5.2.x/li
  /ul
 
  I also checked this through the solr admin interface, running the same
  searches with similar results, so I can't rule out the possibility that
 something
  is configured wrong... but since I am using the solrconfig.xml and
 schema.xml
  files provided with the modules, it is also a possibility that the issue
 lies here
  as well. I have watched the logs and during the searches that produce no
  results but should, there is no output in the log besides the regular
  code[INFO]/code about the query.
 
  I am stumped and I am past a deadline with this project, so any help
 would
  be greatly appreciated.
 
  --
  Sean Adams-Hiett
  Director of Development
  The Advantage Companies
  s...@advantage-companies.com
  www.advantage-companies.com



Re: Tags and Folksonomies

2012-04-03 Thread Ravish Bhagdev
Hi Hoss,

I am not sure why you suggest Payload for ranking documents with more
frequent tags above those with fewer tags.  Wont the term frequency part of
relevancy score ensure this by default?  If you make tags a 'lowercase'
field (with full value tokenisation), the frequency of tags in multivalued
field should improve score for doc A in below scenario?

Payloads, I thought would be more useful when you want some tags in a
record to be weighted more than others?  Or have I missed some point maybe.

Thanks,
Rav

On Tue, Apr 3, 2012 at 1:02 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Suppose I have content which has title and description. Users can tag
 content
 : and search content based on tag, title and description. Tag has more
 : weightage.
 :
 : Any inputs on how indexing and retrieval will work given there is content
 : and tags using Solr? Has anyone implemented search based on collaborative
 : tagging?

 simple stuff would be to have your 3 fields, and search them with a
 weighted boosting -- giving more importance to the tag field.

 where things get more complicated is when you want docA to score
 higher for hte query boat then docB because 100 users have taged docA
 with boat, but only 5 users have taged docB boat

 The canonical way to deal with this would be using payloads to boost the
 weight of a term -- the DelimitedPayloadTokenFilterFactory can help with
 this at index time, but off the top of my head i don't think any of the
 existing Solr QParsers will build the neccessary PayloadTermQuery, so you
 might have to roll your own -- there are afew Jira issues with patches
 that you might be able to re-use or get inspired from...

 https://issues.apache.org/jira/browse/SOLR-1485




 -Hoss



Re: Apache solr not indexing complete pdf file using tikka

2012-04-03 Thread Ravish Bhagdev
I'd also suggest trying extracting text using tika-app (shipped with tika
distribution as executable jar) on the PDF(s) in question to see if problem
is with extraction or with indexing.

Rav

On Mon, Apr 2, 2012 at 1:55 PM, Erick Erickson erickerick...@gmail.comwrote:

 You can index 2B tokens, so upping maxFieldLength should have
 fixed your problem at least as far as Solr is concerned. How
 many tokens get indexed? I'm not as familiar with Tika, but
 there may be some kind of parameter there (although I
 don't remember this coming up before)...

 Did you restart Solr after making the change to solrconfig.xml?

 If you're seeing 10,000 tokens or so, that's the default for
 maxFieldLength

 I'd recommend stopping Solr, rm -rf solr home/data/index
 and restarting Solr just to be sure you're not seeing leftover
 junk, you'll have to re-index your docs after changing
 the maxLength param.


 Best
 Erick


 On Mon, Apr 2, 2012 at 7:19 AM, Manoj Saini manoj.sa...@stigasoft.com
 wrote:
  Hello Guys,
 
  I am using apache solr 3.3.0 with Tikka 1.0.
 
  I have pdf files which I am pushing into solr for conent searching.
 Apache
  solr is indexing pdf files and I can see them in apache solr admin
 interface
  for search. But the issue is apache solr is not indexing whole file
 content.
  It is indexing upto only limited size.
 
  Am I missing something, some configuration, or this is the behavior of
  apache solr?
 
  I have tried to update solrconfig.xml. I have updated ramBufferSizeMB,
  maxFieldLength.
 
  Thanks
  Manoj Saini
 
 
 
 
 
  Thanks,
 
  Best Regards,
 
 
 
  Manoj Saini | Sr. Software Engineer  | Stigasoft
 
  m: +91 98 1034 1281 |
 
  e:  mailto:nseh...@stigasoft.com manoj.sa...@stigasoft.com | w:
  http://www.stigasoft.com www.stigasoft.com
 
 
 



Re: Tags and Folksonomies

2012-04-03 Thread Ravish Bhagdev
OK, yes that's true.  Although I'd expect term vectors to just increment
term count when a tag is re-applied (if you have term vectors enabled),
increasing a boost stored as a payload with each tag, each time an existing
tag is re-tagged maybe a more sensible approach if this is the case.
 You'll still have to rewrite the whole record for this though as its not
possible to 'update' a specific field value in Solr for efficiency reasons.

Rav

On Tue, Apr 3, 2012 at 4:50 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : I am not sure why you suggest Payload for ranking documents with more
 : frequent tags above those with fewer tags.  Wont the term frequency part
 of
 : relevancy score ensure this by default?  If you make tags a 'lowercase'

 Sorry, yes ... absolutely - if you use omitNormws=false on the tags
 field, and add these two docs...

  { id: doc1; tags: [house, house, house, boat] }
  { id: doc2; tags: [house, boat, car, vegas] }

 ...then doc1 will score higher on a query for tags:house.

 my suggestion to use payloads was because sending the same value many many
 times (ie: if 100,000 users apply the tag house you would need to index
 that doc with the word house repeated 100,000 times) can be prohibitive.


 -Hoss



Re: ExtractingRequestHandler

2012-04-03 Thread Ravish Bhagdev
(Bit off-topic but...) I understand the fact that Solr isn't meant to
'store' everything, but because highlighting matches requires a field to be
stored I would expect most people having to end-up storing full document
content in their indexes?  Can't think there is any good workaround for
this...

Rav

On Sun, Apr 1, 2012 at 6:15 PM, Erick Erickson erickerick...@gmail.comwrote:

 Yes, you can. but Generally, storing the raw input in Solr is
 not the best approach. The problem here is that pretty soon
 you get a huge index that contains *everything*. Solr was not
 intended to be a data store.

 Besides, you then need to store the binary form of the file. Solr
 only deals with text, not markup.

 Most people index the text in Solr, and enough information
 so the application knows where to go to fetch the original
 document when the user drills down (e.g. file path, database
 PK, etc). Would that work for your situation?

 Best
 Erick

 On Sat, Mar 31, 2012 at 3:55 PM,  spr...@gmx.eu wrote:
  Hi,
 
  I want to index various filetypes in solr, this can easily done with
  ExtractingRequestHandler. But I also need the extracted content back.
  I know ext.extract.only but then nothing gets indexed, right?
 
  Can I index the document AND get the content back as with
 ext.extract.only?
  In a single request?
 
  Thank you
 
 



Re: Position Solr results

2012-04-03 Thread Ravish Bhagdev
Hi,

I don't believe Solr has anything built in that will do this for you.  You
will likely have to just get the IDs and lookup at what position the ID you
are referring to occurs (using Java or other programming language/scripts).

Rav

On Sun, Apr 1, 2012 at 5:54 PM, Manuel Antonio Novoa Proenza 
mano...@estudiantes.uci.cu wrote:



 hi Marcelo

 In that sense I think the score does not help. The score is a number that
 I determined at that position results generated are a given site.

 For example :

 I perform the following query : q = university

 Solr generates several results among which is that of a certain website.
 Does solr some mechanism to let me know that posción is this result?

 I reiterate that my English is very bad so I use a translator , anyway
 then send you what I mean in Spanish.

 thank you very much

 Manuel

 hola Marcelo

 En ese sentido creo que el score no me sirve. El score es un numero que no
 me determina en que posición de los resultados generados se encuentra un
 determinado sitio.

 Por ejemplo:

 Yo realizo la siguiente consulta: q= universidad

 Solr genera varios resultados entre los que se encuentra el de un
 determinado sitio web. ¿Cuenta solr con algún mecanismo que me permita
 saber en que posción se encuentra este resultado?

 Te reitero que mi inglés es muy malo por eso uso un traductor, de todas
 formas a continuación te envío lo que quiero decir en español.

 Muchas gracias

 Manuel

















 Saludos...














 Manuel Antonio Novoa Proenza
 Universidad de las Ciencias Informáticas
 Email: mano...@estudiantes.uci.cu


 - Mensaje original -

 De: Marcelo Carvalho Fernandes mcf2...@gmail.com
 Para: solr-user@lucene.apache.org
 Enviados: Domingo, 1 de Abril 2012 5:14:50
 Asunto: Re: Position Solr results

 Try using the score field in the search results.

 ---
 Marcelo Carvalho Fernandes

 On Friday, March 30, 2012, Manuel Antonio Novoa Proenza 
 mano...@estudiantes.uci.cu wrote:
 
 
 
 
 
  Hi
 
  I'm not good with English, and for this reason I had to resort to a
 translator.
 
  I have the following question ...
 
  How I can get the position in which there is a certain website in solr
 results generated for a given search criteria ?
 
  regards
 
  ManP
 
 
 
 
 
 
  10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
  CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
  http://www.uci.cu
  http://www.facebook.com/universidad.uci
  http://www.flickr.com/photos/universidad_uci

 --
 
 Marcelo Carvalho Fernandes
 +55 21 8272-7970
 +55 21 2205-2786


 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci




Highlighting matched interesting terms in MoreLikeThisHandler...

2012-03-19 Thread Ravish Bhagdev
Hi All,

I wonder if anyone else has had a requirement similar to this:

I'm using MLT handler to return matching documents, matched on a specific
field which works perfectly.  But I want to be able to show which
interesting terms matched for a given result set.  If there was a way of
listing these terms or having something like snippet highlighting, I would
have been able to do this.  But it seems this is not supported at all as
far as I know?  I came upon following very old thread from 2009 when
looking for solution:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3cf3ce8ddb0905010807l14f08470mf7dc961d872f7...@mail.gmail.com%3E

I wonder if there has been any resolution on this.  Has this been
considered as a new feature request yet?  Anyone else had a similar
requirement that they could find a workaround for?  I believe Autonomy
supports this kind of matching and what matched functionality so it must be
popular requirement...

Thanks,
Ravish


Fwd: Using MLT Handler to find similar documents but also filter similar documents by a keyword.

2012-03-10 Thread Ravish Bhagdev
I will appreciate any comments or help on this. Thanks.

Rav

-- Forwarded message --
From: Ravish Bhagdev ravish.bhag...@gmail.com
Date: Fri, Mar 2, 2012 at 12:12 AM
Subject: Using MLT Handler to find similar documents but also filter
similar documents by a keyword.
To: solr-user@lucene.apache.org


Hi,

Apologies if this has been answered before, I tried searching for it and
didn't find anything answering this exactly.

I want to find similar documents using MLT Handler using some specified
fields but I want to filter down the returned matches with some keywords as
well.

I looked at the example provided at
http://wiki.apache.org/solr/MoreLikeThisHandler :

/solr/mlt?q=id:SP2514Nmlt.fl=manu,catmlt.mindf=1mlt.mintf=1*
fq=inStock:true*mlt.interestingTerms=details

which is specifying a filter query using fq to filter (something).

I understand that the first document returned as a result of query
(q=id:SP2514N) is used for performing the matching and fq actually affects
this result rather than the matched documents returned by MLT.  Am I right
or wrong?

That is the fq in above example going to filter the MLT match results by
the fq query or will it just affect the initial query to  get the first
document to match by?  If former, that is what I want to do, but is fq the
way to do it?  Can I use this fq on any kind of text/string field?

I hope my question is making sense, it is a bit hard to explain so I am
sorry if not!

Thanks,
Ravish


Using MLT Handler to find similar documents but also filter similar documents by a keyword.

2012-03-01 Thread Ravish Bhagdev
Hi,

Apologies if this has been answered before, I tried searching for it and
didn't find anything answering this exactly.

I want to find similar documents using MLT Handler using some specified
fields but I want to filter down the returned matches with some keywords as
well.

I looked at the example provided at
http://wiki.apache.org/solr/MoreLikeThisHandler :

/solr/mlt?q=id:SP2514Nmlt.fl=manu,catmlt.mindf=1mlt.mintf=1*
fq=inStock:true*mlt.interestingTerms=details

which is specifying a filter query using fq to filter (something).

I understand that the first document returned as a result of query
(q=id:SP2514N) is used for performing the matching and fq actually affects
this result rather than the matched documents returned by MLT.  Am I right
or wrong?

That is the fq in above example going to filter the MLT match results by
the fq query or will it just affect the initial query to  get the first
document to match by?  If former, that is what I want to do, but is fq the
way to do it?  Can I use this fq on any kind of text/string field?

I hope my question is making sense, it is a bit hard to explain so I am
sorry if not!

Thanks,
Ravish


Re: highlight issue

2011-12-02 Thread Ravish Bhagdev
Also, not entirely sure wild-cards are supported in text based fields, only
on strings.  Although things may have changed in recent versions of Solr, I
am not sure.

R

On Thu, Dec 1, 2011 at 3:55 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Suppose my search query is *Rak*.In my database i have *Rakesh
 Chaturvedi
 * name.
 I am getting *emRak/ememRak/emesh Chaturvedi* as the response.

 Same the case with the following names.

 Search Dhar -- highlight emDhar/ememDhar/em**mesh Darshan
 Search Suda-- highlight emSuda/ememSuda/em**rshan Faakir

 Can someone help me?

 I am using the following filters for index and query.

 fieldType name=text_autofill class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.**KeywordTokenizerFactory/
 filter class=solr.**LowerCaseFilterFactory/
 filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1 preserveOriginal=1/
 filter class=solr.**EdgeNGramFilterFactory minGramSize=1
 maxGramSize=50 side=front/
   /analyzer
   analyzer type=query
 tokenizer class=solr.**StandardTokenizerFactory/
 filter class=solr.**LowerCaseFilterFactory/
 filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1 preserveOriginal=1/
   /analyzer
 /fieldType


 I don't think Highlighter can support n-gram field.
 Can you try to comment out EdgeNGramFilterFactory and re-index then
 highlight?

 koji
 --
 Check out Query Log Visualizer for Apache Solr
 http://www.rondhuit-demo.com/**loganalyzer/loganalyzer.htmlhttp://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
 http://www.rondhuit.com/en/



Re: Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-28 Thread Ravish Bhagdev
Thanks Chris.  Yes, changing connector settings not just in solr but also in
all webapps that were sending queries into it solved the problem!
 Appreciate the help.

R

On Tue, Sep 13, 2011 at 6:11 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Any idea why solr is unable to return the pound sign as-is?
 :
 : I tried typing in £ 1 million in Solr admin GUI and got following
 response.
 ...
 : str name=q£ 1 million/str
...
 : Here is my Java Properties I got also from admin interface:
...
 : catalina.home =
 : /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/

 Looks like you are using tomcat, so I suspect you are getting bit by
 this...

 https://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

 If that's not the problem, please try running the
 example/exampledocs/test_utf8.sh script against your Solr instance (you'll
 need to change the URL variable to match your host:port)


 -Hoss


Solr messing up the UK GBP (pound) symbol in response, even though Java environment variabe has file encoding is set to UTF 8....

2011-09-11 Thread Ravish Bhagdev
Any idea why solr is unable to return the pound sign as-is?

I tried typing in £ 1 million in Solr admin GUI and got following response.

response
lst name=responseHeader
int name=status0/int
int name=QTime5/int
lst name=params
str name=indenton/str
str name=start0/str
str name=q£ 1 million/str
str name=rows10/str
str name=version2.2/str
/lst
/lst
result name=response numFound=0 start=0/
/response

Here is my Java Properties I got also from admin interface:

java.runtime.name = Java(TM) SE Runtime Environment
sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64
java.vm.version = 20.1-b02
solr.data.dir = target/solr_data
shared.loader =
java.vm.vendor = Sun Microsystems Inc.
java.vendor.url = http://java.sun.com/
path.separator = :java.vm.name = Java HotSpot(TM) 64-Bit Server VM
tomcat.util.buf.StringCache.byte.enabled = true
file.encoding.pkg = sun.io
user.country = GB
sun.java.launcher = SUN_STANDARD
sun.os.patch.level = unknownjava.vm.specification.name = Java Virtual
Machine Specification
user.dir = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr
java.runtime.version = 1.6.0_26-b03
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
java.endorsed.dirs = /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsed
os.arch = amd64
java.io.tmpdir = /tmp
line.separator =

java.vm.specification.vendor = Sun Microsystems Inc.
java.naming.factory.url.pkgs = org.apache.namingos.name = Linux
classworlds.conf = /usr/share/maven2/bin/m2.conf
sun.jnu.encoding = UTF-8
java.library.path =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/libjava.specification.name
= Java Platform API Specification
java.class.version = 50.0
sun.management.compiler = HotSpot 64-Bit Tiered Compilers
os.version = 2.6.38-11-generic
user.home = /home/rbhagdev
user.timezone = Europe/London
catalina.useNaming = true
java.awt.printerjob = sun.print.PSPrinterJob
java.specification.version = 1.6
file.encoding = UTF-8
solr.solr.home = src/test/resources/solr_home
catalina.home =
/home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcatuser.name
= rbhagdev
java.class.path = /usr/share/maven2/boot/classworlds.jar
java.naming.factory.initial = org.apache.naming.java.javaURLContextFactory
package.definition =
sun.,java.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.
java.vm.specification.version = 1.0
sun.arch.data.model = 64
java.home = /usr/lib/jvm/java-6-sun-1.6.0.26/jre
sun.java.command = org.codehaus.classworlds.Launcher tomcat:run-war
java.specification.vendor = Sun Microsystems Inc.
user.language = enjava.vm.info = mixed mode
java.version = 1.6.0_26
java.ext.dirs =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/ext
securerandom.source = file:/dev/./urandom
sun.boot.class.path =
/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classes
java.vendor = Sun Microsystems Inc.
server.loader =
maven.home = /usr/share/maven2
catalina.base = /home/rbhagdev/SCCRepos/SCC_Platform/search/solr/target/tomcat
file.separator = /
java.vendor.url.bug = http://java.sun.com/cgi-bin/bugreport.cgi
common.loader = ${catalina.home}/lib,${catalina.home}/lib/*.jar
sun.cpu.endian = little
sun.io.unicode.encoding = UnicodeLittle
package.access =
sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans.
sun.desktop = gnome
sun.cpu.isalist =

Thanks,

Ravish


Getting sum of all terms count in dataset instead of document count using TermsComponent....(and TermsComponent vs Facets)

2011-02-27 Thread Ravish Bhagdev
Hi Guys,

I need a bit of help.

I want to produce frequency analysis of all tokens inside my solr Index from
a specific (content) field.

When I use TermsComponent or FacetCounts, what I get is how many records or
documents each term appears in (which again confuses me as to what the
difference is, is it facets are restricted to terms in result set and
termscomponent is not restricted by the query?).  Is there yet a way to get
total terms count (not per document but across the whole index)?  I have
tried searching in archieves and across web but closest match I found is
this: http://search-lucene.com/m/of5Fn1PUOHU/

It is suggested in this post that I can post the mentioned lines of code
into TermsComponent.java and it should work.  However, the code seems to
have changed since and when I try this, the Class TermDocs is not even
recognized.

I was wondering if there is any other way by using Lucene or Solr to do
this.  I will be very grateful for any reply.  If it helps, below is the
code I am running right now which gives me document count and not Terms
count.

String queryString = document:*;

SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery(queryString);
solrQuery.setQueryType(/terms);
solrQuery.setTerms(true);
solrQuery.setTermsLimit(20);
solrQuery.setParam(terms.fl, document);
solrQuery.setTermsSortString(count);

QueryResponse solrResp = conf._solr.executeQuery(solrQuery, 0, 10);

TermsResponse termsResp = solrResp.getTermsResponse();
ListTermsResponse.Term terms = termsResp.getTerms(document);

Ignore the conf object and _solr variable thats just my internal singleton
object.

Thanks,
Ravish Bhagdev


Re: Getting sum of all terms count in dataset instead of document count using TermsComponent....(and TermsComponent vs Facets)

2011-02-27 Thread Ravish Bhagdev
Yes, you are right.  Ignore the query (document:*), it wont matter if i have
it for termscomponent i guess.

I've compiled current source from head, but also tried on 1.4.1.

Any idea how to go about finding a solution to this?

Thanks,
Ravish

On Sun, Feb 27, 2011 at 1:56 PM, Ahmet Arslan iori...@yahoo.com wrote:

  I want to produce frequency analysis of all tokens inside
  my solr Index from
  a specific (content) field.
 
  When I use TermsComponent or FacetCounts, what I get is how
  many records or
  documents each term appears in (which again confuses me as
  to what the
  difference is, is it facets are restricted to terms in
  result set and
  termscomponent is not restricted by the query?).  Is
  there yet a way to get
  total terms count (not per document but across the whole
  index)?

 Terms Component does not respect q= parameter. In other words, it is not
 restricted by the query.

  I have
  tried searching in archieves and across web but closest
  match I found is
  this: http://search-lucene.com/m/of5Fn1PUOHU/
 
  It is suggested in this post that I can post the mentioned
  lines of code
  into TermsComponent.java and it should work.  However,
  the code seems to
  have changed since and when I try this, the Class TermDocs
  is not even
  recognized.

 What version of solr are you using?







very quick question that will help me greatly... OR query syntax when using fields for solr dataset....

2011-02-15 Thread Ravish Bhagdev
Hi Guys,

I've been trying various combinations but unable to perform a OR query for
a specific field in my solr schema.

I have a string field called myfield and I want to return all documents that
have this field which either matches abc or  xyz

So all records that have myfield=abc and all records that have myfield=xyz
should be returned (union)

What should my query be?  I have tried (myfield=abc OR myfield=xyz) which
works, but only returns all the documents that contain xyz in that field,
which I find quite weird. I have tried running this as fq query as well but
same result!

It is such a simple thing but I can't find right syntax after going through
a lot of documentation and searching.

Will appreciate any quick reply or examples, thanks very much.

Ravish


Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....

2011-02-15 Thread Ravish Bhagdev
Hi Jan,

Thanks for reply.

I have tried the first variation in your example (and again after reading
your reply).

It returns no results!

Note: it is not a multivalued field, I think when you use example 1 below,
it looks for both xyz and abc in same field for same document, what i'm
trying to get are all records that match either of the two.

I hope I am making sense.

Thanks,
Ravish

On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.com wrote:

 http://wiki.apache.org/solr/SolrQuerySyntax

 Examples:
 q=myfield:(xyz OR abc)

 q={!lucene q.op=OR df=myfield}xyz abc

 q=xyz OR abcdefType=edismaxqf=myfield

 PS: If using type=string, you will not match individual words inside the
 field, only an exact case sensitive match of whole field. Use some variant
 of text if this is not what you want.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote:

  Hi Guys,
 
  I've been trying various combinations but unable to perform a OR query
 for
  a specific field in my solr schema.
 
  I have a string field called myfield and I want to return all documents
 that
  have this field which either matches abc or  xyz
 
  So all records that have myfield=abc and all records that have
 myfield=xyz
  should be returned (union)
 
  What should my query be?  I have tried (myfield=abc OR myfield=xyz) which
  works, but only returns all the documents that contain xyz in that field,
  which I find quite weird. I have tried running this as fq query as well
 but
  same result!
 
  It is such a simple thing but I can't find right syntax after going
 through
  a lot of documentation and searching.
 
  Will appreciate any quick reply or examples, thanks very much.
 
  Ravish




Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....

2011-02-15 Thread Ravish Bhagdev
Arghhh..

I think its the regexp parser messing things up (just looked at the
debugQuery ouput and its parsing incorrectly some / kind of letters I had.

I think I can clean up the data off these characters or maybe there is  a
way to escape them...

Ravish

On Tue, Feb 15, 2011 at 1:54 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote:

 Hi Jan,

 Thanks for reply.

 I have tried the first variation in your example (and again after reading
 your reply).

 It returns no results!

 Note: it is not a multivalued field, I think when you use example 1 below,
 it looks for both xyz and abc in same field for same document, what i'm
 trying to get are all records that match either of the two.

 I hope I am making sense.

 Thanks,
 Ravish


 On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.comwrote:

 http://wiki.apache.org/solr/SolrQuerySyntax

 Examples:
 q=myfield:(xyz OR abc)

 q={!lucene q.op=OR df=myfield}xyz abc

 q=xyz OR abcdefType=edismaxqf=myfield

 PS: If using type=string, you will not match individual words inside the
 field, only an exact case sensitive match of whole field. Use some variant
 of text if this is not what you want.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote:

  Hi Guys,
 
  I've been trying various combinations but unable to perform a OR query
 for
  a specific field in my solr schema.
 
  I have a string field called myfield and I want to return all documents
 that
  have this field which either matches abc or  xyz
 
  So all records that have myfield=abc and all records that have
 myfield=xyz
  should be returned (union)
 
  What should my query be?  I have tried (myfield=abc OR myfield=xyz)
 which
  works, but only returns all the documents that contain xyz in that
 field,
  which I find quite weird. I have tried running this as fq query as well
 but
  same result!
 
  It is such a simple thing but I can't find right syntax after going
 through
  a lot of documentation and searching.
 
  Will appreciate any quick reply or examples, thanks very much.
 
  Ravish





Re: very quick question that will help me greatly... OR query syntax when using fields for solr dataset....

2011-02-15 Thread Ravish Bhagdev
Hi Erick,

I've managed to fix the problem, it was to do with not encoding certain
characters.  Escaped with \ and it all works fine now :) .  Sorry I was just
being insane, look at debugQuery output helped.

I know about the string field, this is kind of a uuid field that I am
storing, so it it desired that it always be exact match, so I am being
careful about why I chose that type.

I am going to start looking at all that is available as Analyzer soon,
something that does string distance match would be cool

Ravish

On Tue, Feb 15, 2011 at 2:30 PM, Erick Erickson erickerick...@gmail.comwrote:

 You might look at the analysis page from the admin console for the
 field in question, it'll show you what various parts of the analysis chain
 do.

 But I agree with Jan, having your field as a string type is a red flag.
 This
 field is NOT analyzed, parsed, or filtered. For instance, if a doc has
 a value for the field of: [My life], only [My life] will match. Not [my],
 not
 [life], not even [my life] (ignore all brackets, but quotes are often
 confused
 with phrases).

 It may well be that this is the exact behavior you want, but this is often
 a point of confusion.

 Best
 Erick

 On Tue, Feb 15, 2011 at 9:00 AM, Ravish Bhagdev
 ravish.bhag...@gmail.com wrote:
  Arghhh..
 
  I think its the regexp parser messing things up (just looked at the
  debugQuery ouput and its parsing incorrectly some / kind of letters I
 had.
 
  I think I can clean up the data off these characters or maybe there is  a
  way to escape them...
 
  Ravish
 
  On Tue, Feb 15, 2011 at 1:54 PM, Ravish Bhagdev 
 ravish.bhag...@gmail.comwrote:
 
  Hi Jan,
 
  Thanks for reply.
 
  I have tried the first variation in your example (and again after
 reading
  your reply).
 
  It returns no results!
 
  Note: it is not a multivalued field, I think when you use example 1
 below,
  it looks for both xyz and abc in same field for same document, what i'm
  trying to get are all records that match either of the two.
 
  I hope I am making sense.
 
  Thanks,
  Ravish
 
 
  On Tue, Feb 15, 2011 at 1:47 PM, Jan Høydahl jan@cominvent.com
 wrote:
 
  http://wiki.apache.org/solr/SolrQuerySyntax
 
  Examples:
  q=myfield:(xyz OR abc)
 
  q={!lucene q.op=OR df=myfield}xyz abc
 
  q=xyz OR abcdefType=edismaxqf=myfield
 
  PS: If using type=string, you will not match individual words inside
 the
  field, only an exact case sensitive match of whole field. Use some
 variant
  of text if this is not what you want.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  On 15. feb. 2011, at 14.39, Ravish Bhagdev wrote:
 
   Hi Guys,
  
   I've been trying various combinations but unable to perform a OR
 query
  for
   a specific field in my solr schema.
  
   I have a string field called myfield and I want to return all
 documents
  that
   have this field which either matches abc or  xyz
  
   So all records that have myfield=abc and all records that have
  myfield=xyz
   should be returned (union)
  
   What should my query be?  I have tried (myfield=abc OR myfield=xyz)
  which
   works, but only returns all the documents that contain xyz in that
  field,
   which I find quite weird. I have tried running this as fq query as
 well
  but
   same result!
  
   It is such a simple thing but I can't find right syntax after going
  through
   a lot of documentation and searching.
  
   Will appreciate any quick reply or examples, thanks very much.
  
   Ravish
 
 
 
 



Re: Are there any restrictions on what kind of how many fields you can use in Pivot Query? I get ClassCastException when I use some of my string fields, and don't when I use some other sting fields

2011-02-15 Thread Ravish Bhagdev
Looks like its a bug?  Is it not?

Ravish

On Tue, Feb 15, 2011 at 4:03 PM, Ravish Bhagdev ravish.bhag...@gmail.comwrote:

 When include some of the fields in my search query:

 SEVERE: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
 [Lorg.apache.solr.common.util.ConcurrentLRUCache$CacheEntry;
  at
 org.apache.solr.common.util.ConcurrentLRUCache$PQueue.myInsertWithOverflow(ConcurrentLRUCache.java:377)
 at
 org.apache.solr.common.util.ConcurrentLRUCache.markAndSweep(ConcurrentLRUCache.java:329)
  at
 org.apache.solr.common.util.ConcurrentLRUCache.put(ConcurrentLRUCache.java:144)
 at org.apache.solr.search.FastLRUCache.put(FastLRUCache.java:131)
  at
 org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:904)
 at
 org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:121)
  at
 org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:126)
 at
 org.apache.solr.handler.component.PivotFacetHelper.process(PivotFacetHelper.java:85)
  at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:84)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231)
  at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298)
  at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
  at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
  at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
  at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
  at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:662)

 Works with some fields not with others...

 What could be the problem?  It is hard to know with just that exception as
 it refers to solr's internal files...any indicators will help me debug.

 Thanks,
 Ravish



Why solr relies on solr.solr.home???

2008-08-08 Thread Ravish Bhagdev
Hi,

This may be a naive question but do we really need to have solr.solr.home
variable for solr installation?  It is a bit annoying modifying tomcat
settings in automated install.  If I create a packaged application, how do I
ensure a normal user would be able to install it without having to modify
tomcat batch or shell files (or service settings in case of msi installer)?
If not possible what will be the easiest way to automate the process (cross
platform)?

Also, is it possible to run solr without needing to host it in a http
container?  Why do we need webapp to index or query??

Ravi


Re: Incremental indexing of database

2008-07-22 Thread Ravish Bhagdev
Can't you write triggers for your database/tables you want to index?
That way you can keep track of all kinds of changes and updates and
not just addition of a new record.

Ravish

On Tue, Jul 22, 2008 at 8:15 PM, anshuljohri [EMAIL PROTECTED] wrote:

 Hi,

 In my project i have to index whole database which contains text data only.
 So if i follow incremental indexing approch than my problem is that how will
 I pick delta data from database. Is there any utility in solr to keep track
 the last indexed record. Or is there any other approch to solve this
 problem.

 Thanks,
 Anshul Johri
 --
 View this message in context: 
 http://www.nabble.com/Incremental-indexing-of-database-tp18596613p18596613.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Is it possible to add synonyms run time?

2008-01-25 Thread Ravish Bhagdev
Yes, I'm fairly new as well.

So do you mean adding words to the query effectively doing an or
between synonymous terms?  That sounds simple way of doing it, if this
works, what makes indexing with synonyms useful?

Ravish

On Jan 25, 2008 2:42 PM, Jon Lehto [EMAIL PROTECTED] wrote:
 Hi Ravish,

 You may want to think about the synonym dictionary as being a tool on the 
 side, rather than each indexed document having a copy of the synonyms. At 
 indexing time, one might normalize synonyms to a single value, and at query 
 time do the same to get the match.

 Alternately, use the synonym dictionary at run-time to expand a
 user's query terms, like a thesaurus.

 That said, I'm new to the tool, and not clear on how synonyms are implemented.

 Jon
 =
 From: Ravish Bhagdev [EMAIL PROTECTED]
 Date: 2008/01/25 Fri AM 08:24:33 CST
 To: solr-user@lucene.apache.org
 Subject: Is it possible to add synonyms run time?


 As I understood from available documentation, synonyms need to be
 defined before starting the indexing process.  Is it possible to add
 synonyms at run time such that all index fields of all documents get
 updated?  Does it work for newly added documents atleast?

 Also, how to make each user of application define his own set of
 synonyms that others should be oblivious to (others get normal results
 without synon considered)

 Thanks,
 Ravish




Is it possible to add synonyms run time?

2008-01-25 Thread Ravish Bhagdev
As I understood from available documentation, synonyms need to be
defined before starting the indexing process.  Is it possible to add
synonyms at run time such that all index fields of all documents get
updated?  Does it work for newly added documents atleast?

Also, how to make each user of application define his own set of
synonyms that others should be oblivious to (others get normal results
without synon considered)

Thanks,
Ravish


Re: Is it possible to add synonyms run time?

2008-01-25 Thread Ravish Bhagdev
I see, thanks a lot for this, makes things clear now.

So just to make sure I understand this bit, by injecting synonyms at
query time you mean basically adding terms implicitly to keywords
behind the scenes before passing it to solr?  Or is there are more
conventional method or interface that is being suggested?

Thanks for all the help!

Ravish

On Jan 25, 2008 3:59 PM, Erick Erickson [EMAIL PROTECTED] wrote:
 To me, it's really a question of where the work should be done given your
 problem space. Injecting synonyms at index time allows the queries to be
 simpler/faster. Injecting the synonyms at query time gets complex but is
 more flexible.

 As always, it's a time/space tradeoff. If you're willing to pay the space
 penalty for increased query speed, inject at index time. Otherwise
 you can inject at query time.

 And the query-time injection performance hit may not be trivial. Consider,
 for instance, span queries. Do you want to pay the price at query time for,
 say a BooleanQuery that is composed of 5 SpanQueries where each
 term in each SpanQuery consists of several OR clauses because of
 synonym injection? Perhaps you do and perhaps you don't. It all depends
 upon what your data looks like and what your performance criteria are.

 And you can do other tricks. Consider rather than indexing all the terms,
 only index the canonical term. That is, consider hit and the synonyms
 strike, popular, punch. you could index hit for any of the 4 terms,
 then do the same substitution for your query. Which would make your
 index smaller *and* your queries faster.

 But you're right. Injecting synonyms at index time really requires a fixed
 synonym list that doesn't vary by user. So if you want synonym
 lists on a per-user basis, you're probably going to have to inject synonyms
 at query time.

 Best
 Erick


 On Jan 25, 2008 9:46 AM, Ravish Bhagdev [EMAIL PROTECTED] wrote:

  Yes, I'm fairly new as well.
 
  So do you mean adding words to the query effectively doing an or
  between synonymous terms?  That sounds simple way of doing it, if this
  works, what makes indexing with synonyms useful?
 
  Ravish
 
  On Jan 25, 2008 2:42 PM, Jon Lehto [EMAIL PROTECTED] wrote:
   Hi Ravish,
  
   You may want to think about the synonym dictionary as being a tool on
  the side, rather than each indexed document having a copy of the synonyms.
  At indexing time, one might normalize synonyms to a single value, and at
  query time do the same to get the match.
  
   Alternately, use the synonym dictionary at run-time to expand a
   user's query terms, like a thesaurus.
  
   That said, I'm new to the tool, and not clear on how synonyms are
  implemented.
  
   Jon
   =
   From: Ravish Bhagdev [EMAIL PROTECTED]
   Date: 2008/01/25 Fri AM 08:24:33 CST
   To: solr-user@lucene.apache.org
   Subject: Is it possible to add synonyms run time?
  
  
   As I understood from available documentation, synonyms need to be
   defined before starting the indexing process.  Is it possible to add
   synonyms at run time such that all index fields of all documents get
   updated?  Does it work for newly added documents atleast?
  
   Also, how to make each user of application define his own set of
   synonyms that others should be oblivious to (others get normal results
   without synon considered)
  
   Thanks,
   Ravish
  
  
 



Re: SOLR X FAST

2007-12-11 Thread Ravish Bhagdev
Stability and better Support (at great cost obviously)

On Dec 11, 2007 10:20 PM, William Silva [EMAIL PROTECTED] wrote:
 Hi,
 Why use FAST and not use SOLR ? For example.
 What will FAST offer that will justify the investment ?
 I would like a matrix comparing both.
 Thanks,
 William.


 On Dec 11, 2007 8:15 PM, Matthew Runo [EMAIL PROTECTED] wrote:

  I think it all depends, what do you want out of Solr or FAST?
 
 Thanks!
 
  Matthew Runo
  Software Developer
  702.943.7833
 
  On Dec 11, 2007, at 2:09 PM, William Silva wrote:
 
   Hi,
   How is the best way to compare SOLR and FAST Search ?
   Thanks,
   William.
 
 



Re: SOLR X FAST

2007-12-11 Thread Ravish Bhagdev
Could you please elaborate on what you mean by ingestion pipeline and
horizontal scalability?  I apologize if this is a stupid question
everyone else on the forum is familiar with.

Thanks,
Ravi

On Dec 12, 2007 1:09 AM, Nuno Leitao [EMAIL PROTECTED] wrote:
 Depends, if you are looking for a small sized index (gigabytes rather
 than dozens or hundreds of gigabytes or terabytes) with relatively
 simple requirements (a few facets, simple tokenization, English only
 linguistics, etc.) Solr is likely to be appropriate for most cases.

 FAST however gives you great horizontal scalability, out of the box
 linguistics for many languages (including CJK), contextual and scope
 searching, a web, file and database crawler, programmable ingestion
 pipeline, etc.

 Regards.

 --Nuno


 On 11 Dec 2007, at 22:09, William Silva wrote:

  Hi,
  How is the best way to compare SOLR and FAST Search ?
  Thanks,
  William.




Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently

2007-11-28 Thread Ravish Bhagdev
Yup, I do remember that happening to me before.

Is this intentionally so?

Ravish

On Nov 28, 2007 1:41 PM, Daniel Alheiros [EMAIL PROTECTED] wrote:
 Hi

 I experienced a very unpleasant problem recently, when my search indexing
 adaptor was changed to add some new fields. The problem is my schema didn't
 follow those changes (new fields added), and after that SOLR was silently
 ignoring all documents I sent.

 Neither SOLR Java client or SOLR server returned me an error code or log
 message. In the server side, nothing was logged and the client received a
 standard success return.

 Why didn't my documents got indexed and this new fields were just ignored?
 That is what I think it was supposed to do.

 Please let me know your thoughts.

 Regards,
 Daniel


 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal 
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in reliance 
 on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.




Re: index size

2007-10-11 Thread Ravish Bhagdev
Hi All,

I'm facing similar problem.  I want to index entire document as a
field.  But I also want to be able to retrieve snippets (like
Google/Nutch return in results page below the links).

To achieve this I have to keep the document field to stored right?
When I do this my index becomes huge 10 GB index, cause I have 10K
docs but each is very lengthy HTML.  Is there any better solution?
Why is index created by nutch so small in comparison (about 27 mb
approx) but it still returns snippets!

Ravish

On 10/9/07, Kevin Lewandowski [EMAIL PROTECTED] wrote:
 Late reply on this but I just wanted to say thanks for the
 suggestions. I went through my whole schema and was storing things
 that didn't need to be stored and indexing a lot of things that didn't
 need to be indexed. Just completed a full reindex and it's a much more
 reasonable size now.

 Kevin

 On 8/20/07, Mike Klaas [EMAIL PROTECTED] wrote:
 
  On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
 
   Are there any tips on reducing the index size or what factors most
   impact index size?
  
   My index has 2.7 million documents and is 200 gigabytes and growing.
   Most documents are around 2-3kb and there are about 30 indexed fields.
 
  An ls -sh will tell you roughly where the the space is being
  occupied.  There is something strange going on: 2.5kB * 2.7m is only
  6GB, and I have trouble imagining where the 30-fold index size
  expansion is coming from.
 
  -Mike
 



Fwd: solr, snippets and stored field in nutch...

2007-10-11 Thread Ravish Bhagdev
Hey guys,

Checkout this thread I opened on nutch mailing list.  Looks like Solr
can benefit from reusing Nutch's segment based storage strategy for
efficiency in returning snippets, summaries etc without using Lucene
stored fields?

Was this considered before?

Ravish

-- Forwarded message --
From: Dennis Kubes [EMAIL PROTECTED]
Date: Oct 11, 2007 11:27 PM
Subject: Re: snippets and stored field in nutch...
To: [EMAIL PROTECTED]


The reason it is stored in the segments instead of index to allow
summarizers to be run on the content of hits to produce the summaries
that appear in the search results.  Summarizers are pluggable and the
actual content used to produce the summary can change.  And summaries
can be changed without re-fetching or re-indexing.  If a summary were
stored in the index, re-indexing would have to occur to make changes.

Also the way the search process works, Nutch returns hits (basically
document ids).  These hits are then sorted and deduped and the best x
number (usually 10) returned.  For only these 10 best hits, hit details
(fields in the index) and summaries are retrieved.  So there is
something to be said about the amount of data being pushed over the network.

Dennis Kubes

Ravish Bhagdev wrote:
 Ah, I see, didn't know that, Thanks!

 Interesting that nutch stores it in a different structure (segments)
 and doesn't reuse Lucene strategy of storing within index.  Any
 particular reason why?  Is there any other use of Segments data
 structure except to return snippets?

 Cheers,
 Ravish

 On 10/11/07, John H. Lee [EMAIL PROTECTED] wrote:
 Hi Ravish.

 You are correct that Nutch does not store document content in the
 Lucene index. The content *is* stored in the Nutch segment, which is
 where snippets come from.

 Hope this helps.

 -J


 On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:

 Hey All,

 Am I right in believing that in Lucene/Nutch, to be able to return
 content or snippet to a search query, the field to be returned has to
 be stored?

 AFAIK, by default, Nutch dose not store the document field, am I
 right?  If so, how does it manage to return snippets?  Wouldn't the
 index be quite huge if nutch were storing document field by default?

 I will appreciate any help/comments as I'm bit lost with this.

 Ravi



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks all for help.

Just to make sure I understand correctly, am I right in summarizing
this way than?:

No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.

HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?)
: Main purpose is to allow users to index html code, it will strip the
html tags and index the contents, but if used for getting snippets in
results the em tags may be in wrong locations

To avoid using HTMLAnalyser, strip out the tags yourself and only send
text to Solr for indexing using one of the normal analysers.
Highlighting should be accurate in this case.

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

Do let me know if my assumptions are wrong!

Cheers,
Ravish

On 10/5/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : In general, I don't recommend indexing HTML content straight to Solr.  None 
 of
 : the Solr contributors do this so the use case hasn't received a lot of love.

 I second that comment ... the HTML Striping code was never intended to be
 an HTML Parser it was designed to be a workarround for dealing with
 dirty data where people had unwanted HTML tags in what should be plain
 text.  indexing as is with some analyzers would result in words like
 script, strong, and class matching lots of docs where the words
 never relaly appear in the text.

 if you have wellformed HTML documents, use an HTML parser to extract the
 real content.



 -Hoss




Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks Adrian,  I'm very new to Solr myself so struggling a bit in
initial stages...

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

Ravish

On 10/5/07, Adrian Sutton [EMAIL PROTECTED] wrote:
 On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
  (Query esp. Adrian):
 
  If you are indexing XHTML, do you replace tags with entities before
  giving it to solr, if so, when you get back snippets do you get tags
  or entities or do you convert again to tags for presentation?  What's
  the best way out?  It would help me a lot if you briefly explain your
  configuration.

 We happen to develop a HTML editor so we know 100% for certain that
 the XHTML is valid XML. Given that we just throw the raw XHTML at
 Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
 stage we haven't configured highlighting at all, so our index is used
 for search and retrieving a document ID. At some point I'd like to
 add highlighting and it sounds like the best way to do so would be to
 index the document text instead of the HTML.

 Beyond that, we also use Solr as an optimization for extracting
 information such as what content was most recently changed, which
 pages link to others etc. On the page linking, we actually identify
 what pages are linked to prior to indexing and store them as a
 separate field - Solr itself has no understanding of the linking.

 Oh and I should note, I'm very new to Solr so I'm probably not doing
 things the best way, but I'm getting great results anyway.

 Regards,

 Adrian Sutton
 http://www.symphonious.net




Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now

but...

But a different use-case might be for the highlighting to encompass
the markup rather than just the text, e.g.
   span class=highlightedtopic type=locationParis/topic/span
which would have to be accomplished some other way.

Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?

Bests,
Ravish


Re: Indexing HTML

2007-10-03 Thread Ravish Bhagdev
Hi Erik, All,

I escaped HTML text into entities before sending to Solr and indexing
went fine.  The problem now is that when I get back a snippet with
highlighted text for the html field, its not well formed as the
highliting dosen't somtimes include the entire tag if present.  For
e.g.:

lst name=0008369D
−
arr name=document
−
str
ound-color: #FF; text-align: left; text-indent: 0px;
emline-heigh/emt: normal ; margin-top: 0px; margin-ri
/str
/arr
/lst

lst name=0008369B
−
arr name=document
−
str
/TRgt;br /
lt;TR align=quot;leftemquot;  va/emlign=quot;middlequot;
style=quot; height: 28.80px;q
/str
/arr
/lst
/lst

Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?

Any help will be appreciated.

Thanks,
Ravi

On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:

 On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
  What's odd about this is that the error seems to indicate that I did.

 Actually the error message looks like you escaped too much.  You
 should _not_ escape field, only the contents of it.

 Erik



 
  The full text (minus the stack trace) was
 
  org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
  or TEXT
  to read text (position: START_TAG seen ...lt;field
  name=linegt;lt;a
  href=foobargt;... @4:37)
 
  Or is that just a byproduct of how SOLR reports the errors back -
  always
  escaping them?
 
  Thanks guys - I'll have another crack at this tonight.
 
 
  On 8/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 
  Michael,
 
  I think the issue is that you're not escaping the field values.
  Send something like this to Solr instead:
 
field name=linelt;a
  href=foobargt;lt;bgt;lt;igt;linktextlt;/igt;lt;/bgt;lt;/
  agt;/field
 
  Erik
 
 
  On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
 
  Hello
 
  I'm trying to index individual lines of an HTML file, and I'm
  hitting this
  error:
 
  TEXT must be immediately followed by END_TAG and not START_TAG
 
  I've got something that looks like
 
  add
  doc
  field name=id4/field
  field name=linea href=foobarbilinktext/i/b/a/
  field
  /doc
  /add
 
  Actually, that sample code above, as its own data file POSTed to
  SOLR,
  throws
 
  parser must be on START_TAG or TEXT to read text (position:
  START_TAG seen
  ...lt;field name=linegt;lt;a href=foobargt;... @4:37
 
  as an error.
 
  Any clues as to how I can do this?  I'd like to keep the original
  copy of
  each line intact in the index.
 
  Thanks!
 
  --
  Michael Kimsal
  http://webdevradio.com
 
 
 
 
  --
  Michael Kimsal
  http://webdevradio.com




unable to figure out nutch type highlighting in solr....

2007-10-02 Thread Ravish Bhagdev
I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is stored in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using full search interface option?  How, it just
returns xml with full document between field tag at the moment.

Please find attached my conf files as well
?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config
  !-- Set this to 'false' if you want solr to continue working after it has 
   encountered an severe configuration error.  In a production environment, 
   you may want solr to keep working even if one handler is mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --
  abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  !-- Used to specify an alternate directory to hold all index data
   other than the default ./data under the Solr home.
   If replication is in use, this should match the replication configuration. --
  !--
  dataDir./solr/data/dataDir
  --

  indexDefaults
   !-- Values here affect all index writers and act as a default unless overridden. --
useCompoundFilefalse/useCompoundFile
mergeFactor5/mergeFactor
maxBufferedDocs100/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
  /indexDefaults

  mainIndex
!-- options specific to the main on-disk lucene index --
useCompoundFilefalse/useCompoundFile
mergeFactor5/mergeFactor
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength

!-- If true, unlock any held write or commit locks on startup. 
 This defeats the locking mechanism that allows multiple
 processes to safely access a lucene index, and should be
 used with care. --
unlockOnStartupfalse/unlockOnStartup
  /mainIndex

  !-- the default high-performance update handler --
  updateHandler class=solr.DirectUpdateHandler2

!-- A prefix of solr. for class names is an alias that
 causes solr to search appropriate packages, including
 org.apache.solr.(search|update|request|core|analysis)
 --

!-- autocommit pending docs if certain criteria are met 
autoCommit 
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit
--
autoCommit 
  maxDocs1000/maxDocs
  maxTime1000/maxTime
/autoCommit

!-- The RunExecutableListener executes an external command.
 exe - the name of the executable to run
 dir - dir to use as the current working directory. default=.
 wait - the calling thread waits until the executable returns. default=true
 args - the arguments to pass to the program.  default=nothing
 env - environment variables to set.  default=nothing
  --
!-- A postCommit event is fired after every commit or optimize command
listener event=postCommit class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
  arr name=args strarg1/str strarg2/str /arr
  arr name=env strMYVAR=val1/str /arr
/listener
--
!-- A postOptimize event is fired only after every optimize command, useful
 in conjunction with index distribution to only distribute optimized indicies 
listener event=postOptimize class=solr.RunExecutableListener
  str name=exesnapshooter/str
  str name=dirsolr/bin/str
  bool name=waittrue/bool
/listener
--

  /updateHandler


  query
!-- Maximum number of clauses in a boolean query... can affect
range or prefix queries that expand 

Processing solr response....

2007-09-04 Thread Ravish Bhagdev
Hi,

Apologies if this has been asked before but I couldn't find anything
when I searched...

I have been looking ant SolJava examples.  I've been using
Nutch/Lucene before which returns results from query nicely in a class
with url, title and snippet (summary).  While Solr seems to return XML
with score and other details with just the url field.

Is there a way to avoid having to deal with XML after each query?  I
want to avoid parsing it will be much better if I could get results
directly into a Java data structure like a List or Map etc through the
API.

Also can anyone point me to some example or documentation which
clarifies XML returned by Solr and also how to get variations of this
including specifying what exactly i would see in xml like which
particular fields etc.  Hope i'm making sense

Thanks,
Ravi


Indexing longer documents using Solr...memory issue after index grows to about 800 MB...

2007-09-04 Thread Ravish Bhagdev
Hi,

The problem:

- I have about 11K html documents to index.
- I'm trying to index these documents (along with 3 more small string
fields) so that when I search within the doc field (field with the
html file content), I can get results with snippets or highlights as I
get when using nutch.
- While going through Wiki I noticed that if I need to do highlighting
in a particular field, I have to make sure it is indexed and stored.

But when I try to do the above, after indexing about 3K files which
creates index of about 800MB (which is fine as files are quite
lengthy) it keeps giving out of heap space errors.

Things I've tried without much help:

- Increase memory of tomcat
- Play around with settings like autoCommit (documents and time)
- Reducing mergefactor to 5
- Reducing maxBufferedDocs to 100

My question is also, if its required to store fields in index to be
able to do highlighting/returning field content, how does nutch/lucene
do it without that (because index for same documents created using
nutch is much much smaller)

But also when trying to query partially added documents, when I set
field highlight on (and a particular field) it doesn't seem to have
any effect.

As you can see I'm very confused how to proceed.  I hope I'm being
clear though :-S

Thanks,
Ravi


Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Ravish Bhagdev
Hello,

Sorry for stupid question.  I'm trying to index html file as one of
the fields in Solr, I've setup appropriate analyzer in schema but I'm
not sure how to add html content to Solr.  Encapsulating HTML content
within field tag is obviously not valid.  How do I add html content?
Hope the query is clear

Thanks,
Ravi