date:20091015

2009-10-15 Thread Noble Paul നോബിള്‍ नोब्ळ्

If the JavaScript support enables me to invoke a URL,  it's really OK with 
me.


Cheers,

- Bill

--
From: Avlesh Singh avl...@gmail.com
Sent: Wednesday, October 14, 2009 11:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Adding callback url to data import handler...Is this possible?



But a callback url is a very specific requirement. We plan to extend
javascript support to the EventListener callback.


I would say the latter is more specific than the former.

People who are comfortable writing JAVA wouldn't need any of these but the
second best thing for others would be a capability to handle it in their 
own

applications. A url can be the simplest way to invoke things in respective
application. Doing it via javascript sounds like a round-about way of 
doing

it.

Cheers
Avlesh

2009/10/15 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com


I can understand the concern that you do not wish to write Java code .
But a callback url is a very specific requirement. We plan to extend
javascript support to the EventListener callback . Will it help?

On Wed, Oct 14, 2009 at 11:47 PM, Avlesh Singh avl...@gmail.com wrote:
 Hmmm ... I think this is a valid use case and it might be a good idea 
 to

 support it in someway.
 I will post this thread on the dev-mailing list to seek opinion.

 Cheers
 Avlesh

 On Wed, Oct 14, 2009 at 11:39 PM, William Pierce evalsi...@hotmail.com
wrote:

 Thanks, Avlesh.  Yes, I did take a look at the event listeners.  As I
 mentioned this would require us to write Java code.

 Our app(s) are entirely windows/asp.net/C# so while we could add Java
in a
 pinch,  we'd prefer to stick to using SOLR using its convenient
REST-style
 interfaces which makes no demand on our app environment.

 Thanks again for your suggestion!

 Cheers,

 Bill

 --
 From: Avlesh Singh avl...@gmail.com
 Sent: Wednesday, October 14, 2009 10:59 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Adding callback url to data import handler...Is this
possible?


  Had a look at EventListeners in
 DIH?http://wiki.apache.org/solr/DataImportHandler#EventListeners

 Cheers
 Avlesh

 On Wed, Oct 14, 2009 at 11:21 PM, William Pierce 
evalsi...@hotmail.com
 wrote:

  Folks:

 I am pretty happy with DIH -- it seems to work very well for my
 situation.
   Thanks!!!

 The one issue I see has to do with the fact that I need to keep
polling
 url/dataimport to check if the data import completed 
 successfully.

I
 need to know when/if the import is completed (successfully or
otherwise)
 so
 that I can update appropriate structures in our app.

 What I would like is something like what Google Checkout API 
 offers --

a
 callback URL.  That is, I should be able to pass along a URL to DIH.
Once
 it has completed the import, it can invoke the provided URL.  This
 provides
 a callback mechanism for those of us who don't have the liberty to
change
 SOLR source code.  We can then do the needful upon receiving this
 callback.

 If this functionality is already provided in some form/fashion, I'd
love
 to
 know.

 All in all, great functionality that has significantly helped me 
 out!


 Cheers,

 - Bill







--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Filtered search for subset of ids

2009-10-15 Thread Andrea D'Ippolito

Hi everybody,
I'm new here..and this is my last chance to find a solution for my problem.

I'm using acts_as_solr for Ruby On Rails.

I need to submit a query to a subset of documents which id belong to an
array of ids that I want to pass as parameter.

for istance, something like:

find_by_solr(query, id:[1,2,3,40,51,56])

or actually I'd just need a way in the option to filter a kind of sql IN
instead of RANGE.

I guess I need to override some methods..but first of all I want to know if
you consider this possibile, and if you have any hints about how to achieve
that.

I'm working on Articles repository, indexing title and content only (but
documents id is sincronized with the document id in the MySql database).

Thanks

(I hope this is not a duplicate..I've send it before to confirm subscription
:S )

Andrea

Re: Boosting of words

2009-10-15 Thread bhaskar chandrasekar

Hi,
 
I am able to see the results when i pass the values in the query browser.
 
When i pass the below query i am able to see the difference in output.
 
http://localhost:8983/solr/select/?q=java^100%20technology^1
 
Each time user cannot pass the values in the query browser to see the output.
 
But where exactly 
 
java^100 technology^1
 
this value should be set.In which file and which location to be precise?.
 
Please help me.
 
Regards
Bhaskar
 

--- On Wed, 10/14/09, AHMET ARSLAN iori...@yahoo.com wrote:


From: AHMET ARSLAN iori...@yahoo.com
Subject: Re: Boosting of words
To: solr-user@lucene.apache.org
Date: Wednesday, October 14, 2009, 6:41 AM



 Hi Clark,
  
 Thanks for your input. I have a query.
  
  
 I have my XML which contains the following:
  
 add
 doc
   field name=urlhttp://www.sun.com/field
   field name=titleinformation/field
   field name=descriptionjava plays a important
 role in computer industry for web users/field
 /doc
 doc
   field name=urlhttp://www.askguru.com/field
   field name=titlehomepage/field
   field name=descriptionInformation about
 technology is stored in the web sites/field
 /doc
 doc
   field name=urlhttp://www.techie.com/field
   field name=titlepost queries/field
   field name=descriptionThis web site have more
 java technology related to web/field
 /doc
 /add
  
 When I give “java technology” as my input in Solr admin
 page ,At present  I get output as 
  
 doc
   field name=urlhttp://www.techie.com/field
   field name=titlepost queries/field
   field name=descriptionThis web site have more
 java technology related to web/field
 /doc
  
 Now I need to get doc which has “technology” also
  
 When I give “java technology “
  
 I need to get output as,I need to give boosting to doc
 which has “technology”. It should display in the below
 order.The output should come as 
  
 doc
   field name=urlhttp://www.techie.com/field
   field name=titlepost queries/field
   field name=descriptionThis web site have more
 java technology related to web/field
 /doc
 doc
   field name=urlhttp://www.askguru.com/field
   field name=titlehomepage/field
   field name=descriptionInformation about
 technology is stored in the web sites/field
 /doc
 doc
   field name=urlhttp://www.sun.com/field
   field name=titleinformation/field
   field name=descriptionjava plays a important
 role in computer industry for web users/field
 /doc
  
 Let me know how to achieve the same?

The query :  java^1 OR technology^100   will do it. Results will be in this 
order:

1-)This web site have more java technology related to web
2-)Information about technology is stored in the web sites
3-)java plays a important role in computer industry for web users

1-) contains both java and technology
2-) contains only technology
3-) contains only java

Is that what you want? 

Note that there is no  quotes in the query above. And you can adjust boost 
factors (1 and 100) according to your needs. Use OR operator between terms. You 
set individual terms boost with ^ operator.

hope this helps.

Re: Adding callback url to data import handler...Is this possible?

It is not yet implemented .You may open an issue for the same

--Noble

On Thu, Oct 15, 2009 at 12:14 PM, William Pierce evalsi...@hotmail.com wrote:
 If the JavaScript support enables me to invoke a URL,  it's really OK with
 me.

 Cheers,

 - Bill

 --
 From: Avlesh Singh avl...@gmail.com
 Sent: Wednesday, October 14, 2009 11:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Adding callback url to data import handler...Is this possible?


 But a callback url is a very specific requirement. We plan to extend
 javascript support to the EventListener callback.

 I would say the latter is more specific than the former.

 People who are comfortable writing JAVA wouldn't need any of these but the
 second best thing for others would be a capability to handle it in their
 own
 applications. A url can be the simplest way to invoke things in respective
 application. Doing it via javascript sounds like a round-about way of
 doing
 it.

The eventhandler

 Cheers
 Avlesh

 2009/10/15 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 I can understand the concern that you do not wish to write Java code .
 But a callback url is a very specific requirement. We plan to extend
 javascript support to the EventListener callback . Will it help?

 On Wed, Oct 14, 2009 at 11:47 PM, Avlesh Singh avl...@gmail.com wrote:
  Hmmm ... I think this is a valid use case and it might be a good idea 
  to
  support it in someway.
  I will post this thread on the dev-mailing list to seek opinion.
 
  Cheers
  Avlesh
 
  On Wed, Oct 14, 2009 at 11:39 PM, William Pierce evalsi...@hotmail.com
 wrote:
 
  Thanks, Avlesh.  Yes, I did take a look at the event listeners.  As I
  mentioned this would require us to write Java code.
 
  Our app(s) are entirely windows/asp.net/C# so while we could add Java
 in a
  pinch,  we'd prefer to stick to using SOLR using its convenient
 REST-style
  interfaces which makes no demand on our app environment.
 
  Thanks again for your suggestion!
 
  Cheers,
 
  Bill
 
  --
  From: Avlesh Singh avl...@gmail.com
  Sent: Wednesday, October 14, 2009 10:59 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Adding callback url to data import handler...Is this
 possible?
 
 
   Had a look at EventListeners in
  DIH?http://wiki.apache.org/solr/DataImportHandler#EventListeners
 
  Cheers
  Avlesh
 
  On Wed, Oct 14, 2009 at 11:21 PM, William Pierce 
 evalsi...@hotmail.com
  wrote:
 
   Folks:
 
  I am pretty happy with DIH -- it seems to work very well for my
  situation.
    Thanks!!!
 
  The one issue I see has to do with the fact that I need to keep
 polling
  url/dataimport to check if the data import completed 
  successfully.
 I
  need to know when/if the import is completed (successfully or
 otherwise)
  so
  that I can update appropriate structures in our app.
 
  What I would like is something like what Google Checkout API 
  offers --
 a
  callback URL.  That is, I should be able to pass along a URL to DIH.
 Once
  it has completed the import, it can invoke the provided URL.  This
  provides
  a callback mechanism for those of us who don't have the liberty to
 change
  SOLR source code.  We can then do the needful upon receiving this
  callback.
 
  If this functionality is already provided in some form/fashion, I'd
 love
  to
  know.
 
  All in all, great functionality that has significantly helped me
   out!
 
  Cheers,
 
  - Bill
 
 
 
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com






-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

browse terms of index

2009-10-15 Thread jfmelian

Hi 

I use a sample embedded Apache Solr to create a Lucene index with few documents 
for tests purpose. 
Documents have text string, sint, sfloat, bool, and date fields, each of them 
are indexed. 
At this time they are also stored but only the ids documents will be stored at 
the end. 

I want to list the terms of index. I don't found a way this solr api so I made 
a try Apache Luke ( Lucene api.) 
Here the code of luke to see terms of index : 

public void terms(String field) throws CorruptIndexException, IOException { 
validateIndexSet(); 
validateOperationPossible(); 
SortedMapString,Integer termMap = new TreeMapString,Integer(); 
IndexReader reader = null; 
try { 
reader = IndexReader.open(indexName); 
TermEnum terms = reader.terms(); // return an enumeration of terms 
while (terms.next()) { 
Term term = terms.term(); 
if ((field.trim().length() == 0) || field.equals(term.field())) { 
termMap.put(term.field() + : + term.text(), 
new Integer((terms.docFreq(; 
} 
} 
int nkeys = 0; 
for (String key : termMap.keySet()) { 
Lucli.message(key + :  + termMap.get(key)); 
nkeys++; 
if (nkeys  Lucli.MAX_TERMS) { 
break; 
} 
} 
} finally { 
closeReader(reader); 
} 
} 

But for sfloat field (is the same for sint) I don't see the value of the term. 
The class Term of Lucene have just 2 fields of type String (name and value) 

Here values returned for the dynamic field f_float of type sfloat : 

f_float:┼?? 
f_float:┼?? 
f_float:┼?l 
f_float:┼?? 
f_float:┼?? 

So, 
have a way to convert term in the good type (int, date, float ) ? 
Or Have a way to see index terms with solr api ? 

Thanks for help 

Jean-François Melian

Re: browse terms of index


Have a look at http://wiki.apache.org/solr/TermsComponent

On Oct 15, 2009, at 5:43 AM, jfmel...@free.fr wrote:


Hi

I use a sample embedded Apache Solr to create a Lucene index with  
few documents for tests purpose.
Documents have text string, sint, sfloat, bool, and date fields,  
each of them are indexed.
At this time they are also stored but only the ids documents will be  
stored at the end.


I want to list the terms of index. I don't found a way this solr api  
so I made a try Apache Luke ( Lucene api.)

Here the code of luke to see terms of index :

public void terms(String field) throws CorruptIndexException,  
IOException {

validateIndexSet();
validateOperationPossible();
SortedMapString,Integer termMap = new TreeMapString,Integer();
IndexReader reader = null;
try {
reader = IndexReader.open(indexName);
TermEnum terms = reader.terms(); // return an enumeration of terms
while (terms.next()) {
Term term = terms.term();
if ((field.trim().length() == 0) || field.equals(term.field())) {
termMap.put(term.field() + : + term.text(),
new Integer((terms.docFreq(;
}
}
int nkeys = 0;
for (String key : termMap.keySet()) {
Lucli.message(key + :  + termMap.get(key));
nkeys++;
if (nkeys  Lucli.MAX_TERMS) {
break;
}
}
} finally {
closeReader(reader);
}
}

But for sfloat field (is the same for sint) I don't see the value of  
the term. The class Term of Lucene have just 2 fields of type String  
(name and value)


Here values returned for the dynamic field f_float of type sfloat :

f_float:┼??
f_float:┼??
f_float:┼?l
f_float:┼??
f_float:┼??

So,
have a way to convert term in the good type (int, date, float ) ?
Or Have a way to see index terms with solr api ?

Thanks for help

Jean-François Melian


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Using DIH's special commands....Help needed

Folks:

I see in the DIH wiki that there are special commands which according to the 
wiki 

Special commands can be given to DIH by adding certain variables to the row 
returned by any of the components . 

In my use case,  my db contains rows that are marked PendingDelete.   How do 
I use the $deleteDocByQuery special command to delete these rows using DIH?
In other words,  where/how do I specify this?  

Thanks,

- Bill

Re: 'Down' boosting shorter docs

2009-10-15 Thread Walter Underwood


Another approach is to change the document length normalization formula.

See Similarity.lengthNorm() in Lucene.

wunder

On Oct 15, 2009, at 12:45 AM, Andrea D'Ippolito wrote:


I've read (correct me if I'm wrong)
that a solution to achieve that is overboost all the other fields.
but I guess this works easily only if u have few fields indexed ;)

bye

2009/10/15 Simon Wistow si...@thegestalt.org


Our index has some items in it which basically contain a title and a
single word body.

If the user searches for a word in the title (especially if title  
is of

itself only oen word) then that doc will get scored quite highly,
despite the fact that, in this case, it's not really relevant.

I've tried something like

qf=title^2.0 content^0.5
bf=num_pages

but that disproportionally boosts long documents to the detriment of
relevancy

bf=product(num_pages,0.05)

has no effect but

bf=product(num_pages,0.06)


has a bunch of long documents which don't seem to return any  
highlighted
fields plus the short document with only the query in the title  
which is

progress in that it's almost exactly the opposite of what I want.

Any suggestions? Am I going to need to reindex and add the length in
bytes or characters of the document?

Simon

Limit occurences per page of items with same category

2009-10-15 Thread javier_uru


I was reading about field collapsing but I think is not what I'm looking for.

I have to resolve this problem. After a search, I need to show, for example,
3 items per page which have the same Category.

I will display 10 items per page.

Suppose the search returns 15 items in this order after priority of search
fields (cars and cycles are 4 in first page, so one of each should be moved
to 2nd page):

#id name category
page 1:
3 -- bmw -- car
2 -- honda  cycle
4 -- mercedes - car
14 - yamaha -- boat
13 - ferrari  car
10 - ktm -- cycle
15 - jaguar  car
12 - rolls royce - plane
1 -- aprilia - cycle
6 -- suzuki  cycle

page 2:
7 -- volvo - truck
8 -- scania  truck
5 -- boeing  plane
9 -- yamaha --- jetski
11 - toyota  car

What I want to know if it could be done with solr or some plugin is to limit
the occurences of items per page according to a category for example. So in
the first page, a car (the jaguar 15) and a cycle (the suzuki 6) will be
moved to 2nd page and both trucks will be moved to first.

Wanted result:

page 1:
3 -- bmw -- car
2 -- honda  cycle
4 -- mercedes - car
14 - yamaha -- boat
13 - ferrari  car
10 - ktm -- cycle
12 - rolls royce - plane
1 -- aprilia - cycle
7 -- volvo - truck
8 -- scania  truck

page 2:

15 - jaguar  car
6 -- suzuki  cycle
5 -- boeing  plane
9 -- yamaha --- jetski
11 - toyota  car

Thank you very much!
-- 
View this message in context: 
http://www.nabble.com/Limit-occurences-per-page-of-items-with-same-category-tp25909143p25909143.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Boosting of words

2009-10-15 Thread Michel Bottan

Hi Bhaskar,

The parameter you're looking for is the Boost Query. Remember using Dismax
Query Handler.

http://wiki.apache.org/solr/DisMaxRequestHandler#bq_.28Boost_Query.29

 http://localhost:8983/solr/select/?q=videoqt=dismaxbq=cat:electronics^5.0


Michel

On Thu, Oct 15, 2009 at 6:04 AM, bhaskar chandrasekar
bas_s...@yahoo.co.inwrote:

 Hi,

 I am able to see the results when i pass the values in the query browser.

 When i pass the below query i am able to see the difference in output.

 http://localhost:8983/solr/select/?q=java^100%20technology^1

 Each time user cannot pass the values in the query browser to see the
 output.

 But where exactly

 java^100 technology^1

 this value should be set.In which file and which location to be precise?.

 Please help me.

 Regards
 Bhaskar


 --- On Wed, 10/14/09, AHMET ARSLAN iori...@yahoo.com wrote:


 From: AHMET ARSLAN iori...@yahoo.com
 Subject: Re: Boosting of words
 To: solr-user@lucene.apache.org
 Date: Wednesday, October 14, 2009, 6:41 AM



  Hi Clark,
 
  Thanks for your input. I have a query.
 
 
  I have my XML which contains the following:
 
  add
  doc
field name=urlhttp://www.sun.com/field
field name=titleinformation/field
field name=descriptionjava plays a important
  role in computer industry for web users/field
  /doc
  doc
field name=urlhttp://www.askguru.com/field
field name=titlehomepage/field
field name=descriptionInformation about
  technology is stored in the web sites/field
  /doc
  doc
field name=urlhttp://www.techie.com/field
field name=titlepost queries/field
field name=descriptionThis web site have more
  java technology related to web/field
  /doc
  /add
 
  When I give “java technology” as my input in Solr admin
  page ,At present  I get output as
 
  doc
field name=urlhttp://www.techie.com/field
field name=titlepost queries/field
field name=descriptionThis web site have more
  java technology related to web/field
  /doc
 
  Now I need to get doc which has “technology” also
 
  When I give “java technology “
 
  I need to get output as,I need to give boosting to doc
  which has “technology”. It should display in the below
  order.The output should come as
 
  doc
field name=urlhttp://www.techie.com/field
field name=titlepost queries/field
field name=descriptionThis web site have more
  java technology related to web/field
  /doc
  doc
field name=urlhttp://www.askguru.com/field
field name=titlehomepage/field
field name=descriptionInformation about
  technology is stored in the web sites/field
  /doc
  doc
field name=urlhttp://www.sun.com/field
field name=titleinformation/field
field name=descriptionjava plays a important
  role in computer industry for web users/field
  /doc
 
  Let me know how to achieve the same?

 The query :  java^1 OR technology^100   will do it. Results will be in this
 order:

 1-)This web site have more java technology related to web
 2-)Information about technology is stored in the web sites
 3-)java plays a important role in computer industry for web users

 1-) contains both java and technology
 2-) contains only technology
 3-) contains only java

 Is that what you want?

 Note that there is no  quotes in the query above. And you can adjust
 boost factors (1 and 100) according to your needs. Use OR operator between
 terms. You set individual terms boost with ^ operator.

 hope this helps.

Re: Solr/Lucene keeps eating up memory while idling

On Oct 14, 2009, at 12:26 PM, nonrenewable wrote:

I'm curious why this is occurring and whether i can prevent it. This
is my

scenario:

Locally I have an idle running solr 1.3 service using lucene 2.4.1
which has
an index of ~330K documents containing ~10 fields each(total size
~12GB).

Did I read that right? 330K docs == 12 GB index.

Currently I've turned off all caching, lazy field loading, however i
do have

facet fields set for some request handlers.

exactly is this happening, considering no requests are being serviced?
Shouldn't the memory usage stabilise with a certain set of
information and
only be affected on requests? Additionally there is a full GC every
half
hour, which seems very unreasonable on a machine that isn't actually
being

used as a service.

Can you share the Solr logs and/or your config? Is this happening
around a commit or some warming process? After startup, with no
requests hitting it and no warming/commits/indexing, I don't see why
it would be growing. Do you have custom code?

I really hope there's just a certain setting that i've overlooked,
or a
concept i'm not understanding because otherwise this behaviour seems
very

unreasonable...

Thanks beforehand,
Tony
--
View this message in context:
http://www.nabble.com/Solr-Lucene-keeps-eating-up-memory-while-idling-tp25894357p25894357.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Boosting of words

2009-10-15 Thread AHMET ARSLAN

 Hi,
  
 I am able to see the results when i pass the values in the
 query browser.
  
 When i pass the below query i am able to see the difference
 in output.
  
 http://localhost:8983/solr/select/?q=java^100%20technology^1
  
 Each time user cannot pass the values in the query browser
 to see the output.
  
 But where exactly 
  
 java^100 technology^1
  
 this value should be set. In which file and which location
 to be precise?.
  
 Please help me.

Althought I do not understand you, you need to URL encode your parameter values 
before you invoke a HTTP GET.   paramater=urlencode(value,UTF-8) 

Try this url :
/select/?q=java%5E100+OR+technology%5E1version=2.2

Note that space is encoded into +.
Also ^ is encoded into %5E. 

What kind of solr client are you using? How are you accessing to solr? From 
java, php, rubby?

Re: Using DIH's special commands....Help needed

2009-10-15 Thread Shalin Shekhar Mangar

On Thu, Oct 15, 2009 at 6:25 PM, William Pierce evalsi...@hotmail.comwrote:

 Folks:

 I see in the DIH wiki that there are special commands which according to
 the wiki

 Special commands can be given to DIH by adding certain variables to the
 row returned by any of the components . 

 In my use case,  my db contains rows that are marked PendingDelete.   How
 do I use the $deleteDocByQuery special command to delete these rows using
 DIH?In other words,  where/how do I specify this?


The $deleteDocByQuery is for deleting Solr documents by a Solr query and not
DB rows.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Using DIH's special commands....Help needed

Thanks, Shalin.   I am sorry if I phrased it incorrectly.  Yes,  I want to 
know how to delete documents in the solr index using the $deleteDocByQuery 
special command.   I looked in the wiki doc and could not find out how to do 
this


Sorry if this is self-evident...

Cheers,

- Bill

--
From: Shalin Shekhar Mangar shalinman...@gmail.com
Sent: Thursday, October 15, 2009 10:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Using DIH's special commandsHelp needed

On Thu, Oct 15, 2009 at 6:25 PM, William Pierce 
evalsi...@hotmail.comwrote:



Folks:

I see in the DIH wiki that there are special commands which according to
the wiki

Special commands can be given to DIH by adding certain variables to the
row returned by any of the components . 

In my use case,  my db contains rows that are marked PendingDelete. 
How

do I use the $deleteDocByQuery special command to delete these rows using
DIH?In other words,  where/how do I specify this?


The $deleteDocByQuery is for deleting Solr documents by a Solr query and 
not

DB rows.

--
Regards,
Shalin Shekhar Mangar.

Re: Solr/Lucene keeps eating up memory while idling

2009-10-15 Thread nonrenewable


Did I read that right?  330K docs == 12 GB index.

Ops, missed the dot - 1.2GB, but i don't think that should really make the
difference in this case. Even if it was 12 GB it would just have some really
juicy documents, right? :)

Can you share the Solr logs and/or your config?  Is this happening  
around a commit or some warming process?  After startup, with no  
requests hitting it and no warming/commits/indexing, I don't see why  
it would be growing.  Do you have custom code?

There is custom code around the solrj API however it does not explain this
behaviour because of the lack of requests coming through it. There are no
indexing, commits or queries sent to the server after it's started up,
except for the initial 2 warming queries (can those be to blame for this
even with no caches present??). Here are these in the log (it's on it's
default verbosity so i'll refrain from posting the whole start up until
necessary) After the initial start up, what you see in the log is GC every
2.5 min and Full GC every 30min. No actual activity is present.

Oct 15, 2009 1:13:36 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={start=0q=fast_warmrows=10} hits=0
status=0 QTime=16853 
Oct 15, 2009 1:13:36 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
Oct 15, 2009 1:13:36 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null
params={q=static+firstSearcher+warming+query+from+solrconfig.xml} hits=0
status=0 QTime=204 
Oct 15, 2009 1:13:36 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done


here is the config on it: 

config
 
abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError
  dataDir/r9/flare1.data/solr/data/dataDir
  indexDefaults
useCompoundFilefalse/useCompoundFile
mergeFactor10/mergeFactor
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
lockTypesingle/lockType
  /indexDefaults

  mainIndex
useCompoundFilefalse/useCompoundFile
ramBufferSizeMB32/ramBufferSizeMB
mergeFactor10/mergeFactor
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
unlockOnStartupfalse/unlockOnStartup
  /mainIndex
  jmx /

  updateHandler class=solr.DirectUpdateHandler2
  /updateHandler


  query
maxBooleanClauses1024/maxBooleanClauses
queryResultWindowSize50/queryResultWindowSize
queryResultMaxDocsCached200/queryResultMaxDocsCached
HashDocSet maxSize=3000 loadFactor=0.75/
listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
lst str name=qsolr/str str name=start0/str str
name=rows10/str /lst
lst str name=qrocks/str str name=start0/str str
name=rows10/str /lst
lststr name=qstatic newSearcher warming query from
solrconfig.xml/str/lst
  /arr
/listener
listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
lst str name=qfast_warm/str str name=start0/str str
name=rows10/str /lst
lststr name=qstatic firstSearcher warming query from
solrconfig.xml/str/lst
  /arr
/listener
useColdSearcherfalse/useColdSearcher
maxWarmingSearchers2/maxWarmingSearchers
  /query

  requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=2048 /
  /requestDispatcher
  
  requestHandler name=standard class=solr.SearchHandler default=true
 lst name=defaults
   str name=echoParamsexplicit/str
 /lst
  /requestHandler

  requestHandler name=dismax class=solr.SearchHandler 
lst name=defaults
 str name=defTypedismax/str
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qf
text^0.5 address_t^2.0 name^1.5 brand^1.1 airport_name_t^1.0
 /str
 str name=pf
text^0.2 address_t^1.1 name^1.5 brand^1.4 brand_exact^1.9
airport_name_t^1.0
 /str
 str name=fl
id,name,price,score
 /str
 int name=ps100/int
 str name=q.alt*:*/str
 str name=hl.fltext features name/str
 str name=f.name.hl.fragsize0/str
 str name=f.name.hl.alternateFieldname/str
 str name=f.text.hl.fragmenterregex/str !-- defined below --
 str name=spellchecktrue/str 
 str name=spellcheck.extendedResultstrue/str
 str name=spellcheck.collatetrue/str
 str name=spellcheck.count5/str
/lst
 arr name=last-components
  strspellcheck/str
/arr
  /requestHandler
  requestHandler name=partitioned class=solr.SearchHandler 
lst name=defaults
 str name=defTypedismax/str
 str name=echoParamsexplicit/str
 str name=qftext^0.5 features^1.0 name^1.2 id^10.0/str
 str name=mm2lt;-1 5lt;-2 6lt;90%/str
 str name=bqincubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2/str
/lst
lst name=appends
  str name=fqinStock:true/str

Re: Using DIH's special commands....Help needed

2009-10-15 Thread Shalin Shekhar Mangar

On Thu, Oct 15, 2009 at 10:42 PM, William Pierce evalsi...@hotmail.comwrote:

 Thanks, Shalin.   I am sorry if I phrased it incorrectly.  Yes,  I want to
 know how to delete documents in the solr index using the $deleteDocByQuery
 special command.   I looked in the wiki doc and could not find out how to do
 this


Sorry, I misunderstood your intent. These special flag variables can be
emitted by Transformers. So what you can do is write a Transformer which
checks if the current row contains PendingDelete in the column and add a
key/value pair to the Map. The key should be $deleteDocByQuery and value
should be the Solr query to be used for deletion. You can write the
transformer in Java as well as Javascript.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Using DIH's special commands....Help needed

Thanks for your help.  Here is my DIH config fileI'd appreciate any 
help/pointers you may give me.  No matter what I do the documents are not 
getting deleted from the index.  My db has rows whose 'IndexingStatus' field 
has values of either 1 (which means add it to solr), or 4 (which means 
delete the document with the primary key from SOLR index).  I have two 
transformers running.  Not sure what I am doing wrong.


dataConfig
 script![CDATA[
   function DeleteRow(row){
   var jis = row.get('IndexingStatus');
   var jid = row.get('Id');
   if ( jis == 4 ) {
row.put('$deleteDocById', jid);
}
   return row;
   }
   ]]/script

 dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/db
 user=**
 password=***/
 document
   entity name=post transformer=script:DeleteRow, RegexTransformer
   query= select  Id, a, b, c, IndexingStatus from  prod_table 
where (IndexingStatus = 1 or IndexingStatus = 4) 

field column=ptype splitBy=, sourceColName=a /
field column=wauth splitBy=,  sourceColName=b /
field column=miles splitBy=,  sourceColName=c /
   /entity
 /document
/dataConfig


Thanks,

- Bill

--
From: Shalin Shekhar Mangar shalinman...@gmail.com
Sent: Thursday, October 15, 2009 11:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Using DIH's special commandsHelp needed

On Thu, Oct 15, 2009 at 10:42 PM, William Pierce 
evalsi...@hotmail.comwrote:


Thanks, Shalin.   I am sorry if I phrased it incorrectly.  Yes,  I want 
to
know how to delete documents in the solr index using the 
$deleteDocByQuery
special command.   I looked in the wiki doc and could not find out how to 
do

this



Sorry, I misunderstood your intent. These special flag variables can be
emitted by Transformers. So what you can do is write a Transformer which
checks if the current row contains PendingDelete in the column and add a
key/value pair to the Map. The key should be $deleteDocByQuery and value
should be the Solr query to be used for deletion. You can write the
transformer in Java as well as Javascript.

--
Regards,
Shalin Shekhar Mangar.

Re: Conditional copyField

Nice find, Amhet,  I'd love to see this formalized in the Solr schema  
syntax, as it is something I've often wanted to.  Max Chars is OK,  
too, but would like to see max tokens as well.


On Oct 12, 2009, at 6:31 PM, AHMET ARSLAN wrote:


Hi,
I am pushing data to solr from two different sources nutch
and a cms. I have a data clash in that in nutch a copyField
is required to push the url field to the id field as it is
used as  the primary lookup in the nutch solr
intergration update. The other cms also uses the url field
but also populates the id field with a different value. Now
I can't really change either source definition so is there a
way in solrconfig or schema to check if id is empty and only
copy if true or is there a better way via the
updateprocessor?


copyField declaration has three attributes: source, dest and maxChars.
Therefore it can be concluded that there is no way to do it in  
schema.xml


Luckily, Wiki [1] has a quick example that implements a conditional  
copyField.


[1] http://wiki.apache.org/solr/UpdateRequestProcessor





--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Solr/Lucene keeps eating up memory while idling

Please send a log covering at least the 2.5 minutes you discuss, but  
upwards of 5 minutes would be good.


On Oct 15, 2009, at 1:26 PM, nonrenewable wrote:




Did I read that right?  330K docs == 12 GB index.


Ops, missed the dot - 1.2GB, but i don't think that should really  
make the
difference in this case. Even if it was 12 GB it would just have  
some really

juicy documents, right? :)


Can you share the Solr logs and/or your config?  Is this happening
around a commit or some warming process?  After startup, with no
requests hitting it and no warming/commits/indexing, I don't see why
it would be growing.  Do you have custom code?


There is custom code around the solrj API however it does not  
explain this
behaviour because of the lack of requests coming through it. There  
are no

indexing, commits or queries sent to the server after it's started up,
except for the initial 2 warming queries (can those be to blame for  
this
even with no caches present??). Here are these in the log (it's on  
it's
default verbosity so i'll refrain from posting the whole start up  
until
necessary) After the initial start up, what you see in the log is GC  
every

2.5 min and Full GC every 30min. No actual activity is present.

Oct 15, 2009 1:13:36 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={start=0q=fast_warmrows=10}  
hits=0

status=0 QTime=16853
Oct 15, 2009 1:13:36 PM org.apache.solr.core.QuerySenderListener  
newSearcher

INFO: QuerySenderListener done.
Oct 15, 2009 1:13:36 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null
params={q=static+firstSearcher+warming+query+from+solrconfig.xml}  
hits=0

status=0 QTime=204
Oct 15, 2009 1:13:36 PM org.apache.solr.core.QuerySenderListener  
newSearcher

INFO: QuerySenderListener done


here is the config on it:

config

abortOnConfigurationError${solr.abortOnConfigurationError:true}/ 
abortOnConfigurationError

 dataDir/r9/flare1.data/solr/data/dataDir
 indexDefaults
   useCompoundFilefalse/useCompoundFile
   mergeFactor10/mergeFactor
   ramBufferSizeMB32/ramBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   lockTypesingle/lockType
 /indexDefaults

 mainIndex
   useCompoundFilefalse/useCompoundFile
   ramBufferSizeMB32/ramBufferSizeMB
   mergeFactor10/mergeFactor
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   unlockOnStartupfalse/unlockOnStartup
 /mainIndex
 jmx /

 updateHandler class=solr.DirectUpdateHandler2
 /updateHandler


 query
   maxBooleanClauses1024/maxBooleanClauses
   queryResultWindowSize50/queryResultWindowSize
   queryResultMaxDocsCached200/queryResultMaxDocsCached
   HashDocSet maxSize=3000 loadFactor=0.75/
   listener event=newSearcher class=solr.QuerySenderListener
 arr name=queries
   lst str name=qsolr/str str name=start0/str str
name=rows10/str /lst
   lst str name=qrocks/str str name=start0/str str
name=rows10/str /lst
   lststr name=qstatic newSearcher warming query from
solrconfig.xml/str/lst
 /arr
   /listener
   listener event=firstSearcher class=solr.QuerySenderListener
 arr name=queries
   lst str name=qfast_warm/str str name=start0/str  
str

name=rows10/str /lst
   lststr name=qstatic firstSearcher warming query from
solrconfig.xml/str/lst
 /arr
   /listener
   useColdSearcherfalse/useColdSearcher
   maxWarmingSearchers2/maxWarmingSearchers
 /query

 requestDispatcher handleSelect=true 
   requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=2048 /
 /requestDispatcher

 requestHandler name=standard class=solr.SearchHandler  
default=true

lst name=defaults
  str name=echoParamsexplicit/str
/lst
 /requestHandler

 requestHandler name=dismax class=solr.SearchHandler 
   lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qf
   text^0.5 address_t^2.0 name^1.5 brand^1.1 airport_name_t^1.0
/str
str name=pf
   text^0.2 address_t^1.1 name^1.5 brand^1.4 brand_exact^1.9
airport_name_t^1.0
/str
str name=fl
   id,name,price,score
/str
int name=ps100/int
str name=q.alt*:*/str
str name=hl.fltext features name/str
str name=f.name.hl.fragsize0/str
str name=f.name.hl.alternateFieldname/str
str name=f.text.hl.fragmenterregex/str !-- defined below  
--

str name=spellchecktrue/str
str name=spellcheck.extendedResultstrue/str
str name=spellcheck.collatetrue/str
str name=spellcheck.count5/str
   /lst
arr name=last-components
 strspellcheck/str
   /arr
 /requestHandler
 requestHandler name=partitioned class=solr.SearchHandler 
   lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
str name=qftext^0.5 features^1.0 name^1.2 id^10.0/str
str name=mm2lt;-1 5lt;-2 6lt;90%/str
str

Re: Using DIH's special commands....Help needed

2009-10-15 Thread Shalin Shekhar Mangar

On Fri, Oct 16, 2009 at 12:46 AM, William Pierce evalsi...@hotmail.comwrote:

 Thanks for your help.  Here is my DIH config fileI'd appreciate any
 help/pointers you may give me.  No matter what I do the documents are not
 getting deleted from the index.  My db has rows whose 'IndexingStatus' field
 has values of either 1 (which means add it to solr), or 4 (which means
 delete the document with the primary key from SOLR index).  I have two
 transformers running.  Not sure what I am doing wrong.

 dataConfig
  script![CDATA[
   function DeleteRow(row){
   var jis = row.get('IndexingStatus');
   var jid = row.get('Id');
   if ( jis == 4 ) {
row.put('$deleteDocById', jid);
}
   return row;
   }
   ]]/script

  dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/db
 user=**
 password=***/
  document
   entity name=post transformer=script:DeleteRow, RegexTransformer
   query= select  Id, a, b, c, IndexingStatus from  prod_table
 where (IndexingStatus = 1 or IndexingStatus = 4) 
field column=ptype splitBy=, sourceColName=a /
field column=wauth splitBy=,  sourceColName=b /
field column=miles splitBy=,  sourceColName=c /
   /entity
  /document
 /dataConfig


One thing I'd try is to use '4' for comparison rather than the number 4 (the
type would depend on the sql type). Also, for javascript transformers to
work, you must use JDK 6 which has javascript support. Rest looks fine to
me.

-- 
Regards,
Shalin Shekhar Mangar.

Re: (Solr 1.4 dev) Why solr.common.* packages are in solrj-*.jar ?


: BTW, is there some sort of transition guide for Solr 1.4?
: I see there are changes how classes are divided into JARs
: like above, and there are some incompatible API changes.
: It'll be greate if such information can be part of CHANGES.txt.

CHANGES.txt contains an Upgrading from Solr 1.3 section ... if there are 
incompatible API changes for plugins they *should* be identified there, if 
you know of something that isn't listed pelase let us know (the 
specifics).


-Hoss

Re: Facet query help

: the original pastie(http://pastie.org/650932). I tried the fq query body with
: quotes and without quotes.

the entire fq param shouldn't be in quotes ... just the value that you 
want to query on (since it's a string field and you want the whole field 
treated as a single string...

fq = Memory_s:1 GB

fq=Memory_s:%221+GB%22


-Hoss

Re: Using DIH's special commands....Help needed

2009-10-15 Thread Fergus McMenemie

Hi,

For example, my data-import.conf has the following. It allows me
to specify a parameter single=pathname on the url used to
invoke DIH. It allows a doc to be deleted from the index by,
in my case its pathname, which is stored in the field fileAbsolutePath.


  document
 !-- ### --
 entity name=single-delete
 dataSource=null
 processor=XPathEntityProcessor
 url=${dataimporter.request.single}
 rootEntity=true
 flatten=true
 stream=false
 forEach=/record
 transformer=TemplateTransformer

  field column=fileAbsolutePath
template=${dataimporter.request.single} / 
  field column=$deleteDocByQuery   
template=fileAbsolutePath:${dataimporter.functions.escapeQueryChars(dataimporter.request.single)}
 /   
  field column=solluckey   
template=${dataimporter.request.single} / 
  /entity
  /document

I feel sure this can be optimised!

Fergus.

On Thu, Oct 15, 2009 at 6:25 PM, William Pierce evalsi...@hotmail.comwrote:

 Folks:

 I see in the DIH wiki that there are special commands which according to
 the wiki

 Special commands can be given to DIH by adding certain variables to the
 row returned by any of the components . 

 In my use case,  my db contains rows that are marked PendingDelete.   How
 do I use the $deleteDocByQuery special command to delete these rows using
 DIH?In other words,  where/how do I specify this?


The $deleteDocByQuery is for deleting Solr documents by a Solr query and not
DB rows.

-- 
Regards,
Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Solr/Lucene keeps eating up memory while idling

2009-10-15 Thread nonrenewable


Here is exactly half an hour from roughly the beginning of logging. There's
nothing to see really because no requests are sent, you just see the GC
behaviour:
[Full GC 211987K-208493K(432448K), 0.6273480 secs]
[GC 276333K-212269K(438720K), 0.0929710 secs]
[GC 289133K-216269K(439936K), 0.1019780 secs]
[GC 293133K-220205K(436672K), 0.1128410 secs]
[GC 304301K-224429K(441472K), 0.1358250 secs]
[GC 308525K-228685K(431744K), 0.1559950 secs]
[GC 317197K-233069K(437312K), 0.1642160 secs]
[GC 321581K-237613K(432832K), 0.1772830 secs]
[GC 329197K-242093K(435136K), 0.1896270 secs]
[GC 333677K-246701K(436352K), 0.2039880 secs]
[GC 274165K-247917K(437760K), 0.2022640 secs]
[Full GC 247917K-208726K(437760K), 0.7195200 secs]

The heap is set to 1400m so it'll take it awhile to hit the roof. I also
haven't tested to see if it stabilises but i'll leave it running now and see
what happens to it overnight. I assume that when(if) it reaches the heap
limit i'll just do full GCs more often. 


Grant Ingersoll-6 wrote:
 
 Please send a log covering at least the 2.5 minutes you discuss, but  
 upwards of 5 minutes would be good.
 

-- 
View this message in context: 
http://www.nabble.com/Solr-Lucene-keeps-eating-up-memory-while-idling-tp25894357p25916348.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr/Lucene keeps eating up memory while idling

2009-10-15 Thread Yonik Seeley

I just did some allocation profiling on the stock Solr example... it's
not completely idle when no requests are being made.

There's only one thing allocating memory: org.mortbay.util.Scanner.scanFiles()
That must be Jetty looking to see if any of the files under webapps has changed.

It's really nothing to worry about - there's no memory leaks, and the
activity is extremely minimal, but if you want to shut it off, it
would be a Jetty config option somewhere.

-Yonik
http://www.lucidimagination.com

On Wed, Oct 14, 2009 at 12:26 PM, nonrenewable nonrenewa...@gmail.com wrote:

I'm curious why this is occurring and whether i can prevent it. This is my
scenario:

Locally I have an idle running solr 1.3 service using lucene 2.4.1 which has
an index of ~330K documents containing ~10 fields each(total size ~12GB).
Currently I've turned off all caching, lazy field loading, however i do have
facet fields set for some request handlers.

What i'm seeing is heap space usage increasing by ~1.2MB per 2 sec (by
java.lang.String objects). I'm assuming they're being used by lucene but i
may be wrong about that, since i have no actual data to confirm it. Why
exactly is this happening, considering no requests are being serviced?
Shouldn't the memory usage stabilise with a certain set of information and
only be affected on requests? Additionally there is a full GC every half
hour, which seems very unreasonable on a machine that isn't actually being
used as a service.

I really hope there's just a certain setting that i've overlooked, or a
concept i'm not understanding because otherwise this behaviour seems very
unreasonable...

Re: Using mincount with date facet in Solr 1.4


:   But I was getting facets even with count 0. So I tried following
: combinations of mincount parameters, as none was specified in the
: wikihttp://wiki.apache.org/solr/SimpleFacetParameters,
: for date faceting.

mincount is not a date faceting option -- it only applies to field value 
faceting (ie: facet.field=foo) ... that's why it's in the Field Value 
Faceting Parameters section of the wiki page you listed, and not in the 
Date Faceting Parameters section.



-Hoss

RE: Right place to put my Tokenizer jars


: Actually, I meant to say I have my Tokenizer jars in solr/lib.
: I have the jars that my Tokenizer jars depend in lib/ext,
: as I wanted them to be loaded only once per container
: due to their internal description.  Bad idea?

unless there is something *really* hinky about those dependencies, i 
wouldn't worry about it -- just put them in solr/lib as well (or sharedLib 
if you use a solr.xml file)

:  This error can be fixed by putting another set of SLF4J jars 
:  in example/lib/ext, but I don't understand why.

In generaly what you are seeing is the security model of classloaders ... 
classes in Solr can access classes in the container's classloader, but 
classes in the container's loader can't see classes in solr.  Even if 
they are the same classes, they are differnet *instances* of those 
classes -- the Class obejcts themselves are distinct.

If you're really concerned about minimizing duplication and having a 
really tiny footprint, reconstructing the solr war to contain all of your 
classes 9and removing anything you *don't* need) is your best bet ... but 
for 99% of the worl hat's going to be major overkill.


-Hoss

Re: Customizing solr search: SpanQueries (revisited)