date:20110208

Field values are copied before being analyzed. There is no cascading of 
analyzers.

 Hello list,
 
 if I have a field title which copied to text and a field text that is
 copied to text.stemmed. Am I going to get the copy from the field title to
 the field text.stemmed or should I include it?
 
 thanks in advance
 
 paul

Re: does copyField recurse?

2011-02-08 Thread Paul Libbrecht

And no cascading of copying (as I experimented).
I just enriched the wiki's
http://wiki.apache.org/solr/SchemaXml#Copy_Fields

thanks to proof.

paul


Le 8 févr. 2011 à 11:16, Markus Jelsma a écrit :

 Field values are copied before being analyzed. There is no cascading of 
 analyzers.
 
 Hello list,
 
 if I have a field title which copied to text and a field text that is
 copied to text.stemmed. Am I going to get the copy from the field title to
 the field text.stemmed or should I include it?
 
 thanks in advance
 
 paul

Re: Solr n00b question: writing a custom QueryComponent

I'm still not quite clear what you are attempting to achieve, and more
so why you need to extend Solr rather than just wrap it.

You have data with title, description and content fields. You make no
mention of an ID field.

Surely, if you want to store some in mysql and some in Solr, you could
make your Solr client code enhance the data it gets back after querying
Solr with data extracted from Mysql. What is the issue here?

Upayavira


On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
 Hi all,
 
 Been a solr user for a while now, and now I need to add some
 functionality to solr for which I'm trying to write a custom
 QueryComponent. Couldn't get much help from websearch. So, turning to
 solr-user for help.
 
 I'm implementing search functionality for  (micro)blog aggregation. We
 use solr 1.4.1. In the current solr config, the title and content fields
 are both indexed and stored in solr. Storing takes up a lot of space,
 even with compression. I'd like to store the title and description field
 in solr in mysql and retrieve these fields in results from MySQL with an
 id lookup.
 
 Using the DataImportHandler won't work because we store just the title
 and content fields in MySQL. The rest of the fields are in solr itself.
 
 I wrote a custom component by extending QueryComponent, and overriding
 only the finishStage(ResponseBuilder) function where I try to retrieve
 the necessary records from MySQL. This is how the new QueryComponent is
 specified in solrconfig.xml
 
 searchComponent name=query
 class=org.apache.solr.handler.component.TestSolr /
 
 
 I see that the component is getting loaded from the solr debug output
 lst name=prepare
 double name=time1.0/double
 lst name=org.apache.solr.handler.component.TestSolr
 double name=time0.0/double
 /lst
 ...
 
 But the strange thing is that the finishStage() function is not being
 called before returning results. What am I missing?
 
 Secondly, functions like ResponseBuilder._responseDocs are visible only
 in the package org.apache.solr.handler.component. How do I access the
 results in my package?
 
 If you folks can give me links to a wiki or some sample custom
 QueryComponent, that'll be great.
 
 --
 Thanks in advance.
 Ishwar.
 
 
 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: q.alt=: for every request?

I'm not sure what you mean but you may be looking for debugQuery=true ?

On Tuesday 08 February 2011 08:28:12 Paul Libbrecht wrote:
 To be able to see this well, it would be lovely to have a switch that
 would activate a logging of the query expansion result. The Dismax
 QParserPlugin is particularly powerful in there so it'd be nice to see
 what's happening.
 
 Any logging category I need to activate?
 
 paul
 
 Le 8 févr. 2011 à 03:22, Markus Jelsma a écrit :
  There is no measurable performance penalty when setting the parameter,
  except maybe the execution of the query with a high value for rows. To
  make things easy, you can define q.alt=*:* as default in your request
  handler. No need to specifiy it in the URL.
  
  Hi,
  
  I use dismax handler with solr 1.4.
  Sometimes, my request comes with q and fq, and others doesn't come with
  q (only fq and q.alt=*:*). It's quite ok if I send q.alt=*:* for every
  request? Does it have side effects on performance?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solr n00b question: writing a custom QueryComponent

2011-02-08 Thread Ishwar

Hi Upayavira,

Apologies for the lack of clarity in the mail. The feeds have the following 
fields:
id, url, title, content, refererurl, createdDate, author, etc. We need search 
functionality on title and content. 
As mentioned earlier, storing title and content in solr takes up a lot of 
space. So, we index title and content in solr, and we wish to store title and 
content in MySQL which has the fields - id, title, content.

I'm also looking at a solr client- solrj to query MySQL based on what solr 
returns. But that means another component which needs to be maintained. I was 
wondering if it's a good idea to implement the functionality in solr itself.

 
--
Thanks,
Ishwar


Just another resurrected Neozoic Archosaur comics.
http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/


From: Upayavira u...@odoko.co.uk
To: solr-user@lucene.apache.org
Cc: 
Sent: Tuesday, February 8, 2011 4:36 PM
Subject: Re: Solr n00b question: writing a custom QueryComponent

I'm still not quite clear what you are attempting to achieve, and more
so why you need to extend Solr rather than just wrap it.

You have data with title, description and content fields. You make no
mention of an ID field.

Surely, if you want to store some in mysql and some in Solr, you could
make your Solr client code enhance the data it gets back after querying
Solr with data extracted from Mysql. What is the issue here?

Upayavira


On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
 Hi all,
 
 Been a solr user for a while now, and now I need to add some
 functionality to solr for which I'm trying to write a custom
 QueryComponent. Couldn't get much help from websearch. So, turning to
 solr-user for help.
 
 I'm implementing search functionality for  (micro)blog aggregation. We
 use solr 1.4.1. In the current solr config, the title and content fields
 are both indexed and stored in solr. Storing takes up a lot of space,
 even with compression. I'd like to store the title and description field
 in solr in mysql and retrieve these fields in results from MySQL with an
 id lookup.
 
 Using the DataImportHandler won't work because we store just the title
 and content fields in MySQL. The rest of the fields are in solr itself.
 
 I wrote a custom component by extending QueryComponent, and overriding
 only the finishStage(ResponseBuilder) function where I try to retrieve
 the necessary records from MySQL. This is how the new QueryComponent is
 specified in solrconfig.xml
 
 searchComponent name=query
 class=org.apache.solr.handler.component.TestSolr /
 
 
 I see that the component is getting loaded from the solr debug output
 lst name=prepare
 double name=time1.0/double
 lst name=org.apache.solr.handler.component.TestSolr
 double name=time0.0/double
 /lst
 ...
 
 But the strange thing is that the finishStage() function is not being
 called before returning results. What am I missing?
 
 Secondly, functions like ResponseBuilder._responseDocs are visible only
 in the package org.apache.solr.handler.component. How do I access the
 results in my package?
 
 If you folks can give me links to a wiki or some sample custom
 QueryComponent, that'll be great.
 
 --
 Thanks in advance.
 Ishwar.
 
 
 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Http Connection is hanging while deleteByQuery

2011-02-08 Thread shan2812


Hi,

At last the migration to Solr-1.4.1 does solve this issue :-)..

Cheers
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Http-Connection-is-hanging-while-deleteByQuery-tp2367405p2451214.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr n00b question: writing a custom QueryComponent

The conventional way to do it would be to index your title and content
fields in Solr, along with the ID to identify the document.

You could do a search against solr, and just return an ID field, then
your 'client code' would match that up with the title/content data from
your database. And yes, SolrJ would be the obvious route to take here,
for your client application.

Yes, it does mean another component that needs to be maintained, but by
using Solr's external interface you will be protected from changes to
internals that could break your custom components, and you will likely
be more able to take advantage of other Solr features that are also
available via the standard interfaces.

My next question is: are you going to be using the data you're storing
in mysql for something other than just enhancing search results? If not,
it may still make sense to store the data in Solr. It would mean you
just have one index to manage, rather than an index and a database -
after all, the words *have* to take up disk space somewhere :-). If you
end up with so many documents indexed that performance grinds (over
10million??) you can split your index across multiple shards.

Upayavira

Once you get search results back from Solr, you would do a query against
your database to return the additional 

On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
 Hi Upayavira,
 
 Apologies for the lack of clarity in the mail. The feeds have the
 following fields:
 id, url, title, content, refererurl, createdDate, author, etc. We need
 search functionality on title and content. 
 As mentioned earlier, storing title and content in solr takes up a lot of
 space. So, we index title and content in solr, and we wish to store title
 and content in MySQL which has the fields - id, title, content.
 
 I'm also looking at a solr client- solrj to query MySQL based on what
 solr returns. But that means another component which needs to be
 maintained. I was wondering if it's a good idea to implement the
 functionality in solr itself.
 
  
 --
 Thanks,
 Ishwar
 
 
 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
 
 
 From: Upayavira u...@odoko.co.uk
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Tuesday, February 8, 2011 4:36 PM
 Subject: Re: Solr n00b question: writing a custom QueryComponent
 
 I'm still not quite clear what you are attempting to achieve, and more
 so why you need to extend Solr rather than just wrap it.
 
 You have data with title, description and content fields. You make no
 mention of an ID field.
 
 Surely, if you want to store some in mysql and some in Solr, you could
 make your Solr client code enhance the data it gets back after querying
 Solr with data extracted from Mysql. What is the issue here?
 
 Upayavira
 
 
 On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com
 wrote:
  Hi all,
  
  Been a solr user for a while now, and now I need to add some
  functionality to solr for which I'm trying to write a custom
  QueryComponent. Couldn't get much help from websearch. So, turning to
  solr-user for help.
  
  I'm implementing search functionality for  (micro)blog aggregation. We
  use solr 1.4.1. In the current solr config, the title and content fields
  are both indexed and stored in solr. Storing takes up a lot of space,
  even with compression. I'd like to store the title and description field
  in solr in mysql and retrieve these fields in results from MySQL with an
  id lookup.
  
  Using the DataImportHandler won't work because we store just the title
  and content fields in MySQL. The rest of the fields are in solr itself.
  
  I wrote a custom component by extending QueryComponent, and overriding
  only the finishStage(ResponseBuilder) function where I try to retrieve
  the necessary records from MySQL. This is how the new QueryComponent is
  specified in solrconfig.xml
  
  searchComponent name=query
  class=org.apache.solr.handler.component.TestSolr /
  
  
  I see that the component is getting loaded from the solr debug output
  lst name=prepare
  double name=time1.0/double
  lst name=org.apache.solr.handler.component.TestSolr
  double name=time0.0/double
  /lst
  ...
  
  But the strange thing is that the finishStage() function is not being
  called before returning results. What am I missing?
  
  Secondly, functions like ResponseBuilder._responseDocs are visible only
  in the package org.apache.solr.handler.component. How do I access the
  results in my package?
  
  If you folks can give me links to a wiki or some sample custom
  QueryComponent, that'll be great.
  
  --
  Thanks in advance.
  Ishwar.
  
  
  Just another resurrected Neozoic Archosaur comics.
  http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
 --- 
 Enterprise Search Consultant at Sourcesense UK, 
 Making Sense of Open Source
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Search for FirstName with first Char uppercase followed by * not giving result; getting result with all lowercase and *

What you are missing is that the analysis page shows what happens when
the text is run through analysis. Wildcards ARE NOT ANALYZED, so you cannot
assume that the analysis page shows you what the search terms in that
case. Regardless of whether george* is shown in the analysis page, the term
searched will be George*, capitalized, and not found.

Pre-process your wildcards to lowercase them all is the easiest solution, as
Ahmet
said.

Best
Erick

On Tue, Feb 8, 2011 at 8:04 AM, Mark Fletcher
mark.fletcher2...@gmail.comwrote:

 Hi Sawas,

 Thank you for the reply.

 In the analysis screen (screen shots attached) *George* is finally stored
 as *george.*
 Also the keyword which I use for search later namely *George* *is
 finally analyzed as *george* and *george**
 Both there are depicted in the screen shots (index as well as query
 analyzers). If one of the index terms is finally *george* and  if one of
 the query terms is also finally *george  *why is it that a match is not
 found.

 I am not sending the mail to the group as I am not sure whether I am
 missing something basic which I am supposed to know here. I believe both the
 index and query analyser are having the same set of tokenizers and filters
 (pls refer the analysis attachment)

 Thanks for your time.

 BR,
 Mark.



 On Sun, Jan 30, 2011 at 2:13 PM, Savvas-Andreas Moysidis 
 savvas.andreas.moysi...@googlemail.com wrote:

 Hi Mark,

 When I indexed *George *it was also finally analyzed and stored as
 *george*
 Theny why is it that I don't get a match as per the analysis report I had

 your indexed term is george but you search for George* which does not go
 through the same analysis process as it did when it was indexed. So, since
 the terms you are searching for are not
 lowercased you are trying to find something that starts with George
 (capital
 G) which doesn't exist in you index.

 If you are not hitting Solr directly, maybe you can lowercase you input
 text before feeding it to Solr?

 On 30 January 2011 16:38, Mark Fletcher mark.fletcher2...@gmail.com
 wrote:

  Hi Ahmet,
 
  Thanks for the reply.
 
  I had attached the Analysis report of the query George*
 
  It is found to be split into terms *George** and *George* by the
  WordDelimiterFilterFactory and the LowerCaseFilterFactory converts it to
 *
  george** and *george*
 
  When I indexed *George *it was also finally analyzed and stored as
 *george*
  Theny why is it that I don't get a match as per the analysis report I
 had
  attached in my previous mail.
 
  Or Am I missing something basic here?
 
  Many Thanks.
 
  M
  On Sun, Jan 30, 2011 at 4:34 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 
  
   :When i try george* I get results. Whereas George* fetches no results.
  
  
   Wildcard queries are not analyzed by QueryParser.

Re: Solr n00b question: writing a custom QueryComponent

2011-02-08 Thread Ishwar

Thanks for the detailed reply Upayavira.


To answer your question, our index is growing much faster than expected and our 
performance is grinding to a halt. Currently, it has over 150 million records.
We're planning to split the index into multiple shards very soon and move the 
index creation to hadoop.

Our current situation is that we need to run optimize once every couple of days 
to keep it in shape. Given the size(index + stored), it takes a long time to 
complete during which time we can't add new documents into the index. And 
because of the size of the stored fields, we need double the storage size of 
the current  index to optimize. Since we're on EC2, this requires frequent 
increase in storage capacity.

Even after sharding the index, the time to take to optimize the index is going 
to be significant. That's the reason why we decided to store these fields in 
MySQL.
If there's some easier solution that I've overlooked, please point it out.

On a related note, is there a way to 'automagically' split the existing index 
into multiple shards?

--
Thanks,
Ishwar


Just another resurrected Neozoic Archosaur comics.
http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/


From: Upayavira u...@odoko.co.uk
To: solr-user@lucene.apache.org
Cc: 
Sent: Tuesday, February 8, 2011 7:17 PM
Subject: Re: Solr n00b question: writing a custom QueryComponent

The conventional way to do it would be to index your title and content
fields in Solr, along with the ID to identify the document.

You could do a search against solr, and just return an ID field, then
your 'client code' would match that up with the title/content data from
your database. And yes, SolrJ would be the obvious route to take here,
for your client application.

Yes, it does mean another component that needs to be maintained, but by
using Solr's external interface you will be protected from changes to
internals that could break your custom components, and you will likely
be more able to take advantage of other Solr features that are also
available via the standard interfaces.

My next question is: are you going to be using the data you're storing
in mysql for something other than just enhancing search results? If not,
it may still make sense to store the data in Solr. It would mean you
just have one index to manage, rather than an index and a database -
after all, the words *have* to take up disk space somewhere :-). If you
end up with so many documents indexed that performance grinds (over
10million??) you can split your index across multiple shards.

Upayavira

Once you get search results back from Solr, you would do a query against
your database to return the additional 

On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
 Hi Upayavira,
 
 Apologies for the lack of clarity in the mail. The feeds have the
 following fields:
 id, url, title, content, refererurl, createdDate, author, etc. We need
 search functionality on title and content. 
 As mentioned earlier, storing title and content in solr takes up a lot of
 space. So, we index title and content in solr, and we wish to store title
 and content in MySQL which has the fields - id, title, content.
 
 I'm also looking at a solr client- solrj to query MySQL based on what
 solr returns. But that means another component which needs to be
 maintained. I was wondering if it's a good idea to implement the
 functionality in solr itself.
 
  
 --
 Thanks,
 Ishwar
 
 
 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
 
 
 From: Upayavira u...@odoko.co.uk
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Tuesday, February 8, 2011 4:36 PM
 Subject: Re: Solr n00b question: writing a custom QueryComponent
 
 I'm still not quite clear what you are attempting to achieve, and more
 so why you need to extend Solr rather than just wrap it.
 
 You have data with title, description and content fields. You make no
 mention of an ID field.
 
 Surely, if you want to store some in mysql and some in Solr, you could
 make your Solr client code enhance the data it gets back after querying
 Solr with data extracted from Mysql. What is the issue here?
 
 Upayavira
 
 
 On Mon, 07 Feb 2011 23:17 -0800, Ishwar ishwarsridha...@yahoo.com
 wrote:
  Hi all,
  
  Been a solr user for a while now, and now I need to add some
  functionality to solr for which I'm trying to write a custom
  QueryComponent. Couldn't get much help from websearch. So, turning to
  solr-user for help.
  
  I'm implementing search functionality for  (micro)blog aggregation. We
  use solr 1.4.1. In the current solr config, the title and content fields
  are both indexed and stored in solr. Storing takes up a lot of space,
  even with compression. I'd like to store the title and description field
  in solr in mysql and retrieve these fields in results from MySQL with an
  id lookup.
  
  Using the DataImportHandler won't work because we store just the title
  and

Re: Solr n00b question: writing a custom QueryComponent

2011-02-08 Thread Edoardo Tosca

Hi,

i agree with Upayavira, probably it's better to create an external app that
retrieves content from a db.
Anyway, if i am not wrong,
finishStage is a method called by the coordinator if you have a distributed
search.

if your solr is on a single machine every component should implement only
prepare and process methods.

HTH.

Edo



On Tue, Feb 8, 2011 at 7:17 AM, Ishwar ishwarsridha...@yahoo.com wrote:

 Hi all,

 Been a solr user for a while now, and now I need to add some functionality
 to solr for which I'm trying to write a custom QueryComponent. Couldn't get
 much help from websearch. So, turning to solr-user for help.

 I'm implementing search functionality for  (micro)blog aggregation. We use
 solr 1.4.1. In the current solr config, the title and content fields are
 both indexed and stored in solr. Storing takes up a lot of space, even with
 compression. I'd like to store the title and description field in solr in
 mysql and retrieve these fields in results from MySQL with an id lookup.

 Using the DataImportHandler won't work because we store just the title and
 content fields in MySQL. The rest of the fields are in solr itself.

 I wrote a custom component by extending QueryComponent, and overriding only
 the finishStage(ResponseBuilder) function where I try to retrieve the
 necessary records from MySQL. This is how the new QueryComponent is
 specified in solrconfig.xml

 searchComponent name=query
 class=org.apache.solr.handler.component.TestSolr /


 I see that the component is getting loaded from the solr debug output
 lst name=prepare
 double name=time1.0/double
 lst name=org.apache.solr.handler.component.TestSolr
 double name=time0.0/double
 /lst
 ...

 But the strange thing is that the finishStage() function is not being
 called before returning results. What am I missing?

 Secondly, functions like ResponseBuilder._responseDocs are visible only in
 the package org.apache.solr.handler.component. How do I access the results
 in my package?

 If you folks can give me links to a wiki or some sample custom
 QueryComponent, that'll be great.

 --
 Thanks in advance.
 Ishwar.


 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/

RequestHandler code within 1.4.0 dist

2011-02-08 Thread McGibbney, Lewis John

Hello list,

I have been searching through 1.4.0 source for a standard requestHandler 
plug-in example. I understand that for my purposes, extending 
RequestHandlerBase is a starting point, however I was wondering if there is any 
examples of plug-ins which I can view such as those contained within /contrib. 
Initially my experience using plug-ins relates to those contained within 
/contrib folder in Solr, or /plugins folder in Nutch, but the structure does 
not seem to be the same in Solr.

Can anyone please help. Thank you

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education's Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

difference between filter_queries and parsed_filter_queries

2011-02-08 Thread Bagesh Sharma


Hi everybody, please suggest me what's the difference between these two
things. After what processing on filter_queries the parsed_filter_queries
are generated.

Basically ... when i am searching city as fq=city:'noida' 

then filter_queries and parsed_filter_queries both are same as 'noida'.  In
this case i do not get any result.

But when i do query like this  fq=city:noida then filter_queries is
noida but parsed_filter_queries is noida and it matches with the city and
i am getting correct results. 

what processing is going on from filter_queries to parsed_filter_queries. 

my schema for city is : -

 fieldType name=facetstr_city class=solr.TextField
sortMissingLast=true 
analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.SynonymFilterFactory
synonyms=synonyms_city_facet.txt ignoreCase=true expand=false /
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType

please suggest me please.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/difference-between-filter-queries-and-parsed-filter-queries-tp2451708p2451708.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: difference between filter_queries and parsed_filter_queries

Hi,

The parsed_filter_queries contains the value after it passed through the 
analyzer. In this case it remains the same because it was already lowercased 
and no synonyms were used.

You're also using single quotes, these have no special meaning so you're 
searching for 'noida' in the first and noida in the second fq.

Cheers,

On Tuesday 08 February 2011 15:52:23 Bagesh Sharma wrote:
 Hi everybody, please suggest me what's the difference between these two
 things. After what processing on filter_queries the parsed_filter_queries
 are generated.
 
 Basically ... when i am searching city as fq=city:'noida'
 
 then filter_queries and parsed_filter_queries both are same as 'noida'.  In
 this case i do not get any result.
 
 But when i do query like this  fq=city:noida then filter_queries is
 noida but parsed_filter_queries is noida and it matches with the city and
 i am getting correct results.
 
 what processing is going on from filter_queries to parsed_filter_queries.
 
 my schema for city is : -
 
  fieldType name=facetstr_city class=solr.TextField
 sortMissingLast=true 
   analyzer
   tokenizer class=solr.KeywordTokenizerFactory /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms_city_facet.txt ignoreCase=true expand=false /
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
 /fieldType
 
 please suggest me please.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

General question about Solr Caches

2011-02-08 Thread Savvas-Andreas Moysidis

Hello,



I am going through the wiki page related to cache configuration
http://wiki.apache.org/solr/SolrCaching and I have a question regarding the
general cache architecture and implementation:



In my understanding, the Current Index Searcher uses a cache instance and
when a New Index Searcher is registered a new cache instance is used which
is also auto-warmed. However, what happens when the New Index Searcher is a
view of an index which has been modified? If the entries contained in the
old cache are copied during auto warming to the new cache wouldn’t that new
cache contain invalid entries?



Thanks,
- Savvas

Re: EdgeNgram Auto suggest - doubles ignore

2011-02-08 Thread johnnyisrael


Hi Erick,

If you have time, Can you please take a look and provide your comments (or)
suggestions for this problem?

Please let me know if you need any more information.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TermVector query using Solr Tutorial

2011-02-08 Thread Grant Ingersoll

Inline...

On Feb 5, 2011, at 4:28 AM, Ryan Chan wrote:

 Hello all,
 
 I am following this tutorial:
 http://lucene.apache.org/solr/tutorial.html, I am playing with the
 TermVector, here is my step:
 
 
 1. Launch the example server, java -jar start.jar
 
 2. Index the monitor.xml, java -jar post.jar monitor.xml, which
 contains the following
 
 adddoc
  field name=id3007WFP/field
  field name=nameDell Widescreen UltraSharp 3007WFP/field
  field name=manuDell, Inc./field
  field name=catelectronics/field
  field name=catmonitor/field
  field name=features30 TFT active matrix LCD, 2560 x 1600, .25mm
 dot pitch, 700:1 contrast/field
  field name=includesUSB cable/field
  field name=weight401.6/field
  field name=price2199/field
  field name=popularity6/field
  field name=inStocktrue/field
 /doc/add
 
 
 3. Execute the query to search for 25, as you can see, there are two
 `25` in the field features, i.e.
 http://localhost/solr/select/?q=25version=2.2start=0rows=10indent=onqt=tvrhtv.all=true
 
 4. The term vector in the result does not make sense to me
 
 
 lst name=termVectors
 -
 lst name=doc-2
 str name=uniqueKey3007WFP/str
 -
 lst name=includes
 -
 lst name=cabl
 int name=tf1/int
 -
 lst name=offsets
 int name=start4/int
 int name=end9/int
 /lst
 -
 lst name=positions
 int name=position1/int
 /lst
 int name=df1/int
 double name=tf-idf1.0/double
 /lst
 -
 lst name=usb
 int name=tf1/int
 -
 lst name=offsets
 int name=start0/int
 int name=end3/int
 /lst
 -
 lst name=positions
 int name=position0/int
 /lst
 int name=df1/int
 double name=tf-idf1.0/double
 /lst
 /lst
 /lst
 str name=uniqueKeyFieldNameid/str
 /lst
 
 What I want to know is the relative position the keywords within a field.
 
 Anyone can explain the above result to me?


It's a little hard to read due to the indentation, but AFAICT you have two 
terms, usb and cabl.  USB appears at position 0 and cabl at position 1.  
Those are the relative positions to each other.  Perhaps you can explain a bit 
more what you are trying to do?

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Cache size

2011-02-08 Thread Mehdi Ben Haj Abbes

Hi folks,

Is there any way to know the size *in bytes* occupied by a cache (filter
cache, doc cache ...)? I don't find such information within the stats page.

Regards

-- 
Mehdi BEN HAJ ABBES

Re: Cache size

You can dump the heap and analyze it with a tool like jhat. IBM's heap 
analyzer is also a very good tool and if i'm not mistaken people also use one 
that comes with Eclipse.

On Tuesday 08 February 2011 16:35:35 Mehdi Ben Haj Abbes wrote:
 Hi folks,
 
 Is there any way to know the size *in bytes* occupied by a cache (filter
 cache, doc cache ...)? I don't find such information within the stats page.
 
 Regards

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: EdgeNgram Auto suggest - doubles ignore

I'm afraid I'll have to pass, I'm absolutely swamped at the moment. Perhaps
someone else can pick it up.

I will say that you should be getting terms back when you pre-lower-case
them, so look in your index via the admin page or Luke to see if what's
really in your index is what you think in the name field.

As for sorting, I haven't a clue. Start by backing out your custom sorting,
verifying that things are as you expect for everything *except* sorting and
add
it back in

Best
Erick



On Tue, Feb 8, 2011 at 10:11 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Erick,

 If you have time, Can you please take a look and provide your comments (or)
 suggestions for this problem?

 Please let me know if you need any more information.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
 Sent from the Solr - User mailing list archive at Nabble.com.

RE: How to search for special chars like ä from ae?

2011-02-08 Thread Steven A Rowe

Hi Anithya,

Yes, that sounds right.

You will want to edit mapping-FoldToASCII.txt, and my suggestion is that you 
rename mapping-FoldToASCII.txt to reflect your changes (for example, if your 
target language is German, you could rename it to 
mapping-German-FoldToASCII.txt); otherwise it would be easy to mistake this 
file for the unchanged original.

Steve

 -Original Message-
 From: Anithya [mailto:surysha...@gmail.com]
 Sent: Monday, February 07, 2011 6:28 PM
 To: solr-user@lucene.apache.org
 Subject: RE: How to search for special chars like ä from ae?
 
 
 Hi Steve, thanks for the reply. I did not understand which file do I need
 to
 rename? I'm working on Solr 1.4. The file in examples/solr/conf directory
 is
 mapping-ISOLatin1Accent.txt. The Schema.xml has the following commented
 entry.
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 
 Do I need to replace  mapping-ISOLatin1Accent.txt with
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/ma
 pping-FoldToASCII.txt
 mapping-FoldToASCII.txt  and change the charFilter mapping to
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-FoldToASCII.txt/ ?
 --
 View this message in context: http://lucene.472066.n3.nabble.com/How-to-
 search-for-special-chars-like-a-from-ae-tp2444921p2447888.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Separating Index Reader and Writer

Just wanted to push that topic.

Regards

Em wrote:

Hi Peter,

I must jump in this discussion: From a logical point of view what you are
saying makes only sense if both instances do not run on the same machine
or at least not on the same drive.

When both run on the same machine and the same drive, the overall used
memory should be equal plus I do not understand why this setup should
affect cache warming etc., since the process of rewarming should be the
same.

Well, my knowledge about the internals is not very deep. But from just a
logical point of view - to me - the same is happening as if I would do it
in a single solr-instance. So what is the difference, what do I overlook?

Another thing: While W is committing and writing to the index, is there
any inconsistency in R or isn't there any, because W is writing a new
Segment and so for R there isn't anything different until the commit
finished?
Are there problems during optimizing an index?

How do you inform R about the finished commit?

Thank you for your explanation, it's a really interesting topic!

Regards,
Em

Peter Sturge-2 wrote:

Hi,

We use this scenario in production where we have one write-only Solr
instance and 1 read-only, pointing to the same data.
We do this so we can optimize caching/etc. for each instance for
write/read. The main performance gain is in cache warming and
associated parameters.
For your Index W, it's worth turning off cache warming altogether, so
commits aren't slowed down by warming.

Peter

On Sun, Feb 6, 2011 at 3:25 PM, Isan Fulia isan.fu...@germinait.com
wrote:
Hi all,
I have setup two indexes one for reading(R) and other for
writing(W).Index R
refers to the same data dir of W (defined in solrconfig via dataDir).
To make sure the R index sees the indexed documents of W , i am firing
an
empty commit on R.
With this , I am getting performance improvement as compared to using
the
same index for reading and writing .
Can anyone help me in knowing why this performance improvement is taking
place even though both the indexeses are pointing to the same data
directory.

--
Thanks Regards,
Isan Fulia.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Separating-Index-Reader-and-Writer-tp2437666p2452238.html
Sent from the Solr - User mailing list archive at Nabble.com.

Jira problem


Hi list,

I wanted to create a Jira-issue because of the CSVUpdateHandler-topic I
started a few days ago. However I can not create a Jira-account - I do not
recieve any mail or something like that. 

Are there any troubles with the Jira?

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Jira-problem-tp2452254p2452254.html
Sent from the Solr - User mailing list archive at Nabble.com.

Scoring: Precedent for a Rules-/Priority-based Approach?

2011-02-08 Thread Tavi Nathanson

Hey everyone,

I have a question about Lucene/Solr scoring in general. There are many
factors at play in the final score for each document, and very often one
factor will completely dominate everything else when that may not be the
intention.

** The question: might there be a way to enforce strict requirements that
certain factors are higher priority than other factors, and/or certain
factors shouldn't overtake other factors? Perhaps a set of rules where one
factor is considered before even examining another factor? Tuning boost
numbers around and hoping for the best seems imprecise and very fragile. **

To make this more concrete, an example:

We previously added the scores of multi-field matches together via an OR,
so: score(query apple) = score(field1:apple) + score(field2:apple). I
changed that to be more in-line with DisMaxParser, namely a max: score(query
apple) = max(score(field1:apple), score(field2:apple)). I also modified
coord such that coord would only consider actual unique terms (apple vs.
orange), rather than terms across multiple fields (field1:apple vs.
field2:apple).

This seemed like a good idea, but it actually introduced a bug that was
previously hidden. Suddenly, documents matching apple in the title and
*nothing* in the body were being boosted over documents matching apple in
the title and apple in the body! I investigated, and it was due to
lengthNorm: previously, documents matching apple in both title and body
were getting very high scores and completely overwhelming lengthNorm. Now
that they were no longer getting *such* high scores, which was beneficial in
many respects, they were also no longer overwhelming lengthNorm. This
allowed lengthNorm to dominate everything else.

I'd love to hear your thoughts :)

Tavi

Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Tavi Nathanson

Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not
appear to provide an easy mechanism to account for this fuzziness.

Let's take an example, where the document I'm indexing is v1.1.0 mr. jones
www.gmail.com

I may want to tokenize this as follows: [v1.1.0, mr, jones, 
www.gmail.com]
...or I may want to tokenize this as follows: [v1, 1.0, mr, jones,
www, gmail.com]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

[v1.1.0, v1, 1.0, mr, jones, www.gmail.com, www, gmail.com]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index [www, gmail.com] in the
same position as www.gmail.com.

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!

Tavi

Re: Scoring: Precedent for a Rules-/Priority-based Approach?

2011-02-08 Thread Savvas-Andreas Moysidis

Hi Tavi,

In my understanding the scoring formula Lucene (and therefore Solr) uses is
based on a mathematical model which is proven to work for general purpose
full text searching.
The real challenge, as you mention, comes when you need to achieve high
quality scoring based on the domain you are working in. For example, a
general search portal for Songs might need to score Songs based on search
relevance, but a search application for a Music Publisher might need to
score Songs first by relevance with matched documents boosted according to
the revenue they have generated..and ranking from that second scoring
strategy could be widely different to the first one..

Personally, I can't think of a generic scoring strategy that would come out
of the box with Solr which would allow for all the widely different use
cases. Don't really agree that tuning Solr and in general experimenting for
better scoring quality is something fragile or awkward. As the name
suggests, it is a tuning process which is targeting your specific
environment. :)

Technically wise, in our case, we were able to significantly improve scoring
quality (as expected by our domain experts) by using the Dismax Search
Handler, and by experimenting with different Boost values, Function Queries,
the mm parameter and by setting omitNorms to true for the fields we were
having problems with.

Regards,
- Savvas


On 8 February 2011 16:23, Tavi Nathanson tavi.nathan...@gmail.com wrote:

 Hey everyone,

 I have a question about Lucene/Solr scoring in general. There are many
 factors at play in the final score for each document, and very often one
 factor will completely dominate everything else when that may not be the
 intention.

 ** The question: might there be a way to enforce strict requirements that
 certain factors are higher priority than other factors, and/or certain
 factors shouldn't overtake other factors? Perhaps a set of rules where one
 factor is considered before even examining another factor? Tuning boost
 numbers around and hoping for the best seems imprecise and very fragile. **

 To make this more concrete, an example:

 We previously added the scores of multi-field matches together via an OR,
 so: score(query apple) = score(field1:apple) + score(field2:apple). I
 changed that to be more in-line with DisMaxParser, namely a max:
 score(query
 apple) = max(score(field1:apple), score(field2:apple)). I also modified
 coord such that coord would only consider actual unique terms (apple vs.
 orange), rather than terms across multiple fields (field1:apple vs.
 field2:apple).

 This seemed like a good idea, but it actually introduced a bug that was
 previously hidden. Suddenly, documents matching apple in the title and
 *nothing* in the body were being boosted over documents matching apple in
 the title and apple in the body! I investigated, and it was due to
 lengthNorm: previously, documents matching apple in both title and body
 were getting very high scores and completely overwhelming lengthNorm. Now
 that they were no longer getting *such* high scores, which was beneficial
 in
 many respects, they were also no longer overwhelming lengthNorm. This
 allowed lengthNorm to dominate everything else.

 I'd love to hear your thoughts :)

 Tavi

Re: Scoring: Precedent for a Rules-/Priority-based Approach?

Hi Tavi,

could you please provide an example query for your problem and the
debugQuery's output?
It confuses me that you write score(query
apple) = max(score(field1:apple), score(field2:apple))

I think your problem could come from the norms of your request, but I am not
sure.

If you can, show us some piece of your schema.xml and the debugQuery's
output, so that we can have a look at it.

I have to agree with Savvas: Tuning scoring for a special domain is an
exciting thing and there are lots of approaches out there to make scoring
good.

Regards

Tavi Nathanson wrote:

Hey everyone,

I have a question about Lucene/Solr scoring in general. There are many
factors at play in the final score for each document, and very often one
factor will completely dominate everything else when that may not be the
intention.

** The question: might there be a way to enforce strict requirements that
certain factors are higher priority than other factors, and/or certain
factors shouldn't overtake other factors? Perhaps a set of rules where one
factor is considered before even examining another factor? Tuning boost
numbers around and hoping for the best seems imprecise and very fragile.
**

To make this more concrete, an example:

We previously added the scores of multi-field matches together via an OR,
so: score(query apple) = score(field1:apple) + score(field2:apple). I
changed that to be more in-line with DisMaxParser, namely a max:
score(query
apple) = max(score(field1:apple), score(field2:apple)). I also modified
coord such that coord would only consider actual unique terms (apple vs.
orange), rather than terms across multiple fields (field1:apple vs.
field2:apple).

This seemed like a good idea, but it actually introduced a bug that was
previously hidden. Suddenly, documents matching apple in the title and
*nothing* in the body were being boosted over documents matching apple
in
the title and apple in the body! I investigated, and it was due to
lengthNorm: previously, documents matching apple in both title and body
were getting very high scores and completely overwhelming lengthNorm. Now
that they were no longer getting *such* high scores, which was beneficial
in
many respects, they were also no longer overwhelming lengthNorm. This
allowed lengthNorm to dominate everything else.

I'd love to hear your thoughts :)

Tavi

--
View this message in context:
http://lucene.472066.n3.nabble.com/Scoring-Precedent-for-a-Rules-Priority-based-Approach-tp2452340p2453161.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTTP ERROR 400 undefined field: *

2011-02-08 Thread Jed Glazner

So I re-indexed some of the content, but no dice. Per Hoss, I tried 
disabling the TVC and it worked great.  We're not really using tvc right 
now since we made a decision to turn off highlighting for the moment, so 
this isn't a huge deal.  I'll create a new jira issue.


FYI here is my query from the logs:

--this one breaks (undefined field)
webapp=/solr path=/select 
params={explainOther=fl=*,scoreindent=onstart=0q=brucehl.fl=qt=standardwt=standardfq=version=2.2rows=10} 
hits=114 status=400 QTime=21


this one works:
webapp=/solr path=/select 
params={explainOther=indent=onhl.fl=wt=standardversion=2.2rows=10fl=*,scorestart=0q=brucetv=falseqt=standardfq=} 
hits=128 status=0 QTime=48


Though i'm not sure why when the tvc is disabled there are more hits, 
but the qtime is slower.  That's a different issue though, and something 
I can work though.


Thanks for your help.



On 02/07/2011 11:38 AM, Chris Hostetter wrote:

: The stack trace is attached.  I also saw this warning in the logs not sure

 From your attachment...

  853 SEVERE: org.apache.solr.common.SolrException: undefined field: score
  854   at 
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142)
  855   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
  856   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  857   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357)

...this is one of the key pieces of info that was missing from your
earlier email: that you are using the TermVectorComponent.

It's likely that something changed in the TVC on 3x between the two
versions you were using and thta change freaks out now on * or score
in the fl.

you still haven't given us an example of the full URLs you are using that
trigger this error. (it's posisble there is something slightly off in your
syntax - we don't know because you haven't shown us)

All in: this sounds like a newly introduced bug in TVC, please post the
details into a new Jira issue.

as to the warning you asked about...

: Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion
: WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
: emulation. You should at some point declare and reindex to at least 3.0,
: because 2.4 emulation is deprecated and will be removed in 4.0. This parameter
: will be mandatory in 4.0.

if you look at the example configs on the 3x branch it should be
explained.  it's basically just a new feature that lets you specify
which quirks of the underlying lucene code you want (so on upgrading you
are in control of wether you eliminate old quirks or not)


-Hoss

Re: Tokenization: How to Allow Multiple Strategies?


Hi Tavi,

if you want to use multiple tokenization strategies (different tokenizers so
to speak) you have to use different fieldTypes.

Maybe you have to create your own tokenizer for doing what you want or a
PatternTokenizer might help you.

However, your examples for the different positions of specific terms reminds
me on the WordDelimiterFilter (see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
).

It does almost everything you wrote and is close to what you want, I think.
Have a look at it.

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTTP ERROR 400 undefined field: *

2011-02-08 Thread Jed Glazner


here is the ticket:
https://issues.apache.org/jira/browse/SOLR-2352

On 02/08/2011 11:27 AM, Jed Glazner wrote:

So I re-indexed some of the content, but no dice. Per Hoss, I tried
disabling the TVC and it worked great.  We're not really using tvc right
now since we made a decision to turn off highlighting for the moment, so
this isn't a huge deal.  I'll create a new jira issue.

FYI here is my query from the logs:

--this one breaks (undefined field)
webapp=/solr path=/select
params={explainOther=fl=*,scoreindent=onstart=0q=brucehl.fl=qt=standardwt=standardfq=version=2.2rows=10}
hits=114 status=400 QTime=21

this one works:
webapp=/solr path=/select
params={explainOther=indent=onhl.fl=wt=standardversion=2.2rows=10fl=*,scorestart=0q=brucetv=falseqt=standardfq=}
hits=128 status=0 QTime=48

Though i'm not sure why when the tvc is disabled there are more hits,
but the qtime is slower.  That's a different issue though, and something
I can work though.

Thanks for your help.



On 02/07/2011 11:38 AM, Chris Hostetter wrote:

: The stack trace is attached.  I also saw this warning in the logs not sure

  From your attachment...

   853 SEVERE: org.apache.solr.common.SolrException: undefined field: score
   854   at 
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142)
   855   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
   856   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   857   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357)

...this is one of the key pieces of info that was missing from your
earlier email: that you are using the TermVectorComponent.

It's likely that something changed in the TVC on 3x between the two
versions you were using and thta change freaks out now on * or score
in the fl.

you still haven't given us an example of the full URLs you are using that
trigger this error. (it's posisble there is something slightly off in your
syntax - we don't know because you haven't shown us)

All in: this sounds like a newly introduced bug in TVC, please post the
details into a new Jira issue.

as to the warning you asked about...

: Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion
: WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
: emulation. You should at some point declare and reindex to at least 3.0,
: because 2.4 emulation is deprecated and will be removed in 4.0. This parameter
: will be mandatory in 4.0.

if you look at the example configs on the 3x branch it should be
explained.  it's basically just a new feature that lets you specify
which quirks of the underlying lucene code you want (so on upgrading you
are in control of wether you eliminate old quirks or not)


-Hoss

relational db mapping for advanced search

2011-02-08 Thread Scott Yeadon


Hi,

I was just after some advice on how to map some relational metadata to a 
solr index. The web application I'm working on is based around people 
and the searching based around properties of these people. Several 
properties are more complex - for example, a person's occupations have 
place, from/to dates and other descriptive text; texts about a person 
have authors, sources and publication dates. Despite the usefulness of 
facets and the search-based navigation, an advanced search feature is a 
non-negotiable required feature of the application.


An advanced search needs to be able to query a person on any set of 
attributes (e.g. gender, birth date, death date, place of birth) etc 
including the more complex search criteron as described above 
(occupation, texts). Taking occupation as an example, because occupation 
has its own metadata and a person could have worked an arbitrary number 
of occupations throughout their lifetime, I was wondering how/if this 
information can be denormalised into a single person index document to 
support such a search. I can't use text concatenation in a multivalued 
field as I need to be able to run date-based range queries (e.g. 
publication dates, occupation dates). And I'm not sure that resorting to 
multiple repeated fields based on the current limits (e.g. occ1, 
occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach 
(although that would work).


If there isn't a sensible way to denormalise this, what is the best 
approach? For example, should I have an occupation document type, a 
person document type, a text/source document type and (in an advanced 
search context) each containing the relevant person id and (in the 
advanced search context) run a query against each document type and then 
use the intersecting set of person ids as the result used by the 
application for its display/pagination? And if so, how do I ensure I 
capture all records - for example if there are 100,000 hits on someone 
having worked in Australia in 1956, is there any way to ensure all 
100,000 are returned in a query (similar to the facet.limit = -1) other 
than specifying an arbitrary high number in the rows parameter and 
hoping a query doesn't hit more than 100,000 and thus exclude those 
above the limit from the intersect processing?


Or is there a single query solution?

Any advice/hints welcome.

Scott.

RE: relational db mapping for advanced search

2011-02-08 Thread Jonathan Rochkind

I have no great answer for you, this is to me a generally unanswered question, 
hard to do Solr with this sort of thing, I think you seem to understand it 
properly. 

There ARE some interesting new features in trunk (not 1.4) that may be 
relevant, although to my perspective none of them provide magic bullet 
solutions. But there is a 'join' feature which could be awfully useful with the 
setup you suggest of having different 'types' of documents all together in the 
same index. 

https://issues.apache.org/jira/browse/SOLR-2272

From: Scott Yeadon [scott.yea...@anu.edu.au]
Sent: Tuesday, February 08, 2011 4:41 PM
To: solr-user@lucene.apache.org
Subject: relational db mapping for advanced search

Hi,

I was just after some advice on how to map some relational metadata to a
solr index. The web application I'm working on is based around people
and the searching based around properties of these people. Several
properties are more complex - for example, a person's occupations have
place, from/to dates and other descriptive text; texts about a person
have authors, sources and publication dates. Despite the usefulness of
facets and the search-based navigation, an advanced search feature is a
non-negotiable required feature of the application.

An advanced search needs to be able to query a person on any set of
attributes (e.g. gender, birth date, death date, place of birth) etc
including the more complex search criteron as described above
(occupation, texts). Taking occupation as an example, because occupation
has its own metadata and a person could have worked an arbitrary number
of occupations throughout their lifetime, I was wondering how/if this
information can be denormalised into a single person index document to
support such a search. I can't use text concatenation in a multivalued
field as I need to be able to run date-based range queries (e.g.
publication dates, occupation dates). And I'm not sure that resorting to
multiple repeated fields based on the current limits (e.g. occ1,
occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach
(although that would work).

If there isn't a sensible way to denormalise this, what is the best
approach? For example, should I have an occupation document type, a
person document type, a text/source document type and (in an advanced
search context) each containing the relevant person id and (in the
advanced search context) run a query against each document type and then
use the intersecting set of person ids as the result used by the
application for its display/pagination? And if so, how do I ensure I
capture all records - for example if there are 100,000 hits on someone
having worked in Australia in 1956, is there any way to ensure all
100,000 are returned in a query (similar to the facet.limit = -1) other
than specifying an arbitrary high number in the rows parameter and
hoping a query doesn't hit more than 100,000 and thus exclude those
above the limit from the intersect processing?

Or is there a single query solution?

Any advice/hints welcome.

Scott.

Re: relational db mapping for advanced search

2011-02-08 Thread Scott Yeadon

Yes, I saw something in the dev stream about compound types as well 
which would also be useful (so in my example an occupation field could 
comprise of multiple fields of different types) but these are up and 
coming features. I suspect using multiple document types is probably the 
best way for now, but thanks for the heads up on the join - looks like 
these issues will be better addressed in the future. RDBMS in my context 
won't work well as requires lots of joins (and self-joins) for complex 
searches and in the old system these tend to lock up the DB as the temp 
table size grows exponentially.


Scott.

On 9/02/11 8:57 AM, Jonathan Rochkind wrote:

I have no great answer for you, this is to me a generally unanswered question, 
hard to do Solr with this sort of thing, I think you seem to understand it 
properly.

There ARE some interesting new features in trunk (not 1.4) that may be 
relevant, although to my perspective none of them provide magic bullet 
solutions. But there is a 'join' feature which could be awfully useful with the 
setup you suggest of having different 'types' of documents all together in the 
same index.

https://issues.apache.org/jira/browse/SOLR-2272

From: Scott Yeadon [scott.yea...@anu.edu.au]
Sent: Tuesday, February 08, 2011 4:41 PM
To: solr-user@lucene.apache.org
Subject: relational db mapping for advanced search

Hi,

I was just after some advice on how to map some relational metadata to a
solr index. The web application I'm working on is based around people
and the searching based around properties of these people. Several
properties are more complex - for example, a person's occupations have
place, from/to dates and other descriptive text; texts about a person
have authors, sources and publication dates. Despite the usefulness of
facets and the search-based navigation, an advanced search feature is a
non-negotiable required feature of the application.

An advanced search needs to be able to query a person on any set of
attributes (e.g. gender, birth date, death date, place of birth) etc
including the more complex search criteron as described above
(occupation, texts). Taking occupation as an example, because occupation
has its own metadata and a person could have worked an arbitrary number
of occupations throughout their lifetime, I was wondering how/if this
information can be denormalised into a single person index document to
support such a search. I can't use text concatenation in a multivalued
field as I need to be able to run date-based range queries (e.g.
publication dates, occupation dates). And I'm not sure that resorting to
multiple repeated fields based on the current limits (e.g. occ1,
occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach
(although that would work).

If there isn't a sensible way to denormalise this, what is the best
approach? For example, should I have an occupation document type, a
person document type, a text/source document type and (in an advanced
search context) each containing the relevant person id and (in the
advanced search context) run a query against each document type and then
use the intersecting set of person ids as the result used by the
application for its display/pagination? And if so, how do I ensure I
capture all records - for example if there are 100,000 hits on someone
having worked in Australia in 1956, is there any way to ensure all
100,000 are returned in a query (similar to the facet.limit = -1) other
than specifying an arbitrary high number in the rows parameter and
hoping a query doesn't hit more than 100,000 and thus exclude those
above the limit from the intersect processing?

Or is there a single query solution?

Any advice/hints welcome.

Scott.

RE: How to search for special chars like ä from ae?

2011-02-08 Thread Anithya


Thanks for the help Steve, it worked!!!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-for-special-chars-like-a-from-ae-tp2444921p2454816.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: How to search for special chars like ä from ae?

2011-02-08 Thread Steven A Rowe

Hi Anithya,

That's good to hear.  Again, please consider donating your work: 
http://wiki.apache.org/solr/HowToContribute#Making_Changes.

Steve

 -Original Message-
 From: Anithya [mailto:surysha...@gmail.com]
 Sent: Tuesday, February 08, 2011 5:16 PM
 To: solr-user@lucene.apache.org
 Subject: RE: How to search for special chars like ä from ae?
 
 
 Thanks for the help Steve, it worked!!!
 --
 View this message in context: http://lucene.472066.n3.nabble.com/How-to-
 search-for-special-chars-like-a-from-ae-tp2444921p2454816.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to search for special chars like ä from ae?

2011-02-08 Thread charan kumar

Hello,

 Quick question on solr replication?

  What effect does  index reload after a replication has on search requests?
Can server still respond to user queries with old index?

  Especially, during the following phases of replication on slaves.
http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F


*  After the download completes, all the new files are 'mov'ed to the
slave's live index directory and the files' timestamps will match the
timestamps in the master.
 A 'commit' command is issued on the slave by the Slave's
ReplicationHandler and the new index is loaded


*
Thanks,
Charan

RE: How to search for special chars like ä from ae?

2011-02-08 Thread Robert Sandiford


So - how did you end up setting it up?  In my reading of the thread, it seems
you could have a search for 'mäcman' hit 'macman' or 'maecman', but not
both, since you it seems you could only map the ä to a single replacement. 
Or can it be mapped multiple times, generating multiple tokens?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-search-for-special-chars-like-a-from-ae-tp2444921p2455176.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr n00b question: writing a custom QueryComponent

Your observation regarding optimisation is an interesting one, it does
at least make sense that reducing the size of a segment will speed up
optimisation and reduce the disk space needed.

In a situation that had multiple shards, we had two 'rows', for
redundancy purposes. In that situation, we could take one row offline
while it optimised and allow the other to serve search during that time.
If we offset optimisation by 12 hours for each of our rows, we can
optimise daily and not have a problem with loss of up-to-date content or
slow searches during an optimisation.

As to splitting indexes, it isn't an easy task to do properly, and
there's nothing in Solr to do it. However, there is a very clever class
in Lucene contrib that you can use to split a Lucene index [1], and you
can safely use it to split a Solr index so long as the index isn't in
use while you're doing it.

Upayavira
[1] for example:
http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html

On Tue, 08 Feb 2011 06:24 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
Thanks for the detailed reply Upayavira.

To answer your question, our index is growing much faster than expected
and our performance is grinding to a halt. Currently, it has over 150
million records.
We're planning to split the index into multiple shards very soon and move
the index creation to hadoop.

Our current situation is that we need to run optimize once every couple
of days to keep it in shape. Given the size(index + stored), it takes a
long time to complete during which time we can't add new documents into
the index. And because of the size of the stored fields, we need double
the storage size of the current index to optimize. Since we're on EC2,
this requires frequent increase in storage capacity.

Even after sharding the index, the time to take to optimize the index is
going to be significant. That's the reason why we decided to store these
fields in MySQL.
If there's some easier solution that I've overlooked, please point it
out.

On a related note, is there a way to 'automagically' split the existing
index into multiple shards?

--
Thanks,
Ishwar

Just another resurrected Neozoic Archosaur comics.
http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/

From: Upayavira u...@odoko.co.uk
To: solr-user@lucene.apache.org
Cc:
Sent: Tuesday, February 8, 2011 7:17 PM
Subject: Re: Solr n00b question: writing a custom QueryComponent

The conventional way to do it would be to index your title and content
fields in Solr, along with the ID to identify the document.

You could do a search against solr, and just return an ID field, then
your 'client code' would match that up with the title/content data from
your database. And yes, SolrJ would be the obvious route to take here,
for your client application.

Yes, it does mean another component that needs to be maintained, but by
using Solr's external interface you will be protected from changes to
internals that could break your custom components, and you will likely
be more able to take advantage of other Solr features that are also
available via the standard interfaces.

My next question is: are you going to be using the data you're storing
in mysql for something other than just enhancing search results? If not,
it may still make sense to store the data in Solr. It would mean you
just have one index to manage, rather than an index and a database -
after all, the words *have* to take up disk space somewhere :-). If you
end up with so many documents indexed that performance grinds (over
10million??) you can split your index across multiple shards.

Upayavira

Once you get search results back from Solr, you would do a query against
your database to return the additional

On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
Hi Upayavira,

Apologies for the lack of clarity in the mail. The feeds have the
following fields:
id, url, title, content, refererurl, createdDate, author, etc. We need
search functionality on title and content.
As mentioned earlier, storing title and content in solr takes up a lot of
space. So, we index title and content in solr, and we wish to store title
and content in MySQL which has the fields - id, title, content.

I'm also looking at a solr client- solrj to query MySQL based on what
solr returns. But that means another component which needs to be
maintained. I was wondering if it's a good idea to implement the
functionality in solr itself.

--
Thanks,
Ishwar

Just another resurrected Neozoic Archosaur comics.
http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/

From: Upayavira u...@odoko.co.uk
To: solr-user@lucene.apache.org
Cc:
Sent: Tuesday, February 8, 2011 4:36 PM
Subject: Re: Solr n00b question: writing a custom QueryComponent

I'm still not quite clear what you are attempting

Re: How to search for special chars like ä from ae?

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is hidden in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.

See: http://people.apache.org/~hossman/#threadhijack

On Tue, Feb 8, 2011 at 5:59 PM, charan kumar charan.ku...@gmail.com wrote:

 Hello,

  Quick question on solr replication?

  What effect does  index reload after a replication has on search requests?
 Can server still respond to user queries with old index?

  Especially, during the following phases of replication on slaves.
 http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F


 *  After the download completes, all the new files are 'mov'ed to the
 slave's live index directory and the files' timestamps will match the
 timestamps in the master.
  A 'commit' command is issued on the slave by the Slave's
 ReplicationHandler and the new index is loaded


 *
 Thanks,
 Charan

Re: Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Tavi Nathanson

Thanks for the suggestions! Using a new field makes sense, except it would
double the size of the index. I'd like to add additional terms, at my
discretion, only when there's ambiguity.

More specifically, do you know of any way to put multiple *tokens sets* at
the same position of the same field?

If I can tokenize 123-4567 apple as:

[Token(123), Token(-), Token(4567), Token(apple)]
or
[Token(123-4567), Token(apple)]

...might there be a way to put [Token(123), Token(-), Token(4567)] *and*
[Token(123-4567)]  in the index in such a way that the PhraseQuery
Token(123-4567) Token(apple) would match the above string, *and* the
PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match
it?

Thanks!
Tavi

On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote:


 Hi Tavi,

 if you want to use multiple tokenization strategies (different tokenizers
 so
 to speak) you have to use different fieldTypes.

 Maybe you have to create your own tokenizer for doing what you want or a
 PatternTokenizer might help you.

 However, your examples for the different positions of specific terms
 reminds
 me on the WordDelimiterFilter (see

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 ).

 It does almost everything you wrote and is close to what you want, I think.
 Have a look at it.

 Regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: jndi datasource in dataimport

2011-02-08 Thread lee carroll

Hi Still no luck with this is the problem with

the name attribute of the datasource element in the data config ?



On 5 February 2011 10:48, lee carroll lee.a.carr...@googlemail.com wrote:

 ah should this work or am i doing something obvious wrong

 in config

 dataSource
   jndiName=java:sourcepathName
   type=JdbcDataSource
   user=xxx password=xxx/

 in dataimport config
 dataSource type=JdbcDataSource name=java:sourcepathName
 /

 what am i doing wrong ?




 On 5 February 2011 10:16, lee carroll lee.a.carr...@googlemail.comwrote:

 Hi list,

 It looks like you can use a jndi datsource in the data import handler.
 however i can't find any syntax on this.

 Where is the best place to look for this ? (and confirm if jndi does work
 in dataimporthandler)

Re: Tokenization: How to Allow Multiple Strategies?

A couple of things...

First, you haven't provided any evidence that increasing the index size is a
concern. If your index isn't all that large, it really doesn't matter, and
conserving
index size may not be a concern.

WordDelimterFilterFactory (WDFF) will do the use cases you outlined below,
but don't
get stuck on, for instance, having the '-' be a token unless you can say
for certain that it has benefits over both indexing and searching on just
123 followed by 4567 which is what would happen with WDFF.

I recommend that you look at the analysis page (check the verbose box)
to see the effects of tokenization with various analysis chains before
making
any firm decisions.

Best
Erick

On Tue, Feb 8, 2011 at 6:24 PM, Tavi Nathanson tavi.nathan...@gmail.comwrote:

 Thanks for the suggestions! Using a new field makes sense, except it would
 double the size of the index. I'd like to add additional terms, at my
 discretion, only when there's ambiguity.

 More specifically, do you know of any way to put multiple *tokens sets* at
 the same position of the same field?

 If I can tokenize 123-4567 apple as:

 [Token(123), Token(-), Token(4567), Token(apple)]
 or
 [Token(123-4567), Token(apple)]

 ...might there be a way to put [Token(123), Token(-), Token(4567)] *and*
 [Token(123-4567)]  in the index in such a way that the PhraseQuery
 Token(123-4567) Token(apple) would match the above string, *and* the
 PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match
 it?

 Thanks!
 Tavi

 On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote:

 
  Hi Tavi,
 
  if you want to use multiple tokenization strategies (different tokenizers
  so
  to speak) you have to use different fieldTypes.
 
  Maybe you have to create your own tokenizer for doing what you want or a
  PatternTokenizer might help you.
 
  However, your examples for the different positions of specific terms
  reminds
  me on the WordDelimiterFilter (see
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
  ).
 
  It does almost everything you wrote and is close to what you want, I
 think.
  Have a look at it.
 
  Regards
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Does Distributed Search support {!boost }?

2011-02-08 Thread Andy

Is it possible to do a query like {!boost b=log(popularity)}foo over sharded 
indexes?

I looked at the wiki on distributed search 
(http://wiki.apache.org/solr/DistributedSearch) and it has a list of 
components that are supported in distributed search. Just wondering what 
component does {!boost } belong to?

Thanks.


 

No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail

Re: General question about Solr Caches

2011-02-08 Thread Chris Hostetter

: In my understanding, the Current Index Searcher uses a cache instance and
: when a New Index Searcher is registered a new cache instance is used which
: is also auto-warmed. However, what happens when the New Index Searcher is a
: view of an index which has been modified? If the entries contained in the
: old cache are copied during auto warming to the new cache wouldn’t that new
: cache contain invalid entries?

a) i'm not sure what you mean by view of an index which has been 
modified ... except for the first time an index is created, an Index 
Searcher always contains a view of an index which has been modified -- 
that view that the IndexSearcher represents is entirely consinsitent and 
doesn't change as documents are added/removed - that's why a new Searcher 
needs to be opened.

b) entries are not copied during autowarming.  the *keys* of the entries 
in the old cache are used to warm the new cache -- using the new searcher 
to generate new values.
  
(caveat: if you have a custom cache, you could write a custom cache 
regenerator that did copy the values from the old cache verbatim -- i have 
done that in special cases where the type of object i was caching didn't 
vary based on the IndexSearcher -- or did vary, but in such a way that i 
could use the new Searcher to determine a cheap piece of information and 
based on the result either reuse an old value that was expensive to 
compute, or recompute it using hte new Searcher.  ... but none of the 
default cache regenerators for the stock solr caches work this way)


: 
: 
: 
: Thanks,
: - Savvas
: 

-Hoss

Re: jndi datasource in dataimport

2011-02-08 Thread Chris Hostetter


: It looks like you can use a jndi datsource in the data import handler.
: however i can't find any syntax on this.
: 
: Where is the best place to look for this ? (and confirm if jndi does work in
: dataimporthandler)

It's been a long time since i used JNDI on anything, and i've never tried 
it with DIH, but google searching for JNDI DataImportHandler pointed 
to...

http://wiki.apache.org/solr/DataImportHandlerFaq#How_do_I_use_a_JNDI_DataSource.3F

-Hoss

[WKT] Spatial Searching

2011-02-08 Thread Adam Estrada

I just came across a ~nudge post over in the SIS list on what the status is for 
that project. This got me looking more in to spatial mods with Solr4.0.  I 
found this enhancement in Jira. 
https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David mentions 
that he's already integrated JTS in to Solr4.0 for querying on polygons stored 
as WKT. 

It's relatively easy to get WKT strings in to Solr but does the Field type 
exist yet? Is there a patch or something that I can test out? 

Here's how I would do it using GDAL/OGR and the already existing csv update 
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form of 
WKT. You can then get the data in to Solr by running the following command.
curl 
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a 
daunting task but because JTS recognizes each geometry type it should be 
possible to work with them. 
Does anyone know of a patch or even when this functionality might be included 
in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

Re: How to search for special chars like ä from ae?

2011-02-08 Thread charan kumar

sorry for cross posting, but that is the only I could get my question
posted. SOLR Mailing server treats my question as SPAM


Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,
HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL (state 18).


On Tue, Feb 8, 2011 at 3:17 PM, Erick Erickson erickerick...@gmail.comwrote:

 When starting a new discussion on a mailing list, please do not reply to
 an existing message, instead start a fresh email.  Even if you change the
 subject line of your email, other mail headers still track which thread
 you replied to and your question is hidden in that thread and gets less
 attention.   It makes following discussions in the mailing list archives
 particularly difficult.

 See: http://people.apache.org/~hossman/#threadhijack

 On Tue, Feb 8, 2011 at 5:59 PM, charan kumar charan.ku...@gmail.com
 wrote:

  Hello,
 
   Quick question on solr replication?
 
   What effect does  index reload after a replication has on search
 requests?
  Can server still respond to user queries with old index?
 
   Especially, during the following phases of replication on slaves.
 
 http://wiki.apache.org/solr/SolrReplication#How_does_the_slave_replicate.3F
 
 
  *  After the download completes, all the new files are 'mov'ed to the
  slave's live index directory and the files' timestamps will match the
  timestamps in the master.
   A 'commit' command is issued on the slave by the Slave's
  ReplicationHandler and the new index is loaded
 
 
  *
  Thanks,
  Charan

Re: [WKT] Spatial Searching

2011-02-08 Thread Mattmann, Chris A (388J)

+1 to David's patch from SOLR-2155.

It would be great to implement. Great job using GDAL on converting the WKT Adam!

Cheers,
Chris

On Feb 8, 2011, at 8:18 PM, Adam Estrada wrote:

I just came across a ~nudge post over in the SIS list on what the status is
for that project. This got me looking more in to spatial mods with Solr4.0.
I found this enhancement in Jira.
https://issues.apache.org/jira/browse/SOLR-2155. In this issue, David
mentions that he's already integrated JTS in to Solr4.0 for querying on
polygons stored as WKT.

It's relatively easy to get WKT strings in to Solr but does the Field type
exist yet? Is there a patch or something that I can test out?

Here's how I would do it using GDAL/OGR and the already existing csv update
handler. http://www.gdal.org/ogr/drv_csv.html

ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_WKT
This converts a shapefile to a csv with the geometries in tact in the form of
WKT. You can then get the data in to Solr by running the following command.
curl
http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,attr1,attr2,attr3,geomstream.file=C:\tmp\output.csvoverwrite=truestream.contentType=text/plain;charset=utf-8;
There are lots of flavors of geometries so I suspect that this will be a
daunting task but because JTS recognizes each geometry type it should be
possible to work with them.
Does anyone know of a patch or even when this functionality might be included
in to Solr4.0? I need to query for polygons ;-)
Thanks,
Adam

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: Solr n00b question: writing a custom QueryComponent

2011-02-08 Thread Ishwar

In the situation that you'd explained, I'm assuming one of the rows is the 
master and the other is the slave. How did you continue feeding documents while 
the master was down for optimisation?

And thanks for the link to MultiPassIndexSplitter. I shall check it out.

 
--
Thanks,

Ishwar


Just another resurrected Neozoic Archosaur comics.
http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/



From: Upayavira u...@odoko.co.uk
To: solr-user@lucene.apache.org
Sent: Wednesday, February 9, 2011 4:42 AM
Subject: Re: Solr n00b question: writing a custom QueryComponent

Your observation regarding optimisation is an interesting one, it does
at least make sense that reducing the size of a segment will speed up
optimisation and reduce the disk space needed.

In a situation that had multiple shards, we had two 'rows', for
redundancy purposes. In that situation, we could take one row offline
while it optimised and allow the other to serve search during that time.
If we offset optimisation by 12 hours for each of our rows, we can
optimise daily and not have a problem with loss of up-to-date content or
slow searches during an optimisation.

As to splitting indexes, it isn't an easy task to do properly, and
there's nothing in Solr to do it. However, there is a very clever class
in Lucene contrib that you can use to split a Lucene index [1], and you
can safely use it to split a Solr index so long as the index isn't in
use while you're doing it.

Upayavira
[1] for example:
http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/index/MultiPassIndexSplitter.html

On Tue, 08 Feb 2011 06:24 -0800, Ishwar ishwarsridha...@yahoo.com
wrote:
 Thanks for the detailed reply Upayavira.
 
 
 To answer your question, our index is growing much faster than expected
 and our performance is grinding to a halt. Currently, it has over 150
 million records.
 We're planning to split the index into multiple shards very soon and move
 the index creation to hadoop.
 
 Our current situation is that we need to run optimize once every couple
 of days to keep it in shape. Given the size(index + stored), it takes a
 long time to complete during which time we can't add new documents into
 the index. And because of the size of the stored fields, we need double
 the storage size of the current  index to optimize. Since we're on EC2,
 this requires frequent increase in storage capacity.
 
 Even after sharding the index, the time to take to optimize the index is
 going to be significant. That's the reason why we decided to store these
 fields in MySQL.
 If there's some easier solution that I've overlooked, please point it
 out.
 
 On a related note, is there a way to 'automagically' split the existing
 index into multiple shards?
 
 --
 Thanks,
 Ishwar
 
 
 Just another resurrected Neozoic Archosaur comics.
 http://www.flickr.com/photos/mojosaurus/sets/72157600257724083/
 
 
 From: Upayavira u...@odoko.co.uk
 To: solr-user@lucene.apache.org
 Cc: 
 Sent: Tuesday, February 8, 2011 7:17 PM
 Subject: Re: Solr n00b question: writing a custom QueryComponent
 
 The conventional way to do it would be to index your title and content
 fields in Solr, along with the ID to identify the document.
 
 You could do a search against solr, and just return an ID field, then
 your 'client code' would match that up with the title/content data from
 your database. And yes, SolrJ would be the obvious route to take here,
 for your client application.
 
 Yes, it does mean another component that needs to be maintained, but by
 using Solr's external interface you will be protected from changes to
 internals that could break your custom components, and you will likely
 be more able to take advantage of other Solr features that are also
 available via the standard interfaces.
 
 My next question is: are you going to be using the data you're storing
 in mysql for something other than just enhancing search results? If not,
 it may still make sense to store the data in Solr. It would mean you
 just have one index to manage, rather than an index and a database -
 after all, the words *have* to take up disk space somewhere :-). If you
 end up with so many documents indexed that performance grinds (over
 10million??) you can split your index across multiple shards.
 
 Upayavira
 
 Once you get search results back from Solr, you would do a query against
 your database to return the additional 
 
 On Tue, 08 Feb 2011 03:38 -0800, Ishwar ishwarsridha...@yahoo.com
 wrote:
  Hi Upayavira,
  
  Apologies for the lack of clarity in the mail. The feeds have the
  following fields:
  id, url, title, content, refererurl, createdDate, author, etc. We need
  search functionality on title and content. 
  As mentioned earlier, storing title and content in solr takes up a lot of
  space. So, we index title and content in solr, and we wish to store title
  and content in MySQL which has the fields - id, title, content.
  
  I'm also looking at a

Help migrating from Lucene

2011-02-08 Thread Todd Nine

Hey guys,
  We're migrating from Lucene to Solr.  So far the migration has been
smooth, however there is one feature I'm having issues adapting.  Our calls
to our indexing service are defined in a central interface.   Here is an
example of a query executed from a programmatically constructed Lucene
query.

   BooleanQuery query = new BooleanQuery();


BooleanQuery inputTerms = new BooleanQuery();

inputTerms.add(new TermQuery(new Term(FIELD_EMAIL, input)), Occur.SHOULD);

inputTerms.add(new TermQuery(new Term(FIELD_PHONE,
getNumericString(input))), Occur.SHOULD);


query.add(inputTerms, Occur.MUST);

query.add(new TermQuery(new Term(FIELD_RESOLVED,
String.valueOf(false))),Occur.MUST);


NumericRangeQuery time = NumericRangeQuery.newLongRange(FIELD_CREATETIME,
null, endTime, true, true);


query.add(time, Occur.MUST);


SortField sort = new SortField(FIELD_CREATETIME, SortField.LONG, true);



CommonsHttpSolrServer client =  getClient(indexName);

SolrQuery solrQuery = new SolrQuery();


//TODO how do I set the sort?

solrQuery.setQuery(query.toString());

QueryResponse response = client.query(solrQuery);


How can I set the sort into the java client?


Also, with the annotations of Pojo's outlined here.


http://wiki.apache.org/solr/Solrj#Directly_adding_POJOs_to_Solr


How are sets handled?  For instance, how are Lists of other POJO's added to
the document?


Thanks,

Todd

Re: Solr n00b question: writing a custom QueryComponent