date:20090722

Boosting of search results

2009-07-22 Thread prerna07


HI,

I want to boost  / block search results, i don't want to use boosting of
fields/ term of dismaxrequest handler.

I have seen some post saying setting a value to the key $docBoost via
transformer, but i am not sure how to use / set doc boost via transformer.

http://www.nabble.com/Boosting-Code-td22119017.html#a22119017

Please let me know how should we use docboost or is there any other way to
boost documents.

Thanks,
Prerna
-- 
View this message in context: 
http://www.nabble.com/Boosting-of-search-results-tp24600557p24600557.html
Sent from the Solr - User mailing list archive at Nabble.com.

importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Julian Davchev

Hi folks,

I have around 50k documents that are reindexed now and then.

Question is what would be the fastest approach to all this.

Data is  just text ~20fields or so.  It comes from database but is first
specially formated to get to format suitable for passing in solr.

Currently xml post is used but have the feeling this is not optimal for
speed wise when it is up to bulk import/reindex.

I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
see howto do this specially formated data so solr makes use of it.

Are there some real examples,articles on howto use this?

Re: Boosting Code

2009-07-22 Thread prerna07


Hi,

I have to boost document, Can someone help me understanding how can we
implement docBoost via transformer.

Thanks,
Prerna


Marc Sturlese wrote:
 
 If you mean at indexing time, you set field boost via data-config.xml.
 That boost is parsed from there and set to the lucene document going
 through DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java
 In case you want to set full-document boost (not just to a field) you can
 do it setting a value to the key $docBoost via transformer. That value is
 set using same classes (DocBuilder,java, SolrInputDocuemnt.java and
 DocuemntBuilder.java).
 
 
 
 dabboo wrote:
 
 Hi,
 
 Can anyone please tell me where I can find the actual
 logic/implementation of field boosting in Solr. I am looking for classes.
 
 Thanks,
 Amit Garg
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Boosting-Code-tp22119017p24600769.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

if each field from the db goes to a separate field in solr as-is .
Then it is very simple.

if you need to split/join fields before feeding it into solr fields
you may need to apply transformers
an example on how your db field looks like and how you wish it to look
like in solr would be helpful

On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:
 Hi folks,

 I have around 50k documents that are reindexed now and then.

 Question is what would be the fastest approach to all this.

 Data is  just text ~20fields or so.  It comes from database but is first
 specially formated to get to format suitable for passing in solr.

 Currently xml post is used but have the feeling this is not optimal for
 speed wise when it is up to bulk import/reindex.

 I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
 see howto do this specially formated data so solr makes use of it.

 Are there some real examples,articles on howto use this?




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

DIH example explanation

2009-07-22 Thread Antonio Eggberg

Hi, 

I am looking at the slashdot example and I am having hard time understanding 
the following, from the wiki

==

You can use this feature for indexing from REST API's such as rss/atom feeds, 
XML data feeds , other Solr servers or even well formed xhtml documents . Our 
XPath support has its limitations (no wildcards , only fullpath etc) but we 
have tried to make sure that common use-cases are covered and since it's based 
on a streaming parser, it is extremely fast and consumes constant amount of 
memory even for large XMLs. It does not support namespaces , but it can handle 
xmls with namespaces . When you provide the xpath, just drop the namespace and 
give the rest (eg if the tag is 'dc:subject' the mapping should just contain 
'subject').Easy, isn't it? And you didn't need to write one line of code! Enjoy
==

How does dc:subject becomes field subject and why it's mapping 
xpath=/RDF/item/subject.. what is the secret? 

I am trying to index atom files and I need to understand the above cos I have 
namespace, not sure how to proceed. are there any atom example anywhere?

Thanks again for clarification.
Anton


  __
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052

Re: Boosting Code

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

public MapString,Object  transform(MapString,Object row , Context ctx){
row.put($docBoost, 3445);
return row;

}

On Wed, Jul 22, 2009 at 12:02 PM, prerna07pkhandelw...@sapient.com wrote:

 Hi,

 I have to boost document, Can someone help me understanding how can we
 implement docBoost via transformer.

 Thanks,
 Prerna


 Marc Sturlese wrote:

 If you mean at indexing time, you set field boost via data-config.xml.
 That boost is parsed from there and set to the lucene document going
 through DocBuilder,java, SolrInputDocuemnt.java and DocuemntBuilder.java
 In case you want to set full-document boost (not just to a field) you can
 do it setting a value to the key $docBoost via transformer. That value is
 set using same classes (DocBuilder,java, SolrInputDocuemnt.java and
 DocuemntBuilder.java).



 dabboo wrote:

 Hi,

 Can anyone please tell me where I can find the actual
 logic/implementation of field boosting in Solr. I am looking for classes.

 Thanks,
 Amit Garg




 --
 View this message in context: 
 http://www.nabble.com/Boosting-Code-tp22119017p24600769.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: All in one index, or multiple indexes?

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

keep in mind that everytime a commit is done all the caches are thrown
away. If  updates for each of these indexes happen at different time
then the caches get invalidated each time you commit. so in that case
smaller index helps

On Wed, Jul 8, 2009 at 4:55 PM, Tim Selltrs...@gmail.com wrote:
 Hi,
 I am wondering if it is common to have just one very large index, or
 multiple smaller indexes specialized for different content types.

 We currently have multiple smaller indexes, although one of them is
 much larger then the others. We are considering merging them, to allow
 the convenience of searching across multiple types at once and get
 them back in one list. The largest of the current indexes has a couple
 of types that belong together, it has just one text field, and it is
 usually quite short and is similar to product names (words like The
 matter). Another index I would merge with this one, has multiple text
 fields (also quite short).

 We of course would still like to be able to get specific types. Is
 doing filtering on just one type a big performance hit compared to
 just querying it from it's own index? Bare in mind all these indexes
 run on the same machine. (we replicate them all to three machines and
 do load balancing).

 There are a number of considerations. From an application standpoint
 when querying across all types we may split the results out into the
 separate types anyway once we have the list back. If we always do
 this, is it silly to have them in one index, rather then query
 multiple indexes at once? Is multiple http requests less significant
 then the time to post split the results?

 In some ways it is easier to maintain a single index, although it has
 felt easier to optimize the results for the type of content if they
 are in separate indexes. My main concern of putting it all in one
 index is that we'll make it harder to work with. We will definitely
 want to do filtering on types sometimes, and if we go with a mashed up
 index I'd prefer not to maintain separate specialized indexes as well.

 Any thoughts?

 ~Tim.




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DIH example explanation

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

The point is that namespace is ignored while DIH reads the xml. So
just use the part after the colon (:) in your xpath expressions and it
should just work.





On Wed, Jul 22, 2009 at 2:16 PM, Antonio
Eggbergantonio_eggb...@yahoo.se wrote:
 Hi,

 I am looking at the slashdot example and I am having hard time understanding 
 the following, from the wiki

 ==

 You can use this feature for indexing from REST API's such as rss/atom 
 feeds, XML data feeds , other Solr servers or even well formed xhtml 
 documents . Our XPath support has its limitations (no wildcards , only 
 fullpath etc) but we have tried to make sure that common use-cases are 
 covered and since it's based on a streaming parser, it is extremely fast and 
 consumes constant amount of memory even for large XMLs. It does not support 
 namespaces , but it can handle xmls with namespaces . When you provide the 
 xpath, just drop the namespace and give the rest (eg if the tag is 
 'dc:subject' the mapping should just contain 'subject').Easy, isn't it? And 
 you didn't need to write one line of code! Enjoy
 ==

 How does dc:subject becomes field subject and why it's mapping 
 xpath=/RDF/item/subject.. what is the secret?

 I am trying to index atom files and I need to understand the above cos I have 
 namespace, not sure how to proceed. are there any atom example anywhere?

 Thanks again for clarification.
 Anton


      __
 Ta semester! - sök efter resor hos Kelkoo.
 Jämför pris på flygbiljetter och hotellrum här:
 http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

US/UK/CA/AU English support

2009-07-22 Thread prerna07


Hi,

1) Out of US/UK/CA/AU,which english does solr support ? 

2) PhoneticFilterFactory perform search for similar sounding words. 
For example : search on  carat will give results of carat, caret and carrat.
I also observed that PhoneticFilterFactory  also support linguistic
variation for US/UK/CA/AU. 
For example: search on Optimize give results of optimise and optimize.

Question : Does PhoneticFilterFactory support all characters/ words of
linguistic variations for US/UK/CA/AU OR linguistic search for US/UK/CA/AU
will be subset of phonetic search.

Please suggest.

Thanks,
Prerna





-- 
View this message in context: 
http://www.nabble.com/US-UK-CA-AU-English-support-tp24602629p24602629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Word frequency count in the index

2009-07-22 Thread Pooja Verlani

Hi Grant,
thanks for your reply. I have one more doubt, if I use Luke's request
handler in solr for this issue, the top terms I get, are they term frequency
or highest document frequency terms.
I would like to get terms that occur max in a document and those document
form a good percentage in the total index.
Kindly reply if any other option straight or an elaborate one is available.

Thank you,


Pooja

On Thu, Jul 16, 2009 at 4:05 PM, Grant Ingersoll gsing...@apache.orgwrote:

 In the trunk version, the TermsComponent should give you this:
 http://wiki.apache.org/solr/TermsComponent.  Also, you can use the
 LukeRequestHandler to get the top words in each field.

 Alternatively, you may just want to point Luke at your index.


 On Jul 16, 2009, at 6:29 AM, Pooja Verlani wrote:

  Hi,

 Is there any way in SOLR to know the count of each word indexed in the
 solr
 ?
 I want to find out the different word frequencies to figure out '
 application specific stop words'.

 Please let me know if its possible.

 Thank you,
 Regards,
 Pooja


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search

Best approach to multiple languages

2009-07-22 Thread Andrew McCombe

Hi

We have a dataset that contains productname, category and descriptions.  The
descriptions can be in one or more different languages.  What would be the
recommended way of indexing these?

My initial thoughts are to index each description as a separate field and
append the language identifier to the field name, for example, three fields
with description_en, description_de, descrtiption_fr.  Is this the best
approach or is there a better way?

Regards
Andrew McCombe

Re: Best approach to multiple languages

2009-07-22 Thread Julian Davchev

Hi,

We have such case...we don't want to search in all of those languages at
once but just one of them.
So we took the approach of different indexes for each language. From
what I know it helps not breaking relevance of the stats as well.
You know, how much an index is used etc etc.

If you dig in mailling list. This has been discussed quite many times.

Andrew McCombe wrote:
 Hi

 We have a dataset that contains productname, category and descriptions.  The
 descriptions can be in one or more different languages.  What would be the
 recommended way of indexing these?

 My initial thoughts are to index each description as a separate field and
 append the language identifier to the field name, for example, three fields
 with description_en, description_de, descrtiption_fr.  Is this the best
 approach or is there a better way?

 Regards
 Andrew McCombe

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Avlesh Singh

As Noble has already said, transforming content before indexing a very
common requirement. DataImportHandler's Transformer lets you achieve this.
Read up on the same here -
http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9

Cheers
Avlesh

2009/7/22 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 if each field from the db goes to a separate field in solr as-is .
 Then it is very simple.

 if you need to split/join fields before feeding it into solr fields
 you may need to apply transformers
 an example on how your db field looks like and how you wish it to look
 like in solr would be helpful

 On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:
  Hi folks,
 
  I have around 50k documents that are reindexed now and then.
 
  Question is what would be the fastest approach to all this.
 
  Data is  just text ~20fields or so.  It comes from database but is first
  specially formated to get to format suitable for passing in solr.
 
  Currently xml post is used but have the feeling this is not optimal for
  speed wise when it is up to bulk import/reindex.
 
  I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
  see howto do this specially formated data so solr makes use of it.
 
  Are there some real examples,articles on howto use this?
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Behaviour when we get more than 1 million hits

2009-07-22 Thread Rakhi Khatwani

Hi,
There is this particulat scenarion where I want to search for a product
and i get a million records which will be given for further processing.
Regards,
Raakhi


On Mon, Jul 13, 2009 at 7:33 PM, Erick Erickson erickerick...@gmail.comwrote:

 It depends (tm) on what you try to do with the results. You really need
 togive
 us some more details on what you want to *do* with 1,000,000 hits
 before any meaningful response is possible.

 Best
 Erick

 On Mon, Jul 13, 2009 at 8:47 AM, Rakhi Khatwani rkhatw...@gmail.com
 wrote:

  Hi,
  If while using Solr, what would the behaviour be like if we perform
 the
  search and we get more than one million hits
 
  Regards,
  Raakhi

[ApacheCon US] Travel Assistance

2009-07-22 Thread Grant Ingersoll

The Travel Assistance Committee is taking in applications for those  
wanting
to attend ApacheCon US 2009 (Oakland) which takes place between the  
2nd and

6th November 2009.

The Travel Assistance Committee is looking for people who would like  
to be

able to attend ApacheCon US 2009 who may need some financial support in
order to get there. There are limited places available, and all  
applications
will be scored on their individual merit. Applications are open to all  
open
source developers who feel that their attendance would benefit  
themselves,

their project(s), the ASF and open source in general.

Financial assistance is available for flights, accommodation,  
subsistence
and Conference fees either in full or in part, depending on  
circumstances.

It is intended that all our ApacheCon events are covered, so it may be
prudent for those in Europe and/or Asia to wait until an event closer to
them comes up - you are all welcome to apply for ApacheCon US of  
course, but
there should be compelling reasons for you to attend an event further  
away
that your home location for your application to be considered above  
those

closer to the event location.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a  
link to

the online application and details for submitting.

Applications for applying for travel assistance will open on 27th July  
2009

and close of the 17th August 2009.

Good luck to all those that will apply.

Regards,

The Travel Assistance Committee

Re: DIH example explanation

2009-07-22 Thread Antonio Eggberg

:)

thank you paul! and it works! I have one more stupid question about the wiki.

url (required) : The url used to invoke the REST API. (Can be templatized).

How do you templatize the URL? My URL's are being updated all the time by an 
external program. i.e. list of atom sites it's a text file. So I should use 
some form of transformer to process it? any hint..

Thanks.
Anton

--- Den ons 2009-07-22 skrev Noble Paul നോബിള്‍  नोब्ळ् 
noble.p...@corp.aol.com:

 Från: Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com
 Ämne: Re: DIH example explanation
 Till: solr-user@lucene.apache.org
 Datum: onsdag 22 juli 2009 10.52
 The point is that namespace is
 ignored while DIH reads the xml. So
 just use the part after the colon (:) in your xpath
 expressions and it
 should just work.
 
 
 
 
 
 On Wed, Jul 22, 2009 at 2:16 PM, Antonio
 Eggbergantonio_eggb...@yahoo.se
 wrote:
  Hi,
 
  I am looking at the slashdot example and I am having
 hard time understanding the following, from the wiki
 
  ==
 
  You can use this feature for indexing from REST API's
 such as rss/atom feeds, XML data feeds , other Solr servers
 or even well formed xhtml documents . Our XPath support has
 its limitations (no wildcards , only fullpath etc) but we
 have tried to make sure that common use-cases are covered
 and since it's based on a streaming parser, it is extremely
 fast and consumes constant amount of memory even for large
 XMLs. It does not support namespaces , but it can handle
 xmls with namespaces . When you provide the xpath, just drop
 the namespace and give the rest (eg if the tag is
 'dc:subject' the mapping should just contain
 'subject').Easy, isn't it? And you didn't need to write one
 line of code! Enjoy
  ==
 
  How does dc:subject becomes field subject and
 why it's mapping xpath=/RDF/item/subject.. what is the
 secret?
 
  I am trying to index atom files and I need to
 understand the above cos I have namespace, not sure how to
 proceed. are there any atom example anywhere?
 
  Thanks again for clarification.
  Anton
 
 
     
  __
  Ta semester! - sök efter resor hos Kelkoo.
  Jämför pris på flygbiljetter och hotellrum här:
  http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052
 
 
 
 
 
 -- 
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com
 


  __
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Julian Davchev

Well yes, transformation is required. But it's like data coming from
multiple tables.. etc.
It's not like getting one row from table and possibly transforming it
and using it.

I am thinking perhaps to create some tables (views) that will have the
data ready/flattened and then simply feeding it.
Cause not sure how much flexibility transformer will give me. Java not
number 1 language either :)

Thanks for suggestions. Will get a look there.

Avlesh Singh wrote:
As Noble has already said, transforming content before indexing a very
common requirement. DataImportHandler's Transformer lets you achieve this.
Read up on the same here -
http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9

Cheers
Avlesh

2009/7/22 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

if each field from the db goes to a separate field in solr as-is .
Then it is very simple.

if you need to split/join fields before feeding it into solr fields
you may need to apply transformers
an example on how your db field looks like and how you wish it to look
like in solr would be helpful

On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:

Hi folks,

I have around 50k documents that are reindexed now and then.

Question is what would be the fastest approach to all this.

Data is just text ~20fields or so. It comes from database but is first
specially formated to get to format suitable for passing in solr.

Currently xml post is used but have the feeling this is not optimal for
speed wise when it is up to bulk import/reindex.

I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
see howto do this specially formated data so solr makes use of it.

Are there some real examples,articles on howto use this?

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Synonyms from index

2009-07-22 Thread Pooja Verlani

Hi,
Is there a possible way to generate synonyms from the index ? I have an
index with lots of searchable terms turning out to be having synonyms and
users too have different synonyms.
If not then the only way if to learn from the query logs and click logs
but in case there exists one, please share.

regards,
Pooja

Re: DIH example explanation

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

any string that is templatized in DIH can have variables like this ${a.b}

for instance look at the following

url=http://xyz.com/atom/${dataimporter.request.foo};

if you pass a parameter foo=bar when you invoke the command the url
invoked becomes

http://xyz.com/atom/bar

the variable can come from many places

see this
http://wiki.apache.org/solr/DataImportHandler#head-86408ce7721ea6f9a3f05b12ace8742fd41737d4

On Wed, Jul 22, 2009 at 4:30 PM, Antonio
Eggbergantonio_eggb...@yahoo.se wrote:
:)

thank you paul! and it works! I have one more stupid question about the wiki.

url (required) : The url used to invoke the REST API. (Can be templatized).

How do you templatize the URL? My URL's are being updated all the time by an
external program. i.e. list of atom sites it's a text file. So I should use
some form of transformer to process it? any hint..

Thanks.
Anton

--- Den ons 2009-07-22 skrev Noble Paul നോബിള്‍ नोब्ळ्
noble.p...@corp.aol.com:

Från: Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com
Ämne: Re: DIH example explanation
Till: solr-user@lucene.apache.org
Datum: onsdag 22 juli 2009 10.52
The point is that namespace is
ignored while DIH reads the xml. So
just use the part after the colon (:) in your xpath
expressions and it
should just work.

On Wed, Jul 22, 2009 at 2:16 PM, Antonio
Eggbergantonio_eggb...@yahoo.se
wrote:
Hi,

I am looking at the slashdot example and I am having
hard time understanding the following, from the wiki

You can use this feature for indexing from REST API's
such as rss/atom feeds, XML data feeds , other Solr servers
or even well formed xhtml documents . Our XPath support has
its limitations (no wildcards , only fullpath etc) but we
have tried to make sure that common use-cases are covered
and since it's based on a streaming parser, it is extremely
fast and consumes constant amount of memory even for large
XMLs. It does not support namespaces , but it can handle
xmls with namespaces . When you provide the xpath, just drop
the namespace and give the rest (eg if the tag is
'dc:subject' the mapping should just contain
'subject').Easy, isn't it? And you didn't need to write one
line of code! Enjoy
==

How does dc:subject becomes field subject and
why it's mapping xpath=/RDF/item/subject.. what is the
secret?

I am trying to index atom files and I need to
understand the above cos I have namespace, not sure how to
proceed. are there any atom example anywhere?

Thanks again for clarification.
Anton

__
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo.se/c-169901-resor-biljetter.html?partnerId=96914052

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

__
Ta semester! - sök efter resor hos Kelkoo.
Jämför pris på flygbiljetter och hotellrum här:
http://www.kelkoo..se/c-169901-resor-biljetter.html?partnerId=96914052

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

a transformer can be written in any language if you are using java 6
javascript support comes out of the box.
http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9

On Wed, Jul 22, 2009 at 4:57 PM, Julian Davchevj...@drun.net wrote:
Well yes, transformation is required. But it's like data coming from
multiple tables.. etc.
It's not like getting one row from table and possibly transforming it
and using it.

Thanks for suggestions. Will get a look there.

Cheers
Avlesh

2009/7/22 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

if each field from the db goes to a separate field in solr as-is .
Then it is very simple.

On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:

Hi folks,

I have around 50k documents that are reindexed now and then.

Question is what would be the fastest approach to all this.

Data is just text ~20fields or so. It comes from database but is first
specially formated to get to format suitable for passing in solr.

Currently xml post is used but have the feeling this is not optimal for
speed wise when it is up to bulk import/reindex.

I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
see howto do this specially formated data so solr makes use of it.

Are there some real examples,articles on howto use this?

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Bill Au

You can also do tables join in a SQL select to pick out the fields you want
from multiple tables. You may want to use temporary tables during
processing. Once you get the data the way you want it, you can use the CSV
request handler to load in the output of the SQL select.

Bill

On Wed, Jul 22, 2009 at 7:27 AM, Julian Davchev j...@drun.net wrote:

Well yes, transformation is required. But it's like data coming from
multiple tables.. etc.
It's not like getting one row from table and possibly transforming it
and using it.

Thanks for suggestions. Will get a look there.

Avlesh Singh wrote:
As Noble has already said, transforming content before indexing a very
common requirement. DataImportHandler's Transformer lets you achieve
this.
Read up on the same here -

http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9

Cheers
Avlesh

2009/7/22 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

if each field from the db goes to a separate field in solr as-is .
Then it is very simple.

On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:

Hi folks,

I have around 50k documents that are reindexed now and then.

Question is what would be the fastest approach to all this.

Data is just text ~20fields or so. It comes from database but is
first
specially formated to get to format suitable for passing in solr.

Currently xml post is used but have the feeling this is not optimal for
speed wise when it is up to bulk import/reindex.

I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
see howto do this specially formated data so solr makes use of it.

Are there some real examples,articles on howto use this?

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Random Slowness

2009-07-22 Thread Jeff Newburn

We can never reproduce the slowness with the same query.  As soon as we try
to run them again they are fine.  I have even tried running the same query
the next day and it is fine.

All of our requests go through our dismax handler which is part of why it is
so weird.  Most queries are fine, however just occasionally they aren't.
Additionally, why would the command=details command also go slow?  That
seems like a server issue.

It appears that for FieldValueCache and FilterCache we have no evictions,
but for queryResultCache and DocumentCache there are a good number of
evictions.  How would I help lower the evictions to see if that is the
problem?

Dismax Config below:
requestHandler name=dismax class=solr.DisMaxRequestHandler
default=true
?
lst name=defaults
str name=echoParamsexplicit/str
float name=tie0.5/float
?
str name=qf

   productId^10.0

personality^8.0
subCategory^8.0
category^6.0
productType^5.0

brandName^1.0
realBrandName^1.0
productNameSearch^1.0

size^1.2
width^1.0
heelHeight^1.0

productDescription^1.0
color^10.0
price^1.0
attrs^5.0

expandedGender^0.5

/str
?
str name=pf

attrs^3 brandName^10.0 productNameSearch^8.0
productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0
productType^8.0

/str
?
str name=fl

productId, productName, price, originalPrice,
brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId

/str
str name=mm100%/str
int name=ps1/int
int name=qs5/int
str name=q.alt*:*/str
!-- More like this search parameters --
?
str name=mlt.fl
brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p
ersonalityFacet,colorFacet,heelHeight,expandedGender
/str
int name=mlt.mindf1/int
int name=mlt.mintf1/int
/lst
?
arr name=last-components
strspellcheck/str
/arr
/requestHandler

-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


 From: Erik Hatcher e...@ehatchersolutions.com
 Reply-To: solr-user@lucene.apache.org
 Date: Wed, 22 Jul 2009 00:36:30 -0400
 To: solr-user@lucene.apache.org
 Subject: Re: Random Slowness
 
 
 On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote:
 
 We are experiencing random slowness on certain queries.  I have been
 unable
 to diagnose what the issue is.  We are using SOLR 1.4 and 99.99% of
 queries
 return in under 250 ms.  The remaining queries are returning in 2-5
 seconds
 for no apparent reason.  There does not seem to be any commonality
 between
 the queries.  This problem also includes admin system queries.  Any
 help or
 direction would be much appreciated.
 
 Do you experience the same slow speeds when you manually issue those
 queries?  In other words, is it repeatable?  If so, try
 debugQuery=true and see the component timings and where the time is
 going.
 
 What's the query parsing too?  Anything unusually large due to synonym
 lists or something like that?
 
 What about your filter cache - how's it looking when these slow
 queries take place?  evictions  0?
 
 params 
 ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s
 ort 
 = 
 true 
 start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf
 acet 
 .field 
 =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa
 cet:Shoes 
 fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5
 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19
 status=0 QTime=3689
 
 What's the config of your dismax handler look like?
 
 Erik

Re: Random Slowness

2009-07-22 Thread Ed Summers

I haven't read this whole thread, so maybe it's already come up. Have
you turned on the garbage collection logging to see if the jvm is busy
cleaning up when you are seeing the slowness? Maybe the jvm is
struggling to keep the heap size within a particular limit?

//Ed

On Wed, Jul 22, 2009 at 10:22 AM, Jeff Newburnjnewb...@zappos.com wrote:
 We can never reproduce the slowness with the same query.  As soon as we try
 to run them again they are fine.  I have even tried running the same query
 the next day and it is fine.

 All of our requests go through our dismax handler which is part of why it is
 so weird.  Most queries are fine, however just occasionally they aren't.
 Additionally, why would the command=details command also go slow?  That
 seems like a server issue.

 It appears that for FieldValueCache and FilterCache we have no evictions,
 but for queryResultCache and DocumentCache there are a good number of
 evictions.  How would I help lower the evictions to see if that is the
 problem?

 Dismax Config below:
 requestHandler name=dismax class=solr.DisMaxRequestHandler
 default=true
 ?
 lst name=defaults
 str name=echoParamsexplicit/str
 float name=tie0.5/float
 ?
 str name=qf

               productId^10.0

                personality^8.0
                subCategory^8.0
                category^6.0
                productType^5.0

                brandName^1.0
                realBrandName^1.0
                productNameSearch^1.0

                size^1.2
                width^1.0
                heelHeight^1.0

                productDescription^1.0
                color^10.0
                price^1.0
                attrs^5.0

                expandedGender^0.5

 /str
 ?
 str name=pf

            attrs^3 brandName^10.0 productNameSearch^8.0
 productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0
 productType^8.0

 /str
 ?
 str name=fl

                productId, productName, price, originalPrice,
 brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId

 /str
 str name=mm100%/str
 int name=ps1/int
 int name=qs5/int
 str name=q.alt*:*/str
 !-- More like this search parameters --
 ?
 str name=mlt.fl
 brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p
 ersonalityFacet,colorFacet,heelHeight,expandedGender
 /str
 int name=mlt.mindf1/int
 int name=mlt.mintf1/int
 /lst
 ?
 arr name=last-components
 strspellcheck/str
 /arr
 /requestHandler

 --
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562


 From: Erik Hatcher e...@ehatchersolutions.com
 Reply-To: solr-user@lucene.apache.org
 Date: Wed, 22 Jul 2009 00:36:30 -0400
 To: solr-user@lucene.apache.org
 Subject: Re: Random Slowness


 On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote:

 We are experiencing random slowness on certain queries.  I have been
 unable
 to diagnose what the issue is.  We are using SOLR 1.4 and 99.99% of
 queries
 return in under 250 ms.  The remaining queries are returning in 2-5
 seconds
 for no apparent reason.  There does not seem to be any commonality
 between
 the queries.  This problem also includes admin system queries.  Any
 help or
 direction would be much appreciated.

 Do you experience the same slow speeds when you manually issue those
 queries?  In other words, is it repeatable?  If so, try
 debugQuery=true and see the component timings and where the time is
 going.

 What's the query parsing too?  Anything unusually large due to synonym
 lists or something like that?

 What about your filter cache - how's it looking when these slow
 queries take place?  evictions  0?

 params
 ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s
 ort
 =
 true
 start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf
 acet
 .field
 =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa
 cet:Shoes
 fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5
 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19
 status=0 QTime=3689

 What's the config of your dismax handler look like?

 Erik

Re: Random Slowness

2009-07-22 Thread Jeff Newburn

Ed,
How do I go about enabling the gc logging for solr?


-- 
Jeff Newburn
Software Engineer, Zappos.com
jnewb...@zappos.com - 702-943-7562


 From: Ed Summers e...@pobox.com
 Reply-To: solr-user@lucene.apache.org
 Date: Wed, 22 Jul 2009 10:39:03 -0400
 To: solr-user@lucene.apache.org
 Subject: Re: Random Slowness
 
 I haven't read this whole thread, so maybe it's already come up. Have
 you turned on the garbage collection logging to see if the jvm is busy
 cleaning up when you are seeing the slowness? Maybe the jvm is
 struggling to keep the heap size within a particular limit?
 
 //Ed
 
 On Wed, Jul 22, 2009 at 10:22 AM, Jeff Newburnjnewb...@zappos.com wrote:
 We can never reproduce the slowness with the same query.  As soon as we try
 to run them again they are fine.  I have even tried running the same query
 the next day and it is fine.
 
 All of our requests go through our dismax handler which is part of why it is
 so weird.  Most queries are fine, however just occasionally they aren't.
 Additionally, why would the command=details command also go slow?  That
 seems like a server issue.
 
 It appears that for FieldValueCache and FilterCache we have no evictions,
 but for queryResultCache and DocumentCache there are a good number of
 evictions.  How would I help lower the evictions to see if that is the
 problem?
 
 Dismax Config below:
 requestHandler name=dismax class=solr.DisMaxRequestHandler
 default=true
 ?
 lst name=defaults
 str name=echoParamsexplicit/str
 float name=tie0.5/float
 ?
 str name=qf
 
               productId^10.0
 
                personality^8.0
                subCategory^8.0
                category^6.0
                productType^5.0
 
                brandName^1.0
                realBrandName^1.0
                productNameSearch^1.0
 
                size^1.2
                width^1.0
                heelHeight^1.0
 
                productDescription^1.0
                color^10.0
                price^1.0
                attrs^5.0
 
                expandedGender^0.5
 
 /str
 ?
 str name=pf
 
            attrs^3 brandName^10.0 productNameSearch^8.0
 productDescription^2.0 personality^4.0 subCategory^12.0 category^10.0
 productType^8.0
 
 /str
 ?
 str name=fl
 
                productId, productName, price, originalPrice,
 brandNameFacet, productRating, imageUrl, productUrl, isNew, onSale, styleId
 
 /str
 str name=mm100%/str
 int name=ps1/int
 int name=qs5/int
 str name=q.alt*:*/str
 !-- More like this search parameters --
 ?
 str name=mlt.fl
 brandNameFacet,productTypeFacet,productName,categoryFacet,subCategoryFacet,p
 ersonalityFacet,colorFacet,heelHeight,expandedGender
 /str
 int name=mlt.mindf1/int
 int name=mlt.mintf1/int
 /lst
 ?
 arr name=last-components
 strspellcheck/str
 /arr
 /requestHandler
 
 --
 Jeff Newburn
 Software Engineer, Zappos.com
 jnewb...@zappos.com - 702-943-7562
 
 
 From: Erik Hatcher e...@ehatchersolutions.com
 Reply-To: solr-user@lucene.apache.org
 Date: Wed, 22 Jul 2009 00:36:30 -0400
 To: solr-user@lucene.apache.org
 Subject: Re: Random Slowness
 
 
 On Jul 21, 2009, at 6:52 PM, Jeff Newburn wrote:
 
 We are experiencing random slowness on certain queries.  I have been
 unable
 to diagnose what the issue is.  We are using SOLR 1.4 and 99.99% of
 queries
 return in under 250 ms.  The remaining queries are returning in 2-5
 seconds
 for no apparent reason.  There does not seem to be any commonality
 between
 the queries.  This problem also includes admin system queries.  Any
 help or
 direction would be much appreciated.
 
 Do you experience the same slow speeds when you manually issue those
 queries?  In other words, is it repeatable?  If so, try
 debugQuery=true and see the component timings and where the time is
 going.
 
 What's the query parsing too?  Anything unusually large due to synonym
 lists or something like that?
 
 What about your filter cache - how's it looking when these slow
 queries take place?  evictions  0?
 
 params
 ={facet=truefacet.mincount=1facet.limit=-1wt=javabinrows=0facet.s
 ort
 =
 true
 start=0q=shoesfacet.field=colorFacetfacet.field=brandNameFacetf
 acet
 .field
 =heelHeightfacet.field=attrFacet_Styleqt=dismaxfq=productTypeFa
 cet:Shoes
 fq=gender:Womensfq=categoryFacet:Sandalsfq=width:EEfq=size:10.5
 fq=priceFacet:$100.00+and+Underfq=personalityFacet:Sexy} hits=19
 status=0 QTime=3689
 
 What's the config of your dismax handler look like?
 
 Erik

Re: Random Slowness

2009-07-22 Thread Ed Summers

On Wed, Jul 22, 2009 at 10:44 AM, Jeff Newburnjnewb...@zappos.com wrote:
 How do I go about enabling the gc logging for solr?

It depends how you are running solr. You basically want to make sure
that when the JVM is started up with the java command, that it gets
some additional arguments [1]. So for example if you are running solr
using jetty you would:

  java -verbose:gc -Xloggc:solr_gc.log -jar start.jar

And then poke around in the log looking for garbage collection events
that take as long as the pauses you are seeing in your app. I think
there are tools that will help you analyze the log files if you need
them. If there is a correlation you'll probably want to tune your solr
memory usage with  -xMx and -xMs.

Hope this helps.

//Ed

[1] http://java.sun.com/javase/7/docs/technotes/tools/windows/java.html

Re: solr 1.3.0 and Oracle Fusion Middleware

2009-07-22 Thread Mark Miller

Lets keep this communication on the list so others can benefit and chime in.
What about the filter-dispatched-requests-enabled setting? Perhaps it
doesn't use the weblogic.xml file anymore and you'll need to find the new
way to configure that setting? From what I can see, that setting will
default to true now if you are using a web.xml defined as 2.4 (according to
weblogic 9 docs). Solr is using 2.3 at the moment - you might try changing
the web.xml to 2.4 from 2.3, or figure out how to adjust that setting
(filter-dispatched-requests-enabled) with your current container. It will
default to true with a web.xml  2.4 for back compat.

- Mark

On Wed, Jul 22, 2009 at 10:35 AM, Hall, David dh...@vermeer.com wrote:

 Mark --- Thanks for the info - I took a look at the two urls and even
 though it is not true Weblogic - ie - this is the Oracle OC4J not the
 Weblogic Java Containers from the pre-oracle acquisition. I have tried to
 remove the encoding from the header and created the weblogic.xml, bounced
 the container and re-tried. However this did not fix the issue. I think this
 is the correct directionmaybe just a little different. Maybe it needs to
 be put in the web.xml. (I am not using weblogic (Oracle Portal Replacment)
 directly - just the Oracle Java Container. I don't know if that makes any
 difference.) Here are my observations on this issue When I hit
 solr/admin - I do get a page - it is just missing the pretty stuff.
 Statistics totally do not work. I get a stackoverflow errors in the opmn/log
 for this container. Below is what I can see from Paros... solr-admin.css
 HTTP/1.1 500 Internal Server Error Date: Wed, 22 Jul 2009 14:21:22 GMT
 Server: Oracle-Application-Server-10g/10.1.3.4.0 Oracle-HTTP-Server
 Content-Location: https://testportalapp.vermeer.com/solr/admin/solr-admin.css
 Content-Typehttps://testportalapp.vermeer.com/solr/admin/solr-admin.cssContent-Type:
 text/html Connection: close 500 Internal Server Errornull
 java.lang.StackOverflowError at
 java.security.AccessController.doPrivileged(Native Method) at
 java.io.PrintWriter.(PrintWriter.java:77) at
 java.io.PrintWriter.(PrintWriter.java:61) at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:316)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:281)
 at
 com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135)
 at
 com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283)
 at
 com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29)
 at
 com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254)
 at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at
 com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259)
 at
 com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
 at
 com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135)
 at
 com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283)
 at
 com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29)
 at
 com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254)
 at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at
 com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259)
 at
 com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
 at
 com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135)
 at
 com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283)
 at
 com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29)
 at
 com.evermind.server.http.FileRequestDispatcher$2.oc4jRun(FileRequestDispatcher.java:254)
 at oracle.oc4j.security.OC4JSecurity.doPrivileged(OC4JSecurity.java:284) at
 com.evermind.server.http.FileRequestDispatcher.forwardInternal(FileRequestDispatcher.java:259)
 at
 com.evermind.server.http.FileRequestDispatcher.forward(FileRequestDispatcher.java:346)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:273)
 at
 com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135)
 at
 com.evermind.server.http.FileRequestDispatcher.unprivileged_forwardInternal(FileRequestDispatcher.java:283)
 at
 com.evermind.server.http.FileRequestDispatcher.access$100(FileRequestDispatcher.java:29)
 at




-- 
-- 
- Mark

http://www.lucidimagination.com

Re: US/UK/CA/AU English support

2009-07-22 Thread Grant Ingersoll



On Jul 22, 2009, at 5:09 AM, prerna07 wrote:



Hi,

1) Out of US/UK/CA/AU,which english does solr support ?


Please clarify what you mean by support?  The only thing in Solr  
that is potentially language dependent are the Tokenizers and  
TokenFilters and those are completely pluggable.  For tokenization,  
I'd say all are supported since all of those languages are whitespace  
delimited.  For things like stemming and synonyms, I'm not sure, but I  
suspect many of the existing capabilities will work in most cases,  
which is all one can ever expect no matter the language.





2) PhoneticFilterFactory perform search for similar sounding words.
For example : search on  carat will give results of carat, caret and  
carrat.

I also observed that PhoneticFilterFactory  also support linguistic
variation for US/UK/CA/AU.
For example: search on Optimize give results of optimise and optimize.

Question : Does PhoneticFilterFactory support all characters/ words of
linguistic variations for US/UK/CA/AU OR linguistic search for US/UK/ 
CA/AU

will be subset of phonetic search.



I would think so, but I might suggest using either the Admin analysis  
capabilities and doing some tests with the various FieldTypes or  
automating some more tests by using the AnalysisRequestHandler (or  
whatever it is called these days)



-Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Best approach to multiple languages

2009-07-22 Thread Grant Ingersoll

How do you want to search those descriptions?  Do you know the query  
language going in?


On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:


Hi

We have a dataset that contains productname, category and  
descriptions.  The
descriptions can be in one or more different languages.  What would  
be the

recommended way of indexing these?

My initial thoughts are to index each description as a separate  
field and
append the language identifier to the field name, for example, three  
fields
with description_en, description_de, descrtiption_fr.  Is this the  
best

approach or is there a better way?

Regards
Andrew McCombe


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Best approach to multiple languages

2009-07-22 Thread Ed Summers

On Wed, Jul 22, 2009 at 11:35 AM, Grant Ingersollgsing...@apache.org wrote:
 My initial thoughts are to index each description as a separate field and
 append the language identifier to the field name, for example, three
 fields
 with description_en, description_de, descrtiption_fr.  Is this the best
 approach or is there a better way?

FWIW, this approach is essentially what we did at the Library of
Congress to support multi-lingual fulltext search in the World Digital
Library [1] webapp. It seems to have paid off pretty well, since we
were able to configure analysis on a per-language basis.

In case you are curious I've attached a copy of our schema.xml to give
you an idea of what we did.

//Ed

[1] http://www.wdl.org/
?xml version=1.0 encoding=ISO-8859-15?
schema name=example version=1.1
  !-- 
Note: there are lots more types available, see original schema.xml for
the full picture. 
  --
  types
fieldType name=string class=solr.StrField omitNorms=true sortMissingLast=true/
fieldType name=integer class=solr.SortableIntField omitNorms=true/
fieldType name=date class=solr.DateField sortMissingLast=true omitNorms=true/

!--  default --
fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType
fieldType name=suggest_text_eng class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
fieldType name=suggest_text_por class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=brazilian-stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
fieldType name=suggest_text_fra class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=french-stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
fieldType name=suggest_text_spa class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=spanish-stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
fieldType name=suggest_text_rus class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=russian-stopwords.txt/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

!--  Arabic (based on aramorph) --
fieldType name=text_arabic class=solr.TextField
  analyzer type=index
tokenizer class=solr.ArabicTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.ArabicTokenizerFactory/
  /analyzer
/fieldType
!-- ArabicAnalyser = ArabicTokenizer = ArabicStemmer = ArabicGrammaticalFilter --
fieldType name=text_arabic_analyzed class=solr.TextField
  analyzer type=index class=solr.ArabicAnalyzer/
  analyzer type=query class=solr.ArabicAnalyzer/
/fieldType

!--  Brazilian (Portuguese) --
fieldType name=text_brazilian class=solr.TextField
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory

DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-22 Thread Chantal Ackermann


Hi all,

this is my first post, as I am new to SOLR (some Lucene exp).

I am trying to load data from an existing datamart into SOLR using the 
DataImportHandler but in my opinion it is too slow due to the special 
structure of the datamart I have to use.


Root Cause:
This datamart uses a row based approach (pivot) to present its data. It 
was so done to allow adding more attributes to the data set without 
having to change the table structure.


Impact:
To use the DataImportHandler, i have to pivot the data to create again 
one row per data set. Unfortunately, this results in more and less 
performant queries. Moreover, there are sometimes multiple rows for a 
single attribute, that require separate queries - or more tricky 
subselects that probably don't speed things up.


Here is an example of the relation between DB requests, row fetches and 
actual number of documents created:


lst name=statusMessages
str name=Total Requests made to DataSource3737/str
str name=Total Rows Fetched5380/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2009-07-22 18:19:06/str
−
str name=
Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
/str
str name=Committed2009-07-22 18:22:29/str
str name=Optimized2009-07-22 18:22:29/str
str name=Time taken 0:3:22.484/str
/lst

(Full index creation.)
There are about half a million data sets, in total. That would require 
about 30h for indexing? My feeling is that there are far too many row 
fetches per data set.


I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using 
around 680MB RAM, Java6. I haven't changed the Lucene configuration 
(merge factor 10, ram buffer size 32).


Possible solutions?
A) Write my own DataImportHandler?
B) Write my own MultiRowTransformer that accepts several rows as input 
argument (not sure this is a valid option)?

C) Approach the DB developers to add a flat table with one data set per row?
D) ...?

If someone would like to share their experiences, that would be great!

Thanks a lot!
Chantal



--
Chantal Ackermann

Re: Best approach to multiple languages

2009-07-22 Thread Andrew McCombe

Hi

We will  know the user's language choice before searching.

Regards
Andrew

2009/7/22 Grant Ingersoll gsing...@apache.org

 How do you want to search those descriptions?  Do you know the query
 language going in?


 On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:

  Hi

 We have a dataset that contains productname, category and descriptions.
  The
 descriptions can be in one or more different languages.  What would be the
 recommended way of indexing these?

 My initial thoughts are to index each description as a separate field and
 append the language identifier to the field name, for example, three
 fields
 with description_en, description_de, descrtiption_fr.  Is this the best
 approach or is there a better way?

 Regards
 Andrew McCombe


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search

Re: Best approach to multiple languages

2009-07-22 Thread Olivier Dobberkau



Am 22.07.2009 um 18:31 schrieb Ed Summers:


In case you are curious I've attached a copy of our schema.xml to give
you an idea of what we did.



Thanks for sharing!

--
Olivier Dobberkau

LocalSolr - order of fields on xml response

2009-07-22 Thread Daniel Cassiano

Hi folks,

When I do some query with LocalSolr to get the geo_distance, the order of
xml fields is different of a standard query.
It's a simple query, like this:
http://myhost.com:8088/solr/core/select?qt=geox=-46.01y=-23.01radius=15sort=geo_distanceascq=*:*

Is this an expected behavior of LocalSolr?


Thanks!

-- 
Daniel Cassiano
_
http://www.apontador.com.br/
http://www.maplink.com.br/

Re: Best approach to multiple languages

2009-07-22 Thread Grant Ingersoll


Typically there are three options that people do:

1. Put 'em all in one big field
2. Split Fields (as you and others have described)  - not sure why no  
one ever splits on documents, which is viable too, but comes with  
repeated data

3. Split indexes

For your case, #1 isn't going to work since you want to search  
language specific.  I'd likely go with #2, but #3 has it's merits  
too.  #3 allows for managing the languages separately (you can update  
the Spanish document w/o affecting the English version, and also can  
take the whole collection offline if you want w/o affecting the other  
indexes), which can sometimes be helpful, but the cost is more  
operational complexity, etc.


-Grant

On Jul 22, 2009, at 12:39 PM, Andrew McCombe wrote:


Hi

We will  know the user's language choice before searching.

Regards
Andrew

2009/7/22 Grant Ingersoll gsing...@apache.org


How do you want to search those descriptions?  Do you know the query
language going in?


On Jul 22, 2009, at 6:12 AM, Andrew McCombe wrote:

Hi


We have a dataset that contains productname, category and  
descriptions.

The
descriptions can be in one or more different languages.  What  
would be the

recommended way of indexing these?

My initial thoughts are to index each description as a separate  
field and

append the language identifier to the field name, for example, three
fields
with description_en, description_de, descrtiption_fr.  Is this the  
best

approach or is there a better way?

Regards
Andrew McCombe



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using

Solr/Lucene:
http://www.lucidimagination.com/search

Re: Synonyms from index

2009-07-22 Thread Otis Gospodnetic

Hi,

There is nothing built-in.  It might be possible to infer if two words are 
synonyms, but that's really not strictly a search thing, so it's not likely to 
be added to Solr in the near future.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Pooja Verlani pooja.verl...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, July 22, 2009 7:41:56 AM
 Subject: Synonyms from index
 
 Hi,
 Is there a possible way to generate synonyms from the index ? I have an
 index with lots of searchable terms turning out to be having synonyms and
 users too have different synonyms.
 If not then the only way if to learn from the query logs and click logs
 but in case there exists one, please share.
 
 regards,
 Pooja

Re: Random Slowness

2009-07-22 Thread Otis Gospodnetic

Or simply attach to the JVM with Jconsole and watch the GC from there.  You'd 
have to watch things (logs and jconsole) closely though, and correlate the slow 
query periods with a GC spike.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Ed Summers e...@pobox.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, July 22, 2009 11:03:08 AM
 Subject: Re: Random Slowness
 
 On Wed, Jul 22, 2009 at 10:44 AM, Jeff Newburnwrote:
  How do I go about enabling the gc logging for solr?
 
 It depends how you are running solr. You basically want to make sure
 that when the JVM is started up with the java command, that it gets
 some additional arguments [1]. So for example if you are running solr
 using jetty you would:
 
   java -verbose:gc -Xloggc:solr_gc.log -jar start.jar
 
 And then poke around in the log looking for garbage collection events
 that take as long as the pauses you are seeing in your app. I think
 there are tools that will help you analyze the log files if you need
 them. If there is a correlation you'll probably want to tune your solr
 memory usage with  -xMx and -xMs.
 
 Hope this helps.
 
 //Ed
 
 [1] http://java.sun.com/javase/7/docs/technotes/tools/windows/java.html

Re: excluding certain terms from facet counts when faceting based on indexed terms of a field

2009-07-22 Thread Chris Hostetter


: I am faceting based on the indexed terms of a field by using facet.field.
: Is there any way to exclude certain terms from the facet counts?

if you're talking about a lot of terms, and they're going to be hte same 
for *all* queries, the best appraoch is to strip them out when indexing 
(StopWordFilter is your freind)

-Hoss

RE: solr 1.3.0 and Oracle Fusion Middleware

2009-07-22 Thread Hall, David

Thanks for the feedback...

I tried to add what is below directly to the web.xml file right after the  
web-app  tag and bounce the OC4J - still same issue.

  container-descriptor

filter-dispatched-requests-enabledfalse/filter-dispatched-requests-enabled
  /container-descriptor

I checked other applications running on 10.1.3 OC4J and they are also using 2.3 
web.xml.

Regardless - I tried to change it to 2.4 and now have the container starting up 
with out errors - and have the above filter statement still in the xml file and 
I get a different error - here is the first subset

SEVERE: java.lang.StackOverflowError
at sun.util.calendar.ZoneInfo.getTransitionIndex(ZoneInfo.java:288)
at sun.util.calendar.ZoneInfo.getOffsets(ZoneInfo.java:238)
at sun.util.calendar.ZoneInfo.getOffsets(ZoneInfo.java:215)
at 
java.util.GregorianCalendar.computeFields(GregorianCalendar.java:1998)
at 
java.util.GregorianCalendar.computeFields(GregorianCalendar.java:1970)
at java.util.Calendar.setTimeInMillis(Calendar.java:1066)
at java.util.Calendar.setTime(Calendar.java:1032)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:785)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:778)
at java.text.DateFormat.format(DateFormat.java:274)
at java.text.Format.format(Format.java:133)
at java.text.MessageFormat.subformat(MessageFormat.java:1279)
at java.text.MessageFormat.format(MessageFormat.java:787)
at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:50)
at java.util.logging.StreamHandler.publish(StreamHandler.java:179)
at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:88)
at java.util.logging.Logger.log(Logger.java:428)
at java.util.logging.Logger.doLog(Logger.java:450)
at java.util.logging.Logger.log(Logger.java:473)
at java.util.logging.Logger.severe(Logger.java:960)
at org.apache.solr.common.SolrException.log(SolrException.java:132)
at org.apache.solr.common.SolrException.logOnce(SolrException.java:150)
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:319)


From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: Wednesday, July 22, 2009 10:04 AM
To: Hall, David; solr-user@lucene.apache.org
Subject: Re: solr 1.3.0 and Oracle Fusion Middleware

Lets keep this communication on the list so others can benefit and chime in.

What about the filter-dispatched-requests-enabled setting? Perhaps it doesn't 
use the weblogic.xml file anymore and you'll need to find the new way to 
configure that setting? From what I can see, that setting will default to true 
now if you are using a web.xml defined as 2.4 (according to weblogic 9 docs). 
Solr is using 2.3 at the moment - you might try changing the web.xml to 2.4 
from 2.3, or figure out how to adjust that setting 
(filter-dispatched-requests-enabled) with your current container. It will 
default to true with a web.xml  2.4 for back compat.

- Mark
On Wed, Jul 22, 2009 at 10:35 AM, Hall, David 
dh...@vermeer.commailto:dh...@vermeer.com wrote:
Mark --- Thanks for the info - I took a look at the two urls and even though it 
is not true Weblogic - ie - this is the Oracle OC4J not the Weblogic Java 
Containers from the pre-oracle acquisition. I have tried to remove the encoding 
from the header and created the weblogic.xml, bounced the container and 
re-tried. However this did not fix the issue. I think this is the correct 
directionmaybe just a little different. Maybe it needs to be put in the 
web.xml. (I am not using weblogic (Oracle Portal Replacment) directly - just 
the Oracle Java Container. I don't know if that makes any difference.) Here are 
my observations on this issue When I hit solr/admin - I do get a page - it 
is just missing the pretty stuff. Statistics totally do not work. I get a 
stackoverflow errors in the opmn/log for this container. Below is what I can 
see from Paros... solr-admin.css HTTP/1.1 500 Internal Server Error Date: Wed, 
22 Jul 2009 14:21:22 GMT Server: Oracle-Application-Server-10g/10.1.3.4.0 
Oracle-HTTP-Server Content-Location: 
https://testportalapp.vermeer.com/solr/admin/solr-admin.css 
Content-Typehttps://testportalapp.vermeer.com/solr/admin/solr-admin.cssContent-Type:
 text/html Connection: close
500 Internal Server Error
null java.lang.StackOverflowError at 
java.security.AccessController.doPrivileged(Native Method) at 
java.io.PrintWriter.(PrintWriter.java:77) at 
java.io.PrintWriter.(PrintWriter.java:61) at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:316)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:281)
 at 
com.evermind.server.http.FileRequestDispatcher.handleWithFilter(FileRequestDispatcher.java:135)
 at

Re: Storing string field in solr.ExternalFieldFile type

2009-07-22 Thread Erick Erickson

Hoping the experts chime in if I'm wrong, but
As far as I know, while storing a field increases the size of an index,
it doesn't have much impact on the search speed. Which you could
pretty easily test by creating the index both ways and firing off some
timing queries and comparing. Although it would be time consuming...

I believe there's some info on the Lucene Wiki about this, but my memory
isn't what it used to be.

Erick


On Tue, Jul 21, 2009 at 2:42 PM, Jibo John jiboj...@mac.com wrote:

 We're in the process of building a log searcher application.

 In order to reduce the index size to improve the query performance, we're
 exploring the possibility of having:

  1. One field for each log line with 'indexed=true  stored=false' that
 will be used for searching
  2. Another field for each log line of type solr.ExternalFileField that
 will be used only for display purpose.

 We realized that currently solr.ExternalFileField supports only float type.

 Is there a way we can override this to support string type? Any issues with
 this approach?

 Any ideas are welcome.


 Thanks,
 -Jibo

Re: Behaviour when we get more than 1 million hits

2009-07-22 Thread Erick Erickson

That's still not very useful. Additional processing? Where, some clientthat
you return all the data to? In which case SOLR is the least of your
concerns, your network speed counts more.

At a blind guess I'd worry more about how you're doing your
additional processing than solr.

Erick

On Wed, Jul 22, 2009 at 6:38 AM, Rakhi Khatwani rkhatw...@gmail.com wrote:

 Hi,
There is this particulat scenarion where I want to search for a product
 and i get a million records which will be given for further processing.
 Regards,
 Raakhi


 On Mon, Jul 13, 2009 at 7:33 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  It depends (tm) on what you try to do with the results. You really need
  togive
  us some more details on what you want to *do* with 1,000,000 hits
  before any meaningful response is possible.
 
  Best
  Erick
 
  On Mon, Jul 13, 2009 at 8:47 AM, Rakhi Khatwani rkhatw...@gmail.com
  wrote:
 
   Hi,
   If while using Solr, what would the behaviour be like if we perform
  the
   search and we get more than one million hits
  
   Regards,
   Raakhi

Re: SolrException - Lock obtain timed out, no leftover locks

2009-07-22 Thread Chris Hostetter

My only guess here is that you are using SolrJ in an embedded sense, not
via HTTP, and something about the code you have in your MyIndexers class
causes two differnet threads to attempt to create two differnet cores (or
perhaps the same core) using identical data directories at the same time.

either that: or maybe there is a bug in the CoreAdmin functionality for
creating/opening a new core resulting from improper synchronization.

it would help to have the full stack trace of hte Lock timed out
exception, and to know more details about how exactly your code goes about
creating new cores on the fly.

: I'm running Solr 1.3.0 in multicore mode and feeding it data from which the
: core name is inferred from a specific field. My service extracts the core
: name and, if it has not seen it before, issues a create request for that
: core before attempting to add the document (via SolrJ). I have a pool of
: MyIndexers that run in parallel, taking documents from a queue and adding
: them via the add method on the SolrServer instance corresponding to that
: core (exactly one per core exists). Each core is in a separate data
: directory. My timeouts are set as such:
:
: writeLockTimeout15000/writeLockTimeout
: commitLockTimeout25000/commitLockTimeout
:
: I remove the index directories, start the server, check that no locks exist,
: and generate ~500 documents spread across 5 cores for the MyIndexers to
: handle. Each time, I see one or more exceptions with a message like
:
:
Lock_obtain_timed_out_SimpleFSLockmulticoreNewUser3dataindexlucenebd4994617386d14e2c8c29e23bcca719writelock__orgapachelucenestoreLockObtainFailedException_Lock_obtain_timed_out_...
:
: When the indexers have completed, no lock is left over. There is no
: discernible pattern as far as when the exception occurs (ie, it does not
: tend to happen on the first or last or any particular document).
:
: Interestingly, this problem does not happen when I have only a single
: MyIndexer, or if I have a pool of MyIndexers and am running in single core
: mode.
:
: I've looked at the other posts from users getting this exception but it
: always seemed to be a different case, such as the server having crashed
: previously and a lock file being left over.
:
: --
: View this message in context:
http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24393255.html
: Sent from the Solr - User mailing list archive at Nabble.com.
:

-Hoss

Re: SolrException - Lock obtain timed out, no leftover locks

2009-07-22 Thread danben

Sorry, I thought I had removed this posting. I am running Solr over HTTP,
but (as you surmised) I had a concurrency bug. Thanks for the response.

Dan

hossman wrote:

either that: or maybe there is a bug in the CoreAdmin functionality for
creating/opening a new core resulting from improper synchronization.

it would help to have the full stack trace of hte Lock timed out
exception, and to know more details about how exactly your code goes about
creating new cores on the fly.

: I'm running Solr 1.3.0 in multicore mode and feeding it data from which
the
: core name is inferred from a specific field. My service extracts the
core
: name and, if it has not seen it before, issues a create request for that
: core before attempting to add the document (via SolrJ). I have a pool
of
: MyIndexers that run in parallel, taking documents from a queue and
adding
: them via the add method on the SolrServer instance corresponding to that
: core (exactly one per core exists). Each core is in a separate data
: directory. My timeouts are set as such:
:
: writeLockTimeout15000/writeLockTimeout
: commitLockTimeout25000/commitLockTimeout
:
: I remove the index directories, start the server, check that no locks
exist,
: and generate ~500 documents spread across 5 cores for the MyIndexers to
: handle. Each time, I see one or more exceptions with a message like
:
:
Lock_obtain_timed_out_SimpleFSLockmulticoreNewUser3dataindexlucenebd4994617386d14e2c8c29e23bcca719writelock__orgapachelucenestoreLockObtainFailedException_Lock_obtain_timed_out_...
:
: When the indexers have completed, no lock is left over. There is no
: discernible pattern as far as when the exception occurs (ie, it does not
: tend to happen on the first or last or any particular document).
:
: Interestingly, this problem does not happen when I have only a single
: MyIndexer, or if I have a pool of MyIndexers and am running in single
core
: mode.
:
: I've looked at the other posts from users getting this exception but it
: always seemed to be a different case, such as the server having crashed
: previously and a lock file being left over.
:
: --
: View this message in context:
http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24393255.html
: Sent from the Solr - User mailing list archive at Nabble.com.
:

-Hoss

--
View this message in context:
http://www.nabble.com/SolrException---Lock-obtain-timed-out%2C-no-leftover-locks-tp24393255p24616034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: LocalSolr - order of fields on xml response

2009-07-22 Thread Ryan McKinley

ya...  'expected', but perhaps not ideal.  As is, LocalSolr munges the  
document on its way out the door to add the distance.


When LocalSolr makes it into the source, it will likely use a method  
like:

 https://issues.apache.org/jira/browse/SOLR-705
to augment each document with the calculated distance.

This will at least have consistent behavior.



On Jul 22, 2009, at 10:47 AM, Daniel Cassiano wrote:


Hi folks,

When I do some query with LocalSolr to get the geo_distance, the  
order of

xml fields is different of a standard query.
It's a simple query, like this:
http://myhost.com:8088/solr/core/select?qt=geox=-46.01y=-23.01radius=15sort=geo_distanceascq=*:*

Is this an expected behavior of LocalSolr?


Thanks!

--
Daniel Cassiano
_
http://www.apontador.com.br/
http://www.maplink.com.br/

how to get all the docIds in the search result?

2009-07-22 Thread shb

When I use
  SolrQuery query = new SolrQuery();
   query.set(q, issn:0002-9505);
   query.setRows(10);
   QueryResponse response = server.query(query);
I only can get the 10 ids in the response.

How can i get all the docIds  in the search result?  Thanks.

Re: importing lots of db data. specially formated. what is fasted approach?

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

look at this
http://wiki.apache.org/solr/DIHQuickStart#head-532678fa5d0d9b33880abeb4d4995562014f8ef9

to know how to fetch data from multiple tables
On Wed, Jul 22, 2009 at 4:57 PM, Julian Davchevj...@drun.net wrote:
Well yes, transformation is required. But it's like data coming from
multiple tables.. etc.
It's not like getting one row from table and possibly transforming it
and using it.

Thanks for suggestions. Will get a look there.

Cheers
Avlesh

2009/7/22 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

if each field from the db goes to a separate field in solr as-is .
Then it is very simple.

On Wed, Jul 22, 2009 at 11:57 AM, Julian Davchevj...@drun.net wrote:

Hi folks,

I have around 50k documents that are reindexed now and then.

Question is what would be the fastest approach to all this.

Data is just text ~20fields or so. It comes from database but is first
specially formated to get to format suitable for passing in solr.

Currently xml post is used but have the feeling this is not optimal for
speed wise when it is up to bulk import/reindex.

I see http://wiki.apache.org/solr/DataImportHandler but kinda fail to
see howto do this specially formated data so solr makes use of it.

Are there some real examples,articles on howto use this?

--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-22 Thread Noble Paul നോബിള്‍ नोब्ळ्

alternately, you can write your own EntityProcessor and just override
the nextRow() . I guess you can still use the JdbcDataSource

On Wed, Jul 22, 2009 at 10:05 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:
 Hi all,

 this is my first post, as I am new to SOLR (some Lucene exp).

 I am trying to load data from an existing datamart into SOLR using the
 DataImportHandler but in my opinion it is too slow due to the special
 structure of the datamart I have to use.

 Root Cause:
 This datamart uses a row based approach (pivot) to present its data. It was
 so done to allow adding more attributes to the data set without having to
 change the table structure.

 Impact:
 To use the DataImportHandler, i have to pivot the data to create again one
 row per data set. Unfortunately, this results in more and less performant
 queries. Moreover, there are sometimes multiple rows for a single attribute,
 that require separate queries - or more tricky subselects that probably
 don't speed things up.

 Here is an example of the relation between DB requests, row fetches and
 actual number of documents created:

 lst name=statusMessages
 str name=Total Requests made to DataSource3737/str
 str name=Total Rows Fetched5380/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2009-07-22 18:19:06/str
 -
 str name=
 Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
 /str
 str name=Committed2009-07-22 18:22:29/str
 str name=Optimized2009-07-22 18:22:29/str
 str name=Time taken 0:3:22.484/str
 /lst

 (Full index creation.)
 There are about half a million data sets, in total. That would require about
 30h for indexing? My feeling is that there are far too many row fetches per
 data set.

 I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
 around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
 factor 10, ram buffer size 32).

 Possible solutions?
 A) Write my own DataImportHandler?
 B) Write my own MultiRowTransformer that accepts several rows as input
 argument (not sure this is a valid option)?
 C) Approach the DB developers to add a flat table with one data set per row?
 D) ...?

 If someone would like to share their experiences, that would be great!

 Thanks a lot!
 Chantal



 --
 Chantal Ackermann




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

46 matches

Mail list logo