date:20100602


my batchSize is -1 and the load ist to big for us. why i should increase it ? 

what is a normal serverload ? our server is a fast server. 4 cores 3 GB Ram
 but we dont want a serverload from over 2 when index a starts.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Full-Import-DB-and-Performance-tp861068p864297.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: logic for auto-index

2010-06-02 Thread findbestopensource

You need to do schedule your task. Check out schedulers available in all
programming languages.
http://www.findbestopensource.com/tagged/job-scheduler

Regards
Aditya
www.findbestopensource.com



On Wed, Jun 2, 2010 at 2:39 PM, Jonty Rhods jonty.rh...@gmail.com wrote:

 Hi Peter,

 actually I want the index process should start automatically. right now I
 am
 doing mannually.
 same thing I want to start indexing when less load on server i.e. late
 night. So setting auto will fix my
 problem..

  On Wed, Jun 2, 2010 at 2:00 PM, Peter Karich peat...@yahoo.de wrote:

  Hi Jonty,
 
  what is your specific problem?
  You could use a cronjob or the Java-lib called quartz to automate this
  task.
  Or did you mean replication?
 
  Regards,
  Peter.
 
   Hi All,
  
   I am very new to solr as well as java too.
   I require to use solrj for indexing also require to index automatically
  once
   in 24 hour.
   I wrote java code for indexing now I want to do further coding for
  automatic
   process.
   Could you suggest or give me sample code for automatic index process..
   please help..
  
   with regards
   Jonty.

Re: Array of arguments in URL?

2010-06-02 Thread Grant Ingersoll

Those aren't in the default parameters.  They are config for the SearchHandler 
itself.

On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote:

 In the /spell declaration in the example solrconfig.xml, we find
 these lines among the default parameters:
 
arr name=last-components
  strspellcheck/str
/arr
 
 How does one supply such an array of strings in HTTP parameters? Does
 Solr have a parsing option for this?
 
 -- 
 Lance Norskog
 goks...@gmail.com

Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll


On Jun 1, 2010, at 9:54 PM, Blargy wrote:

 
 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 
 
 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.

As a data point, I routinely see clients index 5M items on normal
hardware in approx. 1 hour (give or take 30 minutes).  

When you say quite large, what do you mean?  Are we talking books here or 
maybe a couple pages of text or just a couple KB of data?

How long does it take you to get that data out (and, from the sounds of it, 
merge it with your item) w/o going to Solr?

 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?

DataImportHandler now supports multiple threads.  The absolute fastest way that 
I know of to index is via multiple threads sending batches of documents at a 
time (at least 100).  Often, from DBs one can split up the table via SQL 
statements that can then be fetched separately.  You may want to write your own 
multithreaded client to index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki

On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 

 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of documents 
 at a time (at least 100).  Often, from DBs one can split up the table via SQL 
 statements that can then be fetched separately.  You may want to write your 
 own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...



-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

 On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
 
 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 
 
 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of documents 
 at a time (at least 100).  Often, from DBs one can split up the table via 
 SQL statements that can then be fetched separately.  You may want to write 
 your own multithreaded client to index.
 
 SOLR-1301 is also an option if you are familiar with Hadoop ...
 

If the bottleneck is the DB, will that do much?

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki

On 2010-06-02 13:12, Grant Ingersoll wrote:
 
 On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
 
 On 2010-06-02 12:42, Grant Ingersoll wrote:

 On Jun 1, 2010, at 9:54 PM, Blargy wrote:


 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes 
 around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck 
 is
 Solr or our Database.

 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  

 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?

 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?

 - In either case, how would one speed up this process? Is there a way to 
 run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?

 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of 
 documents at a time (at least 100).  Often, from DBs one can split up the 
 table via SQL statements that can then be fetched separately.  You may want 
 to write your own multithreaded client to index.

 SOLR-1301 is also an option if you are familiar with Hadoop ...

 
 If the bottleneck is the DB, will that do much?
 

Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Ninad Raut

Hi,

I want to hit the query given below :

?q=*:*facet=truefacet.date=pubfacet.date.start=2000-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1YEAR

using SolrJ. I am browsing the net but not getting any clues about how
should I approach it.  How can SolJ API be used to create above mentioned
Query.

Regards,
Ninad R

Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Geert-Jan Brits

Hi Ninad,

SolrQuery q = new SolrQuery();
q.setQuery(*:*);
q.setFacet(true);
q.set(facet.data, pub);
q.set(facet.date.start, 2000-01-01T00:00:00Z)
... etc.

basically you can completely build your entire query with the 'raw' set (and
add) methods.
The specific methods are just helpers.

So this is the same as above:

SolrQuery q = new SolrQuery();
q.set(q,*:*);
q.set(facet,true);
q.set(facet.data, pub);
q.set(facet.date.start, 2000-01-01T00:00:00Z)
... etc.


Geert-Jan

2010/6/2 Ninad Raut hbase.user.ni...@gmail.com

 Hi,

 I want to hit the query given below :


 ?q=*:*facet=truefacet.date=pubfacet.date.start=2000-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1YEAR

 using SolrJ. I am browsing the net but not getting any clues about how
 should I approach it.  How can SolJ API be used to create above mentioned
 Query.

 Regards,
 Ninad R

Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Ninad Raut

Thanks Greet-Jan. Din't know about this trick. [?]

On Wed, Jun 2, 2010 at 5:39 PM, Geert-Jan Brits gbr...@gmail.com wrote:

 Hi Ninad,

 SolrQuery q = new SolrQuery();
 q.setQuery(*:*);
 q.setFacet(true);
 q.set(facet.data, pub);
 q.set(facet.date.start, 2000-01-01T00:00:00Z)
 ... etc.

 basically you can completely build your entire query with the 'raw' set
 (and
 add) methods.
 The specific methods are just helpers.

 So this is the same as above:

 SolrQuery q = new SolrQuery();
 q.set(q,*:*);
 q.set(facet,true);
 q.set(facet.data, pub);
 q.set(facet.date.start, 2000-01-01T00:00:00Z)
 ... etc.


 Geert-Jan

 2010/6/2 Ninad Raut hbase.user.ni...@gmail.com

  Hi,
 
  I want to hit the query given below :
 
 
 
 ?q=*:*facet=truefacet.date=pubfacet.date.start=2000-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1YEAR
 
  using SolrJ. I am browsing the net but not getting any clues about how
  should I approach it.  How can SolJ API be used to create above mentioned
  Query.
 
  Regards,
  Ninad R

PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

Hallo Users...

I have a Problem...
In my SolR, i have a lot of multiValued, dynamicFields and now i must print
ther Fields in php..

But i dont know how...


In schema.xml:

  field name=P_VIP_KUNDE_ID type=string indexed=true stored=true/
  dynamicField name=P_VIP_KUNDE_*  type=textindexed=true
stored=true multiValued=true/
  field name=P_VIP_ADR_ID type=string indexed=true stored=true/
  dynamicField name=P_VIP_ADR_*  type=textindexed=true
stored=true multiValued=true/
  field name=P_VIP_PERSON_ID type=string indexed=true stored=true/
  dynamicField name=P_VIP_PERSON_*  type=textindexed=true
stored=true multiValued=true/
  field name=P_VIP_INFO_ID type=string indexed=true stored=true/
  field name=P_VIP_INFO_TEXT type=text indexed=true stored=true/

output from Solr:

doc
str name=P_FILE_ITEMS_FILENAMEA201005311740560002.xml/str
str name=P_FILE_ITEMS_GESPERRTNO/str
str name=P_FILE_ITEMS_IDA201005311740560002/str
str name=P_FILE_ITEMS_LAST_CHANGE2010-05-31 17:40:56/str
−
str name=P_FILE_ITEMS_PFAD
Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml
/str
arr name=P_VIP_ADR_LAND
strD/str
/arr
arr name=P_VIP_ADR_LANDTEXT
str/
/arr
arr name=P_VIP_ADR_ORT
strLeichlingen/str
/arr
arr name=P_VIP_ADR_PLZ
str42799/str
/arr
arr name=P_VIP_ADR_POSTFACH
str/
/arr
arr name=P_VIP_ADR_POSTFACHOR
str/
/arr
arr name=P_VIP_ADR_POSTFACHPL
str/
/arr
arr name=P_VIP_ADR_STRASSE
strSchlo� Eicherhof/str
/arr
str name=P_VIP_ELEMENT_CATADRESS/str
str name=P_VIP_KUNDE_IDKYETG201005311740560002/str
/doc

I don now ha is the name of the Fields, so i dont know how i get the name to
printr it in PHP

Maby someone of you has a answer of the problem?

King

RE: Array of arguments in URL?

2010-06-02 Thread Jonathan Rochkind

You CAN easily turn spellchecking on or off, or set the spellcheck dictionary, 
in request parameters.  So there's really no need, that I can think of,  to try 
to actually add or remove the spellcheck component in request parameters; you 
could just leave it turned off in your default parameters, but turn it on in 
request parameters when you want it.  With 
spellcheck=truespellcheck.dictionary=whatever. 

But I suspect you weren't really asking about spellcheck component, but in 
general, or perhaps for some other specific purpose? I don't think there's any 
general way to pass an array to request parameters. Request parameters that 
take list-like data structures tend to use whitespace to seperate the elements 
instead, to allow you to pass them as request parameters. For instance dismax 
df, pf, etc fields, elements ordinarily seperated by newlines when seen in a 
solrconfig.xml as default params, can also be seperated simply by spaces in an 
actual URL too. (newlines in the URL might work too, never tried it, spaces 
more convenient for an actual URL). 

From: Grant Ingersoll [gsi...@gmail.com] On Behalf Of Grant Ingersoll 
[gsing...@apache.org]
Sent: Wednesday, June 02, 2010 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Array of arguments in URL?

Those aren't in the default parameters.  They are config for the SearchHandler 
itself.

On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote:

 In the /spell declaration in the example solrconfig.xml, we find
 these lines among the default parameters:

arr name=last-components
  strspellcheck/str
/arr

 How does one supply such an array of strings in HTTP parameters? Does
 Solr have a parsing option for this?

 --
 Lance Norskog
 goks...@gmail.com

Many Tomcat Processes on Server ?!?!?


Hello.

Our Server is a 8-Core Server with 12 GB RAM.  
Solr is running with 4 Cores. 

55 Tomcat 5.5 processes are running. ist this normal ??? 

htop show me a list of these processes of the server. and tomcat have about
55. 
every process using:
/usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar.

is this normal ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Erik Hatcher

You probably should try the php or phps response writer - it'll likely  
make your PHP integration easier.


Erik

On Jun 2, 2010, at 9:50 AM, Jörg Agatz wrote:


Hallo Users...

I have a Problem...
In my SolR, i have a lot of multiValued, dynamicFields and now i  
must print

ther Fields in php..

But i dont know how...


In schema.xml:

 field name=P_VIP_KUNDE_ID type=string indexed=true  
stored=true/

 dynamicField name=P_VIP_KUNDE_*  type=textindexed=true
stored=true multiValued=true/
 field name=P_VIP_ADR_ID type=string indexed=true  
stored=true/

 dynamicField name=P_VIP_ADR_*  type=textindexed=true
stored=true multiValued=true/
 field name=P_VIP_PERSON_ID type=string indexed=true  
stored=true/

 dynamicField name=P_VIP_PERSON_*  type=textindexed=true
stored=true multiValued=true/
 field name=P_VIP_INFO_ID type=string indexed=true  
stored=true/
 field name=P_VIP_INFO_TEXT type=text indexed=true  
stored=true/


output from Solr:

doc
str name=P_FILE_ITEMS_FILENAMEA201005311740560002.xml/str
str name=P_FILE_ITEMS_GESPERRTNO/str
str name=P_FILE_ITEMS_IDA201005311740560002/str
str name=P_FILE_ITEMS_LAST_CHANGE2010-05-31 17:40:56/str
−
str name=P_FILE_ITEMS_PFAD
Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml
/str
arr name=P_VIP_ADR_LAND
strD/str
/arr
arr name=P_VIP_ADR_LANDTEXT
str/
/arr
arr name=P_VIP_ADR_ORT
strLeichlingen/str
/arr
arr name=P_VIP_ADR_PLZ
str42799/str
/arr
arr name=P_VIP_ADR_POSTFACH
str/
/arr
arr name=P_VIP_ADR_POSTFACHOR
str/
/arr
arr name=P_VIP_ADR_POSTFACHPL
str/
/arr
arr name=P_VIP_ADR_STRASSE
strSchlo� Eicherhof/str
/arr
str name=P_VIP_ELEMENT_CATADRESS/str
str name=P_VIP_KUNDE_IDKYETG201005311740560002/str
/doc

I don now ha is the name of the Fields, so i dont know how i get the  
name to

printr it in PHP

Maby someone of you has a answer of the problem?

King

Re: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Eric Pugh

My guess would be that commons-daemon is somehow thinking that Tomcat has gone 
down and started up multiple copies...   You only need one Tomcat process for 
your 4 core Solr instance!   You may have many other WAR applications hosted in 
Tomcat, I know a lot of places would have 1 tomcat per deployed WAR pattern.


On Jun 2, 2010, at 9:59 AM, stockii wrote:

 
 Hello.
 
 Our Server is a 8-Core Server with 12 GB RAM.  
 Solr is running with 4 Cores. 
 
 55 Tomcat 5.5 processes are running. ist this normal ??? 
 
 htop show me a list of these processes of the server. and tomcat have about
 55. 
 every process using:
 /usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar.
 
 is this normal ? 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html
 Sent from the Solr - User mailing list archive at Nabble.com.

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal

Re: Many Tomcat Processes on Server ?!?!?


Is your server Linux?
In this case this is very normal.. any java application spawns many  
new processes on linux... it's not exactly bound to threads  
unfortunately.


And, of course, they all refer to the same invocation path.

paul


Le 02-juin-10 à 15:59, stockii a écrit :



Hello.

Our Server is a 8-Core Server with 12 GB RAM.
Solr is running with 4 Cores.

55 Tomcat 5.5 processes are running. ist this normal ???

htop show me a list of these processes of the server. and tomcat  
have about

55.
every process using:
/usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/ 
bootstrap.jar.


is this normal ?




smime.p7s
Description: S/MIME cryptographic signature

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

yes i done.. but i dont know how i get the information out of the big
Array...

Al fields like P_VIP_ADR_*

Re: RIA sample and minimal JARs required to embed Solr

2010-06-02 Thread Eric Pugh

Glad to hear someone looking at Solr not just as web enabled search engine, but 
as a simpler/more powerful interface to Lucene!   

When you download the source code, look at the Chapter 8 Crawler project, 
specifically Indexer.java, it demonstrates how to index into both a 
traditional separate Solr process and how to fire up an embedded Solr.   It is 
remarkably easy to interact with an embedded Solr!   In terms of minimal 
dependencies, what you need for a standalone Solr (outside of the servlet 
container like Tomcat/Jetty) is what you need for an embedded Solr.

Eric

On May 29, 2010, at 9:32 PM, Thomas J. Buhr wrote:

 Solr,
 
 The Solr 1.4 EES book arrived yesterday and I'm very much enjoying it. I was 
 glad to see that rich clients are one case for embedding Solr as this is 
 the case for my application. Multi Cores will also be important for my RIA.
 
 The book covers a lot and makes it clear that Solr has extensive abilities. 
 There is however no clean and simple sample of embedding Solr in a RIA in the 
 book, only a few alternate language usage samples. Is there a link to a Java 
 sample that simply embeds Solr for local indexing and searching using Multi 
 Cores?
 
 Also, what kind of memory footprint am I looking at for embedding Solr? What 
 are the minimal dependancies?
 
 Thom

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal

Re: Many Tomcat Processes on Server ?!?!?


yes, its a Linux... Debian System.

when i running a import. only 2-3 tomcat processes are running. the other
doing nothing ... thats what is strange for me .. ^^
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Many Tomcat Processes on Server ?!?!?


You'd need to search explanations for this at generic java forums.
It's the same with any java process on Linux.
In the Unix family Solaris and MacOSX do it better, fortunately and is  
probably due to the very old time where the Linux java was a  
translation of the Solaris java with the special features implemented  
when it was not found in Linux (e.g. green-threads).


paul




Le 02-juin-10 à 16:21, stockii a écrit :



yes, its a Linux... Debian System.

when i running a import. only 2-3 tomcat processes are running. the  
other

doing nothing ... thats what is strange for me .. ^^
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html
Sent from the Solr - User mailing list archive at Nabble.com.




smime.p7s
Description: S/MIME cryptographic signature

Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:13, schrieb Paul Libbrecht:
 Is your server Linux?
 In this case this is very normal.. any java application spawns many new
 processes on linux... it's not exactly bound to threads unfortunately.

Uh, no. New threads in Java typically don't spawn new processes on OS level.

I never had more than one tomcat process on any Linux machine. In fact,
if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what ps
aux responds.

Re: PHP output at a multiValued AND dynamicField

Am 02.06.2010 16:15, schrieb Jörg Agatz:
 yes i done.. but i dont know how i get the information out of the big
 Array...

They're simply the keys of a single response array.

RE: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Patrick Wilson

Maybe he was looking at the output from top or htop?

-Original Message-
From: Michael Kuhlmann [mailto:michael.kuhlm...@zalando.de]
Sent: Wednesday, June 02, 2010 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:13, schrieb Paul Libbrecht:
 Is your server Linux?
 In this case this is very normal.. any java application spawns many new
 processes on linux... it's not exactly bound to threads unfortunately.

Uh, no. New threads in Java typically don't spawn new processes on OS level.

I never had more than one tomcat process on any Linux machine. In fact,
if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what ps
aux responds.

Re: Many Tomcat Processes on Server ?!?!?


oha... ps aux shows only 3 processes from tomcat55.

but why show htop 55 ? close the garbage collector these not ?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864849.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Many Tomcat Processes on Server ?!?!?

This is impressive, I had this in any Linux I've been using: SuSE,  
Ubuntu, Debian, Mandrake, ...
Maybe there's some modern JDK with a modern Linux where it doesn't  
happen?

It surely is not one process per thread though.

paul


Le 02-juin-10 à 16:29, Michael Kuhlmann a écrit :


Am 02.06.2010 16:13, schrieb Paul Libbrecht:

Is your server Linux?
In this case this is very normal.. any java application spawns many  
new
processes on linux... it's not exactly bound to threads  
unfortunately.


Uh, no. New threads in Java typically don't spawn new processes on  
OS level.


I never had more than one tomcat process on any Linux machine. In  
fact,

if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what  
ps

aux responds.




smime.p7s
Description: S/MIME cryptographic signature

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

i don't understand what you mean!

Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:39, schrieb Paul Libbrecht:
 This is impressive, I had this in any Linux I've been using: SuSE,
 Ubuntu, Debian, Mandrake, ...
 Maybe there's some modern JDK with a modern Linux where it doesn't happen?
 It surely is not one process per thread though.

I'm not a linux thread expert, but from what I know Linux doesn't know
lightweight threads as other systems do. Instead it uses processes for that.

But these processes aren't top level processes that show up in top/ps.
Instead, they're grouped hierarchically (AFAIK). Otherwise you would be
able to kill single user threads with their own process id, or kill the
main process and let the spawned threads continue. That would be totally
crazy.

In my configuration, Tomcat doesn't shut down correctly if I call
bin/shutdown.sh, so I have to kill the process manually. I don't know
why. This might be the reason why stockii has 3 Tomcat processes running.

Re: PHP output at a multiValued AND dynamicField

Am 02.06.2010 16:42, schrieb Jörg Agatz:
 i don't understand what you mean!
 
Then you should ask more precisely.

Re: Many Tomcat Processes on Server ?!?!?


all the process in in htop show, have a own PID. so thats are no threads ? 

i restart my tomcat via  /etc/init.d/tomcat restart  

do you think that after ervery resart the processes arent closed ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Patrick Wilson

Try shutting tomcat down instead of restarting. If processes remain, then I'd 
say further investigation is warranted. If no processes remain, then I think 
it's safe to disregard unless you notice any problems.

-Original Message-
From: stockii [mailto:st...@shopgate.com]
Sent: Wednesday, June 02, 2010 10:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Many Tomcat Processes on Server ?!?!?

all the process in in htop show, have a own PID. so thats are no threads ?

i restart my tomcat via  /etc/init.d/tomcat restart 

do you think that after ervery resart the processes arent closed ?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Many Tomcat Processes on Server ?!?!?



Le 02-juin-10 à 16:57, stockii a écrit :

all the process in in htop show, have a own PID. so thats are no  
threads ?


No, you can't say that.
In general it is sufficient for the mother process to be killed but  
it can take several attempts.




i restart my tomcat via  /etc/init.d/tomcat restart 
do you think that after ervery resart the processes arent closed ?


after bin/shutdown.sh it is very common to me that some hanging  
threads remain... and we crafted my little script snippet (which is  
kind of specific) to actually prevent this and kill... after a while  
only.


it's not optimal.

paul

smime.p7s
Description: S/MIME cryptographic signature

Re: nested querries, and LocalParams syntax

2010-06-02 Thread Jonathan Rochkind

Thanks Yonik.

I guess the confusing thing is if the lucene query parser (for nested
querries) does backslash escaping, and the LocalParams also does
backslash escaping when you have a nested query with local params,
with quotes at both places... the inner scope needs... double escaping?
That gets really confusing fast.

[ Yeah, I recognize that using parameter dereferencing can avoid this;
I'm trying to see if I can make my code flexible enough to work either
way].

Maybe using single vs double quotes is the answer. Let's try one out and
see:

[Query un-uri escaped for clarity:]

_query_:{!dismax q.alt=' \a phrase search \ '} \another phrase
search\

[ Heh, getting that into a ruby string to uri escape it is a pain, but
we end up with: ]

q=_query_%3A%7B%21dismax+q.alt%3D%27%5C%22a+phrase+search%5C%22%27%7D+%5C%22another+phrase+search%5C%22

Which, um, I _think_ is working, although the debugQuery=true isn't
telling me much, I don't entirely understand it. Have to play around
with it more.

But it looks like maybe a fine strategy is use double quote for the
nested query itself, use single quote for the LocalParam values, and
then simply singly escape any single or double quotes inside the
LocalParam values.

Jonathan

Yonik Seeley wrote:

Hmmm, well, the lucene query parser does basic backslash escaping, and
so does local params within quoted strings. You can also use
parameter derefererencing to avoid the need to escape values too.
Like you pointed out, using single quotes in some places can also
help.

But instead of me trying to give you tons of examples that you
probably already understand, start from the assumption that things
will work, and if you come across something that doesn't make sense
(or doesn't work), I can help with that. Or if you give a single
real example as a general pattern, perhaps we could help figure out
the simplest way to avoid most of the escaping.

-Yonik
http://www.lucidimagination.com

On Tue, Jun 1, 2010 at 6:21 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

I am just trying to figure it out mostly, the particular thing I am trying
to do is a very general purpose mapper to complex dismax nested querries. I
could try to explain it, and we could go back and forth for a while, and
maybe I could convince you it makes sense to do what I'm trying to do. But
mostly I'm just exploring at this point, so I can get a sense of what is
possible.

So it would be super helpful if someone can help me figure out escaping
stuff and skip the other part, heh.

But basically, it's a mapper from a CQL query (a structured language for
search-engine-style querries) to Solr, where some of the fields searched
aren't really Solr fields/indexes, but aggregated definitions of dismax
query params including multiple solr fields, where exactly what solr fields
and other dismax querries will not be hard-coded, but will be configurable.
Thus the use of nested querries. So since it ends up so general purpose and
abstract, and many of the individual parameters are configurable, thus my
interest in figuring out proper escaping.

Jonathan

Yonik Seeley wrote:

It's not clear if you're just trying to figure it all out, or get
something specific to work.
If you can give a specific example, we might be able to suggest easier
ways to achieve it rather than going escape crazy :-)

-Yonik
http://www.lucidimagination.com

On Tue, Jun 1, 2010 at 5:06 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

Thanks, the pointer to that documentation page (which somehow I had
missed),
as well as Chris's response is very helpful.

The one thing I'm still not sure about, which I might be able to figure
it
out through trial-and-error reverse engineering, is escaping issues when
you
combine nested querries WITH local params. We potentially have a lot of
levels of quotes:

q= URIescape(_local_={!dismax qf= value that itself contains a \
quote mark} phrase query )

Whole bunch of quotes going on there. How do I give this to Solr so all
my
quotes will end up parsed appropriately? Obviously that above example
isn't
right. We've got the quotes around the _local_ nested query, then we've
got quotes around a LocalParam value, then we've got quotes that might be
IN
the actual literal value of the LocalParam, or quotes that might be in
the
actual literal value of the nested query. Maybe using single quotes in
some
places but double quotes in others will help, for certain places that can
take singel or double quotes?
Thanks very much for any advice, I get confused thinking about this.

Jonathan

Chris Hostetter wrote:

In addition to yonik's point about the LocalParams wiki page (and please
let us know if you aren't sure of the answers to any of your questions
after
reading it) I wanted to clear up one thing...

: Let's start with that not-nested query example. Can you in fact use
it
as
: above, to force dismax handling of the 'q' even if the qt or request

Re: SolrException: No such core

2010-06-02 Thread jfmnews

Solr is used to manage lists of indexes.
We have a database containing documents of different types.
Each document type is defined by a list of properties and we want to associate 
some of these properties with lists of indexes to help users during query.

For example:
The property contains a text field desc may be associated with a field Solr 
desc_en_items.

Desc_en_items is a dynamic field solr:
  dynamicField name=*_en_items type=en_items indexed=true stored=false 
multiValued=true/

And so on for each property associated with a field Solr.

Each Solr document contains an identifier (stored and indexed) Solr and dynamic 
fields. (only indexed)

When adding a document in our database, if needed, we dynamically generate the 
document and add it to solr. When a document is deleted from our database we 
suppress systematically the solr document deleteById (the document can not 
exist in solr).

There is only one core (Core0) and the server is embedded.

We use a derived lucli/LuceneMethods.java to browse index.
 
It seems to me, without being sure, that the problem comes when no list is set 
(solr is started but contains no records) in a few days of operation. We have a 
database with lists parameterized works for several months without problem.

Here the wrappers to use ...solrj.SolrServer
[code]
public class SolrCoreServer
{
   private static Logger log = LoggerFactory.getLogger(SolrCoreServer.class);

   private SolrServer server=null;
   
   public SolrCoreServer(CoreContainer container, String coreName)
   {
  server = new EmbeddedSolrServer( container, coreName );
   }

   protected SolrServer getSolrServer(){
  return server;
   }
   
   public void cleanup() throws SolrServerException, IOException {
   log.debug(cleanup());
   UpdateResponse rsp = server.deleteByQuery( *:* );
   log.debug(cleanup(): + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException(cleanup() failed status= + 
rsp.getStatus());
   }
 
   public void add(SolrInputDocument doc) throws SolrServerException, 
IOException{
   log.debug(add( + doc + ));
   UpdateResponse rsp = server.add(doc);
   log.debug(add(): + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException(add() failed status= + 
rsp.getStatus());
}
   
   public void add(CollectionSolrInputDocument docs) throws 
SolrServerException, IOException{
   log.debug(add( + docs + ));
   UpdateResponse rsp = server.add(docs);
   log.debug(add(): + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException(add() failed status= + 
rsp.getStatus());
   }
   
   public void deleteById(String docId) throws SolrServerException, IOException{
   log.debug(deleteById( + docId + ));
   UpdateResponse rsp = server.deleteById(docId);
   log.debug(deleteById(): + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException(deleteById() failed status= 
+ rsp.getStatus());
}
  
   public void commit() throws SolrServerException, IOException {
   log.debug(commit());
   UpdateResponse rsp = server.commit();
   log.debug(commit(): + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException(commit() failed status= + 
rsp.getStatus());
   }
  
   public void addAndCommit(CollectionSolrInputDocument docs) throws 
SolrServerException, IOException{
  log.debug(addAndCommit( + docs + ));
  UpdateRequest req = new UpdateRequest(); 
  req.setAction( UpdateRequest.ACTION.COMMIT, false, false );
  req.add( docs );
  UpdateResponse rsp = req.process( server );  
  log.debug(addAndCommit(): + rsp.getStatus());
  if (rsp.getStatus() != 0)
   throw new SolrServerException(addAndCommit() failed 
status= + rsp.getStatus());
   }
  
   public QueryResponse query( SolrQuery query ) throws SolrServerException{
  log.debug(query( + query + ));
  QueryResponse qr =  server.query( query );
  log.debug(query(): + qr.getStatus());
  return qr;
   }

   public QueryResponse query( String queryString, String sortField, 
SolrQuery.ORDER order, Integer maxRows ) throws SolrServerException{
  log.debug(query( + queryString + ));
  SolrQuery query = new SolrQuery();
  query.setQuery( queryString );
  query.addSortField( sortField, order );
  query.setRows(maxRows);
  QueryResponse qr = server.query( query );
  log.debug(query(): + qr.getStatus());
  return qr;
   }

}
[/code]

the schema
[code]
?xml version=1.0 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The

Re: Many Tomcat Processes on Server ?!?!?


okay you are right. thats all threads and no processes ...
but so many ? :D hehe

so when all the processes are threads i think its okay so ?! i can ignore
this ... XD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p865008.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)


When you say quite large, what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.


How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.


DataImportHandler now supports multiple threads. 

When you say now, what do you mean? I am running version 1.4.


The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 


You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the
best?

  fieldType name=text class=solr.TextField omitNorms=false
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumber=1
catenateAll=1
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Here is an overview of my data-config.xml. Thoughts?

 entity name=item 
dataSource=datasource1
query=select * from items
 ...
entity name=item_description 
dataSource=datasource2 
query=select description from item_descriptions where
id=${item.id}/
 /entity

I appreciate the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



Andrzej Bialecki wrote:
 
 On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 

 We have around 5 million items in our index and each item has a
 description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only
 indexing
 items and not their corresponding description and a full import takes
 around
 4 hours. Ideally we want to index both our items and their descriptions
 but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the
 bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here
 or maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of
 it, merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to
 run
 parallel import processes and then merge them together at the end?
 Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest
 way that I know of to index is via multiple threads sending batches of
 documents at a time (at least 100).  Often, from DBs one can split up the
 table via SQL statements that can then be fetched separately.  You may
 want to write your own multithreaded client to index.
 
 SOLR-1301 is also an option if you are familiar with Hadoop ...
 
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 

I haven't worked with Hadoop before but I'm willing to try anything to cut
down this full import time. I see this currently uses the embedded solr
server for indexing... would I have to scrap my DIH importing then? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



As a data point, I routinely see clients index 5M items on normal hardware
in approx. 1 hour (give or take 30 minutes). 

Also wanted to add that our main entity (item) consists of 5 sub-entities
(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1. They
currently point to one physical machine although we do have a pool of 3 DB's
that could be used if it helps. The other entity, item_description uses a
datasource2 which has a pool of 2 DB's that could potentially be used. Not
sure if that would help or not.

I might as well that the item description will have indexed, stored and term
vectors set to true.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Luke browser does not show non-String Solr fields?

2010-06-02 Thread jlist9

I see. It's still a little confusing to me but I'm fine as long as
this is the expected behavior. I also tried the example index
with data that come with the solr distribution and observe the
same behavior - only String fields are displayed. So Lucene is
sharing _some_ types with Solr but not all. It's still a bit puzzling
to me that Lucene is not able to understand the simple types
such as long. But I'm OK as long as there is a reason. Thanks
for the explanations!

On Tue, Jun 1, 2010 at 10:38 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : So it seems like Luke does not understand Solr's long type. This
 : is not a native Lucene type?

 No,  Lucene has concept of types ... there are utilities to help encode
 some data in special ways (particularly numbers) but the underlying lucene
 index doesn't keep track of when/how you do ths -- so Luke has no way of
 knowing what type the field is.

 Schema information is specific to Solr.


 -Hoss

Re: Importing large datasets

2010-06-02 Thread Erik Hatcher

One thing that might help indexing speed - create a *single* SQL query  
to grab all the data you need without using DIH's sub-entities, at  
least the non-cached ones.


Erik

On Jun 2, 2010, at 12:21 PM, Blargy wrote:




As a data point, I routinely see clients index 5M items on normal  
hardware

in approx. 1 hour (give or take 30 minutes).

Also wanted to add that our main entity (item) consists of 5 sub- 
entities

(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1.  
They
currently point to one physical machine although we do have a pool  
of 3 DB's
that could be used if it helps. The other entity, item_description  
uses a
datasource2 which has a pool of 2 DB's that could potentially be  
used. Not

sure if that would help or not.

I might as well that the item description will have indexed, stored  
and term

vectors set to true.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Luke browser does not show non-String Solr fields?


: I see. It's still a little confusing to me but I'm fine as long as
: this is the expected behavior. I also tried the example index
: with data that come with the solr distribution and observe the
: same behavior - only String fields are displayed. So Lucene is
: sharing _some_ types with Solr but not all. It's still a bit puzzling
: to me that Lucene is not able to understand the simple types
: such as long. But I'm OK as long as there is a reason. Thanks
: for the explanations!

The key is that there are *no* types in Lucene ... older 
versions of Lucene only supported Strin and clinets that wanted to index 
other types had to encode those types in some way as needed.  More 
recently lucene has started moving away from even dealing with Strings, 
and towards just indexing/searching raw byte[] ... all concepts of field 
types in Solr are specific to Solr 

(the caveat being that Lucene has, over the years, added utilities to help 
people make smart choices about how to encode some data types -- and in 
the case of the Trie numeric fields SOlr uses those utilites.  But that 
data isn't stored anywhere in the index files themselves, so Luke has no 
way of knowing that it should attempt to decode the binary data of a 
field using the Trie utilities.  That said: aparently Andrzej is working 
on making it possible to tell Luke oh BTW, i indexed this field using 
this solr fieldType ... i think he said it was on the Luke trunk)


-Hoss

Re: Array of arguments in URL?


: In the /spell declaration in the example solrconfig.xml, we find
: these lines among the default parameters:

as grant pointed out: these aren't in the default params

: How does one supply such an array of strings in HTTP parameters? Does
: Solr have a parsing option for this?

in general, ignoring for a moment hte question of wether you are asking 
about changing the component list in a param (you can't) and addressing 
just the question of specifing an array of strings in HTTP params: if the 
param supports multiple values, then you can specify multiple values just 
be  repeating hte key...

  q=foofq=firstValuefq=secondValuefq=thirdValue

...this results in a SolrParams instance where the value of fq is an 
array of [firstValue, secondValue] 




-Hoss

Re: Combining index and file spellcheck dictionaries


: Is it possible to combine index and file spellcheck dictionaries?

off the top of my head -- i don't think so.  however you could add special 
docs to your index, which only contain the spell field you use to build 
your spellcheck index, based on the contents of your dictionary file.


-Hoss

Re: Importing large datasets




 One thing that might help indexing speed - create a *single* SQL query  
 to grab all the data you need without using DIH's sub-entities, at  
 least the non-cached ones.
 

Not sure how much that would help. As I mentioned that without the item
description import the full process takes 4 hours which is bearable. However
once I started to import the item description which is located on a separate
machine/database the import process exploded to over 24 hours.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: minpercentage vs. mincount


: Obviously I could implement this in userland (like like mincount for 
: that matter), but I wonder if anyone else see's use in being able to 
: define that a facet must match a minimum percentage of all documents in 
: the result set, rather than a hardcoded value? The idea being that while 
: I might not be interested in a facet that only covers 3 documents in the 
: result set if there are lets say 1000 documents in the result set, the 
: situation would be a lot different if I only have 10 documents in the 
: result set.

typically people deal with this type of situation by using facet.limit to 
ensure they only get the top N constraints back -- and they set 
facet.mincount to something low just to save bandwidth if all the 
counts are too low to care about no matter how few results there are 
(ie: 0)

: I did not yet see such a feature, would it make sense to file it as a 
: feature request or should stuff like this rather be done in userland (I 
: have noticed for example that Solr prefers to have users normalize the 
: scores in userland too)?

feel free to file a feature request -- truthfully this is kind of a hard 
problem to solve in userland, you'd either have to do two queries (the 
first to get the numFound, the second with facet.mincount set as an 
integer relative numFound) or you'd have to do a single query but ask for 
a big value for facet.limit and hope that you get enough to prune your 
list.

Off the top of my head though: i can't relaly think of a sane way to do 
this on the server side that would work with distributed search either -- 
but go ahead and open an issue and let's see what the folks who are really 
smart about the distributed searching stuff have to say.


-Hoss

Re: minpercentage vs. mincount

2010-06-02 Thread Lukas Kahwe Smith

thx for your reply!

On 02.06.2010, at 20:27, Chris Hostetter wrote:

 feel free to file a feature request -- truthfully this is kind of a hard 
 problem to solve in userland, you'd either have to do two queries (the 
 first to get the numFound, the second with facet.mincount set as an 
 integer relative numFound) or you'd have to do a single query but ask for 
 a big value for facet.limit and hope that you get enough to prune your 
 list.

well i would probably implement it by just not setting a limit, and then just 
reducing the facets based on the numRows before sending the facets to the 
client (aka browser)

 Off the top of my head though: i can't relaly think of a sane way to do 
 this on the server side that would work with distributed search either -- 
 but go ahead and open an issue and let's see what the folks who are really 
 smart about the distributed searching stuff have to say.


ok i have created it:
https://issues.apache.org/jira/browse/SOLR-1937

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: Importing large datasets

2010-06-02 Thread David Stuart

How long does it take to do a grab of all the data via SQL? I found by  
denormalizing the data into a lookup table meant that I was able to  
index about 300k rows of similar data size with dih regex spilting on  
some fields in about 8mins I know it's not quite the scale bit with  
batching...


David Stuar

On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:





One thing that might help indexing speed - create a *single* SQL  
query

to grab all the data you need without using DIH's sub-entities, at
least the non-cached ones.



Not sure how much that would help. As I mentioned that without the  
item
description import the full process takes 4 hours which is bearable.  
However
once I started to import the item description which is located on a  
separate

machine/database the import process exploded to over 24 hours.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query related question

: When I query for a word say Tiger woods, and sort results by score... i do
: notice that the results are mixed up i.e first 5 results match Tiger woods
: the next 2 match either tiger/tigers or wood/woods
: the next 2 after that i notice again match tiger woods.
: 
: How do i make sure that when searching for words like above i get all the
: results matching whole search term first, followed by individual tokens like
: tiger, woods later.

for starters, you have to make sense of why exactly those docs are scoring 
that way -- this is what the param debugQuery=true is for -- look at the
score explanations and see why those docs are scoring lower.

My guess is that it's because of fieldNorms (ie: longer documents score 
lower with the same number of matches) but it could also be a term 
frequency factor (some documents contain tiger so many times they score 
high even w/o woods) ... you have to understand why your docs score they 
way they do before you can come up with a general plan for how to change 
the scoring to better meet your goals.



-Hoss

Auto-suggest internal terms

2010-06-02 Thread Jay Hill

I've got a situation where I'm looking to build an auto-suggest where any
term entered will lead to suggestions. For example, if I type wine I want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any custom
work?

-Jay

RE: Auto-suggest internal terms

2010-06-02 Thread Patrick Wilson

I'm painfully new to Solr so please be gentle if my suggestion is terrible!

Could you use highlighting to do this? Take the first n results from a query 
and show their highlights, customizing the highlights to show the desired 
number of words.

Just a thought.

Patrick

-Original Message-
From: Jay Hill [mailto:jayallenh...@gmail.com]
Sent: Wednesday, June 02, 2010 4:02 PM
To: solr-user@lucene.apache.org
Subject: Auto-suggest internal terms

I've got a situation where I'm looking to build an auto-suggest where any
term entered will lead to suggestions. For example, if I type wine I want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any custom
work?

-Jay

RE: Auto-suggest internal terms

2010-06-02 Thread Tim Gilbert

I was interested in the same thing and stumbled upon this article:

http://www.mattweber.org/2009/05/02/solr-autosuggest-with-termscomponent
-and-jquery/

I haven't followed through, but it looked promising to me.

Tim

-Original Message-
From: Jay Hill [mailto:jayallenh...@gmail.com] 
Sent: Wednesday, June 02, 2010 4:02 PM
To: solr-user@lucene.apache.org
Subject: Auto-suggest internal terms

I've got a situation where I'm looking to build an auto-suggest where
any
term entered will lead to suggestions. For example, if I type wine I
want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked
was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any
custom
work?

-Jay

Not able to access Solr Admin

2010-06-02 Thread Bondiga, Murali

Hi,

I installed Solr Server on my machine and able to access with localhost. I 
tried accessing from a different machine with IP Address but not able to access 
it. What do I need to do to be able to access the Solr instance from any 
machine within the network?

Thanks,
Murali

Re: Not able to access Solr Admin

2010-06-02 Thread Abdelhamid ABID

details... detailseverybody let's say details !

Which app server are you using ?
What is the error message that you get when trying to access solr admin from
another machine  ?



On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali 
murali.krishna.bond...@hmhpub.com wrote:

 Hi,

 I installed Solr Server on my machine and able to access with localhost. I
 tried accessing from a different machine with IP Address but not able to
 access it. What do I need to do to be able to access the Solr instance from
 any machine within the network?

 Thanks,
 Murali




-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

RE: Not able to access Solr Admin

2010-06-02 Thread Bondiga, Murali

Thank you so much for the reply.

I am using Jetty which comes with Solr installation. 

http://localhost:8983/solr/

The above URL works fine. 

The below URL does not work:

http://177.44.9.119:8983/solr/


-Original Message-
From: Abdelhamid ABID [mailto:aeh.a...@gmail.com] 
Sent: Wednesday, June 02, 2010 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Not able to access Solr Admin

details... detailseverybody let's say details !

Which app server are you using ?
What is the error message that you get when trying to access solr admin from
another machine  ?



On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali 
murali.krishna.bond...@hmhpub.com wrote:

 Hi,

 I installed Solr Server on my machine and able to access with localhost. I
 tried accessing from a different machine with IP Address but not able to
 access it. What do I need to do to be able to access the Solr instance from
 any machine within the network?

 Thanks,
 Murali




-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

Re: Not able to access Solr Admin

2010-06-02 Thread Abdelhamid ABID

When you access from another machine what message error do you get ?

Check your remote access with Telnet to see if the server respond

On Wed, Jun 2, 2010 at 10:26 PM, Bondiga, Murali 
murali.krishna.bond...@hmhpub.com wrote:

 Thank you so much for the reply.

 I am using Jetty which comes with Solr installation.

 http://localhost:8983/solr/

 The above URL works fine.

 The below URL does not work:

 http://177.44.9.119:8983/solr/


 -Original Message-
 From: Abdelhamid ABID [mailto:aeh.a...@gmail.com]
 Sent: Wednesday, June 02, 2010 5:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Not able to access Solr Admin

 details... detailseverybody let's say details !

 Which app server are you using ?
 What is the error message that you get when trying to access solr admin
 from
 another machine  ?



 On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali 
 murali.krishna.bond...@hmhpub.com wrote:

  Hi,
 
  I installed Solr Server on my machine and able to access with localhost.
 I
  tried accessing from a different machine with IP Address but not able to
  access it. What do I need to do to be able to access the Solr instance
 from
  any machine within the network?
 
  Thanks,
  Murali
 



 --
 Abdelhamid ABID
 Software Engineer- J2EE / WEB




-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

Re: Luke browser does not show non-String Solr fields?

2010-06-02 Thread jlist9

Thank you Chris. I'm clear now. I'll give Luke's latest version a try
when it's out.

On Wed, Jun 2, 2010 at 9:47 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : I see. It's still a little confusing to me but I'm fine as long as
 : this is the expected behavior. I also tried the example index
 : with data that come with the solr distribution and observe the
 : same behavior - only String fields are displayed. So Lucene is
 : sharing _some_ types with Solr but not all. It's still a bit puzzling
 : to me that Lucene is not able to understand the simple types
 : such as long. But I'm OK as long as there is a reason. Thanks
 : for the explanations!

 The key is that there are *no* types in Lucene ... older
 versions of Lucene only supported Strin and clinets that wanted to index
 other types had to encode those types in some way as needed.  More
 recently lucene has started moving away from even dealing with Strings,
 and towards just indexing/searching raw byte[] ... all concepts of field
 types in Solr are specific to Solr

 (the caveat being that Lucene has, over the years, added utilities to help
 people make smart choices about how to encode some data types -- and in
 the case of the Trie numeric fields SOlr uses those utilites.  But that
 data isn't stored anywhere in the index files themselves, so Luke has no
 way of knowing that it should attempt to decode the binary data of a
 field using the Trie utilities.  That said: aparently Andrzej is working
 on making it possible to tell Luke oh BTW, i indexed this field using
 this solr fieldType ... i think he said it was on the Luke trunk)


 -Hoss

Help in facet query

2010-06-02 Thread Sushan Rungta

Hi,

Can I restrict the facet search within the result count? 

Example: A total of 100 documents were fetched for a given query x, and
facet worked in these 100 documents. I want that facet should work only on
first 10 documents fetched from query x.

Regards,

Sushan Rungta

Re: DataImportHandler and running out of disk space


: I ran through some more failure scenarios (scenarios and results below). The
: concerning ones in my deployment are when data does not get updated, but the
: DIH's .properties file does. I could only simulate that scenario when I ran
: out of disk space (all all disk space issues behaved consistently). Is this
: worthy of a JIRA issue?

I don't know that it's DIH's responsibility to be specificly aware of disk 
space issues -- but it definitely sounds like a bug if Exceptions/Errors 
like running out of space (or file permissions errors) are occuring but 
DIH is still reporting success (and still updating hte properties file 
with the lsat updated timestamp)

by all means: please open issues for these types of things.

: Successful import
: 
: all dates updated in .properties (title date updated, each [entity
: name].last_index_time updated to its own update time. last_index_time set to
: earliest entity update time)
: 
: 
: 
: 
: Running out of disk space during import (in data directory only, conf
: directory still has space)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during import (in both data directory and conf
: directory)
: 
: some data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during commit/optimize (in data directory only,
: conf directory still has space)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during commit/optimize (in both data directory and
: conf directory)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: File permissions prevent writing (on index directory)
: 
: data not updated, failure reported, properties file not updated
: 
: 
: 
: 
: File permissions prevent writing (on segment files)
: 
: data updated, failure reported, properties file not updated
: 
: 
: 
: 
: File permissions prevent writing (on .properties file)
: 
: data updated, failure reported, properties file not updated
: 
: 
: 
: 
: Shutting down Solr during import (killing process)
: 
: data not updated, .properties not updated, no result reported
: 
: 
: 
: 
: Shutting down Solr during import (issuing shutdown message)
: 
: Some data updated, .properties not updated, no result reported
: 
: 
: 
: 
: DB connection lost (unplugging network cable)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: Updating single entity fails (first one)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: Updating single entity fails (after another one succeeds)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: 
: -- 
: View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-and-running-out-of-disk-space-tp835125p835368.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 



-Hoss

Some basics

2010-06-02 Thread Frank A

Hi,

I'm new to SOLR and have some basic questions that hopefully steer me in the
right direction.

- I want my search to auto spell check - that is if someone types
restarant I'd like the system to automatically search for restaurant.
I've seen the SpellCheckComponent but that doesn't seem to have a simple way
to automatically do the near type comparison.  Is the SpellCheckComponent
the wrong one or do I just need to manually handle the situation in my
client code?

- Also, what is the proper analyzer if I want to search a search for thai
food or thai restaurant to actually match on Thai?  I can't totally
ignore words like food and restaurant but I want to ignore more general
terms and look for specific first (or I should say score them higher).

Any tips on what I should be reading up on will be greatly appreciated.

Thanks.

Re: Importing large datasets

2010-06-02 Thread Lance Norskog

Wait! You're fetching records from one database and then doing lookups
against another DB? That makes this a completely different problem.

The DIH does not to my knowledge have the ability to pool these
queries. That is, it will not build a batch of 1000 keys from
datasource1 and then do a query against datasource2 with:
select foo where key_field IN (key1, key2,... key1000);

This is the efficient way to do what you want. You'll have to write
your own client to do this.

On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
david.stu...@progressivealliance.co.uk wrote:
 How long does it take to do a grab of all the data via SQL? I found by
 denormalizing the data into a lookup table meant that I was able to index
 about 300k rows of similar data size with dih regex spilting on some fields
 in about 8mins I know it's not quite the scale bit with batching...

 David Stuar

 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:




 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.


 Not sure how much that would help. As I mentioned that without the item
 description import the full process takes 4 hours which is bearable.
 However
 once I started to import the item description which is located on a
 separate
 machine/database the import process exploded to over 24 hours.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

Well, I hope to have around 5 million datasets/documents within 1 year, so this 
is good info. BUT if I DO have that many, then the market I am aiming at will 
end giving me 100 times more than than within 2 years.

Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 
million plus documents? The data is easily shardible geographially, as one 
given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll gsing...@apache.org wrote:

 From: Grant Ingersoll gsing...@apache.org
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 3:42 AM
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
  
  We have around 5 million items in our index and each
 item has a description
  located on a separate physical database. These item
 descriptions vary in
  size and for the most part are quite large. Currently
 we are only indexing
  items and not their corresponding description and a
 full import takes around
  4 hours. Ideally we want to index both our items and
 their descriptions but
  after some quick profiling I determined that a full
 import would take in
  excess of 24 hours. 
  
  - How would I profile the indexing process to
 determine if the bottleneck is
  Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on
 normal
 hardware in approx. 1 hour (give or take 30 minutes). 
 
 
 When you say quite large, what do you mean?  Are we
 talking books here or maybe a couple pages of text or just a
 couple KB of data?
 
 How long does it take you to get that data out (and, from
 the sounds of it, merge it with your item) w/o going to
 Solr?
 
  - In either case, how would one speed up this process?
 Is there a way to run
  parallel import processes and then merge them together
 at the end? Possibly
  use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The
 absolute fastest way that I know of to index is via multiple
 threads sending batches of documents at a time (at least
 100).  Often, from DBs one can split up the table via
 SQL statements that can then be fetched separately. 
 You may want to write your own multithreaded client to
 index.
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

When adding data continuously, that data is available after committing and is 
indexed, right?

If so, how often is reindexing do some good?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki a...@getopt.org wrote:

 From: Andrzej Bialecki a...@getopt.org
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 4:52 AM
 On 2010-06-02 13:12, Grant Ingersoll
 wrote:
  
  On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
  
  On 2010-06-02 12:42, Grant Ingersoll wrote:
 
  On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
 
  We have around 5 million items in our
 index and each item has a description
  located on a separate physical database.
 These item descriptions vary in
  size and for the most part are quite
 large. Currently we are only indexing
  items and not their corresponding
 description and a full import takes around
  4 hours. Ideally we want to index both our
 items and their descriptions but
  after some quick profiling I determined
 that a full import would take in
  excess of 24 hours. 
 
  - How would I profile the indexing process
 to determine if the bottleneck is
  Solr or our Database.
 
  As a data point, I routinely see clients index
 5M items on normal
  hardware in approx. 1 hour (give or take 30
 minutes).  
 
  When you say quite large, what do you
 mean?  Are we talking books here or maybe a couple
 pages of text or just a couple KB of data?
 
  How long does it take you to get that data out
 (and, from the sounds of it, merge it with your item) w/o
 going to Solr?
 
  - In either case, how would one speed up
 this process? Is there a way to run
  parallel import processes and then merge
 them together at the end? Possibly
  use some sort of distributed computing?
 
  DataImportHandler now supports multiple
 threads.  The absolute fastest way that I know of to
 index is via multiple threads sending batches of documents
 at a time (at least 100).  Often, from DBs one can
 split up the table via SQL statements that can then be
 fetched separately.  You may want to write your own
 multithreaded client to index.
 
  SOLR-1301 is also an option if you are familiar
 with Hadoop ...
 
  
  If the bottleneck is the DB, will that do much?
  
 
 Nope. But the workflow could be set up so that during night
 hours a DB
 export takes place that results in a CSV or SolrXML file
 (there you
 could measure the time it takes to do this export), and
 then indexing
 can work from this file.
 
 
 -- 
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _
 _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic
 Web
 ___|||__||  \|  ||  |  Embedded Unix,
 System Integration
 http://www.sigram.com  Contact: info at sigram dot
 com

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

That's promising!!! That's how I have been desigining my project. It must be 
all the joins that are causing the problems for him?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, David Stuart david.stu...@progressivealliance.co.uk wrote:

 From: David Stuart david.stu...@progressivealliance.co.uk
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 12:00 PM
 How long does it take to do a grab of
 all the data via SQL? I found by denormalizing the data into
 a lookup table meant that I was able to index about 300k
 rows of similar data size with dih regex spilting on some
 fields in about 8mins I know it's not quite the scale bit
 with batching...
 
 David Stuar
 
 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com
 wrote:
 
  
  
  
  One thing that might help indexing speed - create
 a *single* SQL query
  to grab all the data you need without using DIH's
 sub-entities, at
  least the non-cached ones.
  
  
  Not sure how much that would help. As I mentioned that
 without the item
  description import the full process takes 4 hours
 which is bearable. However
  once I started to import the item description which is
 located on a separate
  machine/database the import process exploded to over
 24 hours.
  
  --View this message in context: 
  http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
  Sent from the Solr - User mailing list archive at
 Nabble.com.

Re: Importing large datasets



Lance Norskog-2 wrote:
 
 Wait! You're fetching records from one database and then doing lookups
 against another DB? That makes this a completely different problem.
 
 The DIH does not to my knowledge have the ability to pool these
 queries. That is, it will not build a batch of 1000 keys from
 datasource1 and then do a query against datasource2 with:
 select foo where key_field IN (key1, key2,... key1000);
 
 This is the efficient way to do what you want. You'll have to write
 your own client to do this.
 
 On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
 david.stu...@progressivealliance.co.uk wrote:
 How long does it take to do a grab of all the data via SQL? I found by
 denormalizing the data into a lookup table meant that I was able to index
 about 300k rows of similar data size with dih regex spilting on some
 fields
 in about 8mins I know it's not quite the scale bit with batching...

 David Stuar

 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:




 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.


 Not sure how much that would help. As I mentioned that without the item
 description import the full process takes 4 hours which is bearable.
 However
 once I started to import the item description which is located on a
 separate
 machine/database the import process exploded to over 24 hours.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 

Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
so slow because I am using 2 different datasources?

Say I am using just one datasource should I still be seing Creating a
connection for entity  for each sub entity in the document or should it
just be using one connection?




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



Erik Hatcher-4 wrote:
 
 One thing that might help indexing speed - create a *single* SQL query  
 to grab all the data you need without using DIH's sub-entities, at  
 least the non-cached ones.
 
   Erik
 
 On Jun 2, 2010, at 12:21 PM, Blargy wrote:
 


 As a data point, I routinely see clients index 5M items on normal  
 hardware
 in approx. 1 hour (give or take 30 minutes).

 Also wanted to add that our main entity (item) consists of 5 sub- 
 entities
 (ie, joins). 2 of those 5 are fairly small so I am using
 CachedSqlEntityProcessor for them but the other 3 (which includes
 item_description) are normal.

 All the entites minus the item_description connect to datasource1.  
 They
 currently point to one physical machine although we do have a pool  
 of 3 DB's
 that could be used if it helps. The other entity, item_description  
 uses a
 datasource2 which has a pool of 2 DB's that could potentially be  
 used. Not
 sure if that would help or not.

 I might as well that the item description will have indexed, stored  
 and term
 vectors set to true.
 -- 
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

I can't find any example of creating a massive sql query. Any out there?
Will batching still work with this massive query?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets