Re: Importing large datasets

2010-06-07 Thread Alexey Serba
What's the relation between items and item_descriptions table? I.e. is
there only one item_descriptions record for every id?

If 1-1 then you can merge all your data into single database and use
the following query

 entity name=item
   dataSource=single_datasource
   query=select * from items inner join item_descriptions on
item_descriptions.id=items.id
 /entity

HTH,
Alex

On Thu, Jun 3, 2010 at 6:34 AM, Blargy zman...@hotmail.com wrote:


 Erik Hatcher-4 wrote:

 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.

       Erik

 On Jun 2, 2010, at 12:21 PM, Blargy wrote:



 As a data point, I routinely see clients index 5M items on normal
 hardware
 in approx. 1 hour (give or take 30 minutes).

 Also wanted to add that our main entity (item) consists of 5 sub-
 entities
 (ie, joins). 2 of those 5 are fairly small so I am using
 CachedSqlEntityProcessor for them but the other 3 (which includes
 item_description) are normal.

 All the entites minus the item_description connect to datasource1.
 They
 currently point to one physical machine although we do have a pool
 of 3 DB's
 that could be used if it helps. The other entity, item_description
 uses a
 datasource2 which has a pool of 2 DB's that could potentially be
 used. Not
 sure if that would help or not.

 I might as well that the item description will have indexed, stored
 and term
 vectors set to true.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 I can't find any example of creating a massive sql query. Any out there?
 Will batching still work with this massive query?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Importing large datasets

2010-06-03 Thread David Stuart



On 3 Jun 2010, at 02:58, Dennis Gearon gear...@sbcglobal.net wrote:

When adding data continuously, that data is available after  
committing and is indexed, right?

Yes


If so, how often is reindexing do some good?
You should only need to reindex if the data changes or you change your  
schema. The DIH in solr 1.4 supports delta imports so you should only  
really be adding of updating (which is actually deleting and adding)  
items when necessary.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki a...@getopt.org wrote:


From: Andrzej Bialecki a...@getopt.org
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 4:52 AM
On 2010-06-02 13:12, Grant Ingersoll
wrote:


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:


On 2010-06-02 12:42, Grant Ingersoll wrote:


On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our

index and each item has a description

located on a separate physical database.

These item descriptions vary in

size and for the most part are quite

large. Currently we are only indexing

items and not their corresponding

description and a full import takes around

4 hours. Ideally we want to index both our

items and their descriptions but

after some quick profiling I determined

that a full import would take in

excess of 24 hours.

- How would I profile the indexing process

to determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index

5M items on normal

hardware in approx. 1 hour (give or take 30

minutes).


When you say quite large, what do you

mean?  Are we talking books here or maybe a couple
pages of text or just a couple KB of data?


How long does it take you to get that data out

(and, from the sounds of it, merge it with your item) w/o
going to Solr?



- In either case, how would one speed up

this process? Is there a way to run

parallel import processes and then merge

them together at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple

threads.  The absolute fastest way that I know of to
index is via multiple threads sending batches of documents
at a time (at least 100).  Often, from DBs one can
split up the table via SQL statements that can then be
fetched separately.  You may want to write your own
multithreaded client to index.


SOLR-1301 is also an option if you are familiar

with Hadoop ...




If the bottleneck is the DB, will that do much?



Nope. But the workflow could be set up so that during night
hours a DB
export takes place that results in a CSV or SolrXML file
(there you
could measure the time it takes to do this export), and
then indexing
can work from this file.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _
_   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic
Web
___|||__||  \|  ||  |  Embedded Unix,
System Integration
http://www.sigram.com  Contact: info at sigram dot
com




Re: Importing large datasets

2010-06-03 Thread David Stuart



On 3 Jun 2010, at 02:51, Dennis Gearon gear...@sbcglobal.net wrote:

Well, I hope to have around 5 million datasets/documents within 1  
year, so this is good info. BUT if I DO have that many, then the  
market I am aiming at will end giving me 100 times more than than  
within 2 years.


Are there good references/books on using Solr/Lucen/(linux/nginx)  
for 500 million plus documents?


As far as I'm aware there aren't any books yet that cover this for  
solr. The wiki, this mailing list, nabble are your best sources and  
there have been some quite indepth conversations on the matter in this  
list in the past

The data is easily shardible geographially, as one given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll gsing...@apache.org wrote:


From: Grant Ingersoll gsing...@apache.org
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 3:42 AM

On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our index and each

item has a description

located on a separate physical database. These item

descriptions vary in

size and for the most part are quite large. Currently

we are only indexing

items and not their corresponding description and a

full import takes around

4 hours. Ideally we want to index both our items and

their descriptions but

after some quick profiling I determined that a full

import would take in

excess of 24 hours.

- How would I profile the indexing process to

determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index 5M items on
normal
hardware in approx. 1 hour (give or take 30 minutes).


When you say quite large, what do you mean?  Are we
talking books here or maybe a couple pages of text or just a
couple KB of data?

How long does it take you to get that data out (and, from
the sounds of it, merge it with your item) w/o going to
Solr?


- In either case, how would one speed up this process?

Is there a way to run

parallel import processes and then merge them together

at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple threads.  The
absolute fastest way that I know of to index is via multiple
threads sending batches of documents at a time (at least
100).  Often, from DBs one can split up the table via
SQL statements that can then be fetched separately.
You may want to write your own multithreaded client to
index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search




Re: Importing large datasets

2010-06-03 Thread David Stuart



On 3 Jun 2010, at 03:51, Blargy zman...@hotmail.com wrote:



Would dumping the databases to a local file help at all?


I would suspect not especally with the size of your data. But it would  
be good to know how long that takes i.e. Creating a SQL script that  
just pulls that data out how long does that take?


Also have many fields are you indexing per document 10,50,100?
--  
View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f 
 tp863447p866538.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-03 Thread Erik Hatcher
Frankly, if you can create a script that'll turn your data into valid  
CSV, that might be the easiest, quickest way to ingest your data.   
Pragmatic, at least.  Avoids the complexity of DIH, allows you to  
script the export from your DB in the most efficient manner you can,  
and so on.


Solr's CSV update handler is FAST!

Erik

On Jun 3, 2010, at 2:56 AM, David Stuart wrote:




On 3 Jun 2010, at 03:51, Blargy zman...@hotmail.com wrote:



Would dumping the databases to a local file help at all?


I would suspect not especally with the size of your data. But it  
would be good to know how long that takes i.e. Creating a SQL script  
that just pulls that data out how long does that take?


Also have many fields are you indexing per document 10,50,100?
-- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f 
 tp863447p866538.html

Sent from the Solr - User mailing list archive at Nabble.com.




Re: Importing large datasets

2010-06-03 Thread Grant Ingersoll

On Jun 2, 2010, at 10:30 PM, Blargy wrote:
 Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
 so slow because I am using 2 different datasources?
 

By batch size, I meant the number of docs sent from the client to Solr.  MySQL 
Batch Size is broken.  The only thing that will work is -1 or not specifying it 
at all.  If you don't specify it, it materializes all rows into memory.

Does your data really need to be in two different databases?  That is 
undoubtedly your bottleneck.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll

On Jun 1, 2010, at 9:54 PM, Blargy wrote:

 
 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 
 
 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.

As a data point, I routinely see clients index 5M items on normal
hardware in approx. 1 hour (give or take 30 minutes).  

When you say quite large, what do you mean?  Are we talking books here or 
maybe a couple pages of text or just a couple KB of data?

How long does it take you to get that data out (and, from the sounds of it, 
merge it with your item) w/o going to Solr?

 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?

DataImportHandler now supports multiple threads.  The absolute fastest way that 
I know of to index is via multiple threads sending batches of documents at a 
time (at least 100).  Often, from DBs one can split up the table via SQL 
statements that can then be fetched separately.  You may want to write your own 
multithreaded client to index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 

 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of documents 
 at a time (at least 100).  Often, from DBs one can split up the table via SQL 
 statements that can then be fetched separately.  You may want to write your 
 own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...



-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll

On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

 On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
 
 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 
 
 - How would I profile the indexing process to determine if the bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of documents 
 at a time (at least 100).  Often, from DBs one can split up the table via 
 SQL statements that can then be fetched separately.  You may want to write 
 your own multithreaded client to index.
 
 SOLR-1301 is also an option if you are familiar with Hadoop ...
 

If the bottleneck is the DB, will that do much?

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 13:12, Grant Ingersoll wrote:
 
 On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
 
 On 2010-06-02 12:42, Grant Ingersoll wrote:

 On Jun 1, 2010, at 9:54 PM, Blargy wrote:


 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes 
 around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck 
 is
 Solr or our Database.

 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  

 When you say quite large, what do you mean?  Are we talking books here or 
 maybe a couple pages of text or just a couple KB of data?

 How long does it take you to get that data out (and, from the sounds of it, 
 merge it with your item) w/o going to Solr?

 - In either case, how would one speed up this process? Is there a way to 
 run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?

 DataImportHandler now supports multiple threads.  The absolute fastest way 
 that I know of to index is via multiple threads sending batches of 
 documents at a time (at least 100).  Often, from DBs one can split up the 
 table via SQL statements that can then be fetched separately.  You may want 
 to write your own multithreaded client to index.

 SOLR-1301 is also an option if you are familiar with Hadoop ...

 
 If the bottleneck is the DB, will that do much?
 

Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Blargy


As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)


When you say quite large, what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.


How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.


DataImportHandler now supports multiple threads. 

When you say now, what do you mean? I am running version 1.4.


The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 


You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the
best?

  fieldType name=text class=solr.TextField omitNorms=false
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumber=1
catenateAll=1
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter
class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Here is an overview of my data-config.xml. Thoughts?

 entity name=item 
dataSource=datasource1
query=select * from items
 ...
entity name=item_description 
dataSource=datasource2 
query=select description from item_descriptions where
id=${item.id}/
 /entity

I appreciate the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


Andrzej Bialecki wrote:
 
 On 2010-06-02 12:42, Grant Ingersoll wrote:
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 

 We have around 5 million items in our index and each item has a
 description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only
 indexing
 items and not their corresponding description and a full import takes
 around
 4 hours. Ideally we want to index both our items and their descriptions
 but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the
 bottleneck is
 Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on normal
 hardware in approx. 1 hour (give or take 30 minutes).  
 
 When you say quite large, what do you mean?  Are we talking books here
 or maybe a couple pages of text or just a couple KB of data?
 
 How long does it take you to get that data out (and, from the sounds of
 it, merge it with your item) w/o going to Solr?
 
 - In either case, how would one speed up this process? Is there a way to
 run
 parallel import processes and then merge them together at the end?
 Possibly
 use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The absolute fastest
 way that I know of to index is via multiple threads sending batches of
 documents at a time (at least 100).  Often, from DBs one can split up the
 table via SQL statements that can then be fetched separately.  You may
 want to write your own multithreaded client to index.
 
 SOLR-1301 is also an option if you are familiar with Hadoop ...
 
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 

I haven't worked with Hadoop before but I'm willing to try anything to cut
down this full import time. I see this currently uses the embedded solr
server for indexing... would I have to scrap my DIH importing then? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


As a data point, I routinely see clients index 5M items on normal hardware
in approx. 1 hour (give or take 30 minutes). 

Also wanted to add that our main entity (item) consists of 5 sub-entities
(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1. They
currently point to one physical machine although we do have a pool of 3 DB's
that could be used if it helps. The other entity, item_description uses a
datasource2 which has a pool of 2 DB's that could potentially be used. Not
sure if that would help or not.

I might as well that the item description will have indexed, stored and term
vectors set to true.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Erik Hatcher
One thing that might help indexing speed - create a *single* SQL query  
to grab all the data you need without using DIH's sub-entities, at  
least the non-cached ones.


Erik

On Jun 2, 2010, at 12:21 PM, Blargy wrote:




As a data point, I routinely see clients index 5M items on normal  
hardware

in approx. 1 hour (give or take 30 minutes).

Also wanted to add that our main entity (item) consists of 5 sub- 
entities

(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1.  
They
currently point to one physical machine although we do have a pool  
of 3 DB's
that could be used if it helps. The other entity, item_description  
uses a
datasource2 which has a pool of 2 DB's that could potentially be  
used. Not

sure if that would help or not.

I might as well that the item description will have indexed, stored  
and term

vectors set to true.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Importing large datasets

2010-06-02 Thread Blargy



 One thing that might help indexing speed - create a *single* SQL query  
 to grab all the data you need without using DIH's sub-entities, at  
 least the non-cached ones.
 

Not sure how much that would help. As I mentioned that without the item
description import the full process takes 4 hours which is bearable. However
once I started to import the item description which is located on a separate
machine/database the import process exploded to over 24 hours.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread David Stuart
How long does it take to do a grab of all the data via SQL? I found by  
denormalizing the data into a lookup table meant that I was able to  
index about 300k rows of similar data size with dih regex spilting on  
some fields in about 8mins I know it's not quite the scale bit with  
batching...


David Stuar

On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:





One thing that might help indexing speed - create a *single* SQL  
query

to grab all the data you need without using DIH's sub-entities, at
least the non-cached ones.



Not sure how much that would help. As I mentioned that without the  
item
description import the full process takes 4 hours which is bearable.  
However
once I started to import the item description which is located on a  
separate

machine/database the import process exploded to over 24 hours.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Lance Norskog
Wait! You're fetching records from one database and then doing lookups
against another DB? That makes this a completely different problem.

The DIH does not to my knowledge have the ability to pool these
queries. That is, it will not build a batch of 1000 keys from
datasource1 and then do a query against datasource2 with:
select foo where key_field IN (key1, key2,... key1000);

This is the efficient way to do what you want. You'll have to write
your own client to do this.

On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
david.stu...@progressivealliance.co.uk wrote:
 How long does it take to do a grab of all the data via SQL? I found by
 denormalizing the data into a lookup table meant that I was able to index
 about 300k rows of similar data size with dih regex spilting on some fields
 in about 8mins I know it's not quite the scale bit with batching...

 David Stuar

 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:




 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.


 Not sure how much that would help. As I mentioned that without the item
 description import the full process takes 4 hours which is bearable.
 However
 once I started to import the item description which is located on a
 separate
 machine/database the import process exploded to over 24 hours.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
Well, I hope to have around 5 million datasets/documents within 1 year, so this 
is good info. BUT if I DO have that many, then the market I am aiming at will 
end giving me 100 times more than than within 2 years.

Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 
million plus documents? The data is easily shardible geographially, as one 
given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll gsing...@apache.org wrote:

 From: Grant Ingersoll gsing...@apache.org
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 3:42 AM
 
 On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
  
  We have around 5 million items in our index and each
 item has a description
  located on a separate physical database. These item
 descriptions vary in
  size and for the most part are quite large. Currently
 we are only indexing
  items and not their corresponding description and a
 full import takes around
  4 hours. Ideally we want to index both our items and
 their descriptions but
  after some quick profiling I determined that a full
 import would take in
  excess of 24 hours. 
  
  - How would I profile the indexing process to
 determine if the bottleneck is
  Solr or our Database.
 
 As a data point, I routinely see clients index 5M items on
 normal
 hardware in approx. 1 hour (give or take 30 minutes). 
 
 
 When you say quite large, what do you mean?  Are we
 talking books here or maybe a couple pages of text or just a
 couple KB of data?
 
 How long does it take you to get that data out (and, from
 the sounds of it, merge it with your item) w/o going to
 Solr?
 
  - In either case, how would one speed up this process?
 Is there a way to run
  parallel import processes and then merge them together
 at the end? Possibly
  use some sort of distributed computing?
 
 DataImportHandler now supports multiple threads.  The
 absolute fastest way that I know of to index is via multiple
 threads sending batches of documents at a time (at least
 100).  Often, from DBs one can split up the table via
 SQL statements that can then be fetched separately. 
 You may want to write your own multithreaded client to
 index.
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem using Solr/Lucene: 
 http://www.lucidimagination.com/search
 



Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
When adding data continuously, that data is available after committing and is 
indexed, right?

If so, how often is reindexing do some good?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki a...@getopt.org wrote:

 From: Andrzej Bialecki a...@getopt.org
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 4:52 AM
 On 2010-06-02 13:12, Grant Ingersoll
 wrote:
  
  On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
  
  On 2010-06-02 12:42, Grant Ingersoll wrote:
 
  On Jun 1, 2010, at 9:54 PM, Blargy wrote:
 
 
  We have around 5 million items in our
 index and each item has a description
  located on a separate physical database.
 These item descriptions vary in
  size and for the most part are quite
 large. Currently we are only indexing
  items and not their corresponding
 description and a full import takes around
  4 hours. Ideally we want to index both our
 items and their descriptions but
  after some quick profiling I determined
 that a full import would take in
  excess of 24 hours. 
 
  - How would I profile the indexing process
 to determine if the bottleneck is
  Solr or our Database.
 
  As a data point, I routinely see clients index
 5M items on normal
  hardware in approx. 1 hour (give or take 30
 minutes).  
 
  When you say quite large, what do you
 mean?  Are we talking books here or maybe a couple
 pages of text or just a couple KB of data?
 
  How long does it take you to get that data out
 (and, from the sounds of it, merge it with your item) w/o
 going to Solr?
 
  - In either case, how would one speed up
 this process? Is there a way to run
  parallel import processes and then merge
 them together at the end? Possibly
  use some sort of distributed computing?
 
  DataImportHandler now supports multiple
 threads.  The absolute fastest way that I know of to
 index is via multiple threads sending batches of documents
 at a time (at least 100).  Often, from DBs one can
 split up the table via SQL statements that can then be
 fetched separately.  You may want to write your own
 multithreaded client to index.
 
  SOLR-1301 is also an option if you are familiar
 with Hadoop ...
 
  
  If the bottleneck is the DB, will that do much?
  
 
 Nope. But the workflow could be set up so that during night
 hours a DB
 export takes place that results in a CSV or SolrXML file
 (there you
 could measure the time it takes to do this export), and
 then indexing
 can work from this file.
 
 
 -- 
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _
 _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic
 Web
 ___|||__||  \|  ||  |  Embedded Unix,
 System Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 



Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
That's promising!!! That's how I have been desigining my project. It must be 
all the joins that are causing the problems for him?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, David Stuart david.stu...@progressivealliance.co.uk wrote:

 From: David Stuart david.stu...@progressivealliance.co.uk
 Subject: Re: Importing large datasets
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Wednesday, June 2, 2010, 12:00 PM
 How long does it take to do a grab of
 all the data via SQL? I found by denormalizing the data into
 a lookup table meant that I was able to index about 300k
 rows of similar data size with dih regex spilting on some
 fields in about 8mins I know it's not quite the scale bit
 with batching...
 
 David Stuar
 
 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com
 wrote:
 
  
  
  
  One thing that might help indexing speed - create
 a *single* SQL query
  to grab all the data you need without using DIH's
 sub-entities, at
  least the non-cached ones.
  
  
  Not sure how much that would help. As I mentioned that
 without the item
  description import the full process takes 4 hours
 which is bearable. However
  once I started to import the item description which is
 located on a separate
  machine/database the import process exploded to over
 24 hours.
  
  --View this message in context: 
  http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
  Sent from the Solr - User mailing list archive at
 Nabble.com.
 


Re: Importing large datasets

2010-06-02 Thread Blargy


Lance Norskog-2 wrote:
 
 Wait! You're fetching records from one database and then doing lookups
 against another DB? That makes this a completely different problem.
 
 The DIH does not to my knowledge have the ability to pool these
 queries. That is, it will not build a batch of 1000 keys from
 datasource1 and then do a query against datasource2 with:
 select foo where key_field IN (key1, key2,... key1000);
 
 This is the efficient way to do what you want. You'll have to write
 your own client to do this.
 
 On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
 david.stu...@progressivealliance.co.uk wrote:
 How long does it take to do a grab of all the data via SQL? I found by
 denormalizing the data into a lookup table meant that I was able to index
 about 300k rows of similar data size with dih regex spilting on some
 fields
 in about 8mins I know it's not quite the scale bit with batching...

 David Stuar

 On 2 Jun 2010, at 17:58, Blargy zman...@hotmail.com wrote:




 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.


 Not sure how much that would help. As I mentioned that without the item
 description import the full process takes 4 hours which is bearable.
 However
 once I started to import the item description which is located on a
 separate
 machine/database the import process exploded to over 24 hours.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 

Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
so slow because I am using 2 different datasources?

Say I am using just one datasource should I still be seing Creating a
connection for entity  for each sub entity in the document or should it
just be using one connection?




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


Erik Hatcher-4 wrote:
 
 One thing that might help indexing speed - create a *single* SQL query  
 to grab all the data you need without using DIH's sub-entities, at  
 least the non-cached ones.
 
   Erik
 
 On Jun 2, 2010, at 12:21 PM, Blargy wrote:
 


 As a data point, I routinely see clients index 5M items on normal  
 hardware
 in approx. 1 hour (give or take 30 minutes).

 Also wanted to add that our main entity (item) consists of 5 sub- 
 entities
 (ie, joins). 2 of those 5 are fairly small so I am using
 CachedSqlEntityProcessor for them but the other 3 (which includes
 item_description) are normal.

 All the entites minus the item_description connect to datasource1.  
 They
 currently point to one physical machine although we do have a pool  
 of 3 DB's
 that could be used if it helps. The other entity, item_description  
 uses a
 datasource2 which has a pool of 2 DB's that could potentially be  
 used. Not
 sure if that would help or not.

 I might as well that the item description will have indexed, stored  
 and term
 vectors set to true.
 -- 
 View this message in context:
 http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

I can't find any example of creating a massive sql query. Any out there?
Will batching still work with this massive query?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy

Would dumping the databases to a local file help at all?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Importing large datasets

2010-06-01 Thread Blargy

We have around 5 million items in our index and each item has a description
located on a separate physical database. These item descriptions vary in
size and for the most part are quite large. Currently we are only indexing
items and not their corresponding description and a full import takes around
4 hours. Ideally we want to index both our items and their descriptions but
after some quick profiling I determined that a full import would take in
excess of 24 hours. 

- How would I profile the indexing process to determine if the bottleneck is
Solr or our Database.
- In either case, how would one speed up this process? Is there a way to run
parallel import processes and then merge them together at the end? Possibly
use some sort of distributed computing?

Any ideas. Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p863447.html
Sent from the Solr - User mailing list archive at Nabble.com.