Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-03 Thread Mike L.
Hey Shawn / Solr User Group,
 
This makes perfect sense to me. Thanks for the thorough answer.  
 The CSV update handler works at a lower level than the DataImport 
handler, and doesn't 
have clean or full-import options, which defaults to clean=true. The DIH is 
like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When clean=true or 
using full-import without a clean option, DIH itself sends 
a delete all documents update request.
 
And similiarly, my assumption is in the event of a non-syntactical 
failure/interuption  (such as a server crash) during the CSV Update a rollback 
(stream.body=rollback/) would also need to be manually requested (or 
automatted but outside of Solr) where as the DIH automates this Request on my 
behalf as well...? Is there anyway to detect this failure or interuption?...A 
real example is, I was in the process of indexing data via the CSV Update and 
somebody bounced the server before it completed. No actual errors were produced 
but it appeared that the CSV Update process stopped at the point of the reboot. 
My assumption is, if I had passed in a rollback, I'd get the previously indexed 
data , given I didn't request a delete beforehand (haven't yet tested this). 
But wondering, how I could automatically detect this? This I guess is where DIH 
starts gaining some merit. Also - the response that the DIH produces when the 
indexing process is complete appears
 to be a lot more mature in that it explicity suggest the index completed and 
that information can can be re-queried. It would be nice if the CSV Update 
provided a similiar response..my assumption is it would first need to know how 
many lines exist on the file in order to know whether or not the job actually 
completed...
 
 Also - outside of solr initiating a delete due to encountering the same 
UniqueKey, is there anything else that could cause a delete to be initiated by 
Solr? 

Lastly, is there any concern of running multiple Update CSV requests on 
different data files containing different data? 

Thanks in advance. This was very helpful.

Mike
 


From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org 
Sent: Monday, July 1, 2013 2:30 PM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


On 7/1/2013 12:56 PM, Mike L. wrote:
  Hey Ahmet / Solr User Group,

    I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about 
the numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

 http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


 My response from solr

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime591/int/lst
 /response

 I am experimenting with 2 csv files (1 with 10 records, the other with 1000) 
 to see If I can get this to run correctly before running my entire collection 
 of data. I initially loaded the first 1000 records to an empty core and that 
 seemed to work, however, but when running the above with a csv file that has 
 10 records, I would like to see only 10 active records in my core. What I get 
 instead, when looking at my stats page:

 numDocs 1000
 maxDoc 1010

 If I run the same url above while appending an 'optimize=true', I get:

 numDocs 1000,
 maxDoc 1000.

A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:

If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have clean or full-import options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a delete all documents update request.

If you didn't already know the bit about the deleted documents, then 
read this:

It can be normal for indexing new documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
document it already had or the document you are adding is more current, 
so it assumes you know what you are doing and takes care of the deletion 
for you.

When you optimize your index

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-03 Thread Shalin Shekhar Mangar
The split/group implementation in RegexTransformer is not as efficient
as CSVLoader. Perhaps we need a specialized csv loader in DIH.
SOLR-2549 aims to add this support. I'll take a look.

On Tue, Jul 2, 2013 at 12:26 AM, Mike L. javaone...@yahoo.com wrote:
  Hey Ahmet / Solr User Group,

I tried using the built in UpdateCSV and it runs A LOT faster than a 
 FileDataSource DIH as illustrated below. However, I am a bit confused about 
 the numDocs/maxDoc values when doing an import this way. Here's my Get 
 command against a Tab delimted file: (I removed server info and additional 
 fields.. everything else is the same)

 http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


 My response from solr

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime591/int/lst
 /response

 I am experimenting with 2 csv files (1 with 10 records, the other with 1000) 
 to see If I can get this to run correctly before running my entire collection 
 of data. I initially loaded the first 1000 records to an empty core and that 
 seemed to work, however, but when running the above with a csv file that has 
 10 records, I would like to see only 10 active records in my core. What I get 
 instead, when looking at my stats page:

 numDocs 1000
 maxDoc 1010

 If I run the same url above while appending an 'optimize=true', I get:

 numDocs 1000,
 maxDoc 1000.

 Perhaps the commit=true is not doing what its supposed to or am I missing 
 something? I also trying passing a commit afterward like this:
 http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
 seem to do anything either)


 From: Ahmet Arslan iori...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Mike L. 
 javaone...@yahoo.com
 Sent: Saturday, June 29, 2013 7:20 AM
 Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


 Hi Mike,


 You could try http://wiki.apache.org/solr/UpdateCSV

 And make sure you commit at the very end.




 
 From: Mike L. javaone...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Saturday, June 29, 2013 3:15 AM
 Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5



 I've been working on improving index time with a JdbcDataSource DIH based 
 config and found it not to be as performant as I'd hoped for, for various 
 reasons, not specifically due to solr. With that said, I decided to switch 
 gears a bit and test out FileDataSource setup... I assumed by eliminiating 
 network latency, I should see drastic improvements in terms of import 
 time..but I'm a bit surprised that this process seems to run much slower, at 
 least the way I've initially coded it. (below)

 The below is a barebone file import that I wrote which consumes a tab 
 delimited file. Nothing fancy here. The regex just seperates out the 
 fields... Is there faster approach to doing this? If so, what is it?

 Also, what is the recommended approach in terms of index/importing data? I 
 know thats may come across as a vague question as there are various options 
 available, but which one would be considered the standard approach within a 
 production enterprise environment.


 (below has been cleansed)

 dataConfig
  dataSource name=file type=FileDataSource /
document
  entity name=entity1
  processor=LineEntityProcessor
  url=[location_of_file]/file.csv
  dataSource=file
  transformer=RegexTransformer,TemplateTransformer
  field column=rawLine
 
 regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
 
 groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
  /
  /entity
/document
 /dataConfig

 Thanks in advance,
 Mike

 Thanks in advance,
 Mike



-- 
Regards,
Shalin Shekhar Mangar.


Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Mike L.
 Hey Ahmet / Solr User Group,
 
   I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


My response from solr 

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime591/int/lst
/response
 
I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page: 

numDocs 1000 
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000, 
maxDoc 1000.

Perhaps the commit=true is not doing what its supposed to or am I missing 
something? I also trying passing a commit afterward like this:
http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
seem to do anything either)
 

From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Mike L. 
javaone...@yahoo.com 
Sent: Saturday, June 29, 2013 7:20 AM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





From: Mike L. javaone...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the recommended approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the standard approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
dataConfig
 dataSource name=file type=FileDataSource /
   document
 entity name=entity1
 processor=LineEntityProcessor
 url=[location_of_file]/file.csv
 dataSource=file
 transformer=RegexTransformer,TemplateTransformer
 field column=rawLine
    
regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
    
groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
 /entity
   /document
/dataConfig
 
Thanks in advance,
Mike

Thanks in advance,
Mike

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Shawn Heisey

On 7/1/2013 12:56 PM, Mike L. wrote:

  Hey Ahmet / Solr User Group,

I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


My response from solr

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime591/int/lst
/response

I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page:

numDocs 1000
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000,
maxDoc 1000.


A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:


If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have clean or full-import options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a delete all documents update request.


If you didn't already know the bit about the deleted documents, then 
read this:


It can be normal for indexing new documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
document it already had or the document you are adding is more current, 
so it assumes you know what you are doing and takes care of the deletion 
for you.


When you optimize your index, deleted documents are purged, which is why 
the numbers match there.


Thanks,
Shawn



Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-29 Thread Ahmet Arslan
Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





 From: Mike L. javaone...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
 

 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the recommended approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the standard approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
dataConfig
 dataSource name=file type=FileDataSource /
   document
 entity name=entity1
 processor=LineEntityProcessor
 url=[location_of_file]/file.csv
 dataSource=file
 transformer=RegexTransformer,TemplateTransformer
 field column=rawLine
    
regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
    
groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
 /entity
   /document
/dataConfig
 
Thanks in advance,
Mike

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-29 Thread Erick Erickson
Mike:

One issue is that you're forcing all the work onto the Solr
server, and single-threading to boot by using DIH. You can
consider moving to a SolrJ model where you can have
N clients sending data to Solr if you can partition the data
up amongst the N clients cleanly.

FWIW,
Erick


On Sat, Jun 29, 2013 at 8:20 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Mike,


 You could try http://wiki.apache.org/solr/UpdateCSV

 And make sure you commit at the very end.




 
  From: Mike L. javaone...@yahoo.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Saturday, June 29, 2013 3:15 AM
 Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5



 I've been working on improving index time with a JdbcDataSource DIH based
 config and found it not to be as performant as I'd hoped for, for various
 reasons, not specifically due to solr. With that said, I decided to switch
 gears a bit and test out FileDataSource setup... I assumed by eliminiating
 network latency, I should see drastic improvements in terms of import
 time..but I'm a bit surprised that this process seems to run much slower,
 at least the way I've initially coded it. (below)

 The below is a barebone file import that I wrote which consumes a tab
 delimited file. Nothing fancy here. The regex just seperates out the
 fields... Is there faster approach to doing this? If so, what is it?

 Also, what is the recommended approach in terms of index/importing data?
 I know thats may come across as a vague question as there are various
 options available, but which one would be considered the standard
 approach within a production enterprise environment.


 (below has been cleansed)

 dataConfig
  dataSource name=file type=FileDataSource /
document
  entity name=entity1
  processor=LineEntityProcessor
  url=[location_of_file]/file.csv
  dataSource=file
  transformer=RegexTransformer,TemplateTransformer
  field column=rawLine

 regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$

 groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
  /entity
/document
 /dataConfig

 Thanks in advance,
 Mike



FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-06-28 Thread Mike L.
 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the recommended approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the standard approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
dataConfig
 dataSource name=file type=FileDataSource /
   document
 entity name=entity1
 processor=LineEntityProcessor
 url=[location_of_file]/file.csv
 dataSource=file
 transformer=RegexTransformer,TemplateTransformer
 field column=rawLine
    
regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
    
groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
 /entity
   /document
/dataConfig
 
Thanks in advance,
Mike