automatic delta imports?

2011-12-19 Thread Mark Juszczec
Hello all

I'm seeing the following in my web server log file:

[2011-12-19 08:57:00.016] [customersIndex] webapp=/solr path=/dataimport
params={command=delta-importcommit=trueoptimize=true} status=0 QTime=3
[2011-12-19 08:57:00.018] Starting Delta Import
[2011-12-19 08:57:00.018] Read dataimport.properties
[2011-12-19 08:57:00.019] Starting delta collection.
[2011-12-19 08:57:00.019] Running ModifiedRowKey() for Entity: CUSTOMERS
[2011-12-19 08:57:00.019] [ordersIndex] webapp=/solr path=/dataimport
params={command=delta-importcommit=trueoptimize=true} status=0 QTime=1
[2011-12-19 08:57:00.023] Starting Delta Import
[2011-12-19 08:57:00.023] Creating a connection for entity CUSTOMERS with
URL: jdbc:oracle:thin:@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST =
somehost)(PORT = 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME =
someservice)))
[2011-12-19 08:57:00.023] Read dataimport.properties
[2011-12-19 08:57:00.024] Starting delta collection.
[2011-12-19 08:57:00.024] Running ModifiedRowKey() for Entity: item2
[2011-12-19 08:57:00.025] Creating a connection for entity item2 with URL:
jdbc:oracle:thin:@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST =
somehost)(PORT
= 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = someservice)))
[2011-12-19 08:57:01.005] Time taken for getConnection(): 982
[2011-12-19 08:57:01.010] Time taken for getConnection(): 985
[2011-12-19 08:57:01.216] Completed ModifiedRowKey for Entity: CUSTOMERS
rows obtained : 0
[2011-12-19 08:57:01.217] Completed DeletedRowKey for Entity: CUSTOMERS
rows obtained : 0
[2011-12-19 08:57:01.217] Completed parentDeltaQuery for Entity: CUSTOMERS
[2011-12-19 08:57:01.217] Delta Import completed successfully
[2011-12-19 08:57:01.217] {} 0 3
[2011-12-19 08:57:01.217] Time taken = 0:0:1.199
[2011-12-19 08:57:02.226] Completed ModifiedRowKey for Entity: item2 rows
obtained : 0
[2011-12-19 08:57:02.227] Completed DeletedRowKey for Entity: item2 rows
obtained : 0
[2011-12-19 08:57:02.227] Completed parentDeltaQuery for Entity: item2
[2011-12-19 08:57:02.227] Delta Import completed successfully
[2011-12-19 08:57:02.227] {} 0 1
[2011-12-19 08:57:02.227] Time taken = 0:0:2.204

It sure looks like its an automatic delta import running on my 2 indexes.
 But I thought the docs said Solr didn't provide for automatic delta
imports?  Is this a new feature in 3.2?  I've got multiple 3.2 instances
running and this is the only one who's logs show this message.  Have I
turned something on accidentally?  If so, what config files contain these
settings?  I want to turn this on for the other solr instances.

Mark


case insensitive searches

2011-10-30 Thread Mark Juszczec
Hello all

According to the docs, I need to use solr.LowerCaseTokenizerFactory

Does anyone have any experience with it?  Can anyone comment on pitfalls or
things to beware of?

Does anyone know of any examples I can look at?

Thanks

Mark


Re: solr equivalent of select distinct

2011-09-11 Thread Mark juszczec
Erick

Thanks very much for the reply.

I typed this late Friday after work and tried to simplify the problem
description.  I got something wrong.  Hopefully this restatement is better:

My PK is FLD1, FLD2 and FLD3 concatenated together.

In some cases FLD1 and FLD2 can be the same.  The ONLY differing field being
FLD3.

Here's an example:

PK   FLD1  FLD2FLD3 FLD4 FLD5
AB0  AB  0 x   y
AB1  AB  1 x   y
CD0  CD  0 a   b
CD1  CD  1 e   f

I want to write a query using only the terms FLD1 and FLD2 and ONLY get
back:

A B x y
C D a b
C D e f

Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
occurrence of those records.

Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
occurrences of those records.

I'm hoping I can use wildcards to get FLD4 and FLD5.  If not, I can use fl=

I'm using edismax.

We are also creating the query string on the fly.  I suspect using SolrJ and
plugging the values into a bean would be easier - or do I have that wrong?

I hope the tables of example data display properly.

Mark

On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson erickerick...@gmail.comwrote:

 This smells like an XY problem, can you back up and give a higher-level
 reason *why* you want this behavior?

 Because given your problem description, this seems like you are getting
 correct behavior no matter how you define the problem. You're essentially
 saying that you have two records with identical beginnings of your PK,
 why is it incorrect to give you both records?

 But, anyway, if you're searching on FLD1 and FLD2, then by definition
 you're going to get both records back or the search would be failing!

 Best
 Erick

 On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Hello everyone
 
  Let's say each record in my index contains fields named PK, FLD1, FLD2,
 FLD3
   FLD100
 
  PK is my solr primary key and I'm creating it by concatenating
  FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
 
  Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure
 about
  the remaining fields
 
  Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get
 both
  records.  I only want 1.
 
  Research says I should use faceting.  But this:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2 
  facet=true  facet_field=FLD1  facet_field=FLD2
 
  gives me 2 records.
 
  In fact, it gives me the same results as:
 
  q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2
 
  I'm wrong somewhere, but I'm unsure where.
 
  Is faceting the right way to go or should I be using grouping?
 
  Curiously, when I use grouping like this:
 
  q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1,
 FLD2
  group=true group.field=FLD1 group.field=FLD2
 
  I get 2 records as well.
 
  Has anyone dealt with mimicing select distinct in Solr?
 
  Any advice would be very appreciated.
 
  Mark
 



Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec
Erick

My field contains a b (without )

We are trying to assemble the query as a String by appending the various
values.  I think that is a large part of the problem and our lives would be
easier if we let the Solr api do this work.

We've experimented with our query assembler producing

field:a+b

We've also tried making it create

field:a\ b

The first case just does not work and I'm unsure why.

The second case ends up url encoding the \ and I'm unsure if that will cause
it to be used in the query or not.

Mark



On Sun, Sep 11, 2011 at 12:10 PM, Erick Erickson erickerick...@gmail.comwrote:

 Try escaping it for a start.

 But why do you want to? If it's a phrase query, enclose it in double
 quotes.
 You really have to provide more details, because there are too many
 possibilities
 to answer. For instance:

 If you're entering field:a b then 'b' will be searched against your
 default text field
 and you should enter field:(a b) or field:a field:b

 If you've tokenized the field, you shouldn't care.

 If you're using keywordanalyzer, escaping should work.

 Etc.
 

 Best
 Erick

 On Fri, Sep 9, 2011 at 8:11 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Hi folks
 
  I've got a field that contains 2 words separated by a single blank.
 
  What's the trick to creating a search string that contains the single
 blank?
 
  Mark
 



Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec

 But as Erick says, it's not clear that's really what you want (to
 search on a single term with a space in it).  If it's a normal text
 field, each word will be indexed separately, so you really want a
 phrase query or a boolean query:

 field:a b
 or
 field:(a b)


I am looking for a text string with a single, embedded space.  For the
purposes of this example, it is a b and its stored in the index in a field
called field.

Am I incorrect in assuming the query field:a b will match the the string a
followed by a single embedded space followed by a b?

I'm also wondering if this is already handled by the Solr/SolrJ API and if
we are making our lives more difficult by assembling the query strings
ourselves.

Mark


 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference



Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec
That's what I thought.  The problem is, its not and I am unsure what is
wrong.



On Sun, Sep 11, 2011 at 1:35 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sun, Sep 11, 2011 at 1:15 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  I am looking for a text string with a single, embedded space.  For the
  purposes of this example, it is a b and its stored in the index in a
 field
  called field.
 
  Am I incorrect in assuming the query field:a b will match the the
 string a
  followed by a single embedded space followed by a b?

 Yes, that should work regardless of how the field is indexed (as a big
 single token, or as a normal text field that doesn't preserve spaces).

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference



Re: searching for terms containing embedded spaces

2011-09-11 Thread Mark juszczec
The field's properties are:

field name=CUSTOMER_TYPE_NM type=string indexed=true stored=true
required=true default=CUSTOMER_TYPE_NM_MISSING

There have been no changes since I last completely rebuilt the index.

Is re-indexing done when an index is completely rebuilt with a a
dataimport=full?   How about if we've done dataimport=delta?

If it helps, this is what I get when I print out the ModifiableSolrParams
object I'm sending to the query method:

q=+*%3A*++AND+CUSTOMER_TYPE_NM%3ANetwork+Advertiser+AND+ACTIVE_IND%3A1defType=edismaxrows=500sort=ACCOUNT_CUSTOMER_ID+ascstart=0

Mark

On Sun, Sep 11, 2011 at 2:05 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Sun, Sep 11, 2011 at 1:39 PM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  That's what I thought.  The problem is, its not and I am unsure what is
  wrong.

 What is the fieldType definition for that field?  Did you change it
 without re-indexing?

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference



solr equivalent of select distinct

2011-09-09 Thread Mark juszczec
Hello everyone

Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3
 FLD100

PK is my solr primary key and I'm creating it by concatenating
FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique

Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure about
the remaining fields

Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both
records.  I only want 1.

Research says I should use faceting.  But this:

q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2 
facet=true  facet_field=FLD1  facet_field=FLD2

gives me 2 records.

In fact, it gives me the same results as:

q=FLD1:A and FLD2:B  rows=500  defType=edismax  fl=FLD1, FLD2

I'm wrong somewhere, but I'm unsure where.

Is faceting the right way to go or should I be using grouping?

Curiously, when I use grouping like this:

q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1, FLD2
group=true group.field=FLD1 group.field=FLD2

I get 2 records as well.

Has anyone dealt with mimicing select distinct in Solr?

Any advice would be very appreciated.

Mark


searching for terms containing embedded spaces

2011-09-09 Thread Mark juszczec
Hi folks

I've got a field that contains 2 words separated by a single blank.

What's the trick to creating a search string that contains the single blank?

Mark


edismax, inconsistencies with implicit/explicit AND when used with explicit OR

2011-08-09 Thread Mark juszczec
Hello all

We've just switched from the default parser to the edismax parser and a user
has noticed some inconsistencies when using implicit/explicit ANDs, ORs and
grouping search terms
in parenthesis.

First, the default query operator is AND.  I switched it from OR today.

The query:

customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.flhttp://cn-nyc1-ad-dev1.cnet.com:8983/solr/customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.fl
=


returns 1053 results.  Some have only IBM in CUSTOMER_NM, some have only
Software in the name, some have both.


However, when I explicitly specify an AND between CUSTOMER_NM:*IBM* and
CUSTOMER_NM:*Software* :


customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20AND%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.fl=

I only get 3 results and all of them contain both IBM and Software.

I found this reference to inconsistencies with edismax, but I'm not sure it
explains this situation 100%.

http://lucene.472066.n3.nabble.com/edismax-inconsistency-AND-OR-td2131795.html

Have I found a bug or am I doing something terribly wrong?

Mark


edismax configuration

2011-08-08 Thread Mark juszczec
Hello all

Can someone direct me to a link with config info in order to allow use of
the edismax QueryHandler?

Mark


Re: edismax configuration

2011-08-08 Thread Mark juszczec
Got it.  Thank you.

I thought this was going to be much more difficult than it actually was.

Mark

On Mon, Aug 8, 2011 at 4:50 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 http://wiki.apache.org/solr/CommonQueryParameters#defType

  Hello all
 
  Can someone direct me to a link with config info in order to allow use of
  the edismax QueryHandler?
 
  Mark



deleting index directory/files

2011-08-04 Thread Mark juszczec
Hello all

I'm using multiple cores.  I there's a directory named by the core and it
contains a subdir named data that contains a subdir named index that
contains a bunch of files that contain the data for my index.

Let's say I want to completely rebuild the index from scratch.

Can I delete the dir named index?  I know the next thing I'd have to do is a
full data import, and that's ok.  I want to blow away any traces of the
core's previous existence.

Mark


field with repeated data in index

2011-07-28 Thread Mark juszczec
Hello all

I created an index consisting of orders and the names of the salesmen who
are responsible for the order.

As you can imagine, the same name can be associated with many different
orders.

No problem.  Until I try to do a faceted search on the salesman name field.
 Right now, I have the data indexed as follows:

field name=PRIMARY_AC type=string indexed=false stored=true
required=true default=PRIMARY_AC unavailable/

My faceted search gives me the following response:

response={responseHeader={status=0,QTime=358,params={facet=on,indent=true,q=*:*,facet.field=PRIMARY_AC,wt=javabin,rows=0,version=2}},response={numFound=954178,start=0,docs=[]},facet_counts={facet_queries={},facet_fields={PRIMARY_AC={}},facet_dates={},facet_ranges={}}}

Which just isn't right.  I KNOW there's data in there, but am confused as to
how to properly identify it to Solr.

Any suggestions?

Mark


Re: field with repeated data in index

2011-07-28 Thread Mark juszczec
James

Wow.  That was fast.  Thanks!

But I thought you couldn't index a field that has duplicate values?

Mark


On Thu, Jul 28, 2011 at 4:53 PM, Dyer, James james.d...@ingrambook.comwrote:

 You need to index the field you want to facet on.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Mark juszczec [mailto:mark.juszc...@gmail.com]
 Sent: Thursday, July 28, 2011 3:50 PM
 To: solr-user@lucene.apache.org
 Subject: field with repeated data in index

 Hello all

 I created an index consisting of orders and the names of the salesmen who
 are responsible for the order.

 As you can imagine, the same name can be associated with many different
 orders.

 No problem.  Until I try to do a faceted search on the salesman name field.
  Right now, I have the data indexed as follows:

 field name=PRIMARY_AC type=string indexed=false stored=true
 required=true default=PRIMARY_AC unavailable/

 My faceted search gives me the following response:


 response={responseHeader={status=0,QTime=358,params={facet=on,indent=true,q=*:*,facet.field=PRIMARY_AC,wt=javabin,rows=0,version=2}},response={numFound=954178,start=0,docs=[]},facet_counts={facet_queries={},facet_fields={PRIMARY_AC={}},facet_dates={},facet_ranges={}}}

 Which just isn't right.  I KNOW there's data in there, but am confused as
 to
 how to properly identify it to Solr.

 Any suggestions?

 Mark



updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
Hello all

I'm using Solr 3.2 and am confused about updating existing data in an index.

According to the DataImportHandler Wiki:

*delta-import* : For incremental imports and change detection run the
command `http://host:port/solr/dataimport?command=delta-import . It
supports the same clean, commit, optimize and debug parameters as
full-import command.

I know delta-import will find new data in the database and insert it into
the index.  My problem is how it handles updates where I've got a record
that exists in the index and the database, the database record is changed
and I want to incorporate those changes in the existing record in the index.
 IOW I don't want to insert it again.

I've tried this and wound up with 2 records with the same key in the index.
 The first contains the original db values found when the index was created,
the 2nd contains the db values after the record was changed.

I've also found this
http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.472066.n3.nabble.com%2FDelta-import-with-solrj-client-tp1085763p1086173.html
the
subject is 'Delta-import with solrj client'

Greetings. I have a *solrj* client for fetching data from database. I am
using *delta*-*import* for fetching data. If a column is changed in database
using timestamp with *delta*-*import* i get the latest column indexed but
there are *duplicate* values in the index similar to the column but the data
is older. This works with cleaning the index but i want to update the index
without cleaning it. Is there a way to just update the index with the
updated column without having *duplicate* values. Appreciate for any
feedback.

Hando

There are 2 responses:

Short answer is no, there isn't a way. *Solr* doesn't have the concept of
'Update' to an indexed document. You need to add the full document (all
'columns') each time any one field changes. If doing that in your
DataImportHandler logic is difficult you may need to write a separate Update
Service that does:

1) Read UniqueID, UpdatedColumn(s)  from database
2) Using UniqueID Retrieve document from *Solr*
3) Add/Update field(s) with updated column(s)
4) Add document back to *Solr*

Although, if you use DIH to do a full *import*, using the same query in
your *Delta*-*Import* to get the whole document shouldn't be that
difficult.

and

Hi,

Make sure you use a proper ID field, which does *not* change even if the
content in the database changes. In this way, when your
*delta*-*import* fetches
changed rows to index, they will update the existing rows in your index. 

I have an ID field that doesn't change.  It is the primary key field from
the database table I am trying to index and I have verified it is unique.

So, does Solr allow updates (not inserts) of existing records?  Is anyone
able to do this?

Mark


Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
Bob

Thanks very much for the reply!

I am using a unique integer called order_id as the Solr index key.

My query, deltaQuery and deltaImportQuery are below:

entity name=item1
  pk=ORDER_ID
  query=select 1 as TABLE_ID , orders.order_id, orders.order_booked_ind,
orders.order_dt, orders.cancel_dt, orders.account_manager_id,
orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
orders.approved_discount_pct, orders.campaign_nm,
orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders

  deltaImportQuery=select 1 as TABLE_ID, orders.order_id,
orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id,
orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm,
orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders
where orders.order_id = '${dataimporter.delta.ORDER_ID}'

  deltaQuery=select orders.order_id from orders where orders.change_dt 
to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') 
/entity

The test I am running is two part:

1.  After I do a full import of the index, I insert a brand new record (with
a never existed before order_id) in the database.  The delta import picks
this up just fine.

2.  After the full import, I modify a record with an order_id that already
shows up in the index.  I have verified there is only one record with this
order_id in both the index and the db before I do the delta update.

I guess the question is, am I screwing myself up by defining my own Solr
index key?  I want to, ultimately, be able to search on ORDER_ID in the Solr
index.  However, the docs say (I think) a field does not have to be the Solr
primary key in order to be searchable.  Would I be better off letting Solr
manage the keys?

Mark

On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
bob.sandif...@sirsidynix.comwrote:

 What are you using as the unique id in your Solr index?  It sounds like you
 may have one value as your Solr index unique id, which bears no resemblance
 to a unique[1] id derived from your data...

 Or - another way to put it - what is it that makes these two records in
 your Solr index 'the same', and what are the unique id's for those two
 entries in the Solr index?  How are those id's related to your original
 data?

 [1] not only unique, but immutable.  I.E. if you update a row in your
 database, the unique id derived from that row has to be the same as it would
 have been before the update.  Otherwise, there's nothing for Solr to
 recognize as a duplicate entry, and do a 'delete' and 'insert' instead of
 just an 'insert'.

 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com


  -Original Message-
  From: Mark juszczec [mailto:mark.juszc...@gmail.com]
  Sent: Thursday, July 07, 2011 9:15 AM
  To: solr-user@lucene.apache.org
  Subject: updating existing data in index vs inserting new data in index
 
  Hello all
 
  I'm using Solr 3.2 and am confused about updating existing data in an
  index.
 
  According to the DataImportHandler Wiki:
 
  *delta-import* : For incremental imports and change detection run the
  command `http://host:port/solr/dataimport?command=delta-import . It
  supports the same clean, commit, optimize and debug parameters as
  full-import command.
 
  I know delta-import will find new data in the database and insert it
  into
  the index.  My problem is how it handles updates where I've got a record
  that exists in the index and the database, the database record is
  changed
  and I want to incorporate those changes in the existing record in the
  index.
   IOW I don't want to insert it again.
 
  I've tried this and wound up with 2 records with the same key in the
  index.
   The first contains the original db values found when the index was
  created,
  the 2nd contains the db values after the record was changed.
 
  I've also found this
  http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720
  66.n3.nabble.com%2FDelta-import-with-solrj-client-tp1085763p1086173.html
  the
  subject is 'Delta-import with solrj client'
 
  Greetings. I have a *solrj* client for fetching data from database. I
  am
  using *delta*-*import* for fetching data. If a column is changed in
  database
  using timestamp with *delta*-*import* i get the latest column indexed
  but
  there are *duplicate* values in the index similar to the column but the
  data
  is older. This works with cleaning the index but i want to update the
  index
  without cleaning it. Is there a way to just update the index with the
  updated column without having *duplicate* values. Appreciate for any
  feedback.
 
  Hando
 
  There are 2 responses:
 
  Short answer is no, there isn't a way. *Solr* doesn't have the concept
  of
  'Update' to an indexed document. You need to add the full document (all
  'columns') each time any one field changes. If doing

Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
Bob

No, I don't.  Let me look into that and post my results.

Mark


On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford bob.sandif...@sirsidynix.com
 wrote:

 Hi, Mark.

 I haven't used DIH myself - so I'll need to leave comments on your set up
 to others who have done so.

 Another question - after your initial index create (and after each delta),
 do you run a 'commit'?  Do you run an 'optimize'?  (Without the optimize,
 'deleted' records still show up in query results...)

 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com


  -Original Message-
  From: Mark juszczec [mailto:mark.juszc...@gmail.com]
  Sent: Thursday, July 07, 2011 10:04 AM
  To: solr-user@lucene.apache.org
  Subject: Re: updating existing data in index vs inserting new data in
  index
 
  Bob
 
  Thanks very much for the reply!
 
  I am using a unique integer called order_id as the Solr index key.
 
  My query, deltaQuery and deltaImportQuery are below:
 
  entity name=item1
pk=ORDER_ID
query=select 1 as TABLE_ID , orders.order_id,
  orders.order_booked_ind,
  orders.order_dt, orders.cancel_dt, orders.account_manager_id,
  orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
  orders.approved_discount_pct, orders.campaign_nm,
  orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
  orders
 
deltaImportQuery=select 1 as TABLE_ID, orders.order_id,
  orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
  orders.account_manager_id, orders.of_header_id,
  orders.order_status_lov_id,
  orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm,
  orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders
  where orders.order_id = '${dataimporter.delta.ORDER_ID}'
 
deltaQuery=select orders.order_id from orders where orders.change_dt
  
  to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') 
  /entity
 
  The test I am running is two part:
 
  1.  After I do a full import of the index, I insert a brand new record
  (with
  a never existed before order_id) in the database.  The delta import
  picks
  this up just fine.
 
  2.  After the full import, I modify a record with an order_id that
  already
  shows up in the index.  I have verified there is only one record with
  this
  order_id in both the index and the db before I do the delta update.
 
  I guess the question is, am I screwing myself up by defining my own Solr
  index key?  I want to, ultimately, be able to search on ORDER_ID in the
  Solr
  index.  However, the docs say (I think) a field does not have to be the
  Solr
  primary key in order to be searchable.  Would I be better off letting
  Solr
  manage the keys?
 
  Mark
 
  On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
  bob.sandif...@sirsidynix.comwrote:
 
   What are you using as the unique id in your Solr index?  It sounds
  like you
   may have one value as your Solr index unique id, which bears no
  resemblance
   to a unique[1] id derived from your data...
  
   Or - another way to put it - what is it that makes these two records
  in
   your Solr index 'the same', and what are the unique id's for those two
   entries in the Solr index?  How are those id's related to your
  original
   data?
  
   [1] not only unique, but immutable.  I.E. if you update a row in your
   database, the unique id derived from that row has to be the same as it
  would
   have been before the update.  Otherwise, there's nothing for Solr to
   recognize as a duplicate entry, and do a 'delete' and 'insert' instead
  of
   just an 'insert'.
  
   Bob Sandiford | Lead Software Engineer | SirsiDynix
   P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
   www.sirsidynix.com
  
  
-Original Message-
From: Mark juszczec [mailto:mark.juszc...@gmail.com]
Sent: Thursday, July 07, 2011 9:15 AM
To: solr-user@lucene.apache.org
Subject: updating existing data in index vs inserting new data in
  index
   
Hello all
   
I'm using Solr 3.2 and am confused about updating existing data in
  an
index.
   
According to the DataImportHandler Wiki:
   
*delta-import* : For incremental imports and change detection run
  the
command `http://host:port/solr/dataimport?command=delta-import .
  It
supports the same clean, commit, optimize and debug parameters as
full-import command.
   
I know delta-import will find new data in the database and insert it
into
the index.  My problem is how it handles updates where I've got a
  record
that exists in the index and the database, the database record is
changed
and I want to incorporate those changes in the existing record in
  the
index.
 IOW I don't want to insert it again.
   
I've tried this and wound up with 2 records with the same key in the
index.
 The first contains the original db values found when the index was
created

Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
Ok.  That's really good to know because optimization of that kind will be
important.

What of commit?  Does it somehow remove the previous version of an updated
record?

On Thu, Jul 7, 2011 at 10:49 AM, Michael Kuhlmann s...@kuli.org wrote:

 Am 07.07.2011 16:14, schrieb Bob Sandiford:
  [...] (Without the optimize, 'deleted' records still show up in query
 results...)

 No, that's not true. The terms remain in the index, but the document
 won't show up any more.

 Optimize is only for performance (and disk space) optimization, as the
 name suggests.

 -Kuli



Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
Erick

I used to, but now I find I must have commented it out in a fit of rage ;-)

This could be the whole problem.

I have verified via admin schema browser that the field is ORDER_ID and will
double check I refer to it in upper case in the appropriate places in the
Solr config scheme.

Curiously, the admin schema browser display for ORDER_ID says hasDeletions:
false  - which seems the opposite of what I want.  I want to be able to
delete duplicates.  Or am I interpreting this field wrong?

In order to check for duplicates, I am going to using the admin browser to
enter the following in the Make A Query box:

TABLE_ID:1 AND ORDER_ID:674659

When I click search and view the results, 2 records are displayed.  One has
the original values, one has the changed values.  I haven't examined the xml
(via view source) too closely and the next time I run I will look for
something indicating one of the records is inactive.

When you say change your schema do you mean via a delta import or by
modifying the config files or both?  FWIW, I am deleting the index on the
file system, doing a full import, modifying the data in the database and
then doing a delta import.

I am not restarting Solr at all in this process.

I understand Solr does not perform key management.  You described exactly
what I meant.  Sorry for any confusion.

Mark

On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson erickerick...@gmail.comwrote:

 Let me re-state a few things to see if I've got it right:

  your schema.xml file has an entry like uniqueKeyorder_id/uniqueKey,
 right?

  given this definition, any document added with an order_id that already
 exists in the
   Solr index will be replaced. i.e. you should have one and only one
 document with a
   given order_id.

  case matters. Check via the admin page (schema browser) to see if you
 have
   two fields, order_id an ORDER_ID.

  How are you checking that your docs are duplicates? If you do a search on
   order_id, you should get back one and only one document (assuming the
   definition above). A document that's deleted will just be marked as
 deleted,
   the data won't be purged from the index. It won't show in search results,
 but
   it will show if you use lower-level ways to access the data.

  Whenever you change your schema, it's best to clean the index, restart
 the server and
re-index from scratch. Solr won't retroactively remove duplicate
 uniqueKey entries.

  On the stats admin/stats page you should see maxDocs and numDocs. The
 difference
   between these should be the number of deleted documents.

  Solr doesn't manage unique keys. All that happens is Solr will replace
 any
   pre-existing documents where *you've* defined the uniqueKey when a
   new doc is added...

 Hope this helps
 Erick

 On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Bob
 
  No, I don't.  Let me look into that and post my results.
 
  Mark
 
 
  On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford 
 bob.sandif...@sirsidynix.com
  wrote:
 
  Hi, Mark.
 
  I haven't used DIH myself - so I'll need to leave comments on your set
 up
  to others who have done so.
 
  Another question - after your initial index create (and after each
 delta),
  do you run a 'commit'?  Do you run an 'optimize'?  (Without the
 optimize,
  'deleted' records still show up in query results...)
 
  Bob Sandiford | Lead Software Engineer | SirsiDynix
  P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
  www.sirsidynix.com
 
 
   -Original Message-
   From: Mark juszczec [mailto:mark.juszc...@gmail.com]
   Sent: Thursday, July 07, 2011 10:04 AM
   To: solr-user@lucene.apache.org
   Subject: Re: updating existing data in index vs inserting new data in
   index
  
   Bob
  
   Thanks very much for the reply!
  
   I am using a unique integer called order_id as the Solr index key.
  
   My query, deltaQuery and deltaImportQuery are below:
  
   entity name=item1
 pk=ORDER_ID
 query=select 1 as TABLE_ID , orders.order_id,
   orders.order_booked_ind,
   orders.order_dt, orders.cancel_dt, orders.account_manager_id,
   orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
   orders.approved_discount_pct, orders.campaign_nm,
   orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
   orders
  
 deltaImportQuery=select 1 as TABLE_ID, orders.order_id,
   orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
   orders.account_manager_id, orders.of_header_id,
   orders.order_status_lov_id,
   orders.order_type_id, orders.approved_discount_pct,
 orders.campaign_nm,
   orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
 orders
   where orders.order_id = '${dataimporter.delta.ORDER_ID}'
  
 deltaQuery=select orders.order_id from orders where
 orders.change_dt
   
   to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') 
   /entity
  
   The test I am running is two part:
  
   1.  After I do a full import of the index, I

Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Mark juszczec
First thanks for all the help.

I think the problem was a combination of not having a unique key defined AND
not including the commit=true parameter in the delta update.

Once I did those things, the delta import left me with a single (updated)
copy of the record including the changes in the source database.

Do I have write access to the Wiki so I can explicitly state commit=true
NEEDS to be specified?

Mark


On Thu, Jul 7, 2011 at 12:39 PM, Erick Erickson erickerick...@gmail.comwrote:

 I'd restart Solr after changing the schema.xml. The delta import does NOT
 require restart or anything else like that.

 The fact that two records are displayed is not what I'd expect. But Solr
 absolutely handles the replace via uniqueKey. So I suspect that you're
 not actually doing what you expect. A little-known aid for debugging DIH
 is solr/admin/dataimport.jsp, that might give you some joy.

 But, to summarize. This should work fine for DIH as far as Solr is
 concerned
 assuming that uniqueKey is properly defined. In you query above that
 returns two documents, can you paste the entire response with fl=*
 attached?
 I'm guessing that the data in your index isn't what you're expecting...

 Also, you might want to get a copy of Luke and examine your index, there's
 a
 wealth of infomration


 Best
 Erick


 On Thu, Jul 7, 2011 at 11:12 AM, Mark juszczec mark.juszc...@gmail.com
 wrote:
  Erick
 
  I used to, but now I find I must have commented it out in a fit of rage
 ;-)
 
  This could be the whole problem.
 
  I have verified via admin schema browser that the field is ORDER_ID and
 will
  double check I refer to it in upper case in the appropriate places in the
  Solr config scheme.
 
  Curiously, the admin schema browser display for ORDER_ID says
 hasDeletions:
  false  - which seems the opposite of what I want.  I want to be able to
  delete duplicates.  Or am I interpreting this field wrong?
 
  In order to check for duplicates, I am going to using the admin browser
 to
  enter the following in the Make A Query box:
 
  TABLE_ID:1 AND ORDER_ID:674659
 
  When I click search and view the results, 2 records are displayed.  One
 has
  the original values, one has the changed values.  I haven't examined the
 xml
  (via view source) too closely and the next time I run I will look for
  something indicating one of the records is inactive.
 
  When you say change your schema do you mean via a delta import or by
  modifying the config files or both?  FWIW, I am deleting the index on the
  file system, doing a full import, modifying the data in the database and
  then doing a delta import.
 
  I am not restarting Solr at all in this process.
 
  I understand Solr does not perform key management.  You described exactly
  what I meant.  Sorry for any confusion.
 
  Mark
 
  On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Let me re-state a few things to see if I've got it right:
 
   your schema.xml file has an entry like
 uniqueKeyorder_id/uniqueKey,
  right?
 
   given this definition, any document added with an order_id that
 already
  exists in the
Solr index will be replaced. i.e. you should have one and only one
  document with a
given order_id.
 
   case matters. Check via the admin page (schema browser) to see if
 you
  have
two fields, order_id an ORDER_ID.
 
   How are you checking that your docs are duplicates? If you do a search
 on
order_id, you should get back one and only one document (assuming the
definition above). A document that's deleted will just be marked as
  deleted,
the data won't be purged from the index. It won't show in search
 results,
  but
it will show if you use lower-level ways to access the data.
 
   Whenever you change your schema, it's best to clean the index, restart
  the server and
 re-index from scratch. Solr won't retroactively remove duplicate
  uniqueKey entries.
 
   On the stats admin/stats page you should see maxDocs and numDocs. The
  difference
between these should be the number of deleted documents.
 
   Solr doesn't manage unique keys. All that happens is Solr will
 replace
  any
pre-existing documents where *you've* defined the uniqueKey when a
new doc is added...
 
  Hope this helps
  Erick
 
  On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec mark.juszc...@gmail.com
 
  wrote:
   Bob
  
   No, I don't.  Let me look into that and post my results.
  
   Mark
  
  
   On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford 
  bob.sandif...@sirsidynix.com
   wrote:
  
   Hi, Mark.
  
   I haven't used DIH myself - so I'll need to leave comments on your
 set
  up
   to others who have done so.
  
   Another question - after your initial index create (and after each
  delta),
   do you run a 'commit'?  Do you run an 'optimize'?  (Without the
  optimize,
   'deleted' records still show up in query results...)
  
   Bob Sandiford | Lead Software Engineer | SirsiDynix
   P: 800.288.8020 X6943 | bob.sandif

primary key made of multiple fields from multiple source tables

2011-07-05 Thread Mark juszczec
Hello all

I'm using Solr 3.2 and am trying to index a document whose primary key is
built from multiple columns selected from an Oracle DB.

I'm getting the following error:

java.lang.IllegalArgumentException: deltaQuery has no column to resolve to
declared primary key pk='ordersorderline_id'
at
org.apache.solr.handler.dataimport.DocBuilder.findMatchingPkColumn(DocBuilder.java:840)
~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:891)
~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:284)
~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:178)
~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:374)
[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:413)
[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30
23:09:08]


The deltaQuery is:

select orders.order_id || orders.order_booked_ind ||
order_line.order_line_id as ordersorderline_id, orders.order_id,
orders.order_booked_ind, order_line.order_line_id, orders.order_dt,
orders.cancel_dt, orders.account_manager_id, orders.of_header_id,
orders.order_status_lov_id, orders.order_type_id,
orders.approved_discount_pct, orders.campaign_nm,
orders.approved_by_cd,orders.advertiser_id, orders.agency_id,
order_line.accounting_comments_desc
from orders, order_line
where order_line.order_id = orders.order_id and order_line.order_booked_ind
= orders.order_booked_ind

I've just seen in the Solr Wiki Task List at
http://wiki.apache.org/solr/TaskList?highlight=%28composite%29 a Big Idea
for The Future is:

support for *composite* keys ... either with some explicit change to the
uniqueKey declaration or perhaps just copyField with some hidden magic
that concats the resulting terms into a single key Term

Does this prohibit my creating the key with the select as above?

Mark