Re: [jira] Commented: (SOLR-469) Data Import RequestHandler

Noble Paul നോബിള്‍ नोब्ळ् Mon, 12 May 2008 21:43:47 -0700

Moser:
You may not need to resort to workarounds. There are two solutions one
using delta-import and one using full-import


solution:1 using delta-import

If you wish that DIH manage your deletes there is a deletedPkQuery
also ,. The config may  look like,
<entity name="posts" query="SELECT  p.forumid,  p.messageid,p.message
FROM posts p, forums f WHERE f.forumid = p.forumid"
deletedPkQuery ="SELECT p.messageid from posts p, forums f WHERE
f.forumid = p.forumid and p.deleted= true OR f.deleted=true"/>
*  am assuming that p.messageid is the pk

The query is run in the beginning and the pk's returned will be used
to delete documents

solution:2 using full-import The config may  look like, This will do a
clean full import everytime
<entity name="posts" query="SELECT  p.forumid,  p.messageid,  IF
(p.deleted OR f.deleted,true,false) as deleted,  p.message FROM  posts
p, forums f WHERE  f.forumid = p.forumid"/>

This adds the  flag 'deleted' to a document

If you wish to do incremental indexing then run the command
full-import with clean=false , It ensures that the index is not
cleaned prior to indexing.

<entity name="posts" query="SELECT  p.forumid,  p.messageid,p.message
FROM posts p, forums f WHERE f.forumid = p.forumid and
p.last_modified> ${dataimporter.last_index_time}"
deletedPkQuery ="SELECT p.messageid from posts p, forums f WHERE
f.forumid = p.forumid and p.deleted= true OR f.deleted=true"/>

I am assuming that you are maintaining a timestamp for last_modified
in the posts .

note: The full-import may not be as expensive as you think. We do a
full import of 3 million docs in 20 mins .

--Noble














On Tue, May 13, 2008 at 5:36 AM, Chris Moser (JIRA) <[EMAIL PROTECTED]> wrote:
>
>
>     [ 
> https://issues.apache.org/jira/browse/SOLR-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596237#action_12596237
>  ]
>
>  Chris Moser commented on SOLR-469:
>  ----------------------------------
>
>  Hi Shalin,
>
>  I'm indexing forums with Solr and have tables with a structure similar to 
> this:
>
>  {code}
>  posts
>  ------
>  forumid int
>  messageid int
>  deleted boolean
>  message text
>
>  forums
>  ------
>  forumid int
>  name text
>  deleted boolean
>
>  {code}
>
>  The simplified data query I'm running goes like this:
>
>  {code}
>  SELECT
>    p.forumid,
>    p.messageid,
>    IF (p.deleted OR f.deleted,true,false) as deleted,
>    p.message
>
>  FROM
>    posts p, forums f
>
>  WHERE
>    f.forumid = p.forumid
>  {code}
>
>  The query checks to see if the post or the forum is deleted, and marks it in 
> the index as deleted in either case (which is why I'm doing the join).  The 
> problem I'm running into is that the importer is running the WHERE clause 
> like this:
>
>  {code}
>  WHERE
>    f.forumid = p.forumid and forumid=123 and messageid=123456789
>  {code}
>
>  In this case, the _forumid=123_ part is ambiguous (forumid being in the 
> posts and the forums table) so this causes a SQL error.  So I added an 
> additional attribute to the entity defintion (pkTable) which prepends the 
> _forumid=123_ with the pkTable value so it generates _pkTable.forumid=123_.
>
>  Not sure if this is the best way to do it but it fixed the problem :)
>
>  > Data Import RequestHandler
>  > --------------------------
>  >
>  >                 Key: SOLR-469
>  >                 URL: https://issues.apache.org/jira/browse/SOLR-469
>  >             Project: Solr
>  >          Issue Type: New Feature
>  >          Components: update
>  >    Affects Versions: 1.3
>  >            Reporter: Noble Paul
>  >            Assignee: Grant Ingersoll
>  >             Fix For: 1.3
>  >
>  >         Attachments: SOLR-469-contrib.patch, SOLR-469.patch, 
> SOLR-469.patch, SOLR-469.patch, SOLR-469.patch, SOLR-469.patch, 
> SOLR-469.patch, SOLR-469.patch, SOLR-469.patch
>  >
>  >
>  > We need a RequestHandler Which can import data from a DB or other 
> dataSources into the Solr index .Think of it as an advanced form of SqlUpload 
> Plugin (SOLR-103).
>  > The way it works is as follows.
>  >     * Provide a configuration file (xml) to the Handler which takes in the 
> necessary SQL queries and mappings to a solr schema
>  >           - It also takes in a properties file for the data source 
> configuraution
>  >     * Given the configuration it can also generate the solr schema.xml
>  >     * It is registered as a RequestHandler which can take two commands 
> do-full-import, do-delta-import
>  >           -  do-full-import - dumps all the data from the Database into 
> the index (based on the SQL query in configuration)
>  >           - do-delta-import - dumps all the data that has changed since 
> last import. (We assume a modified-timestamp column in tables)
>  >     * It provides a admin page
>  >           - where we can schedule it to be run automatically at regular 
> intervals
>  >           - It shows the status of the Handler (idle, full-import, 
> delta-import)
>
>  --
>  This message is automatically generated by JIRA.
>  -
>  You can reply to this email to add a comment to the issue online.
>
>



-- 
--Noble Paul

Re: [jira] Commented: (SOLR-469) Data Import RequestHandler

Reply via email to