Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?

2010-12-18 Thread Koji Sekiguchi

(10/11/11 1:57), bbarani wrote:


Hi,

I have a peculiar situation where we are trying to use SOLR for indexing
multiple tables (There is no relation between these tables). We are trying
to use the SOLR index instead of using the source tables and hence we are
trying to create the SOLR index as that of source tables.

There are 3 tables which needs to be indexed.

Table 1, table 2 and table 3.

I am trying to index each table in seperate doc tag with different doc tag
name and each table has some of the common field names. For Ex:

document name=DataStoreElement  
entity name=DataStoreElement query=
field column=DATA_STOR name=DATA_STO/
/entity
/document
document name=DataStore 
entity name=DataStore query=
field column=DATA_STOR name=DATA_STO/
/entity
/document


Barani,

You cannot have multiple documents in a data-config, but you can
have multiple entities in a document. And if your table 1,2, and 3
come from different dataSources, you can have multiple data sources
in a data-config. If so, you should use dataSource attribute of entity
element to refer to the name of dataSource:

dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
entity name=t1 dataSource=d1 query=SELECT * from t1 ... .../
entity name=t2 dataSource=d2 query=SELECT * from t2 ... .../
entity name=t3 dataSource=d3 query=SELECT * from t3 ... .../
  /document
/dataConfig

Koji
--
http://www.rondhuit.com/en/


Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?

2010-12-18 Thread Dennis Gearon
Just curious, do these tables have the same schema, like a set of shards would? 

If not, how do you map them to the index?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Koji Sekiguchi k...@r.email.ne.jp
To: solr-user@lucene.apache.org
Sent: Sat, December 18, 2010 5:19:08 AM
Subject: Re: Is there a way to create multiple doc using DIH and access the 
data pertaining to a particular doc name ?

(10/11/11 1:57), bbarani wrote:
 
 Hi,
 
 I have a peculiar situation where we are trying to use SOLR for indexing
 multiple tables (There is no relation between these tables). We are trying
 to use the SOLR index instead of using the source tables and hence we are
 trying to create the SOLR index as that of source tables.
 
 There are 3 tables which needs to be indexed.
 
 Table 1, table 2 and table 3.
 
 I am trying to index each table in seperate doc tag with different doc tag
 name and each table has some of the common field names. For Ex:
 
 document name=DataStoreElement   
 entity name=DataStoreElement query=
 field column=DATA_STOR name=DATA_STO/
 /entity
 /document
 document name=DataStore   
 entity name=DataStore query=
 field column=DATA_STOR name=DATA_STO/
 /entity
 /document

Barani,

You cannot have multiple documents in a data-config, but you can
have multiple entities in a document. And if your table 1,2, and 3
come from different dataSources, you can have multiple data sources
in a data-config. If so, you should use dataSource attribute of entity
element to refer to the name of dataSource:

dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
entity name=t1 dataSource=d1 query=SELECT * from t1 ... .../
entity name=t2 dataSource=d2 query=SELECT * from t2 ... .../
entity name=t3 dataSource=d3 query=SELECT * from t3 ... .../
  /document
/dataConfig

Koji
-- http://www.rondhuit.com/en/



RE: Memory use during merges (OOM)

2010-12-18 Thread Burton-West, Tom
Thanks Robert, 

We will try the termsIndexInterval as a workaround.   I have also opened a JIRA 
issue: https://issues.apache.org/jira/browse/SOLR-2290.
Hope I found the right sections of the Lucene code.  I'm just now in the 
process of looking at the Solr IndexReaderFactory and SolrIndexWriter and 
SolrIndexConfig  trying to better understand how solrconfig.xml gets 
instantiated and how it affects the readers and writers.

Tom

From: Robert Muir [rcm...@gmail.com]

On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom tburt...@umich.edu wrote:
Your setting isn't being applied to the reader IW uses during
merging... its only for readers Solr opens from directories
explicitly.
I think you should open a jira issue!

 Do I understand correctly that this setting in theory could be applied to the 
 reader IW uses during merging but is not currently being applied?

yes, i'm not really sure (especially given the name=) if you can/or
it was planned to have multiple IR factories in solr, e.g. a separate
one for spellchecking.
so i'm not sure if we should (hackishly) steal this parameter from the
IR factory (it is common to all IRFactories, not just
StandardIRFactory) and apply it to to IW..

but we could at least expose the divisor param separately to the IW
config so you have some way of setting it.


 indexReaderFactory name=IndexReaderFactory 
 class=org.apache.solr.core.StandardIndexReaderFactory
int name=termInfosIndexDivisor8/int
  /indexReaderFactory 

 I understand the tradeoffs for doing this during searching, but not the 
 trade-offs for doing this during merging.  Is the use during merging the 
 similar to the use during searching?

  i.e. Some process has to look up data for a particular term as opposed to 
 having to iterate through all the terms?
  (Haven't yet dug into the merging/indexing code).

it needs it for applying deletes...

as a workaround (if you are reindexing), maybe instead of using the
Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8
* 128) ?

this will solve your merging problem, and have the same perf
characteristics of divisor=8, except you cant go back down like you
can with the divisor without reindexing with a smaller interval...

if you've already tested that performance with the divisor of 8 is
acceptable, or in your case maybe necessary!, it sort of makes sense
to 'bake it in' by setting your divisor back to 1 and your interval =
1024 instead...


Re: how to config DataImport Scheduling

2010-12-18 Thread Hamid Vahedi
I think it must work with any version of solr. because it works url base (see 
config file). 


Attention to this point: Successfully tested on Apache Tomcat v6(should work on 
any other servlet container)

 


From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Fri, December 17, 2010 3:22:37 AM
Subject: Re: how to config DataImport Scheduling

 I also have the same problem, i configure
 dataimport.properties file as shown
 in 
 http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example
 but no change occur, can any one help me

What version of solr are you using? This seems a new feature. So it won't work 
on solr 1.4.1.


  

Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?

2010-12-18 Thread Lance Norskog
You can have multiple documents generated by the same data-config:

dataConfig
 dataSource name=ds1 .../
 dataSource name=ds2 .../
 dataSource name=ds3 .../
 document
   entity blah blah rootEntity=false
   entity blah blah this is a document
  entity sets unique id/
   /document
   document blah blah this is another document
  entity sets unique id
   /document
 /document
/dataConfig

It's the 'rootEntity=false that makes the child entity a document.

On Sat, Dec 18, 2010 at 7:43 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 Just curious, do these tables have the same schema, like a set of shards 
 would?

 If not, how do you map them to the index?

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Koji Sekiguchi k...@r.email.ne.jp
 To: solr-user@lucene.apache.org
 Sent: Sat, December 18, 2010 5:19:08 AM
 Subject: Re: Is there a way to create multiple doc using DIH and access the
 data pertaining to a particular doc name ?

 (10/11/11 1:57), bbarani wrote:

 Hi,

 I have a peculiar situation where we are trying to use SOLR for indexing
 multiple tables (There is no relation between these tables). We are trying
 to use the SOLR index instead of using the source tables and hence we are
 trying to create the SOLR index as that of source tables.

 There are 3 tables which needs to be indexed.

 Table 1, table 2 and table 3.

 I am trying to index each table in seperate doc tag with different doc tag
 name and each table has some of the common field names. For Ex:

 document name=DataStoreElement
     entity name=DataStoreElement query=
     field column=DATA_STOR name=DATA_STO/
     /entity
 /document
 document name=DataStore
     entity name=DataStore query=
     field column=DATA_STOR name=DATA_STO/
     /entity
 /document

 Barani,

 You cannot have multiple documents in a data-config, but you can
 have multiple entities in a document. And if your table 1,2, and 3
 come from different dataSources, you can have multiple data sources
 in a data-config. If so, you should use dataSource attribute of entity
 element to refer to the name of dataSource:

 dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
    entity name=t1 dataSource=d1 query=SELECT * from t1 ... .../
    entity name=t2 dataSource=d2 query=SELECT * from t2 ... .../
    entity name=t3 dataSource=d3 query=SELECT * from t3 ... .../
  /document
 /dataConfig

 Koji
 -- http://www.rondhuit.com/en/





-- 
Lance Norskog
goks...@gmail.com


old index files not deleted on slave

2010-12-18 Thread feedly team
I have set up index replication (triggered on optimize). The problem I
am having is the old index files are not being deleted on the slave.
After each replication, I can see the old files still hanging around
as well as the files that have just been pulled. This causes the data
directory size to increase by the index size every replication until
the disk fills up.

Checking the logs, I see the following error:

SEVERE: SnapPull failed
org.apache.solr.common.SolrException: Index fetch failed :
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock
obtain timed out:
NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954)
at 
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192)
at 
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
at 
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at 
org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
... 11 more

lsof reveals that the file is still opened from the java process.

I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup
is pretty vanilla. The OS is linux, the indexes are on local
directories, write permissions look ok, nothing unusual in the config
(default deletion policy, etc.). Contents of the index data dir:

master:
-rw-rw-r-- 1 feeddo feeddo  191 Dec 14 01:06 _1lg.fnm
-rw-rw-r-- 1 feeddo feeddo  26M Dec 14 01:07 _1lg.fdx
-rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt
-rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis
-rw-rw-r-- 1 feeddo feeddo  15M Dec 14 01:12 _1lg.tii
-rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx
-rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq
-rw-rw-r-- 1 feeddo feeddo  311 Dec 14 01:12 segments_1ji
-rw-rw-r-- 1 feeddo feeddo  23M Dec 14 01:12 _1lg.nrm
-rw-rw-r-- 1 feeddo feeddo  191 Dec 18 01:11 _24e.fnm
-rw-rw-r-- 1 feeddo feeddo  26M Dec 18 01:12 _24e.fdx
-rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt
-rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis
-rw-rw-r-- 1 feeddo feeddo  15M Dec 18 01:23 _24e.tii
-rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx
-rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq
-rw-rw-r-- 1 feeddo feeddo  311 Dec 18 01:24 segments_1xz
-rw-rw-r-- 1 feeddo feeddo  23M Dec 18 01:24 _24e.nrm
-rw-rw-r-- 1 feeddo feeddo  191 Dec 18 13:15 _25z.fnm
-rw-rw-r-- 1 feeddo feeddo  26M Dec 18 13:16 _25z.fdx
-rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt
-rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis
-rw-rw-r-- 1 feeddo feeddo  15M Dec 18 13:35 _25z.tii
-rw-rw-r-- 1 feeddo feeddo 146M Dec 18 13:35 _25z.prx
-rw-rw-r-- 1 feeddo feeddo 284M Dec 18 13:35 _25z.frq
-rw-rw-r-- 1 feeddo feeddo   20 Dec 18 13:35 segments.gen
-rw-rw-r-- 1 feeddo feeddo  311 Dec 18 13:35 segments_1y1
-rw-rw-r-- 1 feeddo feeddo  23M Dec 18 13:35 _25z.nrm

slave:
-rw-rw-r-- 1 feeddo feeddo   20 Dec 13 17:54 segments.gen
-rw-rw-r-- 1 feeddo feeddo  191 Dec 15 01:07 _1mk.fnm
-rw-rw-r-- 1 feeddo feeddo  26M Dec 15 01:08 _1mk.fdx
-rw-rw-r-- 1 feeddo feeddo 1.9G Dec 15 01:08 _1mk.fdt
-rw-rw-r-- 1 feeddo feeddo 476M Dec 15 01:18 _1mk.tis
-rw-rw-r-- 1 feeddo feeddo  15M Dec 15 01:18 _1mk.tii
-rw-rw-r-- 1 feeddo feeddo 144M Dec 15 01:18 _1mk.prx
-rw-rw-r-- 1 feeddo feeddo 278M Dec 15 01:18 _1mk.frq
-rw-rw-r-- 1 feeddo feeddo  312 Dec 15 01:18 segments_1kj
-rw-rw-r-- 1 feeddo feeddo  23M Dec 15 01:18 _1mk.nrm
-rw-rw-r-- 1 

Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?

2010-12-18 Thread Lance Norskog
And, a use case: Tika blows up on some files. But we still want other
data like file name etc. and an empty text field. So:

entity rootEntity=false
field set unique id and file name etc.
entity blah blah is a document
 use Tika Empty Parser
 field failed=true/
 /entity
 entity blah blah is a document
 use Tika Auto Parser
 onError=skip
  field failed=false
   /entity
/entity

Both documents have the same unique id. If the Tika autoparser uses
PDF and the PDF works, the second document overwrites the first. If
the PDF blows up, the second document skips and: the first document
goes in.

Ugly, yes, but a testament to the maturity of DIH that it had enough
tools to work around a Tika weakness. Oh, and the AutoParser does not
work: SOLR-2116:
https://issues.apache.org/jira/browse/SOLR-2116

In my previous example, the innermost entities below should be field
not entity. Sorry for any confusion.

On Sat, Dec 18, 2010 at 4:22 PM, Lance Norskog goks...@gmail.com wrote:
 You can have multiple documents generated by the same data-config:

 dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
   entity blah blah rootEntity=false
       entity blah blah this is a document
          entity sets unique id/
       /document
       document blah blah this is another document
          entity sets unique id
       /document
  /document
 /dataConfig

 It's the 'rootEntity=false that makes the child entity a document.

 On Sat, Dec 18, 2010 at 7:43 AM, Dennis Gearon gear...@sbcglobal.net wrote:
 Just curious, do these tables have the same schema, like a set of shards 
 would?

 If not, how do you map them to the index?

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them 
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: Koji Sekiguchi k...@r.email.ne.jp
 To: solr-user@lucene.apache.org
 Sent: Sat, December 18, 2010 5:19:08 AM
 Subject: Re: Is there a way to create multiple doc using DIH and access the
 data pertaining to a particular doc name ?

 (10/11/11 1:57), bbarani wrote:

 Hi,

 I have a peculiar situation where we are trying to use SOLR for indexing
 multiple tables (There is no relation between these tables). We are trying
 to use the SOLR index instead of using the source tables and hence we are
 trying to create the SOLR index as that of source tables.

 There are 3 tables which needs to be indexed.

 Table 1, table 2 and table 3.

 I am trying to index each table in seperate doc tag with different doc tag
 name and each table has some of the common field names. For Ex:

 document name=DataStoreElement
     entity name=DataStoreElement query=
     field column=DATA_STOR name=DATA_STO/
     /entity
 /document
 document name=DataStore
     entity name=DataStore query=
     field column=DATA_STOR name=DATA_STO/
     /entity
 /document

 Barani,

 You cannot have multiple documents in a data-config, but you can
 have multiple entities in a document. And if your table 1,2, and 3
 come from different dataSources, you can have multiple data sources
 in a data-config. If so, you should use dataSource attribute of entity
 element to refer to the name of dataSource:

 dataConfig
  dataSource name=ds1 .../
  dataSource name=ds2 .../
  dataSource name=ds3 .../
  document
    entity name=t1 dataSource=d1 query=SELECT * from t1 ... .../
    entity name=t2 dataSource=d2 query=SELECT * from t2 ... .../
    entity name=t3 dataSource=d3 query=SELECT * from t3 ... .../
  /document
 /dataConfig

 Koji
 -- http://www.rondhuit.com/en/





 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com


Re: old index files not deleted on slave

2010-12-18 Thread Lance Norskog
This could be a quirk of the native locking feature. What's the file
system? Can you fsck it?

If this error keeps happening, please file this. It should not happen.
Add the text above and also your solrconfigs if you can.

One thing you could try is to change from the native locking policy to
the simple locking policy - but only on the child.

On Sat, Dec 18, 2010 at 4:44 PM, feedly team feedly...@gmail.com wrote:
 I have set up index replication (triggered on optimize). The problem I
 am having is the old index files are not being deleted on the slave.
 After each replication, I can see the old files still hanging around
 as well as the files that have just been pulled. This causes the data
 directory size to increase by the index size every replication until
 the disk fills up.

 Checking the logs, I see the following error:

 SEVERE: SnapPull failed
 org.apache.solr.common.SolrException: Index fetch failed :
        at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
        at 
 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954)
        at 
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192)
        at 
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
        at 
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
        at 
 org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
        at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
        at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
        ... 11 more

 lsof reveals that the file is still opened from the java process.

 I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup
 is pretty vanilla. The OS is linux, the indexes are on local
 directories, write permissions look ok, nothing unusual in the config
 (default deletion policy, etc.). Contents of the index data dir:

 master:
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 14 01:06 _1lg.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 14 01:07 _1lg.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt
 -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 14 01:12 _1lg.tii
 -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx
 -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 14 01:12 segments_1ji
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 14 01:12 _1lg.nrm
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 01:11 _24e.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 01:12 _24e.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt
 -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 18 01:23 _24e.tii
 -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 _24e.prx
 -rw-rw-r-- 1 feeddo feeddo 283M Dec 18 01:23 _24e.frq
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 18 01:24 segments_1xz
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 18 01:24 _24e.nrm
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 13:15 _25z.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 13:16 _25z.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 13:16 _25z.fdt
 -rw-rw-r-- 1 feeddo feeddo 484M Dec 18 13:35 _25z.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 18 13:35 _25z.tii
 -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 13:35 _25z.prx
 -rw-rw-r-- 1 feeddo feeddo 284M Dec 18 13:35 _25z.frq
 -rw-rw-r-- 1 feeddo feeddo   20 Dec 18 13:35 segments.gen
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 18 13:35 segments_1y1
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 18 13:35 _25z.nrm

 slave:
 -rw-rw-r-- 1 feeddo feeddo   20 Dec 13 17:54 segments.gen
 -rw-rw-r-- 1 feeddo 

DIH for sharded database?

2010-12-18 Thread Andy
I have a table that is broken up into many virtual shards. So basically I have 
N identical tables:

Document1
Document2
.
.
Document36

Currently these tables all live in the same database, but in the future they 
may be moved to different servers to scale out if the needs arise.

Is there any way to configure a DIH for these tables so that it will 
automatically loop through the 36 identical tables and pull data out for 
indexing?

Something like (pseudo code):

for (i = 1; i = 36; i++) {
   ## retrieve data from the table Document{$i}  index the data
}

What's the best way to handle a situation like this?

Thanks


  


Re: DIH for sharded database?

2010-12-18 Thread Lance Norskog
You can have a file with 1,2,3 on separate lines. There is a
line-by-line file reader that can pull these as separate drivers.
Inside that entity the JDBC url has to be altered with the incoming
numbers. I don't know if this will work.

It also may work for single-threaded DIH but not during multiple
threads. (Ignore this for Solr 1.4, you have no threads feature.)

On Sat, Dec 18, 2010 at 6:20 PM, Andy angelf...@yahoo.com wrote:
 I have a table that is broken up into many virtual shards. So basically I 
 have N identical tables:

 Document1
 Document2
 .
 .
 Document36

 Currently these tables all live in the same database, but in the future they 
 may be moved to different servers to scale out if the needs arise.

 Is there any way to configure a DIH for these tables so that it will 
 automatically loop through the 36 identical tables and pull data out for 
 indexing?

 Something like (pseudo code):

 for (i = 1; i = 36; i++) {
   ## retrieve data from the table Document{$i}  index the data
 }

 What's the best way to handle a situation like this?

 Thanks







-- 
Lance Norskog
goks...@gmail.com


Re: DIH for sharded database?

2010-12-18 Thread Andy

--- On Sat, 12/18/10, Lance Norskog goks...@gmail.com wrote:

 You can have a file with 1,2,3 on
 separate lines. There is a
 line-by-line file reader that can pull these as separate
 drivers.
 Inside that entity the JDBC url has to be altered with the
 incoming
 numbers. I don't know if this will work.

I'm not sure I understand.

How will altering the JDBC url change the name of the table it is importing 
data from?

Wouldn't I need to change the  actual SQL query itself?

select * from Document1
select * from Document2
...
select * from Document36