Re: Index empty after restart.

2012-02-27 Thread zarni aung
Check in the data directory to make sure that they are present.  If so, you
just need to load the cores again.

On Mon, Feb 27, 2012 at 11:30 AM, Wouter de Boer <
wouter.de.b...@springest.nl> wrote:

> Hi,
>
> I run SOLR on Jetty. After a restart of Jetty, the indices are empty.
> Anyone
> an idea what the reason can be?
>
> Regards,
> Wouter.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Index-empty-after-restart-tp3781237p3781237.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query for exact part of sentence

2012-01-31 Thread zarni aung
Did you rebuild the index?  That would help since the index analyzer has
been changed.

On Tue, Jan 31, 2012 at 9:53 AM, Arkadi Colson  wrote:

> The text field in the schema configuration looks like this. I changed
> catenateNumbers to 0 but it still doesn't work as aspected.
>
> 
> 
> 
> 
> 
>
> ignoreCase="true"
>words="stopwords_en.txt"
>enablePositionIncrements="**true"
>/>
> ignoreCase="true"
>words="stopwords_du.txt"
>enablePositionIncrements="**true"
>/>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>  maxGramSize="15"/>
> 
> 
> 
>  ignoreCase="true" expand="true"/>
> ignoreCase="true"
>words="stopwords_en.txt"
>enablePositionIncrements="**true"
>/>
> ignoreCase="true"
>words="stopwords_du.txt"
>enablePositionIncrements="**true"
>/>
>  generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
>
>
>
> On 01/31/2012 03:03 PM, Erick Erickson wrote:
>
>> Unless you provide your schema configuration, there's
>> not much to go on here. Two things though:
>>
>> 1>  look at the admin/analysis page to see how your
>>  data is broken up into tokens.
>> 2>  at a guess you have WordDelimiterFilterFactory
>>  in your chain and perhaps catenateNumbers="1"
>>
>> Best
>> Erick
>>
>> On Mon, Jan 30, 2012 at 3:21 AM, Arkadi Colson
>>  wrote:
>>
>>> Hi
>>>
>>> I'm using the pecl PHP class to query SOLR and was wondering how to query
>>> for a part of a sentence exactly.
>>>
>>> There are 2 data items index in SOLR
>>> 1327497476: 123 456 789
>>> 1327497521. 1234 5678 9011
>>>
>>> However when running the query, both data items are returned as you can
>>> see
>>> below. Any idea why?
>>>
>>> Thanks!
>>>
>>> SolrObject Object
>>> (
>>>[responseHeader] =>SolrObject Object
>>>(
>>>[status] =>0
>>>[QTime] =>5016
>>>[params] =>SolrObject Object
>>>(
>>>[debugQuery] =>true
>>>[shards] =>
>>>  solr01:8983/solr,solr02:8983/**solr,solr03:8983/solr
>>>[fl] =>
>>>  id,smsc_module,smsc_ssid,smsc_**description,smsc_content,smsc_**
>>> courseid,smsc_date_created,**smsc_date_edited,score,**
>>> metadata_stream_size,metadata_**stream_source_info,metadata_**
>>> stream_name,metadata_stream_**content_type,last_modified,**
>>> author,title,subject
>>>[sort] =>smsc_date_created asc
>>>[indent] =>on
>>>[start] =>0
>>>[q] =>(smsc_content:\"123 456\" ||
>>> smsc_description:\"123 456\")&&(smsc_module:Intradesk)&&
>>>  (smsc_date_created:[2011-12-**25T10:29:51Z TO NOW])&&(smsc_ssid:38)
>>>[distrib] =>true
>>>[wt] =>xml
>>>[version] =>2.2
>>>[rows] =>55
>>>)
>>>
>>>)
>>>
>>>[response] =>SolrObject Object
>>>(
>>>[numFound] =>2
>>>[start] =>0
>>>[docs] =>Array
>>>(
>>>[0] =>SolrObject Object
>>>(
>>>[smsc_module] =>Intradesk
>>>[smsc_ssid] =>38
>>>[id] =>1327497476
>>>[smsc_courseid] =>0
>>>[smsc_date_created] =>2011-12-25T10:29:51Z
>>>[smsc_date_edited] =>2011-12-25T10:29:51Z
>>>[score] =>10.028017
>>>)
>>>
>>>[1] =>SolrObject Object
>>>(
>>>[smsc_module] =>Intradesk
>>>[smsc_ssid] =>38
>>>[id] =>1327497521
>>>[smsc_courseid] =>0
>>>[smsc_date_created] =>2011-12-25T10:29:51Z
>>>[smsc_date_edited] =>2011-12-25T10:29:51Z
>>>[score] =>5.541335
>>>)
>>>
>>>)
>>>
>>>)
>>>[debug] =>SolrObject Object
>>>(
>>>[rawquerystring] =>(smsc_content:\"123 456\" ||
>>> smsc_description:\"123 456\")&&(smsc_module:Intradesk)&&
>>>  (smsc_date_created:[2011-12-**25T10:29:51Z TO NOW])&&(smsc_ssid:38)
>>>[querystring] =>(smsc_content:\"123 456\" ||
>>> smsc_description:\"123 456\")&&(smsc_module:Intradesk)&&
>>>  (smsc_

removing dynamic fields

2011-09-29 Thread zarni aung
Hi,

I've been experimenting with Solr dynamic fields.  Here is what I've
gathered based on my research.

For instance, I have a setup where I am catching undefined custom fields
this way.  I am using (trie) types by the way.








I am dealing with documents that may have various number of custom fields.
Instead of having to deal with field type changes on Solr, I decided to go
with dynamic fields.  But what I realized is that over a period of time, I
could have int1, int2, int3 fields and they could be deleted in the database
and the Solr document has been deleted and re-added without the field values
for int1, int2 and int3.  I used the schema browser to inspect the fields
int1, int2 and int3, there's no docs associated with it but the field
definition remains.  I've tried unloading, reloading the cores and also
restarting the server but that doesn't remove the fields.  It only removes
when I clear everything out of the index using "*:*" query.

What type of penalty do I have to pay for having numerous unused fields?
Especially using (trie) fields.  I'm using it so that it'll work well with
range queries.

Thanks,

Zarni


Re: Solr Implementations

2011-08-26 Thread zarni aung
Thank you so much for your response Erik.

On Fri, Aug 26, 2011 at 8:30 AM, Erick Erickson wrote:

> See below
>
> On Thu, Aug 25, 2011 at 4:22 PM, zarni aung  wrote:
> > First, I would like to apologize if this is a repeat question but can't
> seem
> > to get the right answer anywhere.
> >
> >   - What happens to pending documents when the server dies abruptly?  I
> >   understand that when the server shuts down gracefully, it will commit
> the
> >   pending documents and close the IndexWriter.  For the case where the
> server
> >   just crashes,  I am assuming that the pending documents are lost but
> would
> >   it also corrupt the index files?  If so, when the server comes back
> online
> >   what is the state?  I would think that a full re-indexing is in order.
> >
> >
>
> This is generally not a problem, your pending updates are simply lost. A
> lot
> of work has gone into making sure that the indexes aren't corrupted in this
> situation. You can use the checkindex utility if you're worried.
>
> A brief outline here. Solr only writes new segments, it does NOT modify
> existing
> segments. There is a file that lets Solr know what the current valid
> segments are.
> During indexing (including merging, optimization, etc), only NEW segments
> are
> written and the file that tells Solr what's current is left alone
> during the new segment
> writes.
>
> The very last thing that's done is the segments file (i.e. the file
> that tells Solr what's
> current) is updated, and it's very small. I suppose there's a
> vanishingly small chance
> that that file could be corrupted when begin written, and it may even
> be that a temp
> file is written first then files renamed (but I don't know that for
> sure)...
>
> So, the point of this long digression is that if your server gets
> killed, upon restart it
> should see a consistent picture of the index as of the last completed
> commit, any
> interim docs will be lost.
>
> >   - What are the dangers of having n-number of ReadOnly Solr instances
> >   pointing to the same data directory?  (Shared by a SAN)?  Will there be
> >   issues with locking?  This is a scenario with replication.  The
> Read-Only
> >   instances are pointing to the same data directory on a SAN.
> >
>
> This is not a problem. You should have only one *writer*
> pointing to the index, but readers are OK. Applying the discussion above to
> readers, note that the segments available to any reader are never changed.
> So
> having N Solr instances reading from these unchanging files is no problem.
>
> That said, this will be slower than using Solr's replication (which is
> preferred) for
> two reasons.
> 1> any networked filesystem will have some inherent speed issues.
> 2> all these read requests will have to be queued somehow.
>
> But if your performance is acceptable with this setup it'll work.
>
>
> Best
> Erick
>
> > Thank you very much.
> >
> > Z
> >
>


Solr Implementations

2011-08-25 Thread zarni aung
First, I would like to apologize if this is a repeat question but can't seem
to get the right answer anywhere.

   - What happens to pending documents when the server dies abruptly?  I
   understand that when the server shuts down gracefully, it will commit the
   pending documents and close the IndexWriter.  For the case where the server
   just crashes,  I am assuming that the pending documents are lost but would
   it also corrupt the index files?  If so, when the server comes back online
   what is the state?  I would think that a full re-indexing is in order.


   - What are the dangers of having n-number of ReadOnly Solr instances
   pointing to the same data directory?  (Shared by a SAN)?  Will there be
   issues with locking?  This is a scenario with replication.  The Read-Only
   instances are pointing to the same data directory on a SAN.

Thank you very much.

Z


Re: Core Administration

2011-06-30 Thread zarni aung
Thank you very much Stefan.  This helps.

Zarni

On Thu, Jun 30, 2011 at 4:10 PM, Stefan Matheis <
matheis.ste...@googlemail.com> wrote:

> Zarni,
>
> Am 30.06.2011 20:32, schrieb zarni aung:
>
>  But I need to know if Solr already handles that case.  I wouldn't want to
>> have to write the tool if Solr already supports creating cores with new
>> configs on the fly.
>>
>
> there isn't. you have to create the directory structure & the related files
> yourself. solr (the AdminCoreHandler) does only "activate" the core for
> usage.
>
> Few Weeks ago, there was a Question about modifying Configuration Files
> from the Browser: http://search.**lucidimagination.com/search/**
> document/ec79172e7613d1a/**modifying_configuration_from_**a_browser<http://search.lucidimagination.com/search/document/ec79172e7613d1a/modifying_configuration_from_a_browser>
>
> Regards
> Stefan
>


Re: Core Administration

2011-06-30 Thread zarni aung
I have an idea.  I  believe I can discover the Properties of an object (C#
reflection) and then code gen schema.xml file based on the field type and
other meta data of that type (possibly from database).  After that, I should
be able to ftp the files over to the solr machine.  Then I can invoke core
admin to create the new index on the fly.  My original question would be, is
there a tool that already does what I'm describing?

Z

On Thu, Jun 30, 2011 at 2:32 PM, zarni aung  wrote:

> Hi,
>
> I am researching about core administration using Solr.  My requirement is
> to be able to provision/create/delete indexes dynamically.  I have tried it
> and it works.  Apparently core admin handler will create a new core by
> specifying the instance Directory (required), along with data directory, and
> so on.  The issue I'm having is that a separate app that lives on a
> different machine need to create these new cores on demand along with
> creating new schema.xml and data directories.  The required instance
> directory, data directory and others need to be separate from each core.
>
> My first approach is to write a tool that would take additional params that
> can code gen the schema config files and so on based on different type of
> documents.  ie: Homes, People, etc...
>
> But I need to know if Solr already handles that case.  I wouldn't want to
> have to write the tool if Solr already supports creating cores with new
> configs on the fly.
>
> Thanks,
>
> Z
>


Core Administration

2011-06-30 Thread zarni aung
Hi,

I am researching about core administration using Solr.  My requirement is to
be able to provision/create/delete indexes dynamically.  I have tried it and
it works.  Apparently core admin handler will create a new core by
specifying the instance Directory (required), along with data directory, and
so on.  The issue I'm having is that a separate app that lives on a
different machine need to create these new cores on demand along with
creating new schema.xml and data directories.  The required instance
directory, data directory and others need to be separate from each core.

My first approach is to write a tool that would take additional params that
can code gen the schema config files and so on based on different type of
documents.  ie: Homes, People, etc...

But I need to know if Solr already handles that case.  I wouldn't want to
have to write the tool if Solr already supports creating cores with new
configs on the fly.

Thanks,

Z


Field Value Highlighting

2011-06-29 Thread zarni aung
Hi,

I need help in figuring out the right configuration to perform highlighting
in Solr.  I can retrieve the matching documents plus the highlighted
matches.

I've done another tool called DTSearch where it would return the offset
positions of the field value to highlight.  I've tried a few different
configurations but it appears that Solr returns the actual matched documents
+ a section called highlighting with snippets (which can be configured to
have length of 'X').  I was wondering if there is a way to retrieve just the
actual documents with highlighted values or a way to retrieve the offset
position of the field values so that I can perform highlighting.

I am using SolrNet client to integrate to Solr.  I've also tweaked the
configs and used the web admin interface to test highlighting but not yet
successful.

Thank you in advance.

Z


Re: Document Scoring

2011-06-17 Thread zarni aung
Thank you, I will give that a shot.

Zarni


Re: Document Scoring

2011-06-17 Thread zarni aung
Thank you this is something that I wanted to hear.  I knew the design was
most likely flawed because I have never done Solr or any kind of full text
searching, but needed an unbiased opinion.  I think that if I were to tune
the configs and pay close attention to the logs with lots of performance
testing I might be able to achieve close to near real time (1-5 mins).  I've
been reading this mailing list, Hathi Trust, Lucid Imagination and other
sites for insights.

Again Thank you.

Zarni

On Thu, Jun 16, 2011 at 9:49 PM, Erick Erickson wrote:

> I really wouldn't go there, it sounds like there are endless
> opportunities for errors!
>
> How "real-time" is "real-time"? Could you fix this entirely
> by
> 1> adjusting expectations for, say, 5 minutes.
> 2> adjusting your commit (on the master) and poll (on the slave)
> appropriately?
>
> Best
> Erick
>
> On Thu, Jun 16, 2011 at 11:41 AM, zarni aung  wrote:
> > Hi,
> >
> > I am designing my indexes to have 1 write-only master core, 2 read-only
> > slave cores.  That means the read-only cores will only have snapshots
> pulled
> > from the master and will not have near real time changes.  I was thinking
> > about adding a hybrid read and write master core that will have the most
> > recent changes from my primary data source.  I am thinking to query the
> > hybrid master and the read-only slaves and somehow try to intersect the
> > results in order to support near real time full text search.  Is this
> > feasible?
> >
> > Thank you,
> >
> > Zarni
> >
>


Document Scoring

2011-06-16 Thread zarni aung
Hi,

I am designing my indexes to have 1 write-only master core, 2 read-only
slave cores.  That means the read-only cores will only have snapshots pulled
from the master and will not have near real time changes.  I was thinking
about adding a hybrid read and write master core that will have the most
recent changes from my primary data source.  I am thinking to query the
hybrid master and the read-only slaves and somehow try to intersect the
results in order to support near real time full text search.  Is this
feasible?

Thank you,

Zarni


Search with Dynamic indexing

2011-06-14 Thread zarni aung
Hi,

I have requirements to make large amounts of data (> 5 million) documents
search-able.
The problem is that more than half have highly volatile field values.  I
will also have a data store specifically for Meta Data.
Committing frequently isn't a solution.  What I'm basically trying to
achieve is NRT.
I've read so many postings and articles everywhere and even considered
sharing a single index amongst one WriteOnly Solr instance with >1-n Solr
instances.
Apparently this will not work since, calling commit on a searcher is the
only way new documents will become search-able.  I've also considered using
one WriteOnly Master Instance with > 1-n ReadOnly Solr Slaves but that would
mean there will be lag between snapshots of the master.  Another solution
that I was thinking about is having a smaller R/W Dynamic Master Solr
instance that would only store deltas while I will still have a WriteOnly
Master with a set of ReadOnly slaves.  That would mean I would have to add
some logic to combine and intersect the results from the dynamic Solr
instance and R/O slaves.  In this scenario, I wonder what would happen if I
were to search for the top 25 documents that contains "x"?  What would
happen to scoring and other factors?  Would sharding be better in this
situation?

One more question is that I have not seen a lot of people discuss Solr-RA
NRT?  Is anyone familiar with it?  There's not much mention of it except
here http://solr-ra.tgels.com.

Thanks,

Zarni


Available Solr Indexing strategies

2011-06-07 Thread zarni aung
Hi,

I am very new to Solr and my client is trying to implement full text
searching capabilities to their product by using Solr.  They will also have
master storage that would be the Authoritative data store which will also
provide meta data searches.  Can you please point me in the right direction
for some indexing strategies that people are using for further research.

Thank you,

Zarni