Re: documentCache not used in 4.3.1?

2013-07-01 Thread Daniel Collins
We see similar results, again we softCommit every 1s (trying to get as NRT
as we can), and we very rarely get any hits in our caches.  As an
unscheduled test last week, we did shutdown indexing and noticed about 80%
hit rate in caches (and average query time dropped from ~1s to 100ms!) so I
think we are in the same position as you.

I appreciate with such a frequent soft commit that the caches get
invalidated, but I was expecting cache warming to help though it doesn't
appear to be.  We *don't* currently run a warming query, my impression of
NRT was that it was better to not do that as otherwise you spend more time
warming the searcher and caches, and by the time you've done all that, the
searcher is invalidated anyway!


On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote:

 That's a good idea, I'll try that next week.

 Thanks!

 Tim


 On 29/06/13 12:39 PM, Erick Erickson wrote:

 Tim:

 Yeah, this doesn't make much sense to me either since,
 as you say, you should be seeing some metrics upon
 occasion. But do note that the underlying cache only gets
 filled when getting documents to return in query results,
 since there's no autowarming going on it may come and
 go.

 But you can test this pretty quickly by lengthening your
 autocommit interval or just not indexing anything
 for a while, then run a bunch of queries and look at your
 cache stats. That'll at least tell you whether it works at all.
 You'll have to have hard commits turned off (or openSearcher
 set to 'false') for that check too.

 Best
 Erick


 On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Timtvaillanco...@ea.com*
 *wrote:

  Yes, we are softCommit'ing every 1000ms, but that should be enough time
 to
 see metrics though, right? For example, I still get non-cumulative
 metrics
 from the other caches (which are also throw away). I've also curl/sampled
 enough that I probably should have seen a value by now.

 If anyone else can reproduce this on 4.3.1 I will feel less crazy :).

 Cheers,

 Tim

 -Original Message-
 From: Erick Erickson 
 [mailto:erickerickson@gmail.**comerickerick...@gmail.com
 ]
 Sent: Saturday, June 29, 2013 10:13 AM
 To: solr-user@lucene.apache.org
 Subject: Re: documentCache not used in 4.3.1?

 It's especially weird that the hit ratio is so high and you're not seeing
 anything in the cache. Are you perhaps soft committing frequently? Soft
 commits throw away all the top-level caches including documentCache I
 think

 Erick


 On Fri, Jun 28, 2013 at 7:23 PM, Tim 
 Vaillancourttim@elementspace.**comt...@elementspace.com

 wrote:
 Thanks Otis,

 Yeah I realized after sending my e-mail that doc cache does not warm,
 however I'm still lost on why there are no other metrics.

 Thanks!

 Tim


 On 28 June 2013 16:22, Otis 
 Gospodneticotis.gospodnetic@**gmail.comotis.gospodne...@gmail.com
 
 wrote:

  Hi Tim,

 Not sure about the zeros in 4.3.1, but in SPM we see all these
 numbers are non-0, though I haven't had the chance to confirm with

 Solr 4.3.1.

 Note that you can't really autowarm document cache...

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/ Performance

 Monitoring -- http://sematext.com/spm



 On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
 t...@elementspace.com
 wrote:

 Hey guys,

 This has to be a stupid question/I must be doing something wrong,
 but

 after

 frequent load testing with documentCache enabled under Solr 4.3.1
 with autoWarmCount=150, I'm noticing that my documentCache metrics
 are

 always

 zero for non-cumlative.

 At first I thought my commit rate is fast enough I just never see
 the non-cumlative result, but after 100s of samples I still always
 get zero values.

 Here is the current output of my documentCache from Solr's admin
 for 1

 core:

 

 - documentCache

 http://localhost:8983/solr/#/**channels_shard1_replica2/**
 plugins/cache?enhttp://localhost:8983/solr/#/channels_shard1_replica2/plugins/cache?en
 try=documentCache

- class:org.apache.solr.search.**LRUCache
- version:1.0
- description:LRU Cache(maxSize=512, initialSize=512,
autowarmCount=150, regenerator=null)
- src:$URL: https:/
/svn.apache.org/repos/asf/**lucene/dev/branches/lucene_**
 solr_4_3/http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/
solr/core/src/java/org/apache/**solr/search/LRUCache.java

 https://svn.apache.org/repos/**asf/lucene/dev/branches/**
 lucene_solr_4_3/shttps://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/s
 olr/core/src/java/org/apache/**solr/search/LRUCache.java

 $
- stats:
   - lookups:0
   - hits:0
   - hitratio:0.00
   - inserts:0
   - evictions:0
   - size:0
   - warmupTime:0
   - cumulative_lookups:65198986
   - cumulative_hits:63075669
   - cumulative_hitratio:0.96
   - cumulative_inserts:2123317
   - cumulative_evictions:1010262


 The 

Re: dataconfig to index ZIP Files

2013-07-01 Thread Bernd Fehling
Try setting dataSource=null for your toplevel entity and
use filename=\.zip$ as filename selector.



Am 28.06.2013 23:14, schrieb ericrs22:
 unfortunately not. I had tried that before with the logs saying:
 
 Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
 java.util.regex.PatternSyntaxException: Dangling meta character '*' near
 index 0 
 
 
 With .*zip i get this:
 
 
 WARN
  
 SimplePropertiesWriter
  
 Unable to read: dataimport.properties
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074009.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


Index pdf files.

2013-07-01 Thread archit2112
Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler.
Im using Solr-4.3.0. I followed the steps given in this post

http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html

However, I get the following error -

Full Import failed:java.lang.NoClassDefFoundError:
org/apache/tika/parser/Parser

Please help!

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index pdf files.

2013-07-01 Thread Shalin Shekhar Mangar
The tika jars are not in your classpath. You need to add all the jars
inside contrib/extraction/lib directory to your classpath.

On Mon, Jul 1, 2013 at 2:00 PM, archit2112 archit2...@gmail.com wrote:
 Hi I'm new to Solr. I want to index pdf files usng the Data Import Handler.
 Im using Solr-4.3.0. I followed the steps given in this post

 http://lucene.472066.n3.nabble.com/indexing-with-DIH-and-with-problems-td3731129.html

 However, I get the following error -

 Full Import failed:java.lang.NoClassDefFoundError:
 org/apache/tika/parser/Parser

 Please help!

 Thanks



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Regards,
Shalin Shekhar Mangar.


Re: Stemming query in Solr

2013-07-01 Thread snkar
Hi Erick,

Thanks for the reply.

Here is what the situation is:

Relevant portion of Solr Schema:
lt;field name=Content type=text_general indexed=false stored=true 
required=true/gt;
lt;field name=ContentSearch type=text_general indexed=true 
stored=false multiValued=true/gt;
lt;field name=ContentSearchStemming type=text_stem indexed=true 
stored=false multiValued=true/gt;
lt;copyField source=Content dest=ContentSearch/gt;
lt;copyField source=Content dest=ContentSearchStemming/gt;

lt;fieldType name=text_general class=solr.TextField 
positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer 
class=solr.StandardTokenizerFactory/gt; lt;filter 
class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true /gt; lt;filter 
class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer 
type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; 
lt;filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /gt; lt;filter 
class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt;

lt;fieldType name=text_stem class=solr.TextField gt; lt;analyzergt; 
lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt; lt;filter 
class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt; 
lt;/fieldTypegt;
When I am indexing a document, the content gets stored as is in the Content 
field and gets copied over to ContentSearch and ContentSearchStemming for text 
based search and stemming search respectively. So, the ContentSearchStemming 
field does store the
stem/reduced form of the terms. I have checked this with the Luke as well as 
the Admin Schema Browser --gt; Term Info. In the Admin
Analysis screen, I have tested and found that if I index the text burning, it 
gets reduced to and stored as burn. So far so good.

Now in the UI, 
lets say the user puts in the term burn and checks the stemming option. The 
expectation is that since the user has specified stemming, the results should 
be returned for the term burn as well as for all terms which has their stem 
as burn i.e. burning, burned, burns, etc.
lets say the user puts in the term burning and checks the stemming option. 
The expectation is that since the user has specified stemming, the results 
should be returned for the term burning as well as for all terms which has 
their stem as burn i.e. burn, burned, burns, etc.
The query that gets submitted to Solr: q=ContentSearchStemming:burning
From Debug Info: 
lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt;
lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt;
lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt;
lt;str name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt;
So, when the results are returned, I am only getting the hits highlighted with 
the term burn, though the same document contains terms like burning and 
burns.

I thought that the stemming should work like this: 
The stemming filter in the queryanalyzer chain would reduce the input word to 
its stem. burning --gt; burn
The query component should scan through the terms and match those terms for 
which it finds a match between the stem of the term with the stem of the input 
term. burns --gt; burn (matches) burning --gt; burn
The first point is happening. But looks like it is executing the search for an 
exact text based match with the stem burn. Hence, burns or burned are not 
getting returned.
Hope I was able to make myself clear.

 On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] 
lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote  


 First, this is for the Java version, I hope it extends to C#. 

But in your configuration, when you're indexing the stemmer 
should be storing the reduced form in the index. Then, when 
searching, the search should be against the reduced term. 
To check this, try 
1gt; Using the Admin/Analysis page to see what gets stored 
 in your index and what your query is transformed to to 
 insure that you're getting what you expect. 

If you want to get in deeper to the details, try 
1gt; use, say, the TermsComponent or Admin/Schema Browser 
 or Luke to look in your index and see what's actually 
there. 
2gt; us amp;debug=query or Admin/Analysis to see what the query 
actually looks like. 

Both your use-cases should work fine just with reduction 
_unless_ the particular word you look for doesn't happen to 
trip the stemmer. By that I mean that since it's algorithmically 
based, there may be some edge cases that seem like they 
should be reduced that aren't. I don't know whether fisherman 
would reduce to fish for instance. 

So are you seeing things that really don't work as expected or 
are you just working from the docs? Because I really don't 
see why you wouldn't get what you want given your description. 

Best 
Erick 


On Fri, Jun 28, 2013 at 2:33 AM, snkar lt;[hidden email]gt; wrote: 

gt; We have a search system based on 

Set spellcheck field on query time?

2013-07-01 Thread Timo Schmidt
Hello together,

we are currently working on a mutilanguage single core setup.

During that I stumbled upon the question if it is possible to define different 
sources for the spellcheck.

For now I only see the possibility to define different request handlers. Is it 
somehow possible to set
the source field for the DirectSolrSpellChecker on querytime?

Cheers

timo

[cid:image001.jpg@01CE764E.E6958B90]

Timo Schmidt
Entwickler (Dipl. Inf. FH)


AOE GmbH
Borsigstr. 3
65205 Wiesbaden
Germany

Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.demailto:timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/


Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a

USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567


Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould


Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
vernichten Sie diese Mail. This e-mail message may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this e-mail in error) please notify the sender immediately and destroy this 
e-mail.





Sum as a Projection for Facet Queries

2013-07-01 Thread samarth s
Hi,

We have a need of finding the sum of a field for each facet.query. We have
looked at StatsComponent http://wiki.apache.org/solr/StatsComponent but
that supports only facet.field. Has anyone written a patch over
StatsComponent that supports the same along with some performance measures?

Is there any way we can do this using the Function Query -
Sumhttp://wiki.apache.org/solr/FunctionQuery#sum
?

-- 
Regards,
Samarth


Multiple groups of boolean queries in a single query.

2013-07-01 Thread samabhiK
Hello friends,

I have a schema which contains various types of records of three different
categories for ease of management and for making a single query to fetch all
the data. The fields are grouped into three different types of records. For
example:

fields type 1:

field name=x_date type=tdate indexed=true stored=true/
field name=x_name type=tdate indexed=true stored=true/
field name=x_type type=tdate indexed=true stored=true/

fields type 2:

field name=y_date type=tdate indexed=true stored=true/
field name=y_name type=string indexed=true stored=true/
field name=y_phone type=string indexed=true stored=true/

fields type 3:

field name=z_date type=tdate indexed=true stored=true/
field name=z_type type=string indexed=true stored=true/

common partition field which identifies the category of the data record

field name=xyz_category type=string indexed=true stored=true/

What should I do to fetch all these records in the form: 

(+x_date:[2011-01-01T00:00:00Z TO *] +x_type:(1 OR 2 OR 3 OR 4)
+xyz_category:X) OR
(+y_date:[2012-06-01T00:00:00Z TO *] +y_name:sam~ +xyz_category:Y) OR
(+z_date:[2013-03-01T00:00:00Z TO *] +xyz_category:Z)

Can we construct a query like this? Or is it even possible?

Sam



 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple groups of boolean queries in a single query.

2013-07-01 Thread samabhiK
My entire concern is to be able to make a single query to fetch all the types
of records. If I had to create three different cores for this different
types of data, I would have to make 3 calls to solr to fetch the entire set
of data. And I will be having approx 15 such types in real.

Also, at any given record, either the section 1 fields are filled up or
section 2's or section 3's. At no point, will we have all these fields
populated in a single record. Only field that will have data for all records
is xyz_category to allow us to partition the data set.

Any suggestions in writing a single query to fetch all the data we need will
be highly appreciated.

Thanks.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index pdf files.

2013-07-01 Thread archit2112
Hi 

Thanks a lot. I did what you said. Now I'm getting the following error.

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
java.util.regex.PatternSyntaxException: Dangling meta character '*' near
index 0



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Set spellcheck field on query time?

2013-07-01 Thread Jan Høydahl
Check out http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.dictionary 
- you can define multiple dictionaries in the same handler, each with its own 
source field.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

1. juli 2013 kl. 11:34 skrev Timo Schmidt timo.schm...@aoemedia.de:

 Hello together,
  
 we are currently working on a mutilanguage single core setup.
  
 During that I stumbled upon the question if it is possible to define 
 different sources for the spellcheck.
  
 For now I only see the possibility to define different request handlers. Is 
 it somehow possible to set
 the source field for the DirectSolrSpellChecker on querytime?
  
 Cheers
  
 timo
  
 
 Timo Schmidt
 Entwickler (Dipl. Inf. FH)
 
 
 AOE GmbH
 Borsigstr. 3
 65205 Wiesbaden
 Germany
 Tel. +49 (0) 6122 70 70 7 - 234
 Fax. +49 (0) 6122 70 70 7 -199
 
 
 
 e-Mail: timo.schm...@aoemedia.de
 Web: http://www.aoemedia.de/
 
 Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
 
 USt-ID Nr.: DE250247455
 Handelsregister: Wiesbaden B
 Handelsregister Nr.: 22567
  
 Stammsitz: Wiesbaden
 Creditreform: 625.0209354
 Geschäftsführer: Kian Toyouri Gould
  
 Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
 Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
 irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
 vernichten Sie diese Mail. This e-mail message may contain confidential 
 and/or privileged information. If you are not the intended recipient (or have 
 received this e-mail in error) please notify the sender immediately and 
 destroy this e-mail.
  
  



Re: documentCache not used in 4.3.1?

2013-07-01 Thread Erick Erickson
Daniel:

Soft commits invalidate the top level caches, which include
things like filterCache, queryResultCache etc. Various
segment-level caches are NOT invalidated, but you really
don't have a lot of control from the Solr level over those
anyway.

But yeah, the tension between caching a bunch of stuff
for query speedups and NRT is still with us. Soft commits
are much less expensive than hard commits, but not being
able to use the caches as much is the price. You're right
that with such frequent autocommits, autowarming
probably is not worth the effort.

The question I always ask is whether 1 second is really
necessary. Or, more accurately, worth the price. Often
it's not and lengthening it out significantly may be an option,
but that's a discussion for you to have with your product
manager G

I have seen configurations that have a more frequent hard
commit (openSearcher=false) than soft commit. The
mantra is soft commits are about visibility, hard commits
are about durability.

FWIW,
Erick


On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.comwrote:

 We see similar results, again we softCommit every 1s (trying to get as NRT
 as we can), and we very rarely get any hits in our caches.  As an
 unscheduled test last week, we did shutdown indexing and noticed about 80%
 hit rate in caches (and average query time dropped from ~1s to 100ms!) so I
 think we are in the same position as you.

 I appreciate with such a frequent soft commit that the caches get
 invalidated, but I was expecting cache warming to help though it doesn't
 appear to be.  We *don't* currently run a warming query, my impression of
 NRT was that it was better to not do that as otherwise you spend more time
 warming the searcher and caches, and by the time you've done all that, the
 searcher is invalidated anyway!


 On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote:

  That's a good idea, I'll try that next week.
 
  Thanks!
 
  Tim
 
 
  On 29/06/13 12:39 PM, Erick Erickson wrote:
 
  Tim:
 
  Yeah, this doesn't make much sense to me either since,
  as you say, you should be seeing some metrics upon
  occasion. But do note that the underlying cache only gets
  filled when getting documents to return in query results,
  since there's no autowarming going on it may come and
  go.
 
  But you can test this pretty quickly by lengthening your
  autocommit interval or just not indexing anything
  for a while, then run a bunch of queries and look at your
  cache stats. That'll at least tell you whether it works at all.
  You'll have to have hard commits turned off (or openSearcher
  set to 'false') for that check too.
 
  Best
  Erick
 
 
  On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Timtvaillanco...@ea.com
 *
  *wrote:
 
   Yes, we are softCommit'ing every 1000ms, but that should be enough time
  to
  see metrics though, right? For example, I still get non-cumulative
  metrics
  from the other caches (which are also throw away). I've also
 curl/sampled
  enough that I probably should have seen a value by now.
 
  If anyone else can reproduce this on 4.3.1 I will feel less crazy :).
 
  Cheers,
 
  Tim
 
  -Original Message-
  From: Erick Erickson [mailto:erickerickson@gmail.**com
 erickerick...@gmail.com
  ]
  Sent: Saturday, June 29, 2013 10:13 AM
  To: solr-user@lucene.apache.org
  Subject: Re: documentCache not used in 4.3.1?
 
  It's especially weird that the hit ratio is so high and you're not
 seeing
  anything in the cache. Are you perhaps soft committing frequently? Soft
  commits throw away all the top-level caches including documentCache I
  think
 
  Erick
 
 
  On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourttim@elementspace.
 **comt...@elementspace.com
 
  wrote:
  Thanks Otis,
 
  Yeah I realized after sending my e-mail that doc cache does not warm,
  however I'm still lost on why there are no other metrics.
 
  Thanks!
 
  Tim
 
 
  On 28 June 2013 16:22, Otis Gospodneticotis.gospodnetic@**gmail.com
 otis.gospodne...@gmail.com
  
  wrote:
 
   Hi Tim,
 
  Not sure about the zeros in 4.3.1, but in SPM we see all these
  numbers are non-0, though I haven't had the chance to confirm with
 
  Solr 4.3.1.
 
  Note that you can't really autowarm document cache...
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/ Performance
 
  Monitoring -- http://sematext.com/spm
 
 
 
  On Fri, Jun 28, 2013 at 7:14 PM, Tim Vaillancourt
  t...@elementspace.com
  wrote:
 
  Hey guys,
 
  This has to be a stupid question/I must be doing something wrong,
  but
 
  after
 
  frequent load testing with documentCache enabled under Solr 4.3.1
  with autoWarmCount=150, I'm noticing that my documentCache metrics
  are
 
  always
 
  zero for non-cumlative.
 
  At first I thought my commit rate is fast enough I just never see
  the non-cumlative result, but after 100s of samples I still always
  get zero values.
 
  Here is the current output of my documentCache from Solr's admin
  for 1
 
 

Re: Index pdf files.

2013-07-01 Thread Erick Erickson
OK, have you done anything custom? You get
this where? solr logs? Echoed back in the browser?
In response to what command?

You haven't provided enough info to help us help you.
You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Mon, Jul 1, 2013 at 6:08 AM, archit2112 archit2...@gmail.com wrote:

 Hi

 Thanks a lot. I did what you said. Now I'm getting the following error.

 Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
 java.util.regex.PatternSyntaxException: Dangling meta character '*' near
 index 0



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074297.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Index pdf files.

2013-07-01 Thread archit2112
I figured it out. It was a problem with the regular expression i used in
data-config.xml .




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-pdf-files-tp4074278p4074304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stemming query in Solr

2013-07-01 Thread Erick Erickson
bq:  But looks like it is executing the search for an exact text based
match with the stem burn.

Right. You need to appreciate index time as opposed to query time stemming.
Your field
definition has both turned on. The admin/analysis page will help here G..

At index time, the terms are stemmed, and _only_ the reduced term is put in
the index.
At query time, the same thing happens and _only_ the reduced term is
searched for.

By stemming at index time, you lose the original form of the word, it's
just gone and
nothing about checking/unchecking the stem bits will recover it. So the
general
solution is to index the field twice, once with stemming and once without
in order
to have the ability to do both stemmed and exact matches. I think I saw a
clever
approach to doing this involving a custom filter but can't find it now. As
I recall it
indexed the un-stemmed version like a synonym with some kind of marker
to indicate exact match when necessary

Best
Erick


On Mon, Jul 1, 2013 at 5:15 AM, snkar soumya@zoho.com wrote:

 Hi Erick,

 Thanks for the reply.

 Here is what the situation is:

 Relevant portion of Solr Schema:
 lt;field name=Content type=text_general indexed=false stored=true
 required=true/gt;
 lt;field name=ContentSearch type=text_general indexed=true
 stored=false multiValued=true/gt;
 lt;field name=ContentSearchStemming type=text_stem indexed=true
 stored=false multiValued=true/gt;
 lt;copyField source=Content dest=ContentSearch/gt;
 lt;copyField source=Content dest=ContentSearchStemming/gt;

 lt;fieldType name=text_general class=solr.TextField
 positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer
 class=solr.StandardTokenizerFactory/gt; lt;filter
 class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt
 enablePositionIncrements=true /gt; lt;filter
 class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer
 type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt;
 lt;filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /gt; lt;filter
 class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt;
 lt;/fieldTypegt;

 lt;fieldType name=text_stem class=solr.TextField gt;
 lt;analyzergt; lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt;
 lt;filter class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt;
 lt;/fieldTypegt;
 When I am indexing a document, the content gets stored as is in the
 Content field and gets copied over to ContentSearch and
 ContentSearchStemming for text based search and stemming search
 respectively. So, the ContentSearchStemming field does store the
 stem/reduced form of the terms. I have checked this with the Luke as well
 as the Admin Schema Browser --gt; Term Info. In the Admin
 Analysis screen, I have tested and found that if I index the text
 burning, it gets reduced to and stored as burn. So far so good.

 Now in the UI,
 lets say the user puts in the term burn and checks the stemming option.
 The expectation is that since the user has specified stemming, the results
 should be returned for the term burn as well as for all terms which has
 their stem as burn i.e. burning, burned, burns, etc.
 lets say the user puts in the term burning and checks the stemming
 option. The expectation is that since the user has specified stemming, the
 results should be returned for the term burning as well as for all terms
 which has their stem as burn i.e. burn, burned, burns, etc.
 The query that gets submitted to Solr: q=ContentSearchStemming:burning
 From Debug Info:
 lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt;
 lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt;
 lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt;
 lt;str
 name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt;
 So, when the results are returned, I am only getting the hits highlighted
 with the term burn, though the same document contains terms like burning
 and
 burns.

 I thought that the stemming should work like this:
 The stemming filter in the queryanalyzer chain would reduce the input word
 to its stem. burning --gt; burn
 The query component should scan through the terms and match those terms
 for which it finds a match between the stem of the term with the stem of
 the input term. burns --gt; burn (matches) burning --gt; burn
 The first point is happening. But looks like it is executing the search
 for an exact text based match with the stem burn. Hence, burns or burned
 are not getting returned.
 Hope I was able to make myself clear.

  On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] 
 lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote 


  First, this is for the Java version, I hope it extends to C#.

 But in your configuration, when you're indexing the stemmer
 should be storing the reduced form in the index. Then, when
 searching, the search should be against the reduced term.
 To check this, try
 1gt; Using the 

Re: Multiple groups of boolean queries in a single query.

2013-07-01 Thread Erick Erickson
Have you tried the query you indicated? Because it should
just work barring syntax errors. The only other thing you
might want is to turn on grouping by field type. That'll
return separate sections by type, say the top 3 (default 1)
documents in each type. If you don't group, you have the
possibility that your entire results (i.e. the number of docs
in the rows parameter) will be all one type.

see:
http://wiki.apache.org/solr/FieldCollapsing

Best
Erick


On Mon, Jul 1, 2013 at 6:06 AM, samabhiK qed...@gmail.com wrote:

 My entire concern is to be able to make a single query to fetch all the
 types
 of records. If I had to create three different cores for this different
 types of data, I would have to make 3 calls to solr to fetch the entire set
 of data. And I will be having approx 15 such types in real.

 Also, at any given record, either the section 1 fields are filled up or
 section 2's or section 3's. At no point, will we have all these fields
 populated in a single record. Only field that will have data for all
 records
 is xyz_category to allow us to partition the data set.

 Any suggestions in writing a single query to fetch all the data we need
 will
 be highly appreciated.

 Thanks.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multiple-groups-of-boolean-queries-in-a-single-query-tp4074294p4074296.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Shard tolerant partial results

2013-07-01 Thread Phil Hoy
Hi,

When doing distributed searches with shards.tolerant set whilst the hosts for a 
slice are down and therefore the response is partial, how best that inferred as 
we would like to not cache the results upstream and perhaps inform the end user 
in some way.

I am aware that shards.info could be used, however I am concerned this may have 
performance implications due to cost parsing the response from solr and perhaps 
some extra cost incurred by solr to generate the response.

Perhaps an http header could be added or another attribute added to the solr 
result node.

Phil

__
brightsolid is used in this email to collectively mean brightsolid online 
innovation limited and its subsidiary companies brightsolid online publishing 
limited and brightsolid online technology limited.
findmypast.co.uk is a brand of brightsolid online publishing limited.
brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
Street, London EC2A 3DQ. Registered in England No. 04369607.
brightsolid online technology limited, Gateway House, Luna Place, Dundee 
Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.

Email Disclaimer

This message is confidential and may contain privileged information. You should 
not disclose its contents to any other person. If you are not the intended 
recipient, please notify the sender named above immediately. It is expressly 
declared that this e-mail does not constitute nor form part of a contract or 
unilateral obligation. Opinions, conclusions and other information in this 
message that do not relate to the official business of brightsolid shall be 
understood as neither given nor endorsed by it.
__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs
__

Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler. 

*My request handler - *

requestHandler name=/dataimport1 
class=org.apache.solr.handler.dataimport.DataImportHandler 
lst name=defaults 
  str name=configdata-config1.xml/str 
/lst 
  /requestHandler 

*My data-config1.xml *

dataConfig 
dataSource type=BinFileDataSource / 
document 
entity name=f dataSource=null rootEntity=false 
processor=FileListEntityProcessor 
baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf 
recursive=true 
entity name=tika-test processor=TikaEntityProcessor 
url=${f.fileAbsolutePath} format=text 
field column=Author name=author meta=true/
field column=title name=title1 meta=true/
field column=text name=text/
/entity 
/entity 
/document 
/dataConfig 


Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky

It all depends on your data model - tell us more about your data model.

For example, how will users or applications query these documents and what 
will they expect to be able to do with the ID/key for the documents?


How are you expecting to identify documents in your data model?

-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 7:17 AM
To: solr-user@lucene.apache.org
Subject: Unique key error while indexing pdf files

Hi

Im trying to index pdf files in solr 4.3.0 using the data import handler.

*My request handler - *

requestHandler name=/dataimport1
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
 str name=configdata-config1.xml/str
   /lst
 /requestHandler

*My data-config1.xml *

dataConfig
dataSource type=BinFileDataSource /
document
entity name=f dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=C:\Users\aroraarc\Desktop\Impdo fileName=.*pdf
recursive=true
entity name=tika-test processor=TikaEntityProcessor
url=${f.fileAbsolutePath} format=text
field column=Author name=author meta=true/
field column=title name=title1 meta=true/
field column=text name=text/
/entity
/entity
/document
/dataConfig


Now When i try and index the files i get the following error -

org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:517)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:396)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:235)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:500)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


This problem can be solved easily in case of database indexing but i dont
know how to go about the unique key of a document. how do i define the id
field (unique key) of a pdf file. how do i solve this problem?

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Unique key error while indexing pdf files

2013-07-01 Thread archit2112
Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

2013-07-01 Thread tuedel
Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:

updateRequestProcessorChain name=uniq_fields
   processor
class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory
 lst name=fields
   strtitle/str
   strtag_type/str
 /lst
   /processor
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 requestHandler name=/update class=solr.UpdateRequestHandler
   lst name=defaults
  str name=update.chainuniq_fields/str
/lst
  /requestHandler

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an 

tag_type :{add:foo}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unique key error while indexing pdf files

2013-07-01 Thread Jack Krupansky
It's really 100% up to you how you want to come up with the unique key 
values for your documents. What would you like them to be? Just use that. 
Anything (within reason) - anything goes.


But it also comes back to your data model. You absolutely must come up with 
a data model for how you expect to index and query data in Solr before you 
just start throwing random data into Solr.


1. Design your data model.
2. Produce a Solr schema from that data model.
3. Map the raw data from your data sources (e.g., PDF files) to the fields 
in your Solr schema.


That last step includes the ID/key field, but your data model will imply any 
requirements for what the ID/key should be.


To be absolutely clear, it is 100% up to you to design the ID/key for every 
document; Solr does NOT do that for you.


Even if you are just exploring, at least come up with an exploratory 
data model - which includes what expectations you have about the unique 
ID/key for each document.


So, for that first PDF file, what expectation (according to your data model) 
do you have for what its ID/key should be?


-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Monday, July 01, 2013 8:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Unique key error while indexing pdf files

Im new to solr. Im just trying to understand and explore various features
offered by solr and their implementations. I would be very grateful if you
could solve my problem with any example of your choice. I just want to learn
how i can index pdf documents using data import handler.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unique-key-error-while-indexing-pdf-files-tp4074314p4074327.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Stemming query in Solr

2013-07-01 Thread snkar

So the general solution is to index the field twice, once with stemming and 
once without in order to have the ability to do both stemmed and exact matches 

I am already indexing the text twice using the ContentSearch and 
ContentSearchStemming fields. But what this allows me is to return burning as 
well as burn if the user specifies burning as the input search term, 
burning being the exact match:

ContentSearch:burning + ContentSearchStemming:burn(reduced from 
ContentSearchStemming:burning)

What I cannot figure out is how is this going to help me in instructing Solr to 
execute the query for the different grammatical variations of the input search 
term stem i.e. stemming query for burning expands to text based query for 
burn, burns, burned, burning, etc.

You mentioned something about synonym. This was also mentioned in the Solr Wiki:
A related technology to stemming is lemmatization, which allows for stemming 
by expansion, taking a root word and 'expanding' it to all of its various 
forms. Lemmatization can be used either at insertion time or at query time. 
Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory  

I think what I need is exactly this point:

Lucene/Solr does not have built-in support for lemmatization but it can be 
simulated by using your own dictionaries and the SynonymFilterFactory

But I am not sure, how to go about it and exactly how can Synonym help me here 
as I am not looking for synonyms, rather different expansions of the stemmed 
word.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] 
lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem burn. 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
lt;Ggt;.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the stem bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: 

gt; Hi Erick, 
gt; 
gt; Thanks for the reply. 
gt; 
gt; Here is what the situation is: 
gt; 
gt; Relevant portion of Solr Schema: 
gt; amp;lt;field name=Content type=text_general indexed=false 
stored=true 
gt; required=true/amp;gt; 
gt; amp;lt;field name=ContentSearch type=text_general indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; 
gt; 
gt; amp;lt;fieldType name=text_general class=solr.TextField 
gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; 
amp;lt;tokenizer 
gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter 
gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
gt; enablePositionIncrements=true /amp;gt; amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
amp;lt;analyzer 
gt; type=queryamp;gt; amp;lt;tokenizer 
class=solr.StandardTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true 
gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; 
amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; 
gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; 
gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer 
class=solr.WhitespaceTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; 
amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; When I am indexing a document, the content gets stored as is in the 
gt; Content field and gets copied over to ContentSearch and 
gt; ContentSearchStemming for text based search and stemming search 
gt; respectively. So, the ContentSearchStemming field does store the 
gt; stem/reduced form of the terms. I have checked this with the Luke as well 
gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin 
gt; Analysis screen, I have tested and found that if I index the text 
gt; burning, it gets reduced to and stored as burn. So far so good. 

Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate values in multivalued field

2013-07-01 Thread Jack Krupansky
Your stated problem seems to have nothing to do with the message subject 
line relating to RemoveDuplicatesTokenFilterFactory. Please start a new 
message thread unless you really are concerned with an issue related to 
RemoveDuplicatesTokenFilterFactory.


This kind of thread hijacking is inappropriate for this email list (or any 
email list.)


-- Jack Krupansky

-Original Message- 
From: tuedel

Sent: Monday, July 01, 2013 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: RemoveDuplicatesTokenFilterFactory to avoid import duplicate 
values in multivalued field


Hey, i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:

updateRequestProcessorChain name=uniq_fields
  processor
class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory
lst name=fields
  strtitle/str
  strtag_type/str
/lst
  /processor
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

requestHandler name=/update class=solr.UpdateRequestHandler
  lst name=defaults
 str name=update.chainuniq_fields/str
   /lst
 /requestHandler

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an

tag_type :{add:foo}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/RemoveDuplicatesTokenFilterFactory-to-avoid-import-duplicate-values-in-multivalued-field-tp4029004p4074324.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Stemming query in Solr

2013-07-01 Thread snkar
I was just wondering if another solution might work. If we are able to extract 
the stem of the input search term(maybe using a C# based stemmer, some open 
source implementation of the Porter algorithm) for cases where the stemming 
option is selected, and submit the query to solr as a multiple character wild 
card query with respect to the stem, it should return me all the different 
variations of the stemmed word.

Example:

Search Term: burning
Stem: burn
Modified Query: burn*
Results: burn, burning, burns, burnt, etc.

I am sure this is not the proper way of executing a stemming by expansion, but 
this might just get the job done. What do you think? Trying to think of test 
case where this will fail.

 On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via 
Lucene]lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote  


 bq:  But looks like it is executing the search for an exact text based 
match with the stem burn. 

Right. You need to appreciate index time as opposed to query time stemming. 
Your field 
definition has both turned on. The admin/analysis page will help here 
lt;Ggt;.. 

At index time, the terms are stemmed, and _only_ the reduced term is put in 
the index. 
At query time, the same thing happens and _only_ the reduced term is 
searched for. 

By stemming at index time, you lose the original form of the word, it's 
just gone and 
nothing about checking/unchecking the stem bits will recover it. So the 
general 
solution is to index the field twice, once with stemming and once without 
in order 
to have the ability to do both stemmed and exact matches. I think I saw a 
clever 
approach to doing this involving a custom filter but can't find it now. As 
I recall it 
indexed the un-stemmed version like a synonym with some kind of marker 
to indicate exact match when necessary 

Best 
Erick 


On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: 

gt; Hi Erick, 
gt; 
gt; Thanks for the reply. 
gt; 
gt; Here is what the situation is: 
gt; 
gt; Relevant portion of Solr Schema: 
gt; amp;lt;field name=Content type=text_general indexed=false 
stored=true 
gt; required=true/amp;gt; 
gt; amp;lt;field name=ContentSearch type=text_general indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true 
gt; stored=false multiValued=true/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; 
gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; 
gt; 
gt; amp;lt;fieldType name=text_general class=solr.TextField 
gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; 
amp;lt;tokenizer 
gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter 
gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
gt; enablePositionIncrements=true /amp;gt; amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
amp;lt;analyzer 
gt; type=queryamp;gt; amp;lt;tokenizer 
class=solr.StandardTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true 
gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; 
amp;lt;filter 
gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; 
gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; 
gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer 
class=solr.WhitespaceTokenizerFactory/amp;gt; 
gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; 
amp;lt;/analyzeramp;gt; 
gt; amp;lt;/fieldTypeamp;gt; 
gt; When I am indexing a document, the content gets stored as is in the 
gt; Content field and gets copied over to ContentSearch and 
gt; ContentSearchStemming for text based search and stemming search 
gt; respectively. So, the ContentSearchStemming field does store the 
gt; stem/reduced form of the terms. I have checked this with the Luke as well 
gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin 
gt; Analysis screen, I have tested and found that if I index the text 
gt; burning, it gets reduced to and stored as burn. So far so good. 
gt; 
gt; Now in the UI, 
gt; lets say the user puts in the term burn and checks the stemming option. 
gt; The expectation is that since the user has specified stemming, the results 
gt; should be returned for the term burn as well as for all terms which has 
gt; their stem as burn i.e. burning, burned, burns, etc. 
gt; lets say the user puts in the term burning and checks the stemming 
gt; option. The expectation is that since the user has specified stemming, the 
gt; results should be returned for the term burning as well as for all terms 
gt; which has their stem as burn i.e. burn, burned, burns, etc. 
gt; The query that gets submitted to Solr: q=ContentSearchStemming:burning 
gt; From Debug Info: 
gt; amp;lt;str 
name=rawquerystringamp;gt;ContentSearchStemming:burningamp;lt;/stramp;gt; 
gt; amp;lt;str 

Re: Shard tolerant partial results

2013-07-01 Thread Mark Miller

On Jul 1, 2013, at 6:56 AM, Phil Hoy p...@brightsolid.com wrote:

 Perhaps an http header could be added or another attribute added to the solr 
 result node.

I thought that was already done - I'm surprised that it's not. If that's really 
the case, please make a JIRA issue.

- Mark

Distinct values in multivalued fields

2013-07-01 Thread tuedel
Hello everybody,

i have tried to make use of the UniqFieldsUpdateProcessorFactory in 
order to achieve distinct values in multivalued fields. Example below: 

updateRequestProcessorChain name=uniq_fields 
   processor 
class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory 
 lst name=fields 
   strtitle/str 
   strtag_type/str 
 /lst 
   /processor 
   processor class=solr.RunUpdateProcessorFactory / 
/updateRequestProcessorChain 

requestHandler name=/update class=solr.UpdateRequestHandler 
   lst name=defaults 
  str name=update.chainuniq_fields/str 
/lst 
  /requestHandler 

However the data being is indexed one by one. This may happen, since a 
document may will get an additional tag in a future update. Unfortunately in 
order to ensure not having any duplicate tags, i was hoping, the 
UpdateProcessorFactory is doing what i want to achieve. In order to actually 
add a tag, i am sending an 

tag_type :{add:foo}, which still adds the tag, without questioning if 
its already part of the field. How may i be able to achieve distinct values 
on solr side?! 

In order to achieve this behavior i suggest writing an own processor might
be a solution. However i am uncertain how to do and if it's the proper way. 
Imagine an incoming update - e.g. an update of an existing document having
several multivalued fields without specifying add or set. This task
would cause the corresponding document to get dropped and re-indexed without
keeping any previously added values within the multivalued field. 
Therefore if a field is getting updated and not having the distinct value
being part of the index yet, shall add the value, otherwise ignore it. The
processor needs to define whether a field is getting added to the index or
not in condition of the existing index. Is that achievable on Solr side?! 
Below my current pretty empty processor class:

public class ConditionalSolrUniqFieldValuesProcessorFactory extends
UpdateRequestProcessorFactory {

@Override
public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
return new ConditionalUniqFieldValuesProcessor(urp);
}

class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor
{

public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();

CollectionString incomingFieldNames = doc.getFieldNames();
for (String t : incomingFieldNames) {
/*
is multivalued
if (doc.getField(t).) { 
If multivalued and already part of index, drop from
index. Otherwise add to multivalued field.
}
*/
}
 
}
}
}







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
Sent from the Solr - User mailing list archive at Nabble.com.


Converting nested data model to solr schema

2013-07-01 Thread adfel70
Hi,
I have the following data model:
1. Document (fields: doc_id, author, content)
2. Each Document has multiple  attachment types. Each attachment type has
multiple instances. And each attachment type may have different fields.
for example:
doc
   doc_id1/doc_id
   authorjohn/author
   contentsome long long text.../content
   file_attachments
  file_attachment
 attach_id458/attach_id
 attach_textSomeText/attach_text
 attach_date12/12/2012/attach_date
  /file_attachment
  file_attachment
 attach_id568/attach_id
 attach_textSomeText2/attach_text
 attach_date12/11/2012/attach_date
  /file_attachment
   /file_attachments
   reply_attachments
  reply_attachment
 reply_id345/reply_id
 reply_textSomeText/reply_text
 reply_authorJack/reply_author
 reply_date22-12-2012/reply_date
  /reply_attachment
  reply_attachment
 reply_id897/attach_id
 reply_textSomeText2/reply_text
 reply_authorBob/reply_author
 reply_date23-12-2012/reply_date
  /reply_attachment
   /reply_attachments


I want to index all this data in solr cloud.
My current solution is to index the original document by its self and index
each attachment as a single solr document with its parent_doc_id, and then
use solr join capability.
The problem with this solution is  that I must index all the attachments of
each document, and the document itself in the same shard (current solr
limitation).
This requires me to override the solr document distribution mechanism.
I fear that with this solution I may loose some of solr cloud's
capabilities.
My questions are:
1. Are my concerns regarding downside of overriding solr cloud's
out-of-the-box mechanism justified? Or should I proceed with this solution?
2. If I'm looking for another solution, can I  somehow keep all attachments
on the same document and be able to query on a single attachment?
A query example:
Retrieve  all documents where:
content: contains abc
AND
reply_attachment.author = 'Bob'
AND
reply_attachment.date = '12-12-2012'


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distinct values in multivalued fields

2013-07-01 Thread Upayavira
Have a look at the DedupUpdateProcessorFactory, which may help you.
Although, I'm not sure if it works with multivalued fields.

Upayavira

On Mon, Jul 1, 2013, at 02:34 PM, tuedel wrote:
 Hello everybody,
 
 i have tried to make use of the UniqFieldsUpdateProcessorFactory in 
 order to achieve distinct values in multivalued fields. Example below: 
 
 updateRequestProcessorChain name=uniq_fields 
processor 
 class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory 
  lst name=fields 
strtitle/str 
strtag_type/str 
  /lst 
/processor 
processor class=solr.RunUpdateProcessorFactory / 
 /updateRequestProcessorChain 
 
 requestHandler name=/update class=solr.UpdateRequestHandler 
lst name=defaults 
   str name=update.chainuniq_fields/str 
 /lst 
   /requestHandler 
 
 However the data being is indexed one by one. This may happen, since a 
 document may will get an additional tag in a future update. Unfortunately
 in 
 order to ensure not having any duplicate tags, i was hoping, the 
 UpdateProcessorFactory is doing what i want to achieve. In order to
 actually 
 add a tag, i am sending an 
 
 tag_type :{add:foo}, which still adds the tag, without questioning
 if 
 its already part of the field. How may i be able to achieve distinct
 values 
 on solr side?! 
 
 In order to achieve this behavior i suggest writing an own processor
 might
 be a solution. However i am uncertain how to do and if it's the proper
 way. 
 Imagine an incoming update - e.g. an update of an existing document
 having
 several multivalued fields without specifying add or set. This task
 would cause the corresponding document to get dropped and re-indexed
 without
 keeping any previously added values within the multivalued field. 
 Therefore if a field is getting updated and not having the distinct value
 being part of the index yet, shall add the value, otherwise ignore it.
 The
 processor needs to define whether a field is getting added to the index
 or
 not in condition of the existing index. Is that achievable on Solr side?! 
 Below my current pretty empty processor class:
 
 public class ConditionalSolrUniqFieldValuesProcessorFactory extends
 UpdateRequestProcessorFactory {
 
 @Override
 public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
 SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
 return new ConditionalUniqFieldValuesProcessor(urp);
 }
 
 class ConditionalUniqFieldValuesProcessor extends
 UpdateRequestProcessor
 {
 
 public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
 next) {
 super(next);
 }
 
 @Override
 public void processAdd(AddUpdateCommand cmd) throws IOException {
 SolrInputDocument doc = cmd.getSolrInputDocument();
 
 CollectionString incomingFieldNames = doc.getFieldNames();
 for (String t : incomingFieldNames) {
 /*
 is multivalued
 if (doc.getField(t).) { 
 If multivalued and already part of index, drop from
 index. Otherwise add to multivalued field.
 }
 */
 }
  
 }
 }
 }
 
 
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Converting nested data model to solr schema

2013-07-01 Thread Jack Krupansky
Simply duplicate a subset of the fields that you want to query of the parent 
document on each child document and then you can directly query the child 
documents without any join.


Yes, given the complexity of your data, a two-step query process may be 
necessary for some queries - do one query to get parent or child IDs and 
then do a second query filtered by those IDs.


And, yes, this only approximates the full power of an SQL join - but at a 
tiny fraction of the cost.


-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Monday, July 01, 2013 9:56 AM
To: solr-user@lucene.apache.org
Subject: Converting nested data model to solr schema

Hi,
I have the following data model:
1. Document (fields: doc_id, author, content)
2. Each Document has multiple  attachment types. Each attachment type has
multiple instances. And each attachment type may have different fields.
for example:
doc
  doc_id1/doc_id
  authorjohn/author
  contentsome long long text.../content
  file_attachments
 file_attachment
attach_id458/attach_id
attach_textSomeText/attach_text
attach_date12/12/2012/attach_date
 /file_attachment
 file_attachment
attach_id568/attach_id
attach_textSomeText2/attach_text
attach_date12/11/2012/attach_date
 /file_attachment
  /file_attachments
  reply_attachments
 reply_attachment
reply_id345/reply_id
reply_textSomeText/reply_text
reply_authorJack/reply_author
reply_date22-12-2012/reply_date
 /reply_attachment
 reply_attachment
reply_id897/attach_id
reply_textSomeText2/reply_text
reply_authorBob/reply_author
reply_date23-12-2012/reply_date
 /reply_attachment
  /reply_attachments


I want to index all this data in solr cloud.
My current solution is to index the original document by its self and index
each attachment as a single solr document with its parent_doc_id, and then
use solr join capability.
The problem with this solution is  that I must index all the attachments of
each document, and the document itself in the same shard (current solr
limitation).
This requires me to override the solr document distribution mechanism.
I fear that with this solution I may loose some of solr cloud's
capabilities.
My questions are:
1. Are my concerns regarding downside of overriding solr cloud's
out-of-the-box mechanism justified? Or should I proceed with this solution?
2. If I'm looking for another solution, can I  somehow keep all attachments
on the same document and be able to query on a single attachment?
A query example:
Retrieve  all documents where:
content: contains abc
AND
reply_attachment.author = 'Bob'
AND
reply_attachment.date = '12-12-2012'


Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Converting-nested-data-model-to-solr-schema-tp4074351.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Distinct values in multivalued fields

2013-07-01 Thread Jack Krupansky
Unfortunately, update processors only see the new, fresh, incoming data, 
not any existing document data.


This is a case where your best bet may be to read the document first and 
then merge your new value into the existing list of values.



-- Jack Krupansky
-Original Message- 
From: tuedel

Sent: Monday, July 01, 2013 9:34 AM
To: solr-user@lucene.apache.org
Subject: Distinct values in multivalued fields

Hello everybody,

i have tried to make use of the UniqFieldsUpdateProcessorFactory in
order to achieve distinct values in multivalued fields. Example below:

updateRequestProcessorChain name=uniq_fields
  processor
class=org.apache.solr.update.processor.UniqFieldsUpdateProcessorFactory
lst name=fields
  strtitle/str
  strtag_type/str
/lst
  /processor
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

requestHandler name=/update class=solr.UpdateRequestHandler
  lst name=defaults
 str name=update.chainuniq_fields/str
   /lst
 /requestHandler

However the data being is indexed one by one. This may happen, since a
document may will get an additional tag in a future update. Unfortunately in
order to ensure not having any duplicate tags, i was hoping, the
UpdateProcessorFactory is doing what i want to achieve. In order to actually
add a tag, i am sending an

tag_type :{add:foo}, which still adds the tag, without questioning if
its already part of the field. How may i be able to achieve distinct values
on solr side?!

In order to achieve this behavior i suggest writing an own processor might
be a solution. However i am uncertain how to do and if it's the proper way.
Imagine an incoming update - e.g. an update of an existing document having
several multivalued fields without specifying add or set. This task
would cause the corresponding document to get dropped and re-indexed without
keeping any previously added values within the multivalued field.
Therefore if a field is getting updated and not having the distinct value
being part of the index yet, shall add the value, otherwise ignore it. The
processor needs to define whether a field is getting added to the index or
not in condition of the existing index. Is that achievable on Solr side?!
Below my current pretty empty processor class:

public class ConditionalSolrUniqFieldValuesProcessorFactory extends
UpdateRequestProcessorFactory {

   @Override
   public UpdateRequestProcessor getInstance(SolrQueryRequest sqr,
SolrQueryResponse sqr1, UpdateRequestProcessor urp) {
   return new ConditionalUniqFieldValuesProcessor(urp);
   }

   class ConditionalUniqFieldValuesProcessor extends UpdateRequestProcessor
{

   public ConditionalUniqFieldValuesProcessor(UpdateRequestProcessor
next) {
   super(next);
   }

   @Override
   public void processAdd(AddUpdateCommand cmd) throws IOException {
   SolrInputDocument doc = cmd.getSolrInputDocument();

   CollectionString incomingFieldNames = doc.getFieldNames();
   for (String t : incomingFieldNames) {
   /*
   is multivalued
   if (doc.getField(t).) {
   If multivalued and already part of index, drop from
index. Otherwise add to multivalued field.
   }
   */
   }

   }
   }
}







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Distinct-values-in-multivalued-fields-tp4074337.html
Sent from the Solr - User mailing list archive at Nabble.com. 



How to re-index Solr get term frequency within documents

2013-07-01 Thread Tony Mullins
Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in all
documents of search result.

My field is

 field name=CommentX type=text_general stored=true indexed=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/

And I am trying this query

http://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony.


Re: ConcurrentUpdateSolrServer hanging

2013-07-01 Thread qungg
Hi,

BlockUntilFinish block indefinitely sometimes. But if I send a commit from
another thread to the instance, the concurrentUpdateServer unblock and send
the rest of the documents and commit. So the squence look like this:

1. adding documents as usual...
2. finish adding documents...
3. block untill finished... block forever (i try to block before commit,
call this commit 1)
4. from other thread, send a commit (lets call this commit 2)
5. magically unblocked... and flushed out the rest of the documents...
6. commit 1...  
7. commit 2 ... 

The order of commit in 6 and 7 is observed in solr log.

Thanks,
Qun




--
View this message in context: 
http://lucene.472066.n3.nabble.com/ConcurrentUpdateSolrServer-hanging-tp4073620p4074366.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to re-index Solr get term frequency within documents

2013-07-01 Thread Jack Krupansky
You can write any function query in the field list of the fl parameter. 
Sounds like you want termfreq:


termfreq(field_arg,term)

fl=id,a,b,c,termfreq(a,xyz)


-- Jack Krupansky

-Original Message- 
From: Tony Mullins

Sent: Monday, July 01, 2013 10:47 AM
To: solr-user@lucene.apache.org
Subject: How to re-index Solr  get term frequency within documents

Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in all
documents of search result.

My field is

field name=CommentX type=text_general stored=true indexed=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/

And I am trying this query

http://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony. 



Concurrent Modification Exception

2013-07-01 Thread adityab
Hi, 
I have recently upgraded from Solr 3.5 to 4.2.1.
Also we have added spellcheck feature to our search query. During our
performance testing we have observed that for every 2000 request, 1 request
fails. 
The exception we observe in solr log are ConcurrentModificationException.
Below is the complete stack for exception. 
Any idea what could potentially be the reason. I did check JIRA list in
Solr/Lucene to see if there is any issue files and that's fixed. Couldn't
filnd thats directly associated to LRUCache. 

thanks
Aditya 

2013-06-28 20:32:57,265 SEVERE [org.apache.solr.core.SolrCore] (http-80-20)
java.util.ConcurrentModificationException
at 
java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
at java.util.AbstractList$Itr.next(AbstractList.java:343)
at java.util.AbstractList.equals(AbstractList.java:506)
at org.apache.solr.search.QueryResultKey.isEqual(QueryResultKey.java:96)
at org.apache.solr.search.QueryResultKey.equals(QueryResultKey.java:81)
at java.util.HashMap.getEntry(HashMap.java:349)
at java.util.LinkedHashMap.get(LinkedHashMap.java:280)
at org.apache.solr.search.LRUCache.get(LRUCache.java:130)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1276)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:567)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:662)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Concurrent-Modification-Exception-tp4074371.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: documentCache not used in 4.3.1?

2013-07-01 Thread Daniel Collins
Regrettably, visibility is key for us :(  Documents must be searchable as
soon as they have been indexed (or as near as we can make it).  Our old
search system didn't do relevance sort, it was time-ordered (so it had a
much simpler job) but it did have sub-second latency, and that is what is
expected for its replacement (I know Solr doesn't like 1s currently, but
we live in hope!).  Tried explaining that by doing relevance sort we are
searching 100% of the collection, instead of the ~10%-20% a time-ordered
sort did (it effectively sharded by date and only searched as far back as
it needed to fill a page of results), but that tends to get blank looks
from business. :)

One of life's little challenges.


On 1 July 2013 11:10, Erick Erickson erickerick...@gmail.com wrote:

 Daniel:

 Soft commits invalidate the top level caches, which include
 things like filterCache, queryResultCache etc. Various
 segment-level caches are NOT invalidated, but you really
 don't have a lot of control from the Solr level over those
 anyway.

 But yeah, the tension between caching a bunch of stuff
 for query speedups and NRT is still with us. Soft commits
 are much less expensive than hard commits, but not being
 able to use the caches as much is the price. You're right
 that with such frequent autocommits, autowarming
 probably is not worth the effort.

 The question I always ask is whether 1 second is really
 necessary. Or, more accurately, worth the price. Often
 it's not and lengthening it out significantly may be an option,
 but that's a discussion for you to have with your product
 manager G

 I have seen configurations that have a more frequent hard
 commit (openSearcher=false) than soft commit. The
 mantra is soft commits are about visibility, hard commits
 are about durability.

 FWIW,
 Erick


 On Mon, Jul 1, 2013 at 3:40 AM, Daniel Collins danwcoll...@gmail.com
 wrote:

  We see similar results, again we softCommit every 1s (trying to get as
 NRT
  as we can), and we very rarely get any hits in our caches.  As an
  unscheduled test last week, we did shutdown indexing and noticed about
 80%
  hit rate in caches (and average query time dropped from ~1s to 100ms!)
 so I
  think we are in the same position as you.
 
  I appreciate with such a frequent soft commit that the caches get
  invalidated, but I was expecting cache warming to help though it doesn't
  appear to be.  We *don't* currently run a warming query, my impression of
  NRT was that it was better to not do that as otherwise you spend more
 time
  warming the searcher and caches, and by the time you've done all that,
 the
  searcher is invalidated anyway!
 
 
  On 30 June 2013 01:58, Tim Vaillancourt t...@elementspace.com wrote:
 
   That's a good idea, I'll try that next week.
  
   Thanks!
  
   Tim
  
  
   On 29/06/13 12:39 PM, Erick Erickson wrote:
  
   Tim:
  
   Yeah, this doesn't make much sense to me either since,
   as you say, you should be seeing some metrics upon
   occasion. But do note that the underlying cache only gets
   filled when getting documents to return in query results,
   since there's no autowarming going on it may come and
   go.
  
   But you can test this pretty quickly by lengthening your
   autocommit interval or just not indexing anything
   for a while, then run a bunch of queries and look at your
   cache stats. That'll at least tell you whether it works at all.
   You'll have to have hard commits turned off (or openSearcher
   set to 'false') for that check too.
  
   Best
   Erick
  
  
   On Sat, Jun 29, 2013 at 2:48 PM, Vaillancourt, Tim
 tvaillanco...@ea.com
  *
   *wrote:
  
Yes, we are softCommit'ing every 1000ms, but that should be enough
 time
   to
   see metrics though, right? For example, I still get non-cumulative
   metrics
   from the other caches (which are also throw away). I've also
  curl/sampled
   enough that I probably should have seen a value by now.
  
   If anyone else can reproduce this on 4.3.1 I will feel less crazy :).
  
   Cheers,
  
   Tim
  
   -Original Message-
   From: Erick Erickson [mailto:erickerickson@gmail.**com
  erickerick...@gmail.com
   ]
   Sent: Saturday, June 29, 2013 10:13 AM
   To: solr-user@lucene.apache.org
   Subject: Re: documentCache not used in 4.3.1?
  
   It's especially weird that the hit ratio is so high and you're not
  seeing
   anything in the cache. Are you perhaps soft committing frequently?
 Soft
   commits throw away all the top-level caches including documentCache I
   think
  
   Erick
  
  
   On Fri, Jun 28, 2013 at 7:23 PM, Tim Vaillancourttim@elementspace.
  **comt...@elementspace.com
  
   wrote:
   Thanks Otis,
  
   Yeah I realized after sending my e-mail that doc cache does not
 warm,
   however I'm still lost on why there are no other metrics.
  
   Thanks!
  
   Tim
  
  
   On 28 June 2013 16:22, Otis Gospodneticotis.gospodnetic@**
 gmail.com
  otis.gospodne...@gmail.com
   
   wrote:
  
Hi Tim,
  
   Not sure about the 

Does solr cloud required passwordless ssh?

2013-07-01 Thread adfel70
Hi
Does solr cloud on a cluster of servers require passwordless ssh to be
configured between the servers?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
To answer the previous Post:

I was not sure what datasource=binaryFile I took it from a PDF sample
thinking that would help.

after setting datasource=null I'm still gett the same errors...

dataConfig
dataSource type=BinFileDataSource user=svcSolr
password=SomePassword /
document
entity name=Archive
  processor=FileListEntityProcessor baseDir=E:\ArchiveRoot
fileName=.zip$ recursive=true rootEntity=false dataSource=null
onError=skip

field column=fileSize name=size/
field column=file 
name=filename/

/entity

/document
/dataConfig

the logs report this:

 
INFO  - 2013-07-01 16:45:57.317;
org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
WARN  - 2013-07-01 16:45:57.333;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read:
dataimport.properties




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cores sharing an instance

2013-07-01 Thread Roman Chyla
as for the second option:

If you look inside SolrResourceLoader, you will notice that before a
CoreContainer is created, a new class loader is also created

line:111

this.classLoader = createClassLoader(null, parent);

however, this parent object is always null, because it is called from:

public SolrResourceLoader( String instanceDir )
  {
this( instanceDir, null, null );
  }

but if you were able to replace the second null (parent class loader) with
a classloader of your own choice - ie. one that loads your singleton (but
only that singleton, you don't want to share other objects), your cores
should be able to see/share that object

so, as you can see, if you test it and it works, you may fill a JIRA ticket
and help other folks out there (i was too lazy and worked around it in the
past - but that wasn't a good solution). If there a well justified reason
to share objects, it seems weird the core is using 'null' as a parent class
loader

HTH,

  roman






On Sun, Jun 30, 2013 at 2:18 PM, Peyman Faratin pey...@robustlinks.comwrote:

 I see. If I wanted to try the second option (find a place inside solr
 before the core is created) then where would that place be in the flow of
 app waking up? Currently what I am doing is each core loads its app caches
 via a requesthandler (in solrconfig.xml) that initializes the java class
 that does the loading. For instance:

 requestHandler name=/cachedResources class=solr.SearchHandler
 startup=lazy 
arr name=last-components
  strAppCaches/str
/arr
 /requestHandler
 searchComponent name=AppCaches
 class=com.name.Project.AppCaches/


 So each core has its own so specific cachedResources handler. Where in
 SOLR would I need to place the AppCaches code to make it visible to all
 other cores then?

 thank you Roman

 On Jun 29, 2013, at 10:58 AM, Roman Chyla roman.ch...@gmail.com wrote:

  Cores can be reloaded, they are inside solrcore loader /I forgot the
 exact
  name/, and they will have different classloaders /that's servlet thing/,
 so
  if you want singletons you must load them outside of the core, using a
  parent classloader - in case of jetty, this means writing your own jetty
  initialization or config to force shared class loaders. or find a place
  inside the solr, before the core is created. Google for montysolr to see
  the example of the first approach.
 
  But, unless you really have no other choice, using singletons is IMHO a
 bad
  idea in this case
 
  Roman
 
  On 29 Jun 2013 10:18, Peyman Faratin pey...@robustlinks.com wrote:
 
  its the singleton pattern, where in my case i want an object (which is
  RAM expensive) to be a centralized coordinator of application logic.
 
  thank you
 
  On Jun 29, 2013, at 1:16 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com
  wrote:
 
  There is very little shared between multiple cores (instanceDir paths,
  logging config maybe?). Why are you trying to do this?
 
  On Sat, Jun 29, 2013 at 1:14 AM, Peyman Faratin 
 pey...@robustlinks.com
  wrote:
  Hi
 
  I have a multicore setup (in 4.3.0). Is it possible for one core to
  share an instance of its class with other cores at run time? i.e.
 
  At run time core 1 makes an instance of object O_i
 
  core 1 -- object O_i
  core 2
  ---
  core n
 
  then can core K access O_i? I know they can share properties but is it
  possible to share objects?
 
  thank you
 
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 




Re: Does solr cloud required passwordless ssh?

2013-07-01 Thread Mark Miller
No, SolrCloud does not currently use ssh.

- Mark

On Jul 1, 2013, at 12:58 PM, adfel70 adfe...@gmail.com wrote:

 Hi
 Does solr cloud on a cluster of servers require passwordless ssh to be
 configured between the servers?
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Does-solr-cloud-required-passwordless-ssh-tp4074398.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to re-index Solr get term frequency within documents

2013-07-01 Thread Tony Mullins
Thanks Jack , it worked.

Could you please provide some info on how to re-index existing data in
Solr, after changing the schema.xml ?

Thanks,
Tony


On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 You can write any function query in the field list of the fl parameter.
 Sounds like you want termfreq:

 termfreq(field_arg,term)

 fl=id,a,b,c,termfreq(a,xyz)


 -- Jack Krupansky

 -Original Message- From: Tony Mullins
 Sent: Monday, July 01, 2013 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: How to re-index Solr  get term frequency within documents


 Hi,

 I am using Solr 4.3.0.
 If I change my solr's schema.xml then do I need to re-index my solr ? And
 if yes , how to ?

 My 2nd question is I need to find the frequency of term per document in all
 documents of search result.

 My field is

 field name=CommentX type=text_general stored=true indexed=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/

 And I am trying this query

 http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%**
 2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true**
 qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true

 Its just returning me the result set, no info on my searched term's
 (iphone) frequency in each document.

 How can I make Solr to return the frequency of searched term per document
 in result set ?

 Thanks,
 Tony.



Re: dataconfig to index ZIP Files

2013-07-01 Thread Noble Paul നോബിള്‍ नोब्ळ्
IIRC Zip files are not supported


On Mon, Jul 1, 2013 at 10:30 PM, ericrs22 ericr...@yahoo.com wrote:

 To answer the previous Post:

 I was not sure what datasource=binaryFile I took it from a PDF sample
 thinking that would help.

 after setting datasource=null I'm still gett the same errors...

 dataConfig
 dataSource type=BinFileDataSource user=svcSolr
 password=SomePassword /
 document
 entity name=Archive
   processor=FileListEntityProcessor baseDir=E:\ArchiveRoot
 fileName=.zip$ recursive=true rootEntity=false dataSource=null
 onError=skip

 field column=fileSize name=size/
 field column=file
 name=filename/

 /entity

 /document
 /dataConfig

 the logs report this:


 INFO  - 2013-07-01 16:45:57.317;
 org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
 WARN  - 2013-07-01 16:45:57.333;
 org.apache.solr.handler.dataimport.SimplePropertiesWriter; Unable to read:
 dataimport.properties




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074399.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
-
Noble Paul


Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
I'm using the Tika plugin to do so and according to
http://tika.apache.org/0.5/formats.html it does


*ZIP archive (application/zip) Tika uses Java's built-in Zip classes to
parse ZIP files.
Support for ZIP was added in Tika 0.2.*



--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074421.html
Sent from the Solr - User mailing list archive at Nabble.com.


are fields stored or unstored by default xml

2013-07-01 Thread Katie McCorkell
In schema.xml I know you can label a field as stored=false or
stored=true, but if you say neither, which is it by default?

Thank you
Katie


Re: are fields stored or unstored by default xml

2013-07-01 Thread Otis Gospodnetic
Haven't tried it recently, but is that even legal?  Just be explicit :)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell
katiemccork...@gmail.com wrote:
 In schema.xml I know you can label a field as stored=false or
 stored=true, but if you say neither, which is it by default?

 Thank you
 Katie


Re: How to re-index Solr get term frequency within documents

2013-07-01 Thread Otis Gospodnetic
If all your fields are stored, you can do it with
http://search-lucene.com/?q=solrentityprocessor

Otherwise, just reindex the same way you indexed in the first place.
*Always* be ready to reindex from scratch.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com wrote:
 Thanks Jack , it worked.

 Could you please provide some info on how to re-index existing data in
 Solr, after changing the schema.xml ?

 Thanks,
 Tony


 On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 You can write any function query in the field list of the fl parameter.
 Sounds like you want termfreq:

 termfreq(field_arg,term)

 fl=id,a,b,c,termfreq(a,xyz)


 -- Jack Krupansky

 -Original Message- From: Tony Mullins
 Sent: Monday, July 01, 2013 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: How to re-index Solr  get term frequency within documents


 Hi,

 I am using Solr 4.3.0.
 If I change my solr's schema.xml then do I need to re-index my solr ? And
 if yes , how to ?

 My 2nd question is I need to find the frequency of term per document in all
 documents of search result.

 My field is

 field name=CommentX type=text_general stored=true indexed=true
 multiValued=true termVectors=true termPositions=true
 termOffsets=true/

 And I am trying this query

 http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%**
 2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true**
 qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true

 Its just returning me the result set, no info on my searched term's
 (iphone) frequency in each document.

 How can I make Solr to return the frequency of searched term per document
 in result set ?

 Thanks,
 Tony.



Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Mike L.
 Hey Ahmet / Solr User Group,
 
   I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


My response from solr 

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime591/int/lst
/response
 
I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page: 

numDocs 1000 
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000, 
maxDoc 1000.

Perhaps the commit=true is not doing what its supposed to or am I missing 
something? I also trying passing a commit afterward like this:
http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
seem to do anything either)
 

From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Mike L. 
javaone...@yahoo.com 
Sent: Saturday, June 29, 2013 7:20 AM
Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


Hi Mike,


You could try http://wiki.apache.org/solr/UpdateCSV 

And make sure you commit at the very end.





From: Mike L. javaone...@yahoo.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Saturday, June 29, 2013 3:15 AM
Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5


 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the recommended approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the standard approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
dataConfig
 dataSource name=file type=FileDataSource /
   document
 entity name=entity1
 processor=LineEntityProcessor
 url=[location_of_file]/file.csv
 dataSource=file
 transformer=RegexTransformer,TemplateTransformer
 field column=rawLine
    
regex=^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$
    
groupNames=field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12
 /
 /entity
   /document
/dataConfig
 
Thanks in advance,
Mike

Thanks in advance,
Mike

Re: Classic 4.2 master-slave replication not completing

2013-07-01 Thread Neal Ensor
is it conceivable that there's too much traffic, causing Solr to stall
re-opening the searcher (thus releasing to the new index)?  I'm grasping at
straws, and this is beginning to bug me a lot.  The traffic logs wouldn't
seem to support this (apart from periodic health-check pings, the load is
distributed fairly evenly across 3 slaves by a load-balancer tool).  After
35+ minutes this morning, none of the three successfully unstuck, and had
to be manually core-reloaded.

Is there perhaps a configuration element I'm overlooking that might make
solr a bit less friendly about it, and just dump the searchers/reopen
when replication completes?

As a side note, I'm getting really frustrated with trying to get log4j
logging on 4.3.1 set up; my tomcat container persists in complaining that
it cannot find log4j.properties, when I've put it in the WEB-INF/classes of
the war file, have SLF4j libraries AND log4j at the shared container lib
level, and log4j.debug turned on.  I can't find any excuses why it cannot
seem to locate the configuration.

Any suggestions or pointers would be greatly appreciated.  Thanks!


On Thu, Jun 27, 2013 at 10:35 AM, Mark Miller markrmil...@gmail.com wrote:

 Odd - looks like it's stuck waiting to be notified that a new searcher is
 ready.

 - Mark

 On Jun 27, 2013, at 8:58 AM, Neal Ensor nen...@gmail.com wrote:

  Okay, I have done this (updated to 4.3.1 across master and four slaves;
 one
  of these is my own PC for experiments, it is not being accessed by
 clients).
 
  Just had a minor replication this morning, and all three slaves are
 stuck
  again.  Replication supposedly started at 8:40, ended 30 seconds later or
  so (on my local PC, set up identically to the other three slaves).  The
  three slaves will NOT complete the roll-over to the new index.  All three
  index folders have a write.lock and latest files are dated 8:40am (now it
  is 8:54am, with no further activity in the index folders).  There exists
 an
  index.2013062708461 (or some variation thereof) in all three
 slaves'
  data folder.
 
  The seemingly-relevant thread dump of a snappuller thread on each of
  these slaves:
 
- sun.misc.Unsafe.park(Native Method)
- java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
-
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
-
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
-
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
- java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
- java.util.concurrent.FutureTask.get(FutureTask.java:83)
-
 
 org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631)
-
 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446)
-
 
 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
- org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
-
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
-
 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
- java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
-
 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
-
 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
-
 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
-
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
-
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
- java.lang.Thread.run(Thread.java:662)
 
 
  Here they sit.  My local PC slave replicated very quickly, switched
 over
  to the new generation (206) immediately.  I am not sure why the three
  slaves are dragging on this.  If there's any configuration elements or
  other details you need, please let me know.  I can manually kick them
 by
  reloading the core from the admin pages, but obviously I would like this
 to
  be a hands-off process.  Any help is greatly appreciated; this has been
  bugging me for some time now.
 
 
 
  On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  A bunch of replication related issues were fixed in 4.2.1 so you're
  better off upgrading to 4.2.1 or later (4.3.1 is the latest release).
 
  On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor nen...@gmail.com wrote:
  As a bit of background, we run a setup (coming from 3.6.1 to 4.2
  relatively
  recently) with a single master receiving updates with three slaves
  pulling
  changes in.  Our index is around 5 million documents, around 26GB in
 size
  total.
 
  

Perf. difference when the solr core is 'current' or not 'current'

2013-07-01 Thread jchen2000
in Solr's admin statistics page, there is a 'current' flag indicating whether
the core index reader is 'current' or not. According to some discussions in
this mailing list a few months back, it wouldn't affect anything. But my
observation is completely different. When the current flag was not checked
for some of the cores ( I have defined 15 cores in total), my median search
latency over 48M records was over 190ms, but if every current flag was
checked, the median dropped to only 87 ms. 

Another observation is, restarting solr instance may not necessarily make
'current' flags  checked,  have to reload cores even after starting solr.

Could anybody explain the difference? I am using Datastax Enterprise 3.0.2

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Perf-difference-when-the-solr-core-is-current-or-not-current-tp4074438.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

2013-07-01 Thread Shawn Heisey

On 7/1/2013 12:56 PM, Mike L. wrote:

  Hey Ahmet / Solr User Group,

I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=trueheader=falseseparator=%09escape=\stream.file=/location/of/file/on/server/file.csvfieldnames=id,otherfields


My response from solr

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime591/int/lst
/response

I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page:

numDocs 1000
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000,
maxDoc 1000.


A discrepancy between numDocs and maxDoc indicates that there are 
deleted documents in your index.  You might already know this, so here's 
an answer to what I think might be your actual question:


If you want to delete the 1000 existing documents before adding the 10 
documents, then you have to actually do that deletion.  The CSV update 
handler works at a lower level than the DataImport handler, and doesn't 
have clean or full-import options, which defaults to clean=true. 
The DIH is like a full application embedded inside Solr, one that uses 
an update handler -- it is not itself an update handler.  When 
clean=true or using full-import without a clean option, DIH itself sends 
a delete all documents update request.


If you didn't already know the bit about the deleted documents, then 
read this:


It can be normal for indexing new documents to cause deleted 
documents.  This happens when you have the same value in your UniqueKey 
field as documents that are already in your index.  Solr knows by the 
config you gave it that they are the same document, so it deletes the 
old one before adding the new one.  Solr has no way to know whether the 
document it already had or the document you are adding is more current, 
so it assumes you know what you are doing and takes care of the deletion 
for you.


When you optimize your index, deleted documents are purged, which is why 
the numbers match there.


Thanks,
Shawn



Re: are fields stored or unstored by default xml

2013-07-01 Thread Jack Krupansky

stored and indexed both default to true.

This is legal:

   field name=alpha type=string /

This detail will be in Early Access Release #2 of my book on Friday.

-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic 
Sent: Monday, July 01, 2013 2:21 PM 
To: solr-user@lucene.apache.org 
Subject: Re: are fields stored or unstored by default xml 


Haven't tried it recently, but is that even legal?  Just be explicit :)

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 2:16 PM, Katie McCorkell
katiemccork...@gmail.com wrote:

In schema.xml I know you can label a field as stored=false or
stored=true, but if you say neither, which is it by default?

Thank you
Katie


Re: are fields stored or unstored by default xml

2013-07-01 Thread Yonik Seeley
On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky j...@basetechnology.com wrote:
 stored and indexed both default to true.

 This is legal:

field name=alpha type=string /

Actually, for fields I believe the defaults come from the fieldType.
The fieldType defaults to true for both indexed and stored if they are
not specified there.

-Yonik
http://lucidworks.com


Re: Classic 4.2 master-slave replication not completing

2013-07-01 Thread Shawn Heisey

On 7/1/2013 1:07 PM, Neal Ensor wrote:

is it conceivable that there's too much traffic, causing Solr to stall
re-opening the searcher (thus releasing to the new index)?  I'm grasping at
straws, and this is beginning to bug me a lot.  The traffic logs wouldn't
seem to support this (apart from periodic health-check pings, the load is
distributed fairly evenly across 3 slaves by a load-balancer tool).  After
35+ minutes this morning, none of the three successfully unstuck, and had
to be manually core-reloaded.

Is there perhaps a configuration element I'm overlooking that might make
solr a bit less friendly about it, and just dump the searchers/reopen
when replication completes?


Can you share your solrconfig.xml file, someplace like 
http://apaste.info?  Please be sure to choose the correct file type ... 
on that website it is (X)HTML for an XML file.



As a side note, I'm getting really frustrated with trying to get log4j
logging on 4.3.1 set up; my tomcat container persists in complaining that
it cannot find log4j.properties, when I've put it in the WEB-INF/classes of
the war file, have SLF4j libraries AND log4j at the shared container lib
level, and log4j.debug turned on.  I can't find any excuses why it cannot
seem to locate the configuration.


The wiki is still down for maintenance, so below is a relevant section 
of the SolrLogging wiki page extracted from Google Cache.  When it comes 
back up, you can find it at this URL:


http://wiki.apache.org/solr/SolrLogging#Switching_from_Log4J_back_to_JUL_.28java.util.logging.29

=
The example logging setup takes over the configuration of Solr logging, 
which prevents the container from controlling where logs go. Users of 
containers other than the included Jetty (Tomcat in particular) may be 
accustomed to doing the logging configuration in the container. If you 
want to switch back to java.util.logging so this is once again possible, 
here's what to do. These steps apply to the example/lib/ext directory in 
the Solr example, or to your container's lib directory as mentioned in 
the previous section. These steps also assume that the slf4j version is 
1.6.6, which comes with Solr4.3. Newer versions may use a different 
slf4j version. As of May 2013, you can use a newer SLF4J version with no 
trouble, but be aware that all slf4j components in your classpath must 
be the same version.


Download slf4j version 1.6.6 (the version used in Solr4.3.x). 
http://www.slf4j.org/dist/slf4j-1.6.6.zip

Unpack the slf4j archive.
Delete these JARs from your lib folder: slf4j-log4j12-1.6.6.jar, 
jul-to-slf4j-1.6.6.jar, log4j-1.2.16.jar
Add these JARs to your lib folder (from slf4j zip): 
slf4j-jdk14-1.6.6.jar, log4j-over-slf4j-1.6.6.jar

Use your old logging.properties
=

Thanks,
Shawn



Re: are fields stored or unstored by default xml

2013-07-01 Thread Jack Krupansky
Correct - the field definitions inherit the attributes of the field type, 
and it is the field type that has the actual default values for indexed and 
stored (and other attributes.)


-- Jack Krupansky

-Original Message- 
From: Yonik Seeley

Sent: Monday, July 01, 2013 3:56 PM
To: solr-user@lucene.apache.org
Subject: Re: are fields stored or unstored by default xml

On Mon, Jul 1, 2013 at 3:50 PM, Jack Krupansky j...@basetechnology.com 
wrote:

stored and indexed both default to true.

This is legal:

   field name=alpha type=string /


Actually, for fields I believe the defaults come from the fieldType.
The fieldType defaults to true for both indexed and stored if they are
not specified there.

-Yonik
http://lucidworks.com 



Re: How to re-index Solr get term frequency within documents

2013-07-01 Thread Jack Krupansky
Or, go with a commercial product that has a single-click Solr re-index 
capability, such as:


1. DataStax Enterprise - data is stored in Cassandra and reindexed into Solr 
from there.


2. LucidWorks Search - data sources are declared so that the package can 
automatically re-crawl the data sources.


But, yeah, as Otis says, re-index is really just a euphemism for deleting 
your Solr data directory and indexing from scratch from the original data 
sources.


-- Jack Krupansky

-Original Message- 
From: Otis Gospodnetic

Sent: Monday, July 01, 2013 2:26 PM
To: solr-user@lucene.apache.org
Subject: Re: How to re-index Solr  get term frequency within documents

If all your fields are stored, you can do it with
http://search-lucene.com/?q=solrentityprocessor

Otherwise, just reindex the same way you indexed in the first place.
*Always* be ready to reindex from scratch.

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Mon, Jul 1, 2013 at 1:29 PM, Tony Mullins tonymullins...@gmail.com 
wrote:

Thanks Jack , it worked.

Could you please provide some info on how to re-index existing data in
Solr, after changing the schema.xml ?

Thanks,
Tony


On Mon, Jul 1, 2013 at 8:21 PM, Jack Krupansky 
j...@basetechnology.comwrote:



You can write any function query in the field list of the fl parameter.
Sounds like you want termfreq:

termfreq(field_arg,term)

fl=id,a,b,c,termfreq(a,xyz)


-- Jack Krupansky

-Original Message- From: Tony Mullins
Sent: Monday, July 01, 2013 10:47 AM
To: solr-user@lucene.apache.org
Subject: How to re-index Solr  get term frequency within documents


Hi,

I am using Solr 4.3.0.
If I change my solr's schema.xml then do I need to re-index my solr ? And
if yes , how to ?

My 2nd question is I need to find the frequency of term per document in 
all

documents of search result.

My field is

field name=CommentX type=text_general stored=true indexed=true
multiValued=true termVectors=true termPositions=true
termOffsets=true/

And I am trying this query

http://localhost:8080/solr/**select/?q=iphonefl=AuthorX%**
2CTitleX%2CCommentXdf=**CommentXwt=xmlindent=true**
qt=tvrhtv=truetv.tf=truetv.**df=truetv.positionstv.**offsets=truehttp://localhost:8080/solr/select/?q=iphonefl=AuthorX%2CTitleX%2CCommentXdf=CommentXwt=xmlindent=trueqt=tvrhtv=truetv.tf=truetv.df=truetv.positionstv.offsets=true

Its just returning me the result set, no info on my searched term's
(iphone) frequency in each document.

How can I make Solr to return the frequency of searched term per document
in result set ?

Thanks,
Tony.





Using per-segment FieldCache or DocValues in custom component?

2013-07-01 Thread Michael Ryan
I have some custom code that uses the top-level FieldCache (e.g., 
FieldCache.DEFAULT.getLongs(reader, foobar, false)). I'd like to redesign 
this to use the per-segment FieldCaches so that re-opening a Searcher is 
fast(er). In most cases, I've got a docId and I want to get the value for a 
particular single-valued field for that doc.

Is there a good place to look to see example code of per-segment FieldCache 
use? I've been looking at PerSegmentSingleValuedFaceting, but hoping there 
might be something less confusing :)

Also thinking DocValues might be a better way to go for me... is there any 
documentation or example code for that?

-Michael


Re: Improving performance to return 2000+ documents

2013-07-01 Thread Utkarsh Sengar
Thanks Erick/Jagdish.

Just to give some background on my queries.

1. All my queries are unique. A query can be: ipod and ipod 8gb (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:
documentCache class=solr.LRUCache
   size=1
   initialSize=1
   autowarmCount=0
   cleanupThread=true/
//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh

queryResultCache class=solr.LRUCache
 size=512
 initialSize=512
 autowarmCount=0
 cleanupThread=true/
//Default, since my queries are unique.

filterCache class=solr.FastLRUCache
 size=512
 initialSize=512
 autowarmCount=0/
//Now sure how can I use filterCache, so I am keeping it as the default

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize100/queryResultWindowSize
queryResultMaxDocsCached100/queryResultMaxDocsCached


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a proxy around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote:

 Solrconfig.xml has got entries which you can tweak for your use case. One
 of them is queryresultwindowsize. You can try using the value of 2000 and
 see if it helps improving performance. Please make sure you have enough
 memory allocated for queryresultcache.
 A combination of sharding and distribution of workload(requesting
 2000/number of shards) with an aggregator would be a good way to maximize
 performance.

 Thanks,

 Jagdish


 On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  50M documents, depending on a bunch of things,
  may not be unreasonable for a single node, only
  testing will tell.
 
  But the question I have is whether you should be
  using standard Solr queries for this or building a custom
  component that goes at the base Lucene index
  and does the right thing. Or even re-indexing your
  entire corpus periodically to add this kind of data.
 
  FWIW,
  Erick
 
 
  On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Thanks Erick/Peter.
  
   This is an offline process, used by a relevancy engine implemented
 around
   solr. The engine computes boost scores for related keywords based on
   clickstream data.
   i.e.: say clickstream has: ipad=upc1,upc2,upc3
   I query solr with keyword: ipad (to get 2000 documents) and then
 make 3
   individual queries for upc1,upc2,upc3 (which are fast).
   The data is then used to compute related keywords to ipad with their
   boost values.
  
   So, I cannot really replace that, since I need full text search over my
   dataset to retrieve top 2000 documents.
  
   I tried paging: I retrieve 500 solr documents 4 times (0-500,
  500-1000...),
   but don't see any improvements.
  
  
   Some questions:
   1. Maybe the JVM size might help?
   This is what I see in the dashboard:
   Physical Memory 76.2%
   Swap Space NaN% (don't have any swap space, running on AWS EBS)
   File Descriptor Count 4.7%
   JVM-Memory 73.8%
  
   Screenshot: http://i.imgur.com/aegKzP6.png
  
   2. Will reducing the shards from 3 to 1 improve performance? (maybe
   increase the RAM from 30 to 60GB) The problem I will face in that case
  will
   be fitting 50M documents on 1 machine.
  
   Thanks,
   -Utkarsh
  
  
   On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we
 deal
   with
this scenario is to retrieve the top N documents 5,10,20or100 at a
 time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve'
 millions
   of
documents - we just do it at the user's leisure, rather than make
 them
   wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
   whatever
number) documents at one time - it's simply too much to take in at
 one
time.
If your use-case involves an automated or offline procedure (e.g.
   running a
report or some data-mining op), then presumably it doesn't matter so
  much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter
   
   
   
On Sat, Jun 29, 2013 at 6:17 PM, Erick 

Disable Document Id from being printed in the logs...

2013-07-01 Thread Niran Fajemisin
Hi all,

I noticed that for Solr 4.2, when an internal call is made between two nodes 
Solr uses the list of matching document ids to fetch the document details. At 
this time, it prints out all matching document ids as a part of the query. Is 
there a way to suppress these log statements from being created?

Thanks.
Niran

Re: full-import failed after 5 hours with Exception: ORA-01555: snapshot too old: rollback segment number with name too small ORA-22924: snapshot too old

2013-07-01 Thread Michael Della Bitta
I would say definitely investigate the performance of the query, but also
since you're using CachedSqlEntityProcessor, you might want to back off on
the transaction isolation to READ_COMMITTED, which I think is the lowest
one that Oracle supports:

http://wiki.apache.org/solr/DataImportHandler#Configuring_JdbcDataSource

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Fri, Jun 28, 2013 at 2:52 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I'd go talk to the DBA.  How long does this query take if you run it
 directly against Oracle?  How long if you run it locally vs. from a
 remove server (like Solr is in relation to your Oracle server(s)).
 What happens if you increase batchSize?

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Thu, Jun 27, 2013 at 6:41 PM, srinalluri nallurisr...@yahoo.com
 wrote:
  Hello,
 
  I am using Solr 4.3.2 and Oracle DB. The sub entity is using
  CachedSqlEntityProcessor. The dataSource is having batchSize=500. The
  full-import is failed with 'ORA-01555: snapshot too old: rollback segment
  number  with name  too small ORA-22924: snapshot too old' Exception
 after
  5 hours.
 
  We already increased the undo space 4 times at the database end. Number
 of
  records in the jan_story table is 800,000 only. Tomcat is with 4GB JVM
  memory.
 
  Following is the entity (there are other sub-entities, I didn't mention
 them
  here. As the import failed with article_details entity. article_details
 is
  the first sub-entity)
 
  entity name=par8-article-testingprod dataSource=par8_prod pk=VCMID
  preImportDeleteQuery=content_type:article AND
  repository:par8qatestingprod
  query=select ID as VCMID from jan_story
  entity name=article_details dataSource=par8_prod
  transformer=TemplateTransformer,ClobTransformer,RegexTransformer
query=select bb.recordid, aa.ID as DID,aa.STORY_TITLE,
  aa.STORY_HEADLINE, aa.SOURCE, aa.DECK, regexp_replace(aa.body,
  '\p\\[(pullquote|summary)\]\/p\|\[video [0-9]+?\]|\[youtube
  .+?\]', '') as BODY, aa.PUBLISHED_DATE, aa.MODIFIED_DATE, aa.DATELINE,
  aa.REPORTER_NAME, aa.TICKER_CODES,aa.ADVERTORIAL_CONTENT from jan_story
  aa,mapp bb where aa.id=bb.keystring1 cacheKey=DID
  cacheLookup=par8-article-testingprod.VCMID
  processor=CachedSqlEntityProcessor 
  field column=content_type template=article /
  field column=RECORDID name=native_id /
  field column=repository template=par8qatestingprod /
  field column=STORY_TITLE name=title /
  field column=DECK name=description clob=true /
  field column=PUBLISHED_DATE name=date /
  field column=MODIFIED_DATE name=last_modified_date /
  field column=BODY name=body clob=true /
  field column=SOURCE name=source /
  field column=DATELINE name=dateline /
  field column=STORY_HEADLINE name=export_headline /
/entity
/entity
 
 
  The full-import without CachedSqlEntityProcessor is taking 7 days. That
 is
  why I am doing all this.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/full-import-failed-after-5-hours-with-Exception-ORA-01555-snapshot-too-old-rollback-segment-number-wd-tp4073822.html
  Sent from the Solr - User mailing list archive at Nabble.com.



Re: Disable Document Id from being printed in the logs...

2013-07-01 Thread Shawn Heisey

On 7/1/2013 3:24 PM, Niran Fajemisin wrote:

I noticed that for Solr 4.2, when an internal call is made between two nodes 
Solr uses the list of matching document ids to fetch the document details. At 
this time, it prints out all matching document ids as a part of the query. Is 
there a way to suppress these log statements from being created?


There's no way for Solr to distinguish between requests made by another 
Solr core and requests made by real clients.  Paying attention to the 
IP address where the request originated won't work either - a lot of 
Solr installations run on the same hardware as the web server or other 
application that *uses* Solr.


Debugging a problem becomes very difficult if you come up with *ANY* way 
to stop logging these requests.  That said, on newer versions the 
parameter 'distrib=false' should be included on those requests that you 
don't want to log, so an option to turn off logging of non-distributed 
requests might be a reasonable idea.  I think you'll run into some 
resistance, but as long as it doesn't default to enabled, it might be 
something that could be added.


If you are worried about performance, update the logging configuration 
so that Solr only logs at WARN, that way no requests will be logged.  If 
you then need to debug, you can change the logging to INFO using the 
admin UI, get your debugging done, and then turn it back down to WARN. 
This is the best logging approach from a performance perspective.


Thanks,
Shawn



Re: dataconfig to index ZIP Files

2013-07-01 Thread ericrs22
not sure if this will help any.

Here's the verbose log 

INFO  - 2013-07-01 23:17:08.632;
org.apache.solr.handler.dataimport.DataImporter; Loading DIH Configuration:
tika-data-config.xml
INFO  - 2013-07-01 23:17:08.648;
org.apache.solr.handler.dataimport.DataImporter; Data Configuration loaded
successfully
INFO  - 2013-07-01 23:17:08.663; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={optimize=falseclean=falseindent=truecommit=falseverbose=trueentity=Archivecommand=full-importdebug=falsewt=json}
status=0 QTime=31 
INFO  - 2013-07-01 23:17:08.663;
org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
INFO  - 2013-07-01 23:17:08.679; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720628679wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:08.679;
org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read
dataimport.properties
INFO  - 2013-07-01 23:17:09.552; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720629552wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:11.580; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720631577wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:13.593; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720633593wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:15.247;
org.apache.solr.handler.dataimport.DocBuilder; Time taken = 0:0:6.553
INFO  - 2013-07-01 23:17:15.247;
org.apache.solr.update.processor.LogUpdateProcessor; [tika] webapp=/solr
path=/dataimport
params={optimize=falseclean=falseindent=truecommit=falseverbose=trueentity=Archivecommand=full-importdebug=falsewt=json}
status=0 QTime=31 {} 0 31
INFO  - 2013-07-01 23:17:15.621; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720635621wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:17.259; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720637256wt=json} status=0 QTime=0 
INFO  - 2013-07-01 23:17:17.649; org.apache.solr.core.SolrCore; [tika]
webapp=/solr path=/dataimport
params={indent=truecommand=status_=1372720637645wt=json} status=0 QTime=0 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataconfig-to-index-ZIP-Files-tp4073965p4074498.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Converting nested data model to solr schema

2013-07-01 Thread Mikhail Khludnev
On Mon, Jul 1, 2013 at 5:56 PM, adfel70 adfe...@gmail.com wrote:

 This requires me to override the solr document distribution mechanism.
 I fear that with this solution I may loose some of solr cloud's
 capabilities.


It's not clear whether you aware of
http://searchhub.org/2013/06/13/solr-cloud-document-routing/ , but what you
did doesn't sound scary to me. If it works, it should be fine. I'm not
aware of any capabilities that you are going to loose.
Obviously SOLR-3076 provides astonishing query time performance, with
offloading actual join work into index time. Check it if you current
approach turns slow.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Schema design for parent child field

2013-07-01 Thread Mikhail Khludnev
from my experience deeply nested scopes is for SOLR-3076 almost only.


On Sat, Jun 29, 2013 at 1:08 PM, Sperrink
kevin.sperr...@lexisnexis.co.zawrote:

 Good day,
 I'm seeking some guidance on how best to represent the following data
 within
 a solr schema.
 I have a list of subjects which are detailed to n levels.
 Each document can contain many of these subject entities.
 As I see it if this had been just 1 subject per document, dynamic fields
 would have been a good resolution.
 Any suggestions on how best to create this structure in a denormalised
 fashion while maintaining the data integrity.
 For example a document could have:
 Subject level 1: contract
 Subject level 2: claims
 Subject level 1: patent
 Subject level 2: counter claims

 If I were to search for level 1 contract, I would only want the facet count
 for level 2 to contain claims and not counter claims.

 Any assistance in this would be much appreciated.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Schema-design-for-parent-child-field-tp4074084.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com