date:20130626

Ok thank you all for the great help!
Now I'm ready to start playing with my index!

Best,
Flavio


On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yeah, URL Classify does only do so much. That's why you need to combine
 multiple methods.

 As a fourth method, you could code up a short JavaScript **
 StatelessScriptUpdateProcessor** that did something like take a full
 domain name (such as output by URL Classify) and turn it into multiple
 values, each with more of the prefix removed, so that lucene.apache.org
 would index as:

 lucene.apache.org
 apache.org
 apache
 .org
 org

 And then the user could query by any of those partial domain names.

 But, if you simply tokenize the URL (copy the URL string to a text field),
 you automatically get most of that. The user can query by a URL fragment,
 such as apache.org, .org, lucene.apache.org, etc. and the
 tokenization will strip out the punctuation.

 I'll add this script to my list of examples to add in the next rev of my
 book.


 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 10:06 AM

 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing

 I bought the book and looking at the example I still don't understand if it
 possible query all sub-urls of my URL.
 For example, if the URLClassifyProcessorFactory takes in input url_s:
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html
 and makes some
 outputs like
 - url_domain_s:lucene.apache.**org http://lucene.apache.org
 - url_canonical_s:
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html
 
 How should I configure url_domain_s in order to be able to makes query like
 '*.apache.org'?
 How should I configure url_canonical_s in order to be able to makes query
 like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/*
 '?
 Is it better to have two different fields for the two queries or could I
 create just one field for the two kind of queries (obviously for the former
 case then I should query something like *://.apache.org/*)?


 On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  There are examples in my book:
 http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/product-21079719.htmlhttp://**
 www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/**product-21079719.htmlhttp://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
 


 But... I still think you should use a tokenized text field as well - use
 all three: raw string, tokenized text, and URL classification fields.

 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 9:02 AM
 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing


 That's sound exactly what I'm looking for! However I cannot find an
 example
 of how to use it..could you help me please?
 Moreover, about id field, isn't true that id field shouldn't be analyzed
 as
 suggested in
 http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_documenthttp://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document
 http://wiki.apache.**org/solr/UniqueKey#Text_field_**in_the_documenthttp://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
 

 ?


 On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl jan@cominvent.com
 wrote:

  Sure you can query the url directly. Or if you choose you can split it up

 in multiple components, e.g. using
 http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**
 solr/update/processor/URLClassifyProcessor.htmlhttp**
 ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**
 update/processor/**URLClassifyProcessor.htmlhttp://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html
 


 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier pomperma...@okkam.it:

  Sorry but maybe I miss something here..could I declare url as key field
 and
  query it too..?
  At the moment, my schema.xml looks like:
 
  fields
  field name=url type=string indexed=true stored=true
  required=true multiValued=false /
 
field name=category type=string indexed=true stored=true/
field name=language type=string indexed=true stored=true/
   ...
field name=_version_ type=long indexed=true stored=true/
 
  /fields
  uniqueKeyurl/uniqueKey
 
  Is it ok? or should I add a baseurl field of some kind to be able to
  query all url coming from a certain domain (1st or 2nd level as well)?
 
  Best,
  Flavio
 
 
  On Tue, Jun 25, 2013 at 12:28 PM,

How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?

2013-06-26 Thread Daniel Collins

We have a requirement to grab the first N words in a particular field and
weight them differently for scoring purposes.  So I thought to use a
copyField and have some extra filter on the destination to truncate it
down (post tokenization).

Did a quick search and found both a LimitTokenCountAnalyzer
and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter
is the correct approach for Solr as we have the schema-able analyzer chain,
so we don't need to code anything, right?

The Analyzer version would be more useful if we were explicitly coding up a
set of operations in Java, so that's what Lucene users directly would tend
to use.

Just in search of confirmation really.

Re: Result Grouping

What type of field are you grouping on? What happens when you distribute
it? I.e. what specifically goes wrong?

Upayavira

On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote:
 I was reading this documentation on Result Grouping...
 http://docs.lucidworks.com/display/solr/Result+Grouping
 
 which says...
 
 sort - sortspec - Specifies how Solr sorts the groups relative to each
 other. For example, sort=popularity desc will cause the groups to be
 sorted
 according to the highest popularity document in each group. The default
 value is score desc.
 
 group.sort - sort.spec - Specifies how Solr sorts documents within a
 single
 group. The default value is score desc.
 
 Is it possible to use these parameters such that group.sort would first
 sort with in each group, and then the overall sort would be applied
 according to the first element of each sorted group ?
 
 For example, using the scenario above where it has sort=popularity
 desc,
 could you also have group.sort=date asc resulting in the the most
 recent
 document of each group being sorted by decreasing popularity ?
 
 It seems to work the way I described when running a single node Solr 4.3
 instance, but in a 2 shard configuration it appears to work differently.
 
 -Bryan

multiValued field score and count

Hi to everybody,
I have some multiValued (single-token) field, for example authorid and
itemid, and what I'd like to know if there's the possibility to know how
many times a match was found in that document for some field and if the
score is higher when multiple match are found. For example, my docs are:

doc
   id1/id
   authorid11/authorid
   authorid9/authorid
   itemid1000/itemid
   itemid1000/itemid
   itemid1000/itemid
   itemid5000/itemid
/doc
doc
   id2/id
   authorid3/authorid
   itemid1000/itemid
/doc

Whould the first document have an higher score than the second if I search
for itemid=1000? Is it possible to know how many times the match was found
(3 for the doc1 and 1 for doc2)?

Otherwise, how could I achieve that result?

Best,
Flavio
-- 

Flavio Pompermaier
*Development Department
*___
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Re: multiValued field score and count

Add fl=[explain],* to your query, and review the output in the new
field. It will tell you how the score was calculated. Look at the TF or
termfreq values, as this is the number of times the term appears.

Also, you could add this to your fl= param: count:termfreq(authorid,
'1000’) which would give you a new field telling you how many times the
term 1000 appears in the authorid field for each document.

Upayavira

On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
 Hi to everybody,
 I have some multiValued (single-token) field, for example authorid and
 itemid, and what I'd like to know if there's the possibility to know how
 many times a match was found in that document for some field and if the
 score is higher when multiple match are found. For example, my docs are:
 
 doc
id1/id
authorid11/authorid
authorid9/authorid
itemid1000/itemid
itemid1000/itemid
itemid1000/itemid
itemid5000/itemid
 /doc
 doc
id2/id
authorid3/authorid
itemid1000/itemid
 /doc
 
 Whould the first document have an higher score than the second if I
 search
 for itemid=1000? Is it possible to know how many times the match was
 found
 (3 for the doc1 and 1 for doc2)?
 
 Otherwise, how could I achieve that result?
 
 Best,
 Flavio
 -- 
 
 Flavio Pompermaier
 *Development Department
 *___
 *OKKAM**Srl **- www.okkam.it*
 
 *Phone:* +(39) 0461 283 702
 *Fax:* + (39) 0461 186 6433
 *Email:* f.pomperma...@okkam.it
 *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
 *Registered office:* Trento (Italy), via Segantini 23
 
 Confidentially notice. This e-mail transmission may contain legally
 privileged and/or confidential information. Please do not read it if you
 are not the intended recipient(S). Any use, distribution, reproduction or
 disclosure by any other person is strictly prohibited. If you have
 received
 this e-mail in error, please notify the sender and destroy the original
 transmission and its attachments without reading or saving it in any
 manner.

Re: multiValued field score and count

So, in order to achieve that feature I have to declare my fileds (authorid
and itemid) with termVectors=true termPositions=true
termOffsets=false?
Should it be enough?


On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote:

 Add fl=[explain],* to your query, and review the output in the new
 field. It will tell you how the score was calculated. Look at the TF or
 termfreq values, as this is the number of times the term appears.

 Also, you could add this to your fl= param: count:termfreq(authorid,
 '1000’) which would give you a new field telling you how many times the
 term 1000 appears in the authorid field for each document.

 Upayavira

 On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
  Hi to everybody,
  I have some multiValued (single-token) field, for example authorid and
  itemid, and what I'd like to know if there's the possibility to know how
  many times a match was found in that document for some field and if the
  score is higher when multiple match are found. For example, my docs are:
 
  doc
 id1/id
 authorid11/authorid
 authorid9/authorid
 itemid1000/itemid
 itemid1000/itemid
 itemid1000/itemid
 itemid5000/itemid
  /doc
  doc
 id2/id
 authorid3/authorid
 itemid1000/itemid
  /doc
 
  Whould the first document have an higher score than the second if I
  search
  for itemid=1000? Is it possible to know how many times the match was
  found
  (3 for the doc1 and 1 for doc2)?
 
  Otherwise, how could I achieve that result?
 
  Best,
  Flavio
  --
 
  Flavio Pompermaier
  *Development Department
  *___
  *OKKAM**Srl **- www.okkam.it*
 
  *Phone:* +(39) 0461 283 702
  *Fax:* + (39) 0461 186 6433
  *Email:* f.pomperma...@okkam.it
  *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
  *Registered office:* Trento (Italy), via Segantini 23
 
  Confidentially notice. This e-mail transmission may contain legally
  privileged and/or confidential information. Please do not read it if you
  are not the intended recipient(S). Any use, distribution, reproduction or
  disclosure by any other person is strictly prohibited. If you have
  received
  this e-mail in error, please notify the sender and destroy the original
  transmission and its attachments without reading or saving it in any
  manner.




-- 

Flavio Pompermaier
*Development Department
*___
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Re: multiValued field score and count

I mentioned two features, [explain] and termfreq(field, 'value').
Neither of these require anything special, as they are using stuff
central to Lucene's scoring mechanisms. I think you can turn off the
storage of term frequencies, obviously that would spoil things, but
that's certainly not on my default.

I typed the syntax below from memory, so I might not have got it exactly
right.

Upayavira

On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote:
 So, in order to achieve that feature I have to declare my fileds
 (authorid
 and itemid) with termVectors=true termPositions=true
 termOffsets=false?
 Should it be enough?
 
 
 On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote:
 
  Add fl=[explain],* to your query, and review the output in the new
  field. It will tell you how the score was calculated. Look at the TF or
  termfreq values, as this is the number of times the term appears.
 
  Also, you could add this to your fl= param: count:termfreq(authorid,
  '1000’) which would give you a new field telling you how many times the
  term 1000 appears in the authorid field for each document.
 
  Upayavira
 
  On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
   Hi to everybody,
   I have some multiValued (single-token) field, for example authorid and
   itemid, and what I'd like to know if there's the possibility to know how
   many times a match was found in that document for some field and if the
   score is higher when multiple match are found. For example, my docs are:
  
   doc
  id1/id
  authorid11/authorid
  authorid9/authorid
  itemid1000/itemid
  itemid1000/itemid
  itemid1000/itemid
  itemid5000/itemid
   /doc
   doc
  id2/id
  authorid3/authorid
  itemid1000/itemid
   /doc
  
   Whould the first document have an higher score than the second if I
   search
   for itemid=1000? Is it possible to know how many times the match was
   found
   (3 for the doc1 and 1 for doc2)?
  
   Otherwise, how could I achieve that result?
  
   Best,
   Flavio
   --
  
   Flavio Pompermaier
   *Development Department
   *___
   *OKKAM**Srl **- www.okkam.it*
  
   *Phone:* +(39) 0461 283 702
   *Fax:* + (39) 0461 186 6433
   *Email:* f.pomperma...@okkam.it
   *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
   *Registered office:* Trento (Italy), via Segantini 23
  
   Confidentially notice. This e-mail transmission may contain legally
   privileged and/or confidential information. Please do not read it if you
   are not the intended recipient(S). Any use, distribution, reproduction or
   disclosure by any other person is strictly prohibited. If you have
   received
   this e-mail in error, please notify the sender and destroy the original
   transmission and its attachments without reading or saving it in any
   manner.
 
 
 
 
 -- 
 
 Flavio Pompermaier
 *Development Department
 *___
 *OKKAM**Srl **- www.okkam.it*
 
 *Phone:* +(39) 0461 283 702
 *Fax:* + (39) 0461 186 6433
 *Email:* f.pomperma...@okkam.it
 *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
 *Registered office:* Trento (Italy), via Segantini 23
 
 Confidentially notice. This e-mail transmission may contain legally
 privileged and/or confidential information. Please do not read it if you
 are not the intended recipient(S). Any use, distribution, reproduction or
 disclosure by any other person is strictly prohibited. If you have
 received
 this e-mail in error, please notify the sender and destroy the original
 transmission and its attachments without reading or saving it in any
 manner.

Re: Is there a way to capture div tag by id?

2013-06-26 Thread Arcadius Ahouansou

Hi.

I ran into this issue a while ago.
In my case, the div I was trying to extract was the main content of the
page.
If that is your case, boilerpipe way help.
There is a patch at https://issues.apache.org/jira/browse/SOLR-3808  that
worked for me.

Arcadius.


On 25 June 2013 18:17, eShard zim...@yahoo.com wrote:

 let's say I have a div with id=myDiv
 Is there a way to set up the solr upate/extract handler to capture just
 that
 particular div?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html
 Sent from the Solr - User mailing list archive at Nabble.com.

StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Elran Dvir

Hi all,

StatsComponent doesn't work if field's type is TextField.
I get the following message:
Field type 
textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
sortMissingLast=true}} is not currently supported.

My field configuration is:

fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 
sortMissingLast=true
analyzer type=index
tokenizer class=solr.PatternTokenizerFactory pattern=\n /
/analyzer
/fieldType

field name=myField type=mvstring indexed=true stored=false 
multiValued=true/

So, the reason my field is of type TextField is that in the document indexed 
there may be multiple values in the field separated by new lines.
The tokenizer is splitting it to multiple values and the field is indexed as 
multi-valued field.

Is there a way I can define the field as regular String field? Or a way to make 
StatsComponent work with TextField?

Thank you very much.

Re: multiValued field score and count

I tried to play a little with the tools you suggested. However, I probably
miss something because the term frequency is not that expected.
My itemid field is defined (in schema.xml) as:

 field name=itemid type=string indexed=true stored=true
multiValued=true/

I was supposing that indexing via post.sh the xml mentioned in the first
mail, the term frequency of itemid 1000 was 3 in the first doc and 1 in the
second!
Instead, I got that result only if I change my settings to:

 field name=itemid type=text_ws indexed=true stored=true
multiValued=true/
 fieldType name=text_ws class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

and I modify my populating xml as:

doc
   id1/id
   authorid11/authorid
   authorid9/authorid
   itemid1000 1000 1000/itemid
   itemid5000/itemid
/doc
doc
   id2/id
   authorid3/authorid
   itemid1000/itemid
/doc

Is there a way to achieve termFrequency=3 for doc1 also using my initial
settings (itemid as string and just one value per itemid-tag)?

Best,
Flavio

On Wed, Jun 26, 2013 at 12:38 PM, Upayavira u...@odoko.co.uk wrote:

 I mentioned two features, [explain] and termfreq(field, 'value').
 Neither of these require anything special, as they are using stuff
 central to Lucene's scoring mechanisms. I think you can turn off the
 storage of term frequencies, obviously that would spoil things, but
 that's certainly not on my default.

 I typed the syntax below from memory, so I might not have got it exactly
 right.

 Upayavira

 On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote:
  So, in order to achieve that feature I have to declare my fileds
  (authorid
  and itemid) with termVectors=true termPositions=true
  termOffsets=false?
  Should it be enough?
 
 
  On Wed, Jun 26, 2013 at 10:42 AM, Upayavira u...@odoko.co.uk wrote:
 
   Add fl=[explain],* to your query, and review the output in the new
   field. It will tell you how the score was calculated. Look at the TF or
   termfreq values, as this is the number of times the term appears.
  
   Also, you could add this to your fl= param: count:termfreq(authorid,
   '1000’) which would give you a new field telling you how many times the
   term 1000 appears in the authorid field for each document.
  
   Upayavira
  
   On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
Hi to everybody,
I have some multiValued (single-token) field, for example authorid
 and
itemid, and what I'd like to know if there's the possibility to know
 how
many times a match was found in that document for some field and if
 the
score is higher when multiple match are found. For example, my docs
 are:
   
doc
   id1/id
   authorid11/authorid
   authorid9/authorid
   itemid1000/itemid
   itemid1000/itemid
   itemid1000/itemid
   itemid5000/itemid
/doc
doc
   id2/id
   authorid3/authorid
   itemid1000/itemid
/doc
   
Whould the first document have an higher score than the second if I
search
for itemid=1000? Is it possible to know how many times the match was
found
(3 for the doc1 and 1 for doc2)?
   
Otherwise, how could I achieve that result?
   
Best,
Flavio
--
   
Flavio Pompermaier
*Development Department
*___
*OKKAM**Srl **- www.okkam.it*
   
*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23
   
Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if
 you
are not the intended recipient(S). Any use, distribution,
 reproduction or
disclosure by any other person is strictly prohibited. If you have
received
this e-mail in error, please notify the sender and destroy the
 original
transmission and its attachments without reading or saving it in any
manner.
  
 
 
 
  --
 
  Flavio Pompermaier
  *Development Department
  *___
  *OKKAM**Srl **- www.okkam.it*
 
  *Phone:* +(39) 0461 283 702
  *Fax:* + (39) 0461 186 6433
  *Email:* f.pomperma...@okkam.it
  *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
  *Registered office:* Trento (Italy), via Segantini 23
 
  Confidentially notice. This e-mail transmission may contain legally
  privileged and/or confidential information. Please do not read it if you
  are not the intended recipient(S). Any use, distribution, reproduction or
  disclosure by any other person is strictly prohibited. If you have
  received
  this e-mail in error, please notify the sender and destroy the original
  transmission and its attachments without reading or saving it in any
  manner.

Re: Is there a way to capture div tag by id?

2013-06-26 Thread Michael Sokolov


On 06/25/2013 01:17 PM, eShard wrote:

let's say I have a div with id=myDiv
Is there a way to set up the solr upate/extract handler to capture just that
particular div?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
You might be interested in Lux (see at http://luxdb.org), which provides 
XML-aware indexing for Solr.  It indexes text in the context of every 
element, and also allows you to explicitly define indexes using any 
XPath 2.0 expression, including //div[@id='myDiv'], for example.


--
Michael Sokolov
Senior Architect
Safari Books Online

Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

You could use an update processor to turn the text string into multiple 
string values. A short snippet  of JavaScript in a 
StatelessScriptUpdateProcessor could do the trick. The field could then be a 
multivalued string field.


-- Jack Krupansky

-Original Message- 
From: Elran Dvir

Sent: Wednesday, June 26, 2013 7:14 AM
To: solr-user@lucene.apache.org
Subject: StatsComponent doesn't work if field's type is TextField - can I 
change field's type to String


Hi all,

StatsComponent doesn't work if field's type is TextField.
I get the following message:
Field type 
textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,

sortMissingLast=true}} is not currently supported.

My field configuration is:

fieldType name=mvstring class=solr.TextField positionIncrementGap= 
100 sortMissingLast=true

   analyzer type=index
   tokenizer class=solr.PatternTokenizerFactory pattern=\n 
/

   /analyzer
/fieldType

field name=myField type=mvstring indexed=true stored=false 
multiValued=true/


So, the reason my field is of type TextField is that in the document indexed 
there may be multiple values in the field separated by new lines.
The tokenizer is splitting it to multiple values and the field is indexed as 
multi-valued field.


Is there a way I can define the field as regular String field? Or a way to 
make StatsComponent work with TextField?


Thank you very much.

index analyzer vs query analyzer

2013-06-26 Thread Mugoma Joseph O.

Hello,

What's the criteria used in putting an analyzer at query or index? e.g. I
want to use NGramFilterFactory, is there a difference whether I put it
under analyzer type=index or analyzer type=query ?

Thanks.


Mugoma

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

Right, unfortunately this is a gremlin lurking in the weeds, see:
http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock

There are a couple of ways to deal with this:
1 go ahead and up the limit and re-compile, if you look at
SolrCmdDistributor the semaphore is defined there.

2 https://issues.apache.org/jira/browse/SOLR-4816 should
address this as well as improve indexing throughput. I'm totally sure
Joel (the guy working on this) would be thrilled if you were able to
verify that these two points, I'd ask him (on the JIRA) whether he thinks
it's ready to test.

3 Reduce the number of threads you're indexing with

4 index docs in small packets, perhaps even one and just rack
together a zillion threads to get throughput.

FWIW,
Erick

On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis poth...@gmail.com wrote:
 Jason and Scott,

 Thanks for the replies and pointers!
 Yes, I will consider the 'maxDocs' value as well. How do i monitor the
 transaction logs during the interval between commits?

 Thanks
 Vinay


 On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman 
 jhell...@innoventsolutions.com wrote:

 Scott,

 My comment was meant to be a bit tongue-in-cheek, but my intent in the
 statement was to represent hard failure along the lines Vinay is seeing.
  We're talking about OutOfMemoryException conditions, total cluster
 paralysis requiring restart, or other similar and disastrous conditions.

 Where that line is is impossible to generically define, but trivial to
 accomplish.  What any of us running Solr has to achieve is a realistic
 simulation of our desired production load (probably well above peak) and to
 see what limits are reached.  Armed with that information we tweak.  In
 this case, we look at finding the point where data ingestion reaches a
 natural limit.  For some that may be JVM GC, for others memory buffer size
 on the client load, and yet others it may be I/O limits on multithreaded
 reads from a database or file system.

 In old Solr days we had a little less to worry about.  We might play with
 a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial
 commits and rollback recoveries.  But with 4.x we now have more durable
 write options and NRT to consider, and SolrCloud begs to use this.  So we
 have to consider transaction logs, the file handles they leave open until
 commit operations occur, and how we want to manage writing to all cores
 simultaneously instead of a more narrow master/slave relationship.

 It's all manageable, all predictable (with some load testing) and all
 filled with many possibilities to meet our specific needs.  Considering hat
 each person's data model, ingestion pipeline, request processors, and field
 analysis steps will be different, 5 threads of input at face value doesn't
 really contemplate the whole problem.  We have to measure our actual data
 against our expectations and find where the weak chain links are to
 strengthen them.  The symptoms aren't necessarily predictable in advance of
 this testing, but they're likely addressable and not difficult to decipher.

 For what it's worth, SolrCloud is new enough that we're still experiencing
 some uncharted territory with unknown ramifications but with continued
 dialog through channels like these there are fewer territories without good
 cartography :)

 Hope that's of use!

 Jason



 On Jun 24, 2013, at 7:12 PM, Scott Lundgren 
 scott.lundg...@carbonblack.com wrote:

  Jason,
 
  Regarding your statement push you over the edge- what does that mean?
  Does it mean uncharted territory with unknown ramifications or
 something
  more like specific, known symptoms?
 
  I ask because our use is similar to Vinay's in some respects, and we want
  to be able to push the capabilities of write perf - but not over the
 edge!
  In particular, I am interested in knowing the symptoms of failure, to
 help
  us troubleshoot the underlying problems if and when they arise.
 
  Thanks,
 
  Scott
 
  On Monday, June 24, 2013, Jason Hellman wrote:
 
  Vinay,
 
  You may wish to pay attention to how many transaction logs are being
  created along the way to your hard autoCommit, which should truncate the
  open handles for those files.  I might suggest setting a maxDocs value
 in
  parallel with your maxTime value (you can use both) to ensure the commit
  occurs at either breakpoint.  30 seconds is plenty of time for 5
 parallel
  processes of 20 document submissions to push you over the edge.
 
  Jason
 
  On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com wrote:
 
  I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds.
 
  On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman 
  jhell...@innoventsolutions.com wrote:
 
  Vinay,
 
  What autoCommit settings do you have for your indexing process?
 
  Jason
 
  On Jun 24, 2013, at 1:28 PM, Vinay Pothnis poth...@gmail.com wrote:
 
  Here is the ulimit -a output:
 
  core file size   (blocks, -c)  0  data seg size
  (kbytes,
  -d)

Get the query result from one collection and send it to other collection to for merging the result sets

2013-06-26 Thread Jilani Shaik

Hi,

We will have two categories of data, where one category will be the list of
primary data (for example products) and the other collection (it could be
spread across shards) holds the transaction data (for example product sales
data).



We have search scenario where we need to show the products along with the
number of sales for each product. For this we need to do a facet based
search on second collection and then this has to shown together along with
the primary data.


Is there any way to handle this kind of scenario. Please suggest any other
approaches to get the desired result.


Thank you,
Jilani

Re: Solr indexer and Hadoop

Well, it's been merged into trunk according to the comments, so

Try it on trunk, help with any bugs, buy Mark beer.

And, most especially, document up what it takes to make it work.
Mark is juggling a zillion things and I'm sure he'd appreciate any
help there.

Erick

On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 zomghowcanihelp? :)

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 You might be interested in following:
 https://issues.apache.org/jira/browse/SOLR-4916

 Best
 Erick

 On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  Jack,
 
  Sorry, but I don't agree that it's that cut and dried. I've very
  successfully worked with terabytes of data in Hadoop that was stored on
 an
  Isilon mounted via NFS, for example. In cases like this, you're using
  MapReduce purely for it's execution model (which existed far before
 Hadoop
  and HDFS ever did).
 
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky j...@basetechnology.com
 wrote:
 
  ???
 
  Hadoop=HDFS
 
  If the data is not in Hadoop/HDFS, just use the normal Solr indexing
  tools, including SolrCell and Data Import Handler, and possibly
 ManifoldCF.
 
 
  -- Jack Krupansky
 
  -Original Message- From: engy.morsy
  Sent: Tuesday, June 25, 2013 8:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr indexer and Hadoop
 
 
  Thank you Jack. So, I need to convert those nodes holding data to HDFS.
 
 
 
  --
  View this message in context: http://lucene.472066.n3.**
  nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html
 http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
 
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Result Grouping

2013-06-26 Thread Bryan Bende

The field I am grouping on is a single-valued string.

It looks like in non-distributed mode if I use group=true, sort,
group.sort, and
group.limit=1, it will..

- group the results
- sort with in each group
- limit down to 1 result per group
- apply the sort between groups using the single result of each group

When I run with numShards = 1...

- group the results
- apply the sort between groups using the document from each group based
on the sort, for example if sort= popularity desc then it uses the highest
popularity from each group
- sort with in the group
- limit down to 1 result per group

I was trying to confirm if this was the expected behavior, or if there is
something I could do to get the first behavior in a distributed configuration.

I posted this a few days ago describing the scenario in more detail if
you are interested...
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCALo_M18WVoLKvepJMu0wXk_x2H8cv3UaX9RQYtEh4-mksQHLBA%40mail.gmail.com%3E

What type of field are you grouping on? What happens when you distribute
?it? I.e. what specifically goes wrong?

Upayavira

On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote:
I was reading this documentation on Result Grouping...
http://docs.lucidworks.com/display/solr/Result+Grouping

which says...

sort - sortspec - Specifies how Solr sorts the groups relative to each
other. For example, sort=popularity desc will cause the groups to be
sorted
according to the highest popularity document in each group. The default
value is score desc.

group.sort - sort.spec - Specifies how Solr sorts documents within a
single
group. The default value is score desc.

Is it possible to use these parameters such that group.sort would first
sort with in each group, and then the overall sort would be applied
according to the first element of each sorted group ?

For example, using the scenario above where it has sort=popularity
desc,
could you also have group.sort=date asc resulting in the the most
recent
document of each group being sorted by decreasing popularity ?

It seems to work the way I described when running a single node Solr 4.3
instance, but in a 2 shard configuration it appears to work differently.

-Bryan

Re: how to replicate Solr Cloud

On the lengthy TODO list is making SolrCloud nodes rack aware
that should help with this, but it's not real high in the priority queue
as I recall. The current architecture sends updates and requests
all over the cluster, so there are lots of messages that go
across the presumably expensive pipe between data centers. Not
to mention the Zookeeper quorum problem.

Hmmm, Zookeeper Quorum problem. Say 1 ZK node is in DC1
and 2 are in DC2. If DC2 goes down, DC1 will not accept updates
because there is no available ZK quorum. I've seen one proposal
where you use 3 DCs, each with a ZK node to ameliorate this.

But all this is an issue only if the communications link between the
datacenters is expensive where that term can mean that it literally
costs more, that it is slow, whatever.

Best
Erick

On Tue, Jun 25, 2013 at 12:14 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Uh, I remember that email, but can't recall where we did it will
 try to recall it some more and reply if I can manage to dig it out of
 my brain...

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Tue, Jun 25, 2013 at 2:24 PM, Kevin Osborn kevin.osb...@cbsi.com wrote:
 Otis,

 I did actually stumble upon this link.

 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/74870

 This was from you. You were attempting to replicate data from SolrCloud to
 some other slaves for heavy-duty queries. You said that you accomplished
 this. Can you provide a few pointers on how you did this? Thanks.


 On Tue, Jun 25, 2013 at 10:25 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

 I think what is needed is a Leader that, while being a Leader for its
 own Slice in its local Cluster and Collection (I think I'm using all
 the latest terminology correctly here), is at the same time a Replica
 of its own Leader counterpart in the Primary Cluster.

 Not currently possible, AFAIK.
 Or maybe there is a better way?

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Tue, Jun 25, 2013 at 1:07 PM, Kevin Osborn kevin.osb...@cbsi.com
 wrote:
  We are going to have two datacenters, each with their own SolrCloud and
  ZooKeeper quorums. The end result will be that they should be replicas of
  each other.
 
  One method that has been mentioned is that we should add documents to
 each
  cluster separately. For various reasons, this may not be ideal for us.
  Instead, we are playing around with the idea of always indexing to one
  datacenter. And then having that replicate to the other datacenter. And
  this is where I am having some trouble on how to proceed.
 
  The nice thing about SolrCloud is that there is no masters and slaves.
 Each
  node is equals, has the same configs, etc. But in this case, I want to
 have
  a node in one datacenter poll for changes in another data center. Before
  SolrCloud, I would have used slave/master replication. But in the
 SolrCloud
  world, I am not sure how to configure this setup?
 
  Or is there any better ideas on how to use replication to push or pull
 data
  from one datacenter to another?
 
  In my case, NRT is not a requirement. And I will also be dealing with
 about
  3 collections and 5 or 6 shards.
 
  Thanks.
 
  --
  *KEVIN OSBORN*
  LEAD SOFTWARE ENGINEER
  CNET Content Solutions
  OFFICE 949.399.8714
  CELL 949.310.4677  SKYPE osbornk
  5 Park Plaza, Suite 600, Irvine, CA 92614
  [image: CNET Content Solutions]




 --
 *KEVIN OSBORN*
 LEAD SOFTWARE ENGINEER
 CNET Content Solutions
 OFFICE 949.399.8714
 CELL 949.310.4677  SKYPE osbornk
 5 Park Plaza, Suite 600, Irvine, CA 92614
 [image: CNET Content Solutions]

Re: How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?


Yes, the LimitTokenCountFilterFactory will do the trick.

I have some examples in the book, showing for a given input string, what the 
output tokens will be.


Otherwise, the Solr Javadoc does given one generic example, but without 
showing how it actually works:

http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilterFactory.html

The new Apache Solr Reference? No mention of the filter.

-- Jack Krupansky

-Original Message- 
From: Daniel Collins

Sent: Wednesday, June 26, 2013 3:38 AM
To: solr-user@lucene.apache.org
Subject: How to truncate a particular field, LimitTokenCountAnalyzer or 
LimitTokenCountFilter?


We have a requirement to grab the first N words in a particular field and
weight them differently for scoring purposes.  So I thought to use a
copyField and have some extra filter on the destination to truncate it
down (post tokenization).

Did a quick search and found both a LimitTokenCountAnalyzer
and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter
is the correct approach for Solr as we have the schema-able analyzer chain,
so we don't need to code anything, right?

The Analyzer version would be more useful if we were explicitly coding up a
set of operations in Java, so that's what Lucene users directly would tend
to use.

Just in search of confirmation really.

Re: Querying multiple collections in SolrCloud

bq: Would the above setup qualify as multiple compatible collections

No. While there may be enough fields in common to form a single query,
the TF/IDF calculations will not be compatible and the scores from the
various collections will NOT be comparable. So simply getting the list of
top N docs will probably be dominated by the docs from a single type.

bq: How does SolrCloud combine the query results from multiple collections?

It doesn't. SolrCloud sorts the results from multiple nodes in the
_same_ collection
according to whatever sort criteria are specified, defaulting to score. Say you
ask for the top 20 docs. A node from each shard returns the top 20 docs for that
shard. The node processing them just merges all the returned lists and
only keeps
the top 20.

I don't think your last two questions are really relevant, SolrCloud
isn't built to
query multiple collections and return the results coherently.

The root problem here is that you're trying to compare docs from
different collections for goodness to return the top N. This isn't
actually hard
_except_ when goodness is the score, then it just doesn't work. You can't
even compare scores from different queries on the _same_ collection, much
less different ones. Consider two collections, books and songs. One consists
of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF)
will be hugely different than songs. Not to mention field length normalization.

Now, all that aside there's an option. Index all the docs in a single
collection and
use grouping (aka field collapsing) to get a single response that has the top N
docs from each type (they'll be in different sections of the original
response) and present
them to the user however makes sense. You'll get hands on experience in
why this isn't something that's easy to do automatically if you try to
sort these
into a single list by relevance G...

Best
Erick

On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote:
Thanks Jack for the alternatives. The first is interesting but has the
downside of requiring multiple queries to get the full matching docs. The
second is interesting and very simple, but has the downside of not being
modular and being difficult to configure field boosting when the
collections have overlapping field names with different boosts being needed
for the same field in different document types.

I'd still like to know about the viability of my original approach though
too.

Chris

On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky
j...@basetechnology.comwrote:

One simple scenario to consider: N+1 collections - one collection per
document type with detailed fields for that document type, and one common
collection that indexes a subset of the fields. The main user query would
be an edismax over the common fields in that main collection. You can
then display summary results from the common collection. You can also then
support drill down into the type-specific collection based on a type
field for each document in the main collection.

Or, sure, you actually CAN index multiple document types in the same
collection - add all the fields to one schema - there is no time or space
penalty if most of the field are empty for most documents.

-- Jack Krupansky

-Original Message- From: Chris Toomey
Sent: Tuesday, June 25, 2013 6:08 PM
To: solr-user@lucene.apache.org
Subject: Querying multiple collections in SolrCloud

Hi, I'm investigating using SolrCloud for querying documents of different
but similar/related types, and have read through docs. on the wiki and done
many searches in these archives, but still have some questions. Thanks in
advance for your help.

Setup:
* Say that I have N distinct types of documents and I want to do queries
that return the best matches regardless document type. I.e., something
akin to a Google search where I'd like to get the best matches from the
web, news, images, and maps.

* Our main use case is supporting simple user-entered searches, which would
just contain terms / phrases and wouldn't specify fields.

* The document types will not all have the same fields, though there may be
some overlap in the fields.

* We plan to use a separate collection for each document type, and to use
the eDisMax query parser. Each collection would have a document-specific
schema configuration with appropriate defaults for query fields and boosts,
etc.

Questions:
* Would the above setup qualify as multiple compatible collections, such
that we could search all N collections with a single SolrCloud query, as in
the example query
http://localhost:8983/solr/**collection1/select?q=apple%**
20piecollection=c1,c2,..http://localhost:8983/solr/collection1/select?q=apple%20piecollection=c1,c2,..
.,cN**?
Again, we're not querying against specific fields.

* How does SolrCloud combine the query results from multiple collections?
Does it re-sort the combined result set, or does it just return

Re: URL search and indexing

Flavio:

You mention that you're new to Solr, so I thought I'd make sure
you know that the admin/analysis page is your friend! I flat
guarantee that as you try to index/search following the suggestions
you'll scratch your head at your results and you'll discover that
the analysis process isn't doing quite what you expect. The
admin/analysis page shows you the transformation of the input
at each stage, i.e. how the input is tokenized, what transformations
are applied to each token etc. It's invaluable!

Best
Erick

P.S. Feel free to un-check the verbose box, it provides lots
of information but can be overwhelming, especially at first!

On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
pomperma...@okkam.it wrote:
 Ok thank you all for the great help!
 Now I'm ready to start playing with my index!

 Best,
 Flavio


 On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
 j...@basetechnology.comwrote:

 Yeah, URL Classify does only do so much. That's why you need to combine
 multiple methods.

 As a fourth method, you could code up a short JavaScript **
 StatelessScriptUpdateProcessor** that did something like take a full
 domain name (such as output by URL Classify) and turn it into multiple
 values, each with more of the prefix removed, so that lucene.apache.org
 would index as:

 lucene.apache.org
 apache.org
 apache
 .org
 org

 And then the user could query by any of those partial domain names.

 But, if you simply tokenize the URL (copy the URL string to a text field),
 you automatically get most of that. The user can query by a URL fragment,
 such as apache.org, .org, lucene.apache.org, etc. and the
 tokenization will strip out the punctuation.

 I'll add this script to my list of examples to add in the next rev of my
 book.


 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 10:06 AM

 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing

 I bought the book and looking at the example I still don't understand if it
 possible query all sub-urls of my URL.
 For example, if the URLClassifyProcessorFactory takes in input url_s:
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html
 and makes some
 outputs like
 - url_domain_s:lucene.apache.**org http://lucene.apache.org
 - url_canonical_s:
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html
 
 How should I configure url_domain_s in order to be able to makes query like
 '*.apache.org'?
 How should I configure url_canonical_s in order to be able to makes query
 like 'http://lucene.apache.org/**solr/* http://lucene.apache.org/solr/*
 '?
 Is it better to have two different fields for the two queries or could I
 create just one field for the two kind of queries (obviously for the former
 case then I should query something like *://.apache.org/*)?


 On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  There are examples in my book:
 http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/product-21079719.htmlhttp://**
 www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/**product-21079719.htmlhttp://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
 


 But... I still think you should use a tokenized text field as well - use
 all three: raw string, tokenized text, and URL classification fields.

 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 9:02 AM
 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing


 That's sound exactly what I'm looking for! However I cannot find an
 example
 of how to use it..could you help me please?
 Moreover, about id field, isn't true that id field shouldn't be analyzed
 as
 suggested in
 http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_documenthttp://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document
 http://wiki.apache.**org/solr/UniqueKey#Text_field_**in_the_documenthttp://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
 

 ?


 On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl jan@cominvent.com
 wrote:

  Sure you can query the url directly. Or if you choose you can split it up

 in multiple components, e.g. using
 http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**
 solr/update/processor/URLClassifyProcessor.htmlhttp**
 ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**
 update/processor/**URLClassifyProcessor.htmlhttp://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html
 


 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 25. juni 2013 kl.

Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

From the stats component page:

The stats component returns simple statistics for indexed numeric
fields within the DocSet

So string, text, anything non-numeric won't work. You can declare it
multiValued but then
you have to add multiple values for the field when you send the doc to
Solr or implement
a custom update component to break them up. At least there's no filter
that I know of that
takes a delimited set of numbers and transforms them.

FWIW,
Erick

On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir elr...@checkpoint.com wrote:
Hi all,

StatsComponent doesn't work if field's type is TextField.
I get the following message:
Field type
textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
sortMissingLast=true}} is not currently supported.

My field configuration is:

fieldType name=mvstring class=solr.TextField positionIncrementGap= 100
sortMissingLast=true
analyzer type=index
tokenizer class=solr.PatternTokenizerFactory pattern=\n
/
/analyzer
/fieldType

field name=myField type=mvstring indexed=true stored=false
multiValued=true/

So, the reason my field is of type TextField is that in the document indexed
there may be multiple values in the field separated by new lines.
The tokenizer is splitting it to multiple values and the field is indexed as
multi-valued field.

Is there a way I can define the field as regular String field? Or a way to
make StatsComponent work with TextField?

Thank you very much.

Dynamic Type For Solr Schema

2013-06-26 Thread Furkan KAMACI

I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at
schema.xml. Let's assume that I have specialized my analyzer for Turkish.
However I want to have another analzyer too, i.e. for English. I have that
fields at my schema:
...
field name=content type=text_tr stored=true indexed=true/
field name=title type=text_tr stored=true indexed=true/
...

I have a field type as text_tr that is combined for Turkish. I have another
field type as text_en that is combined for Englished. I have another field
at my schema as lang. lang holds the language of document as en or tr.

If I get a document that has a lang field holds *tr* I want that:

...
field name=content type=*text_tr* stored=true indexed=true/
field name=title type=*text_tr* stored=true indexed=true/
...

If I get a document that has a lang field holds *en* I want that:

...
field name=content type=*text_en* stored=true indexed=true/
field name=title type=*text_en* stored=true indexed=true/
...

I want dynamic types just for that fields other will be same. How can I do
that properly at Solr? (UpdateRequestProcessor, ...?)

Re: URL search and indexing

I was doing exactly that and, thanks to the administration page and
explanation/debugging, I checked if results were those expected.
Unfortunately, results were not correct submitting updates trough post.sh
script (that use curl in the end).
Probably, if it founds the same tag (same value for the same field-name),
it will collapse them.
Rewriting the same document in Java and submitting the updates did the
things work correctly.

In my opinion this is a bug (of the entire process, then I don't know it
this is a problem of curl or of the script itself).

Best,
Flavio

On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.comwrote:

 Flavio:

 You mention that you're new to Solr, so I thought I'd make sure
 you know that the admin/analysis page is your friend! I flat
 guarantee that as you try to index/search following the suggestions
 you'll scratch your head at your results and you'll discover that
 the analysis process isn't doing quite what you expect. The
 admin/analysis page shows you the transformation of the input
 at each stage, i.e. how the input is tokenized, what transformations
 are applied to each token etc. It's invaluable!

 Best
 Erick

 P.S. Feel free to un-check the verbose box, it provides lots
 of information but can be overwhelming, especially at first!

 On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
 pomperma...@okkam.it wrote:
  Ok thank you all for the great help!
  Now I'm ready to start playing with my index!
 
  Best,
  Flavio
 
 
  On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  Yeah, URL Classify does only do so much. That's why you need to combine
  multiple methods.
 
  As a fourth method, you could code up a short JavaScript **
  StatelessScriptUpdateProcessor** that did something like take a full
  domain name (such as output by URL Classify) and turn it into multiple
  values, each with more of the prefix removed, so that 
 lucene.apache.org
  would index as:
 
  lucene.apache.org
  apache.org
  apache
  .org
  org
 
  And then the user could query by any of those partial domain names.
 
  But, if you simply tokenize the URL (copy the URL string to a text
 field),
  you automatically get most of that. The user can query by a URL
 fragment,
  such as apache.org, .org, lucene.apache.org, etc. and the
  tokenization will strip out the punctuation.
 
  I'll add this script to my list of examples to add in the next rev of my
  book.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Flavio Pompermaier
  Sent: Tuesday, June 25, 2013 10:06 AM
 
  To: solr-user@lucene.apache.org
  Subject: Re: URL search and indexing
 
  I bought the book and looking at the example I still don't understand
 if it
  possible query all sub-urls of my URL.
  For example, if the URLClassifyProcessorFactory takes in input url_s:
  http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
 http://lucene.apache.org/solr/4_0_0/changes/Changes.html
  and makes some
  outputs like
  - url_domain_s:lucene.apache.**org http://lucene.apache.org
  - url_canonical_s:
  http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
 http://lucene.apache.org/solr/4_0_0/changes/Changes.html
  
  How should I configure url_domain_s in order to be able to makes query
 like
  '*.apache.org'?
  How should I configure url_canonical_s in order to be able to makes
 query
  like 'http://lucene.apache.org/**solr/* 
 http://lucene.apache.org/solr/*
  '?
  Is it better to have two different fields for the two queries or could I
  create just one field for the two kind of queries (obviously for the
 former
  case then I should query something like *://.apache.org/*)?
 
 
  On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky 
 j...@basetechnology.com*
  *wrote:
 
   There are examples in my book:
  http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-
 http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
  early-access-release-1/ebook/product-21079719.htmlhttp://**
  www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
  early-access-release-1/ebook/**product-21079719.html
 http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
 
  
 
 
  But... I still think you should use a tokenized text field as well -
 use
  all three: raw string, tokenized text, and URL classification fields.
 
  -- Jack Krupansky
 
  -Original Message- From: Flavio Pompermaier
  Sent: Tuesday, June 25, 2013 9:02 AM
  To: solr-user@lucene.apache.org
  Subject: Re: URL search and indexing
 
 
  That's sound exactly what I'm looking for! However I cannot find an
  example
  of how to use it..could you help me please?
  Moreover, about id field, isn't true that id field shouldn't be
 analyzed
  as
  suggested in
 
 http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
 http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document

Re: URL search and indexing

If there is a bug... we should identify it. What's a sample post command 
that you issued?


-- Jack Krupansky

-Original Message- 
From: Flavio Pompermaier

Sent: Wednesday, June 26, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: Re: URL search and indexing

I was doing exactly that and, thanks to the administration page and
explanation/debugging, I checked if results were those expected.
Unfortunately, results were not correct submitting updates trough post.sh
script (that use curl in the end).
Probably, if it founds the same tag (same value for the same field-name),
it will collapse them.
Rewriting the same document in Java and submitting the updates did the
things work correctly.

In my opinion this is a bug (of the entire process, then I don't know it
this is a problem of curl or of the script itself).

Best,
Flavio

On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson 
erickerick...@gmail.comwrote:



Flavio:

You mention that you're new to Solr, so I thought I'd make sure
you know that the admin/analysis page is your friend! I flat
guarantee that as you try to index/search following the suggestions
you'll scratch your head at your results and you'll discover that
the analysis process isn't doing quite what you expect. The
admin/analysis page shows you the transformation of the input
at each stage, i.e. how the input is tokenized, what transformations
are applied to each token etc. It's invaluable!

Best
Erick

P.S. Feel free to un-check the verbose box, it provides lots
of information but can be overwhelming, especially at first!

On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
pomperma...@okkam.it wrote:
 Ok thank you all for the great help!
 Now I'm ready to start playing with my index!

 Best,
 Flavio


 On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
j...@basetechnology.comwrote:

 Yeah, URL Classify does only do so much. That's why you need to combine
 multiple methods.

 As a fourth method, you could code up a short JavaScript **
 StatelessScriptUpdateProcessor** that did something like take a full
 domain name (such as output by URL Classify) and turn it into multiple
 values, each with more of the prefix removed, so that 
lucene.apache.org
 would index as:

 lucene.apache.org
 apache.org
 apache
 .org
 org

 And then the user could query by any of those partial domain names.

 But, if you simply tokenize the URL (copy the URL string to a text
field),
 you automatically get most of that. The user can query by a URL
fragment,
 such as apache.org, .org, lucene.apache.org, etc. and the
 tokenization will strip out the punctuation.

 I'll add this script to my list of examples to add in the next rev of 
 my

 book.


 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 10:06 AM

 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing

 I bought the book and looking at the example I still don't understand
if it
 possible query all sub-urls of my URL.
 For example, if the URLClassifyProcessorFactory takes in input 
 url_s:

 http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
http://lucene.apache.org/solr/4_0_0/changes/Changes.html
 and makes some
 outputs like
 - url_domain_s:lucene.apache.**org http://lucene.apache.org
 - url_canonical_s:
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
http://lucene.apache.org/solr/4_0_0/changes/Changes.html
 
 How should I configure url_domain_s in order to be able to makes query
like
 '*.apache.org'?
 How should I configure url_canonical_s in order to be able to makes
query
 like 'http://lucene.apache.org/**solr/* 
http://lucene.apache.org/solr/*
 '?
 Is it better to have two different fields for the two queries or could 
 I

 create just one field for the two kind of queries (obviously for the
former
 case then I should query something like *://.apache.org/*)?


 On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky 
j...@basetechnology.com*
 *wrote:

  There are examples in my book:
 http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-
http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/product-21079719.htmlhttp://**
 www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
 early-access-release-1/ebook/**product-21079719.html
http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html

 


 But... I still think you should use a tokenized text field as well -
use
 all three: raw string, tokenized text, and URL classification fields.

 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Tuesday, June 25, 2013 9:02 AM
 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing


 That's sound exactly what I'm looking for! However I cannot find an
 example
 of how to use it..could you help me please?
 Moreover, about id field, isn't true that id field shouldn't be
analyzed
 as
 suggested in

Re: index analyzer vs query analyzer

Yes! A rather extreme difference and you probably want it in both.

The admin/analysis page is your friend.

Basically, putting stuff in the type=index section dictates what
goes into the index, and that is _all_ that is searchable. The result
of the full analysis chain is what's in the index and searchable.

Putting stuff in the type=query section dictates what terms the index
is searched for.

So if the two don't match, you will get surprising results.

I'd advise that you keep them both identical until you're more familiar
with how all this works or use one of the pre-defined examples and
add or remove filters _in the same order_.

Best
Erick

On Wed, Jun 26, 2013 at 6:23 AM, Mugoma Joseph O. mug...@yengas.com wrote:
 Hello,

 What's the criteria used in putting an analyzer at query or index? e.g. I
 want to use NGramFilterFactory, is there a difference whether I put it
 under analyzer type=index or analyzer type=query ?

 Thanks.


 Mugoma

Re: Solr indexer and Hadoop

2013-06-26 Thread David Larochelle

Pardon, my unfamiliarity with the Solr development process.

Now that it's in the trunk, will it appear in the next 4.X release?

--

David



On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson erickerick...@gmail.comwrote:

 Well, it's been merged into trunk according to the comments, so

 Try it on trunk, help with any bugs, buy Mark beer.

 And, most especially, document up what it takes to make it work.
 Mark is juggling a zillion things and I'm sure he'd appreciate any
 help there.

 Erick

 On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  zomghowcanihelp? :)
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  You might be interested in following:
  https://issues.apache.org/jira/browse/SOLR-4916
 
  Best
  Erick
 
  On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
  michael.della.bi...@appinions.com wrote:
   Jack,
  
   Sorry, but I don't agree that it's that cut and dried. I've very
   successfully worked with terabytes of data in Hadoop that was stored
 on
  an
   Isilon mounted via NFS, for example. In cases like this, you're using
   MapReduce purely for it's execution model (which existed far before
  Hadoop
   and HDFS ever did).
  
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
   w: appinions.com http://www.appinions.com/
  
  
   On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
  
   ???
  
   Hadoop=HDFS
  
   If the data is not in Hadoop/HDFS, just use the normal Solr indexing
   tools, including SolrCell and Data Import Handler, and possibly
  ManifoldCF.
  
  
   -- Jack Krupansky
  
   -Original Message- From: engy.morsy
   Sent: Tuesday, June 25, 2013 8:10 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Solr indexer and Hadoop
  
  
   Thank you Jack. So, I need to convert those nodes holding data to
 HDFS.
  
  
  
   --
   View this message in context: http://lucene.472066.n3.**
   nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html
 
 http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
  
   Sent from the Solr - User mailing list archive at Nabble.com.

Re: URL search and indexing

Your other best friend is debug=query on the URL, you might
be seeing different parsed queries than you expect, although that
doesn't really hold water given you say SolrJ fixes things.

I'd be surprised if posting the xml was the culprit, but you never
know. Did you re-index after schema changes etc?

Best
Erick

On Wed, Jun 26, 2013 at 8:18 AM, Jack Krupansky j...@basetechnology.com wrote:
 If there is a bug... we should identify it. What's a sample post command
 that you issued?


 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Wednesday, June 26, 2013 10:53 AM

 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing

 I was doing exactly that and, thanks to the administration page and
 explanation/debugging, I checked if results were those expected.
 Unfortunately, results were not correct submitting updates trough post.sh
 script (that use curl in the end).
 Probably, if it founds the same tag (same value for the same field-name),
 it will collapse them.
 Rewriting the same document in Java and submitting the updates did the
 things work correctly.

 In my opinion this is a bug (of the entire process, then I don't know it
 this is a problem of curl or of the script itself).

 Best,
 Flavio

 On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson
 erickerick...@gmail.comwrote:

 Flavio:

 You mention that you're new to Solr, so I thought I'd make sure
 you know that the admin/analysis page is your friend! I flat
 guarantee that as you try to index/search following the suggestions
 you'll scratch your head at your results and you'll discover that
 the analysis process isn't doing quite what you expect. The
 admin/analysis page shows you the transformation of the input
 at each stage, i.e. how the input is tokenized, what transformations
 are applied to each token etc. It's invaluable!

 Best
 Erick

 P.S. Feel free to un-check the verbose box, it provides lots
 of information but can be overwhelming, especially at first!

 On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
 pomperma...@okkam.it wrote:
  Ok thank you all for the great help!
  Now I'm ready to start playing with my index!
 
  Best,
  Flavio
 
 
  On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  Yeah, URL Classify does only do so much. That's why you need to combine
  multiple methods.
 
  As a fourth method, you could code up a short JavaScript **
  StatelessScriptUpdateProcessor** that did something like take a full
  domain name (such as output by URL Classify) and turn it into multiple
  values, each with more of the prefix removed, so that 
 lucene.apache.org
  would index as:
 
  lucene.apache.org
  apache.org
  apache
  .org
  org
 
  And then the user could query by any of those partial domain names.
 
  But, if you simply tokenize the URL (copy the URL string to a text
 field),
  you automatically get most of that. The user can query by a URL
 fragment,
  such as apache.org, .org, lucene.apache.org, etc. and the
  tokenization will strip out the punctuation.
 
  I'll add this script to my list of examples to add in the next rev of
   my
  book.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Flavio Pompermaier
  Sent: Tuesday, June 25, 2013 10:06 AM
 
  To: solr-user@lucene.apache.org
  Subject: Re: URL search and indexing
 
  I bought the book and looking at the example I still don't understand
 if it
  possible query all sub-urls of my URL.
  For example, if the URLClassifyProcessorFactory takes in input 
  url_s:
  http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
 http://lucene.apache.org/solr/4_0_0/changes/Changes.html
  and makes some
  outputs like
  - url_domain_s:lucene.apache.**org http://lucene.apache.org
  - url_canonical_s:
  http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
 http://lucene.apache.org/solr/4_0_0/changes/Changes.html
  
  How should I configure url_domain_s in order to be able to makes query
 like
  '*.apache.org'?
  How should I configure url_canonical_s in order to be able to makes
 query
  like 'http://lucene.apache.org/**solr/* 
 http://lucene.apache.org/solr/*
  '?
  Is it better to have two different fields for the two queries or could
   I
  create just one field for the two kind of queries (obviously for the
 former
  case then I should query something like *://.apache.org/*)?
 
 
  On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky 
 j...@basetechnology.com*
  *wrote:
 
   There are examples in my book:
  http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-
 http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
  early-access-release-1/ebook/product-21079719.htmlhttp://**
  www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
  early-access-release-1/ebook/**product-21079719.html

 http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
 
  
 
 
  But... I still think you should use a tokenized text

Re: Solr indexer and Hadoop


See Mark's comments on the Jira when I asked that question.

My take: If 4.4 happens real soon (which some people have proposed), then it 
may not make it into 4.4. But if a 4.4 RC doesn't happen for another couple 
of weeks (my inclination), then the HDFS support could well make it into 
4.4. If not in 4.4, 4.5 is probably a slam-dunk.


-- Jack Krupansky

-Original Message- 
From: David Larochelle

Sent: Wednesday, June 26, 2013 11:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr indexer and Hadoop

Pardon, my unfamiliarity with the Solr development process.

Now that it's in the trunk, will it appear in the next 4.X release?

--

David



On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson 
erickerick...@gmail.comwrote:



Well, it's been merged into trunk according to the comments, so

Try it on trunk, help with any bugs, buy Mark beer.

And, most especially, document up what it takes to make it work.
Mark is juggling a zillion things and I'm sure he'd appreciate any
help there.

Erick

On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 zomghowcanihelp? :)

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 w: appinions.com http://www.appinions.com/


 On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com
wrote:

 You might be interested in following:
 https://issues.apache.org/jira/browse/SOLR-4916

 Best
 Erick

 On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  Jack,
 
  Sorry, but I don't agree that it's that cut and dried. I've very
  successfully worked with terabytes of data in Hadoop that was stored
on
 an
  Isilon mounted via NFS, for example. In cases like this, you're using
  MapReduce purely for it's execution model (which existed far before
 Hadoop
  and HDFS ever did).
 
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky 
j...@basetechnology.com
 wrote:
 
  ???
 
  Hadoop=HDFS
 
  If the data is not in Hadoop/HDFS, just use the normal Solr indexing
  tools, including SolrCell and Data Import Handler, and possibly
 ManifoldCF.
 
 
  -- Jack Krupansky
 
  -Original Message- From: engy.morsy
  Sent: Tuesday, June 25, 2013 8:10 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr indexer and Hadoop
 
 
  Thank you Jack. So, I need to convert those nodes holding data to
HDFS.
 
 
 
  --
  View this message in context: http://lucene.472066.n3.**
  nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html

http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
 
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: URL search and indexing

Obviously I messed up with email thread...however I found a problem
indexing my document via post.sh.
This is basically my schema.xml:

schema name=dopa-schema version=1.5
 fields
   field name=url type=string indexed=true stored=true
required=true multiValued=false /
   field name=itemid type=string indexed=true stored=true
multiValued=true/
   field name=_version_ type=long indexed=true stored=true/
 /fields
 uniqueKeyurl/uniqueKey
  types
fieldType name=string class=solr.StrField sortMissingLast=true /
fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/
 /types
/schema

and this is the document I tried to upload via post.sh:

add
doc
  field name=urlhttp://test.example.org/first.html/field
  field name=itemid1000/field
  field name=itemid1000/field
  field name=itemid1000/field
  field name=itemid5000/field
/doc
doc
  field name=urlhttp://test.example.org/second.html/field
  field name=itemid1000/field
  field name=itemid5000/field
/doc
/add

When playing with administration and debugging tools I discovered that
searching for q=itemid:5000 gave me the same score for those docs, while I
was expecting different term frequencies between the first and the second.
In fact, using java to upload documents lead to correct results (3
occurrences of item 1000 in the first doc and 1 in the second), e.g.:
document1.addField(itemid, 1000);
document1.addField(itemid, 1000);
document1.addField(itemid, 1000);

Am I right or am I missing something else?


On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky j...@basetechnology.comwrote:

 If there is a bug... we should identify it. What's a sample post command
 that you issued?


 -- Jack Krupansky

 -Original Message- From: Flavio Pompermaier
 Sent: Wednesday, June 26, 2013 10:53 AM

 To: solr-user@lucene.apache.org
 Subject: Re: URL search and indexing

 I was doing exactly that and, thanks to the administration page and
 explanation/debugging, I checked if results were those expected.
 Unfortunately, results were not correct submitting updates trough post.sh
 script (that use curl in the end).
 Probably, if it founds the same tag (same value for the same field-name),
 it will collapse them.
 Rewriting the same document in Java and submitting the updates did the
 things work correctly.

 In my opinion this is a bug (of the entire process, then I don't know it
 this is a problem of curl or of the script itself).

 Best,
 Flavio

 On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson erickerick...@gmail.com*
 *wrote:

  Flavio:

 You mention that you're new to Solr, so I thought I'd make sure
 you know that the admin/analysis page is your friend! I flat
 guarantee that as you try to index/search following the suggestions
 you'll scratch your head at your results and you'll discover that
 the analysis process isn't doing quite what you expect. The
 admin/analysis page shows you the transformation of the input
 at each stage, i.e. how the input is tokenized, what transformations
 are applied to each token etc. It's invaluable!

 Best
 Erick

 P.S. Feel free to un-check the verbose box, it provides lots
 of information but can be overwhelming, especially at first!

 On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
 pomperma...@okkam.it wrote:
  Ok thank you all for the great help!
  Now I'm ready to start playing with my index!
 
  Best,
  Flavio
 
 
  On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  Yeah, URL Classify does only do so much. That's why you need to combine
  multiple methods.
 
  As a fourth method, you could code up a short JavaScript **
  StatelessScriptUpdateProcessor that did something like take a
 full
  domain name (such as output by URL Classify) and turn it into multiple
  values, each with more of the prefix removed, so that 
 lucene.apache.org
  would index as:
 
  lucene.apache.org
  apache.org
  apache
  .org
  org
 
  And then the user could query by any of those partial domain names.
 
  But, if you simply tokenize the URL (copy the URL string to a text
 field),
  you automatically get most of that. The user can query by a URL
 fragment,
  such as apache.org, .org, lucene.apache.org, etc. and the
  tokenization will strip out the punctuation.
 
  I'll add this script to my list of examples to add in the next rev of
  my
  book.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Flavio Pompermaier
  Sent: Tuesday, June 25, 2013 10:06 AM
 
  To: solr-user@lucene.apache.org
  Subject: Re: URL search and indexing
 
  I bought the book and looking at the example I still don't understand
 if it
  possible query all sub-urls of my URL.
  For example, if the URLClassifyProcessorFactory takes in input 
 url_s:
  http://lucene.apache.org/solr/4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/**4_0_0/changes/Changes.html
 
 http://lucene.apache.org/solr/**4_0_0/changes/Changes.htmlhttp://lucene.apache.org/solr/4_0_0/changes/Changes.html
 
  and makes

RE: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Elran Dvir

Erick, thanks for the response.

I think the stats component works with strings.

In StatsValuesFactory, I see the following code:

public static StatsValues createStatsValues(SchemaField sf) {
...
   else if (StrField.class.isInstance(fieldType)) {
  return new StringStatsValues(sf);
} 
  }

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 26, 2013 5:30 PM
To: solr-user@lucene.apache.org
Subject: Re: StatsComponent doesn't work if field's type is TextField - can I 
change field's type to String

From the stats component page:

The stats component returns simple statistics for indexed numeric fields 
within the DocSet

So string, text, anything non-numeric won't work. You can declare it 
multiValued but then you have to add multiple values for the field when you 
send the doc to Solr or implement a custom update component to break them up. 
At least there's no filter that I know of that takes a delimited set of numbers 
and transforms them.

FWIW,
Erick

On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir elr...@checkpoint.com wrote:
 Hi all,

 StatsComponent doesn't work if field's type is TextField.
 I get the following message:
 Field type 
 textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.
 solr.analysis.TokenizerChain,args={positionIncrementGap=100,
 sortMissingLast=true}} is not currently supported.

 My field configuration is:

 fieldType name=mvstring class=solr.TextField positionIncrementGap= 100 
 sortMissingLast=true
 analyzer type=index
 tokenizer class=solr.PatternTokenizerFactory pattern=\n 
 /
 /analyzer
 /fieldType

 field name=myField type=mvstring indexed=true stored=false 
 multiValued=true/

 So, the reason my field is of type TextField is that in the document indexed 
 there may be multiple values in the field separated by new lines.
 The tokenizer is splitting it to multiple values and the field is indexed as 
 multi-valued field.

 Is there a way I can define the field as regular String field? Or a way to 
 make StatsComponent work with TextField?

 Thank you very much.

Email secured by Check Point

Re: Need Help in migrating Solr version 1.4 to 4.3

On 6/25/2013 11:52 PM, Sandeep Gupta wrote:
 Also in application development side,
 as I said that I am going to use HTTPSolrServer API and I found that we
 shouldn't create this object multiple times
 (as per the wiki document http://wiki.apache.org/solr/Solrj#HttpSolrServer)
 So I am planning to have my Server class as singleton.
  Please advice little bit in this front also.

This is always the way that SolrServer objects are intended to be used,
including CommonsHttpSolrServer in version 1.4.  The only major
difference between the two objects is that the new one uses
HttpComponents 4.x and the old one uses HttpClient 3.x.  There are other
differences, but they are just the result of incremental improvements
from version to version.

Thanks,
Shawn

Re: Dynamic Type For Solr Schema

2013-06-26 Thread Alexandre Rafalovitch

On Wed, Jun 26, 2013 at 11:46 AM, Jack Krupansky
j...@basetechnology.com wrote:
 But there are also built-in language identifier update processors that can
 simultaneously identify what language is used in the input value for a field
 AND do the redirection to a language-specific field AND store the language
 code.

I have an example of using this as well (for English/Russian):
https://github.com/arafalov/solr-indexing-book/tree/master/published/languages
. This includes the collection data files, so you can see the end
result and play with it. The instructions on how to recreate this and
explanation behind routing and field aliases setup are in my book :
http://blog.outerthoughts.com/2013/06/my-book-on-solr-is-now-published/
:-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

Re: Is it possible to searh Solr with a longer query string?

On 6/25/2013 6:15 PM, Jack Krupansky wrote:
 Are you using Tomcat?
 
 See:
 http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests
 
 Enabling Longer Query Requests
 
 If you try to submit too long a GET query to Solr, then Tomcat will
 reject your HTTP request on the grounds that the HTTP header is too
 large; symptoms may include an HTTP 400 Bad Request error or (if you
 execute the query in a web browser) a blank browser window.
 
 If you need to enable longer queries, you can set the maxHttpHeaderSize
 attribute on the HTTP Connector element in your server.xml file. The
 default value is 4K. (See
 http://tomcat.apache.org/tomcat-5.5-doc/config/http.html)

Even better would be to force SolrJ to use a POST request.  In newer
versions (4.1 and later) Solr sets the servlet container's POST buffer
size and defaults it to 2MB.  In older versions, you'd have to adjust
this in your servlet container config, but the default should be
considerably larger than the header buffer used for GET requests.

I thought that SolrJ used POST by default, but after looking at the
code, it seems that I was wrong.  Here's how to send a POST query:

response = server.query(query, METHOD.POST);

The import required for this is:

import org.apache.solr.client.solrj.SolrRequest.METHOD;

Gary, if you can avoid it, you should not be creating a new
HttpSolrServer object every time you make a query.  It is completely
thread-safe, so create a singleton and use it for all queries against
the medline core.

Thanks,
Shawn

Re: Dynamic Type For Solr Schema

You can certainly do redirection of input values in an update processing, 
even in a JavaScript script.


But there are also built-in language identifier update processors that can 
simultaneously identify what language is used in the input value for a field 
AND do the redirection to a language-specific field AND store the language 
code.


See:
LangDetectLanguageIdentifierUpdateProcessorFactory
TikaLanguageIdentifierUpdateProcessorFactory
http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.html
http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/TikaLanguageIdentifierUpdateProcessorFactory.html
http://wiki.apache.org/solr/LanguageDetection

The non-Tika version may be better, depending on the nature of your input.

Neither processor is in the new Apache Solr Reference Guide nor current 
release from Lucid, but see the detailed examples in my book.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, June 26, 2013 10:51 AM
To: solr-user@lucene.apache.org
Subject: Dynamic Type For Solr Schema

I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at
schema.xml. Let's assume that I have specialized my analyzer for Turkish.
However I want to have another analzyer too, i.e. for English. I have that
fields at my schema:
...
field name=content type=text_tr stored=true indexed=true/
field name=title type=text_tr stored=true indexed=true/
...

I have a field type as text_tr that is combined for Turkish. I have another
field type as text_en that is combined for Englished. I have another field
at my schema as lang. lang holds the language of document as en or tr.

If I get a document that has a lang field holds *tr* I want that:

...
field name=content type=*text_tr* stored=true indexed=true/
field name=title type=*text_tr* stored=true indexed=true/
...

If I get a document that has a lang field holds *en* I want that:

...
field name=content type=*text_en* stored=true indexed=true/
field name=title type=*text_en* stored=true indexed=true/
...

I want dynamic types just for that fields other will be same. How can I do
that properly at Solr? (UpdateRequestProcessor, ...?)

Re: Dynamic Type For Solr Schema

On 6/26/2013 8:51 AM, Furkan KAMACI wrote:
 If I get a document that has a lang field holds *tr* I want that:
 
 ...
 field name=content type=*text_tr* stored=true indexed=true/
 field name=title type=*text_tr* stored=true indexed=true/

Changing the TYPE of a field based on the contents of another field
isn't possible.  The language detection that has been mentioned in your
other replies makes it possible to direct different languages to
different fields, but won't change the type.

Solr is highly dependent on its schema.  The schema is necessarily
fairly static.  This is changing to some degree with the schema REST API
in newer versions, but even with that, types aren't dynamic.  If you
change them, you have to reindex.  Making them dynamic would require a
major rewrite of Solr internals, and it's very likely that nobody would
be able to agree on the criteria used to choose a type.

What you are trying to do could be done by writing a custom Lucene
application, because Lucene has no schema.  Field types are determined
by whatever code you write yourself.  The problem with this approach is
that you have to write ALL the server code, something that you get for
free with Solr.  It would not be a trivial task.

Thanks,
Shawn

MoreLikeThis handler and pivot facets

2013-06-26 Thread Achim Domma

Hi,

I have the current worklow, which works fine:

- User enters search text
- Text is send to SOLR as query. Quite some faceting is also include in the 
request.
- Result comes back and extensive facet information is displayed.

Now I want to allow my user to enter a whole reference text as search text. So 
I do the same as above, but send the text via POST to a MoreLikeThis handler. 
Therefore I add those additional parameters:

mlt.fl = 'text_field'
mlt.minwl = 1
mlt.maxqt = 20
mlt.minf = 0

and remove of course the q parameter. The rest of the request - i.e. the 
faceting parameters - are identical. But I do not get facets back. For my 
sample request, I can see that 499 documents were found, but all facets are 
just empty. And the facet_pivot key does not exist at all.

Is there any know issue with MLT + facets? I know that MLT + facets worked for 
me, but not yet when using pivot facets.

kind regards,
Achim

Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.

 
Hello,
 
   I'm trying to execute a parallel DIH process and running into heap 
related issues, hoping somebody has experienced this and can recommend some 
options..
 
   Using Solr 3.5 on CentOS.
   Currently have JVM heap 4GB min , 8GB max
 
 When executing the entities in a sequential process (entities executing in 
sequence by default), my heap never exceeds 3GB. When executing the parallel 
process, everything runs fine for roughly an hour, then I reach the 8GB max 
heap size and the process stalls/fails.
 
 More specifically, here's how I'm executing the parallel import process: I 
target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
VALUE') within my entity queries. And within Solrconfig.xml, I've created 
corresponding data import handlers, one for each of these entities.
 
My total rows fetch/count is 9M records.
 
And when I initiate the import, I call each one, similar to the below 
(obviously I've stripped out my server  naming conventions.
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2]
 
 
I assume that when doing this, only the first import request needs to contain 
the clean=true param. 
 
I've divided each import query to target roughly the same amount of data, and 
in solrconfig, I've tried various things in hopes to reduce heap size.
 
Here's my current config: 
 
 useCompoundFilefalse/useCompoundFile
    mergeFactor15/mergeFactor    !-- I've experimented with 10, 15,25 and 
haven't seen much differences --
    ramBufferSizeMB100/ramBufferSizeMB 
    maxMergeDocs2147483647/maxMergeDocs
    maxFieldLength1/maxFieldLength
    writeLockTimeout1000/writeLockTimeout
    commitLockTimeout1/commitLockTimeout
    lockTypesingle/lockType
  /indexDefaults
  mainIndex
    useCompoundFilefalse/useCompoundFile
    ramBufferSizeMB100/ramBufferSizeMB  !-- I've bumped this up from 32 
-- 
    mergeFactor15/mergeFactor
    maxMergeDocs2147483647/maxMergeDocs
    maxFieldLength1/maxFieldLength
    unlockOnStartupfalse/unlockOnStartup
  /mainIndex

 
updateHandler class=solr.DirectUpdateHandler2
   autoCommit
  maxTime6/maxTime !-- I've experimented with various times here 
as well -- 
  maxDocs25000/maxDocs !-- I've experimented with 25k, 500k, 100k -- 
    /autoCommit
    maxPendingDeletes10/maxPendingDeletes
 /updateHandler

 
What gets tricky is finding the sweet spot with these parameters, but wondering 
if anybody has any recommendations for an optimal config. Also, regarding 
autoCommit, I've even turned that feature off, but my heap size reaches its max 
sooner. I am wondering though, what would be the difference with autoCommit and 
passing in the commit=true param on each import query.
 
Thanks in advance!
Mike

Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Michael Della Bitta

Hi Mike,

Have you considered trying something like jhat or visualvm to see what's
taking up room on the heap?

http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html
http://visualvm.java.net/


Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Wed, Jun 26, 2013 at 12:58 PM, Mike L. javaone...@yahoo.com wrote:


 Hello,

I'm trying to execute a parallel DIH process and running into heap
 related issues, hoping somebody has experienced this and can recommend some
 options..

Using Solr 3.5 on CentOS.
Currently have JVM heap 4GB min , 8GB max

  When executing the entities in a sequential process (entities
 executing in sequence by default), my heap never exceeds 3GB. When
 executing the parallel process, everything runs fine for roughly an hour,
 then I reach the 8GB max heap size and the process stalls/fails.

  More specifically, here's how I'm executing the parallel import
 process: I target a logical range (i.e WHERE some field BETWEEN 'SOME
 VALUE' AND 'SOME VALUE') within my entity queries. And within
 Solrconfig.xml, I've created corresponding data import handlers, one for
 each of these entities.

 My total rows fetch/count is 9M records.

 And when I initiate the import, I call each one, similar to the below
 (obviously I've stripped out my server  naming conventions.

 http://
 [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true
 http://
 [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2]

 I assume that when doing this, only the first import request needs to
 contain the clean=true param.

 I've divided each import query to target roughly the same amount of data,
 and in solrconfig, I've tried various things in hopes to reduce heap size.

 Here's my current config:

  useCompoundFilefalse/useCompoundFile
 mergeFactor15/mergeFactor!-- I've experimented with 10, 15,25
 and haven't seen much differences --
 ramBufferSizeMB100/ramBufferSizeMB
 maxMergeDocs2147483647/maxMergeDocs
 maxFieldLength1/maxFieldLength
 writeLockTimeout1000/writeLockTimeout
 commitLockTimeout1/commitLockTimeout
 lockTypesingle/lockType
   /indexDefaults
   mainIndex
 useCompoundFilefalse/useCompoundFile
 ramBufferSizeMB100/ramBufferSizeMB  !-- I've bumped this up from
 32 --
 mergeFactor15/mergeFactor
 maxMergeDocs2147483647/maxMergeDocs
 maxFieldLength1/maxFieldLength
 unlockOnStartupfalse/unlockOnStartup
   /mainIndex


 updateHandler class=solr.DirectUpdateHandler2
autoCommit
   maxTime6/maxTime !-- I've experimented with various times
 here as well --
   maxDocs25000/maxDocs !-- I've experimented with 25k, 500k,
 100k --
 /autoCommit
 maxPendingDeletes10/maxPendingDeletes
  /updateHandler


 What gets tricky is finding the sweet spot with these parameters, but
 wondering if anybody has any recommendations for an optimal config. Also,
 regarding autoCommit, I've even turned that feature off, but my heap size
 reaches its max sooner. I am wondering though, what would be the difference
 with autoCommit and passing in the commit=true param on each import query.

 Thanks in advance!
 Mike

Re: Parallal Import Process on same core. Solr 3.5

On 6/26/2013 10:58 AM, Mike L. wrote:
  
 Hello,
  
I'm trying to execute a parallel DIH process and running into heap 
 related issues, hoping somebody has experienced this and can recommend some 
 options..
  
Using Solr 3.5 on CentOS.
Currently have JVM heap 4GB min , 8GB max
  
  When executing the entities in a sequential process (entities executing 
 in sequence by default), my heap never exceeds 3GB. When executing the 
 parallel process, everything runs fine for roughly an hour, then I reach the 
 8GB max heap size and the process stalls/fails.
  
  More specifically, here's how I'm executing the parallel import process: 
 I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
 VALUE') within my entity queries. And within Solrconfig.xml, I've created 
 corresponding data import handlers, one for each of these entities.
  
 My total rows fetch/count is 9M records.
  
 And when I initiate the import, I call each one, similar to the below 
 (obviously I've stripped out my server  naming conventions.
  
 http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true
  
 http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2]
  
  
 I assume that when doing this, only the first import request needs to contain 
 the clean=true param. 
  
 I've divided each import query to target roughly the same amount of data, and 
 in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn

Solr 4.2.1 - master taking long time to respond after tomcat restart

Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
we enabled updateLog and made the few unstored int and boolean fields as
stored. We have a single master and a single slave and all the queries go
only to the slave. We make only max. 50 atomic update requests/hour to the
master.

Noticing that on restarting tomcat, the master Solr server takes several
minutes to respond. This was not happening in 3.6.1. The slave is
responding as quickly as before after restarting tomcat. Any ideas why only
master would take this long?

Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

On 6/26/2013 11:18 AM, Arun Rangarajan wrote:
 Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
 we enabled updateLog and made the few unstored int and boolean fields as
 stored. We have a single master and a single slave and all the queries go
 only to the slave. We make only max. 50 atomic update requests/hour to the
 master.
 
 Noticing that on restarting tomcat, the master Solr server takes several
 minutes to respond. This was not happening in 3.6.1. The slave is
 responding as quickly as before after restarting tomcat. Any ideas why only
 master would take this long?

Classic problem after enabling the updateLog:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup

Thanks,
Shawn

Need help with indexing names in a pdf

2013-06-26 Thread Warren H. Prince

We receive about 100 documents a day of various sizes.  The documents 
could pertain to any of 40,000 contacts stored in our database, and could 
include more than one.   For each file we have, we maintain a list of contacts 
that are related to or involved in that file.  I know it will never be exact, 
but I'd like to index possible names in the text, and then attempt to identify 
which files the document might pertain to, looking with files that are tied to 
contacts contained in the document.

I've found some regex code to parse names from the text, but does anyone have 
any ideas on how to set up the index.  There are currently approximately 
900,000 documents in our library.

--Warren

Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

You need to do occasional hard commits, otherwise the update log just grows 
and grows and gets replayed on each server start.


-- Jack Krupansky

-Original Message- 
From: Arun Rangarajan

Sent: Wednesday, June 26, 2013 1:18 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.2.1 - master taking long time to respond after tomcat 
restart


Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
we enabled updateLog and made the few unstored int and boolean fields as
stored. We have a single master and a single slave and all the queries go
only to the slave. We make only max. 50 atomic update requests/hour to the
master.

Noticing that on restarting tomcat, the master Solr server takes several
minutes to respond. This was not happening in 3.6.1. The slave is
responding as quickly as before after restarting tomcat. Any ideas why only
master would take this long?

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-26 Thread Vinay Pothnis

Thank you Erick!

Will look at all these suggestions.

-Vinay


On Wed, Jun 26, 2013 at 6:37 AM, Erick Erickson erickerick...@gmail.comwrote:

 Right, unfortunately this is a gremlin lurking in the weeds, see:
 http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock

 There are a couple of ways to deal with this:
 1 go ahead and up the limit and re-compile, if you look at
 SolrCmdDistributor the semaphore is defined there.

 2 https://issues.apache.org/jira/browse/SOLR-4816 should
 address this as well as improve indexing throughput. I'm totally sure
 Joel (the guy working on this) would be thrilled if you were able to
 verify that these two points, I'd ask him (on the JIRA) whether he thinks
 it's ready to test.

 3 Reduce the number of threads you're indexing with

 4 index docs in small packets, perhaps even one and just rack
 together a zillion threads to get throughput.

 FWIW,
 Erick

 On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis poth...@gmail.com wrote:
  Jason and Scott,
 
  Thanks for the replies and pointers!
  Yes, I will consider the 'maxDocs' value as well. How do i monitor the
  transaction logs during the interval between commits?
 
  Thanks
  Vinay
 
 
  On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman 
  jhell...@innoventsolutions.com wrote:
 
  Scott,
 
  My comment was meant to be a bit tongue-in-cheek, but my intent in the
  statement was to represent hard failure along the lines Vinay is seeing.
   We're talking about OutOfMemoryException conditions, total cluster
  paralysis requiring restart, or other similar and disastrous conditions.
 
  Where that line is is impossible to generically define, but trivial to
  accomplish.  What any of us running Solr has to achieve is a realistic
  simulation of our desired production load (probably well above peak)
 and to
  see what limits are reached.  Armed with that information we tweak.  In
  this case, we look at finding the point where data ingestion reaches a
  natural limit.  For some that may be JVM GC, for others memory buffer
 size
  on the client load, and yet others it may be I/O limits on multithreaded
  reads from a database or file system.
 
  In old Solr days we had a little less to worry about.  We might play
 with
  a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial
  commits and rollback recoveries.  But with 4.x we now have more durable
  write options and NRT to consider, and SolrCloud begs to use this.  So
 we
  have to consider transaction logs, the file handles they leave open
 until
  commit operations occur, and how we want to manage writing to all cores
  simultaneously instead of a more narrow master/slave relationship.
 
  It's all manageable, all predictable (with some load testing) and all
  filled with many possibilities to meet our specific needs.  Considering
 hat
  each person's data model, ingestion pipeline, request processors, and
 field
  analysis steps will be different, 5 threads of input at face value
 doesn't
  really contemplate the whole problem.  We have to measure our actual
 data
  against our expectations and find where the weak chain links are to
  strengthen them.  The symptoms aren't necessarily predictable in
 advance of
  this testing, but they're likely addressable and not difficult to
 decipher.
 
  For what it's worth, SolrCloud is new enough that we're still
 experiencing
  some uncharted territory with unknown ramifications but with continued
  dialog through channels like these there are fewer territories without
 good
  cartography :)
 
  Hope that's of use!
 
  Jason
 
 
 
  On Jun 24, 2013, at 7:12 PM, Scott Lundgren 
  scott.lundg...@carbonblack.com wrote:
 
   Jason,
  
   Regarding your statement push you over the edge- what does that
 mean?
   Does it mean uncharted territory with unknown ramifications or
  something
   more like specific, known symptoms?
  
   I ask because our use is similar to Vinay's in some respects, and we
 want
   to be able to push the capabilities of write perf - but not over the
  edge!
   In particular, I am interested in knowing the symptoms of failure, to
  help
   us troubleshoot the underlying problems if and when they arise.
  
   Thanks,
  
   Scott
  
   On Monday, June 24, 2013, Jason Hellman wrote:
  
   Vinay,
  
   You may wish to pay attention to how many transaction logs are being
   created along the way to your hard autoCommit, which should truncate
 the
   open handles for those files.  I might suggest setting a maxDocs
 value
  in
   parallel with your maxTime value (you can use both) to ensure the
 commit
   occurs at either breakpoint.  30 seconds is plenty of time for 5
  parallel
   processes of 20 document submissions to push you over the edge.
  
   Jason
  
   On Jun 24, 2013, at 2:21 PM, Vinay Pothnis poth...@gmail.com
 wrote:
  
   I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30
 seconds.
  
   On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman 
   jhell...@innoventsolutions.com

Re: Querying multiple collections in SolrCloud

2013-06-26 Thread Chris Toomey

Thanks Erick, that's a very helpful answer.

Regarding the grouping option, does that require all the docs to be put
into a single collection, or could it be done with across N collections
(assuming each collection had a common type field for grouping on)?

Chris


On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson erickerick...@gmail.comwrote:

 bq: Would the above setup qualify as multiple compatible collections

 No. While there may be enough fields in common to form a single query,
 the TF/IDF calculations will not be compatible and the scores from the
 various collections will NOT be comparable. So simply getting the list of
 top N docs will probably be dominated by the docs from a single type.

 bq: How does SolrCloud combine the query results from multiple collections?

 It doesn't. SolrCloud sorts the results from multiple nodes in the
 _same_ collection
 according to whatever sort criteria are specified, defaulting to score.
 Say you
 ask for the top 20 docs. A node from each shard returns the top 20 docs
 for that
 shard. The node processing them just merges all the returned lists and
 only keeps
 the top 20.

 I don't think your last two questions are really relevant, SolrCloud
 isn't built to
 query multiple collections and return the results coherently.

 The root problem here is that you're trying to compare docs from
 different collections for goodness to return the top N. This isn't
 actually hard
 _except_ when goodness is the score, then it just doesn't work. You can't
 even compare scores from different queries on the _same_ collection, much
 less different ones. Consider two collections, books and songs. One
 consists
 of lots and lots of text and the ter frequency and inverse doc freq
 (TF/IDF)
 will be hugely different than songs. Not to mention field length
 normalization.

 Now, all that aside there's an option. Index all the docs in a single
 collection and
 use grouping (aka field collapsing) to get a single response that has the
 top N
 docs from each type (they'll be in different sections of the original
 response) and present
 them to the user however makes sense. You'll get hands on experience in
 why this isn't something that's easy to do automatically if you try to
 sort these
 into a single list by relevance G...

 Best
 Erick

 On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey ctoo...@gmail.com wrote:
  Thanks Jack for the alternatives.  The first is interesting but has the
  downside of requiring multiple queries to get the full matching docs.
  The
  second is interesting and very simple, but has the downside of not being
  modular and being difficult to configure field boosting when the
  collections have overlapping field names with different boosts being
 needed
  for the same field in different document types.
 
  I'd still like to know about the viability of my original approach though
  too.
 
  Chris
 
 
  On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky j...@basetechnology.com
 wrote:
 
  One simple scenario to consider: N+1 collections - one collection per
  document type with detailed fields for that document type, and one
 common
  collection that indexes a subset of the fields. The main user query
 would
  be an edismax over the common fields in that main collection. You can
  then display summary results from the common collection. You can also
 then
  support drill down into the type-specific collection based on a type
  field for each document in the main collection.
 
  Or, sure, you actually CAN index multiple document types in the same
  collection - add all the fields to one schema - there is no time or
 space
  penalty if most of the field are empty for most documents.
 
  -- Jack Krupansky
 
  -Original Message- From: Chris Toomey
  Sent: Tuesday, June 25, 2013 6:08 PM
  To: solr-user@lucene.apache.org
  Subject: Querying multiple collections in SolrCloud
 
 
  Hi, I'm investigating using SolrCloud for querying documents of
 different
  but similar/related types, and have read through docs. on the wiki and
 done
  many searches in these archives, but still have some questions.  Thanks
 in
  advance for your help.
 
  Setup:
  * Say that I have N distinct types of documents and I want to do queries
  that return the best matches regardless document type.  I.e., something
  akin to a Google search where I'd like to get the best matches from the
  web, news, images, and maps.
 
  * Our main use case is supporting simple user-entered searches, which
 would
  just contain terms / phrases and wouldn't specify fields.
 
  * The document types will not all have the same fields, though there
 may be
  some overlap in the fields.
 
  * We plan to use a separate collection for each document type, and to
 use
  the eDisMax query parser.  Each collection would have a
 document-specific
  schema configuration with appropriate defaults for query fields and
 boosts,
  etc.
 
  Questions:
  * Would the above setup qualify as multiple compatible collections,
 such

Solr document auto-upload?

2013-06-26 Thread aspielman

Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command? 

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type. 

Thanks!



-
| A.Spielman |
In theory there is no difference between theory and practice. In practice 
there is. - Chuck Reid
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need help with indexing names in a pdf

2013-06-26 Thread Walter Underwood

This kind of text processing is called entity extraction. I'm not up to date on 
what is available in Solr, but search on that.

wunder

On Jun 26, 2013, at 10:26 AM, Warren H. Prince wrote:

   We receive about 100 documents a day of various sizes.  The documents 
 could pertain to any of 40,000 contacts stored in our database, and could 
 include more than one.   For each file we have, we maintain a list of 
 contacts that are related to or involved in that file.  I know it will never 
 be exact, but I'd like to index possible names in the text, and then attempt 
 to identify which files the document might pertain to, looking with files 
 that are tied to contacts contained in the document.
 
 I've found some regex code to parse names from the text, but does anyone have 
 any ideas on how to set up the index.  There are currently approximately 
 900,000 documents in our library.
 
 --Warren

OOM killer script woes

2013-06-26 Thread Timothy Potter

Recently upgraded to 4.3.1 but this problem has persisted for a while now ...

I'm using the following configuration when starting Jetty:

-XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p

If an OOM is triggered during Solr web app initialization (such as by
me lowering -Xmx to a value that is too low to initialize Solr with),
then the script gets called and does what I expect!

However, once the Solr webapp initializes and Solr is happily
responding to updates and queries. When an OOM occurs in this
situation, then the script doesn't actually get invoked! All I see is
the following in the stdout/stderr log of my process:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
#   Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358...

The oom_killer.sh script doesn't actually get called!

So to recap, it works if an OOM occurs during initialization but once
Solr is running, the OOM killer doesn't fire correctly. This leads me
to believe my script is fine and there's something else going wrong.
Here's the oom_killer.sh script (pretty basic):

#!/bin/bash
SOLR_PORT=$1
SOLR_PID=$2
NOW=$(date +%Y%m%d_%H%M)
(
echo Running OOM killer script for process $SOLR_PID for Solr on port
89$SOLR_PORT
kill -9 $SOLR_PID
echo Killed process $SOLR_PID
exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT 
echo Restarted Solr on 89$SOLR_PORT after OOM
) | tee oom_killer-89$SOLR_PORT-$NOW.log

Anyone see anything like this before? Suggestions on where to begin
tracking down this issue?

Cheers,
Tim

Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Learner

I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr document auto-upload?


Take a look at LucidWorks Search for automated crawler scheduling:
http://docs.lucidworks.com/display/help/Create+or+Edit+a+Schedule
http://docs.lucidworks.com/display/lweug/Data+Source+Schedules

ManifoldCF also has crawler job scheduling:
http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html

I think the general idea on Unix is that cron is the obvious way to schedule 
periodic operations.


You could certainly do a custom request handler that initializes with a 
thread on a timer and initiate custom directory crawling of your own.


But there is no such feature directly implemented in Solr

-- Jack Krupansky

-Original Message- 
From: aspielman

Sent: Wednesday, June 26, 2013 2:16 PM
To: solr-user@lucene.apache.org
Subject: Solr document auto-upload?

Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command?

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type.

Thanks!



-
| A.Spielman |
In theory there is no difference between theory and practice. In practice 
there is. - Chuck Reid

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Alexandre Rafalovitch

Yes, it is possible by running an embedded Solr inside SolrJ process.
The nice thing is that the index is portable, so you can then access
it from the standalone Solr server later.

I have an example here:
https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj
, which shows SolrJ running both as a client and with an embedded
container. Notice that you will probably need more jars than you
expect for the standalone Solr to work, including a number of servlet
jars.

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Jun 26, 2013 at 2:59 PM, Learner bbar...@gmail.com wrote:
 I currently have a SOLRJ program which I am using for indexing the data in
 SOLR. I am trying to figure out a way to build index without depending on
 running instance of SOLR. I should be able to supply the solrconfig and
 schema.xml to the indexing program which in turn create index files that I
 can use with any SOLR instance. Is it possible to implement this?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.

Thanks for the response.
 
Here's the scrubbed version of my DIH: http://apaste.info/6uGH 
 
It contains everything I'm more or less doing...pretty straight forward.. One 
thing to note and I don't know if this is a bug or not, but the batchSize=-1 
streaming feature doesn't seem to work, at least with informix jdbc drivers. I 
set the batchsize to 500, but have tested it with various numbers including 
5000, 1. I'm aware that behind the scenes this should be just setting the 
fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set 
as a global DB param and can't be modified (which I haven't looked into 
afterward.)
 
As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.
 
I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:
 
readOnly=true transactionIsolation=TRANSACTION_READ_UNCOMMITTED
 
What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.
 
Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak.. 
 
Thanks!
Mike
 

From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org 
Sent: Wednesday, June 26, 2013 12:13 PM
Subject: Re: Parallal Import Process on same core. Solr 3.5


On 6/26/2013 10:58 AM, Mike L. wrote:
  
 Hello,
  
        I'm trying to execute a parallel DIH process and running into heap 
related issues, hoping somebody has experienced this and can recommend some 
options..
  
        Using Solr 3.5 on CentOS.
        Currently have JVM heap 4GB min , 8GB max
  
      When executing the entities in a sequential process (entities executing 
in sequence by default), my heap never exceeds 3GB. When executing the 
parallel process, everything runs fine for roughly an hour, then I reach the 
8GB max heap size and the process stalls/fails.
  
      More specifically, here's how I'm executing the parallel import process: 
I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
VALUE') within my entity queries. And within Solrconfig.xml, I've created 
corresponding data import handlers, one for each of these entities.
  
 My total rows fetch/count is 9M records.
  
 And when I initiate the import, I call each one, similar to the below 
 (obviously I've stripped out my server  naming conventions.
  
 http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting1]clean=true
  
 http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-importentity=[NameOfEntityTargetting2]
  
  
 I assume that when doing this, only the first import request needs to contain 
 the clean=true param. 
  
 I've divided each import query to target roughly the same amount of data, and 
 in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info/is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn

How to get values of external file field(s) in Solr query?

http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
says
this about external file fields:
They can be used only for function queries or display.
I understand how to use them in function queries, but how do I retrieve the
values for display?

If I want to fetch only the values of a single external file field for a
set of primary keys, I can do:
q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score
For this query, the score is the value of the external file field.

But how to get the values for docs that match some arbitrary query? Is
there a syntax trick that will work where the value of the ext file field
does not affect the score of the main query, but I can still retrieve its
value?

Also is it possible to retrieve the values of more than one external file
field in a single query?

Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

Thanks, Shawn  Jack. I will go with the wiki and use autoCommit with
openSearcher set to false.


On Wed, Jun 26, 2013 at 10:23 AM, Jack Krupansky j...@basetechnology.comwrote:

 You need to do occasional hard commits, otherwise the update log just
 grows and grows and gets replayed on each server start.

 -- Jack Krupansky

 -Original Message- From: Arun Rangarajan
 Sent: Wednesday, June 26, 2013 1:18 PM
 To: solr-user@lucene.apache.org
 Subject: Solr 4.2.1 - master taking long time to respond after tomcat
 restart


 Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
 we enabled updateLog and made the few unstored int and boolean fields as
 stored. We have a single master and a single slave and all the queries go
 only to the slave. We make only max. 50 atomic update requests/hour to the
 master.

 Noticing that on restarting tomcat, the master Solr server takes several
 minutes to respond. This was not happening in 3.6.1. The slave is
 responding as quickly as before after restarting tomcat. Any ideas why only
 master would take this long?

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Guido Medina

AFAIK solrj is just the network client that connects to a Solr server 
using Java, now, if you just need to index your data on your local HDD 
you might want to step back to Lucene. I'm assuming you are using Java 
so you could also annotate your POJO's with Lucene annotations, google 
hibernate-search, maybe that's what you are looking for.


HTH,

Guido.

On 26/06/13 19:59, Learner wrote:

I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: OOM killer script woes

2013-06-26 Thread Daniel Collins

Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
assumes that the application doesn't handle the Errors and so they would
reach the JVM and thus invoke the handler.
Since Jetty has an exception handler that is dealing with anything
(included Errors), they never reach the JVM, hence no handler.

Not much we can do short of not using Jetty?

That's a pain, I'd just written a nice OOM handler too!


On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote:

 A little more to this ...

 Just on chance this was a weird Jetty issue or something, I tried with
 the latest 9 and the problem still occurs :-(

 This is on Java 7 on debian:

 java version 1.7.0_21
 Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
 Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

 Here is an example stack trace from the log

 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
 solr.servlet.SolrDispatchFilter Q:22 -
 null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
 space
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.eclipse.jetty.server.Server.handle(Server.java:445)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
 at
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
 at
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.OutOfMemoryError: Java heap space

 On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com
 wrote:
  Recently upgraded to 4.3.1 but this problem has persisted for a while
 now ...
 
  I'm using the following configuration when starting Jetty:
 
  -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
 
  If an OOM is triggered during Solr web app initialization (such as by
  me lowering -Xmx to a value that is too low to initialize Solr with),
  then the script gets called and does what I expect!
 
  However, once the Solr webapp initializes and Solr is happily
  responding to updates and queries. When an OOM occurs in this
  situation, then the script doesn't actually get invoked! All I see is
  the following in the stdout/stderr log of my process:
 
  #
  # java.lang.OutOfMemoryError: Java heap space
  # -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
  #   Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358...
 
  The oom_killer.sh script doesn't actually get called!
 
  So to recap, it works if an OOM occurs during initialization but once
  Solr is running, the OOM killer doesn't fire correctly. This leads me
  to believe my script is fine and there's something else going wrong.
  Here's the oom_killer.sh script (pretty basic):
 
  #!/bin/bash
  SOLR_PORT=$1
  SOLR_PID=$2
  NOW=$(date +%Y%m%d_%H%M)
  (
  echo Running OOM killer script for process $SOLR_PID for Solr on port
  89$SOLR_PORT
  kill -9 $SOLR_PID
  echo Killed process $SOLR_PID
  exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT 
  echo Restarted Solr on 89$SOLR_PORT after OOM
  ) | tee oom_killer-89$SOLR_PORT-$NOW.log
 
  Anyone see anything like this before? Suggestions on where to begin
  tracking down this issue?
 
  Cheers,
  Tim

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Guido Medina

Never heard of embedded Solr server, isn't better to just use lucene
alone for that purpose? Using a helper like Hibernate? Since most
applications that require indexes will have a relational DB behind the
scene, it would not be a bad idea to use a ORM combined with Lucene
annotations (aka hibernate-search)

Guido.

On 26/06/13 20:30, Alexandre Rafalovitch wrote:

Yes, it is possible by running an embedded Solr inside SolrJ process.
The nice thing is that the index is portable, so you can then access
it from the standalone Solr server later.

I have an example here:
https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj
, which shows SolrJ running both as a client and with an embedded
container. Notice that you will probably need more jars than you
expect for the standalone Solr to work, including a number of servlet
jars.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
book)

On Wed, Jun 26, 2013 at 2:59 PM, Learner bbar...@gmail.com wrote:

I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Parallal Import Process on same core. Solr 3.5

On 6/26/2013 1:36 PM, Mike L. wrote:

Here's the scrubbed version of my DIH: http://apaste.info/6uGH

It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I
don't know if this is a bug or not, but the batchSize=-1 streaming feature doesn't seem
to work, at least with informix jdbc drivers. I set the batchsize to 500, but have
tested it with various numbers including 5000, 1. I'm aware that behind the scenes this should
be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of
what value I actually use. I was told by one of our DBA's that our value is set as a global DB
param and can't be modified (which I haven't looked into afterward.)

Setting the batchSize to -1 causes DIH to set fetchSize to
Integer.MIN_VALUE (around negative two billion), which seems to be a
MySQL-specific hack to enable result streaming. I've never heard of it
working on any other JDBC driver.

Assuming that the Informix JDBC driver is actually honoring the
fetchSize, setting batchSize in the DIH config should be enough. If
it's not, then it's a bug in the JDBC driver or possibly a server
misconfiguration.

As far as HEAP patterns, I watch the process via WILY and notice GC occurs
every 15min's or so, but becomes infrequent and not as significant as the
previous one. It's almost as if some memory is never released until it
eventually catches up to the max heap size.

I did assume that perhaps there could have been some locking issues, which is
why I made the following modifications:

readOnly=true transactionIsolation=TRANSACTION_READ_UNCOMMITTED

I can't really comment here. It does appear that the Informix JDBC
driver is not something you can download from IBM's website without
paying them money. I would suggest going to IBM (or an informix-related
support avenue) for some help, ESPECIALLY if you've paid money for it.

What do you recommend for the mergeFactor,ramBufferSize and autoCommit options?
My general understanding is the higher the mergeFactor, the less frequent
merges which should improve index time, but slow down query response time. I
also read somewhere that an increase on the ramBufferSize should help prevent
frequent merges...but confused why I didn't really see an improvement...perhaps
my combination of these values wasn't right in relation to my total fetch size.

Of these, ramBufferSizeMB is the only one that should have a
*significant* effect on RAM usage, and at a value of 100, I would not
expect there to be a major issue unless you are doing a lot of imports
at the same time.

Because you are using Solr 3.5, if you do not need your import results
to be visible until the end, I wouldn't worry about using autoCommit.
If you were using Solr 4.x, I would recommend that you turn autoCommit
on, but with openSearcher set to false.

Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e
the defaults) the better on memory management, but cost on index time as you
pay for the overhead of committing. That is a number I've been experimenting
with as well and have scene some variations in heap trends but unfortunately,
have not completed the job quite yet with any config... I did get very close..
I'd hate to throw additional memory at the problem if there is something else I
can tweak..

General impressions: Unless the amount of data involved in each Solr
document is absolutely enormous, this is very likely bugs (memory leaks
or fetchSize problems) in the Informix JDBC driver. I did find the
following page, but it's REALLY REALLY old, which hopefully means that
it doesn't apply.

http://www-01.ibm.com/support/docview.wss?uid=swg21260832

If your documents ARE huge, then you probably need to give more memory
to the java heap ... but you might still have memory leak bugs in the
JDBC driver.

When it comes to Java and Lucene/Solr, IBM has a *terrible* track
record, especially for people using the IBM Java VM. I would not be
surprised if their JDBC driver is plagued by similar problems. If you
do find a support resource and they tell you that you should change your
JDBC code to work differently, then you need to tell them that you can't
change the JDBC code and that they need to give you a configuration URL
workaround.

Here's another possibility of a bug that causes memory leaks:

http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469

You might ask whether the problem could be a memory leak in Solr. It's
always possible, but I've had a lot of experience with DIH from MySQL on
Solr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1. I've never seen any signs
of a leak.

Thanks,
Shawn

Re: Is it possible to searh Solr with a longer query string?

2013-06-26 Thread Gary Young

Oh this is good!


On Wed, Jun 26, 2013 at 12:05 PM, Shawn Heisey s...@elyograg.org wrote:

 On 6/25/2013 6:15 PM, Jack Krupansky wrote:
  Are you using Tomcat?
 
  See:
  http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests
 
  Enabling Longer Query Requests
 
  If you try to submit too long a GET query to Solr, then Tomcat will
  reject your HTTP request on the grounds that the HTTP header is too
  large; symptoms may include an HTTP 400 Bad Request error or (if you
  execute the query in a web browser) a blank browser window.
 
  If you need to enable longer queries, you can set the maxHttpHeaderSize
  attribute on the HTTP Connector element in your server.xml file. The
  default value is 4K. (See
  http://tomcat.apache.org/tomcat-5.5-doc/config/http.html)

 Even better would be to force SolrJ to use a POST request.  In newer
 versions (4.1 and later) Solr sets the servlet container's POST buffer
 size and defaults it to 2MB.  In older versions, you'd have to adjust
 this in your servlet container config, but the default should be
 considerably larger than the header buffer used for GET requests.

 I thought that SolrJ used POST by default, but after looking at the
 code, it seems that I was wrong.  Here's how to send a POST query:

 response = server.query(query, METHOD.POST);

 The import required for this is:

 import org.apache.solr.client.solrj.SolrRequest.METHOD;

 Gary, if you can avoid it, you should not be creating a new
 HttpSolrServer object every time you make a query.  It is completely
 thread-safe, so create a singleton and use it for all queries against
 the medline core.

 Thanks,
 Shawn

Re: How to get values of external file field(s) in Solr query?

The only way is using a frange (function range) query:

q={!frange l=0 u=10}my_external_field

Will pull out documents that have your external field with a value
between zero and 10.

Upayavira 

On Wed, Jun 26, 2013, at 09:02 PM, Arun Rangarajan wrote:
 http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
 says
 this about external file fields:
 They can be used only for function queries or display.
 I understand how to use them in function queries, but how do I retrieve
 the
 values for display?
 
 If I want to fetch only the values of a single external file field for a
 set of primary keys, I can do:
 q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score
 For this query, the score is the value of the external file field.
 
 But how to get the values for docs that match some arbitrary query? Is
 there a syntax trick that will work where the value of the ext file field
 does not affect the score of the main query, but I can still retrieve its
 value?
 
 Also is it possible to retrieve the values of more than one external file
 field in a single query?

Re: How to get values of external file field(s) in Solr query?

2013-06-26 Thread Yonik Seeley

On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan
arunrangara...@gmail.com wrote:
 http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
 says
 this about external file fields:
 They can be used only for function queries or display.
 I understand how to use them in function queries, but how do I retrieve the
 values for display?

 If I want to fetch only the values of a single external file field for a
 set of primary keys, I can do:
 q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score
 For this query, the score is the value of the external file field.

 But how to get the values for docs that match some arbitrary query?

Pseudo-fields allow you to retrieve the value for any arbitrary
function per returned document.
Should work here, but I haven't tried it.

fl=id, score, field(EXT_FILE_FIELD)

or you can alias it:

fl=id, score, myfield:field(EXT_FILE_FIELD)

-Yonik
http://lucidworks.com

Re: How to get values of external file field(s) in Solr query?

Yonik,
Thanks, your answer works!


On Wed, Jun 26, 2013 at 2:07 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan
 arunrangara...@gmail.com wrote:
 
 http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
  says
  this about external file fields:
  They can be used only for function queries or display.
  I understand how to use them in function queries, but how do I retrieve
 the
  values for display?
 
  If I want to fetch only the values of a single external file field for a
  set of primary keys, I can do:
  q=_val_:EXT_FILE_FIELDfq=id:(doc1 doc2 doc3)fl=id,score
  For this query, the score is the value of the external file field.
 
  But how to get the values for docs that match some arbitrary query?

 Pseudo-fields allow you to retrieve the value for any arbitrary
 function per returned document.
 Should work here, but I haven't tried it.

 fl=id, score, field(EXT_FILE_FIELD)

 or you can alias it:

 fl=id, score, myfield:field(EXT_FILE_FIELD)

 -Yonik
 http://lucidworks.com

Configuring Solr to retrieve documents?

2013-06-26 Thread aspielman

Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command?

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type.

Thanks!



-
| A.Spielman |
In theory there is no difference between theory and practice. In practice 
there is. - Chuck Reid
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-Solr-to-retrieve-documents-tp4073372.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Alexandre Rafalovitch

On Wed, Jun 26, 2013 at 4:43 PM, Guido Medina guido.med...@temetra.comwrote:

 Never heard of embedded Solr server,


I guess that's the exciting part about Solr. Always more nuances to learn:
https://wiki.apache.org/solr/EmbeddedSolr :-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: OOM killer script woes

2013-06-26 Thread Timothy Potter

Thanks for the feedback Daniel ... For now, I've opted to just kill
the JVM with System.exit(1) in the SolrDispatchFilter code and will
restart it with a Linux supervisor. Not elegant but the alternative of
having a zombie Solr instance walking around my cluster is much worse
;-) Will try to dig into the code that is trapping this error but for
now I've lost too many hours on this problem.

Cheers,
Tim

On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins danwcoll...@gmail.com wrote:
 Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
 throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
 assumes that the application doesn't handle the Errors and so they would
 reach the JVM and thus invoke the handler.
 Since Jetty has an exception handler that is dealing with anything
 (included Errors), they never reach the JVM, hence no handler.

 Not much we can do short of not using Jetty?

 That's a pain, I'd just written a nice OOM handler too!


 On 26 June 2013 20:37, Timothy Potter thelabd...@gmail.com wrote:

 A little more to this ...

 Just on chance this was a weird Jetty issue or something, I tried with
 the latest 9 and the problem still occurs :-(

 This is on Java 7 on debian:

 java version 1.7.0_21
 Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
 Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

 Here is an example stack trace from the log

 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
 solr.servlet.SolrDispatchFilter Q:22 -
 null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
 space
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
 at org.eclipse.jetty.server.Server.handle(Server.java:445)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
 at
 org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
 at
 org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.lang.OutOfMemoryError: Java heap space

 On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter thelabd...@gmail.com
 wrote:
  Recently upgraded to 4.3.1 but this problem has persisted for a while
 now ...
 
  I'm using the following configuration when starting Jetty:
 
  -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
 
  If an OOM is triggered during Solr web app initialization (such as by
  me lowering -Xmx to a value that is too low to initialize Solr with),
  then the script gets called and does what I expect!
 
  However, once the Solr webapp initializes and Solr is happily
  responding to updates and queries. When an OOM occurs in this
  situation, then the script doesn't actually get invoked! All I see is
  the following in the stdout/stderr log of my process:
 
  #
  # java.lang.OutOfMemoryError: Java heap space
  # -XX:OnOutOfMemoryError=/home/solr/oom_killer.sh 83 %p
  #   Executing /bin/sh -c /home/solr/oom_killer.sh 83 21358...
 
  The oom_killer.sh script doesn't actually get called!
 
  So to recap, it works if an OOM occurs during initialization but once
  Solr is running, the OOM killer doesn't fire correctly. This leads me
  to believe my script is fine and there's something else going wrong.
  Here's the oom_killer.sh script (pretty basic):
 
  #!/bin/bash
  SOLR_PORT=$1
  SOLR_PID=$2
  NOW=$(date +%Y%m%d_%H%M)
  (
  echo Running OOM killer script for process $SOLR_PID for Solr on port
  89$SOLR_PORT
  kill -9 $SOLR_PID
  echo Killed process

Replicating files containing external file fields