Re: Solr edismax clarification

2012-02-17 Thread Jan Høydahl
Please provide your full query, including your qf parameter and all other 
request parameters, and also the relevant fields/field-types from schema. Do 
you use stopwords? Can you also add debugQuery=true and paste in the 
parsedQuery?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 16. feb. 2012, at 18:07, Indika Tantrigoda wrote:

 Hi All,
 
 I am using edismax SearchHandler in my search and I have some issues in the
 search results. As I understand if the defaultOperator is set to OR the
 search query will be passed as  - The OR quick OR brown OR fox implicitly.
 However if I search for The quick brown fox, I get lesser results than
 explicitly adding the OR. Another issue is that if I search for The quick
 brown fox other documents that contain the word fox is not in the search
 results.
 
 Thanks.



Re: custom scoring

2012-02-17 Thread Carlos Gonzalez-Cadenas
Thanks Em, Robert, Chris for your time and valuable advice. We'll make some
tests and will let you know soon.



On Thu, Feb 16, 2012 at 11:43 PM, Em mailformailingli...@yahoo.de wrote:

 Hello Carlos,

 I think we missunderstood eachother.

 As an example:
 BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery(stopword_field, barcelona),
   TermQuery(stopword_field, hoteles)
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
 )

 Explanation:
 You construct an artificial BooleanQuery which wraps your user's query
 as well as your function query.
 Your user's query - in that case - is just a DisjunctionMaxQuery
 consisting of two TermQueries.
 In the real world you might construct another BooleanQuery around your
 DisjunctionMaxQuery in order to have more flexibility.
 However the interesting part of the given example is, that we specify
 the user's query as a MustMatch-condition of the BooleanQuery and the
 FunctionQuery just as a ShouldMatch.
 Constructed that way, I am expecting the FunctionQuery only scores those
 documents which fit the MustMatch-Condition.

 I conclude that from the fact that the FunctionQuery-class also has a
 skipTo-method and I would expect that the scorer will use it to score
 only matching documents (however I did not search where and how it might
 get called).

 If my conclusion is wrong than hopefully Robert Muir (as far as I can
 see the author of that class) can tell us what was the intention by
 constructing an every-time-match-all-function-query.

 Can you validate whether your QueryParser constructs a query in the form
 I drew above?

 Regards,
 Em

 Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
  Hello Em:
 
  1) Here's a printout of an example DisMax query (as you can see mostly
 MUST
  terms except for some SHOULD terms used for boosting scores for
 stopwords)
  *
  *
  *((+stopword_shortened_phrase:hoteles
 +stopword_shortened_phrase:barcelona
  stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
  +stopword_phrase:barcelona
  stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
  ened_phrase:barcelona stopword_shortened_phrase:en) |
 (+stopword_phrase:hoteles
  +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
  tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
  stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
  ord_phrase:barcelona stopword_phrase:en) |
 (+stopword_shortened_phrase:hoteles
  +wildcard_stopword_shortened_phrase:barcelona
 stopword_shortened_phrase:en)
  | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
  stopword_phrase:en))*
  *
  *
  2)* *The collector is inserted in the SolrIndexSearcher (replacing the
  TimeLimitingCollector). We trigger it through the SOLR interface by
 passing
  the timeAllowed parameter. We know this is a hack but AFAIK there's no
  out-of-the-box way to specify custom collectors by now (
  https://issues.apache.org/jira/browse/SOLR-1680). In any case the
 collector
  part works perfectly as of now, so clearly this is not the problem.
 
  3) Re: your sentence:
  *
  *
  **I* would expect that with a shrinking set of matching documents to
  the overall-query, the function query only checks those documents that
 are
  guaranteed to be within the result set.*
  *
  *
  Yes, I agree with this, but this snippet of code in FunctionQuery.java
  seems to say otherwise:
 
  // instead of matching all docs, we could also embed a query.
  // the score could either ignore the subscore, or boost it.
  // Containment:  floatline(foo:myTerm, myFloatField, 1.0, 0.0f)
  // Boost:foo:myTerm^floatline(myFloatField,1.0,0.0f)
  @Override
  public int nextDoc() throws IOException {
for(;;) {
  ++doc;
  if (doc=maxDoc) {
return doc=NO_MORE_DOCS;
  }
  if (acceptDocs != null  !acceptDocs.get(doc)) continue;
  return doc;
}
  }
 
  It seems that the author also thought of maybe embedding a query in order
  to restrict matches, but this doesn't seem to be in place as of now (or
  maybe I'm not understanding how the whole thing works :) ).
 
  Thanks
  Carlos
  *
  *
 
  Carlos Gonzalez-Cadenas
  CEO, ExperienceOn - New generation search
  http://www.experienceon.com
 
  Mobile: +34 652 911 201
  Skype: carlosgonzalezcadenas
  LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
 
 
  On Thu, Feb 16, 2012 at 8:09 PM, Em mailformailingli...@yahoo.de
 wrote:
 
  Hello Carlos,
 
  We have some more tests on that matter: now we're moving from issuing
  this
  large query through the SOLR interface to creating our own
  QueryParser. The
  initial tests we've done in our QParser (that internally creates
 multiple
  queries and inserts them inside a 

Re: Solr edismax clarification

2012-02-17 Thread O. Klein

Indika Tantrigoda wrote
 
 Hi All,
 
 I am using edismax SearchHandler in my search and I have some issues in
 the
 search results. As I understand if the defaultOperator is set to OR the
 search query will be passed as  - The OR quick OR brown OR fox
 implicitly.
 
 

Did you also remove mm? If not  defaultOperator is ignored and it
follows mm settings.
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-edismax-clarification-tp3751013p3753260.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to connect embedded solr with each other by sharding

2012-02-17 Thread mustafozbek
I have been using sharding with multiple basic solr server for clustering. I
also used one embedded solr server (Solrj Java API) with many basic solr
servers and connecting them by sharding as embedded solr server is the
caller of them. I used the code line below for this purpose.
SolrQuery query = new SolrQuery();
query.set(shards, solr1URL,solr2URL,...);

Now, I have many embedded solr servers running on different computers and
they are unaware of each others. I want to communicate them with each other
by sharding. Is it possible? if yes how? if not what are the other options
that you can advice by using embedded solr servers?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-connect-embedded-solr-with-each-other-by-sharding-tp3753337p3753337.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error Indexing in solr 3.5

2012-02-17 Thread mechravi25
Hi Chantal,

I checked my client. It was pointing to the old solrj. After changing that,
it got indexed properly.

Thanks a lot.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3753359.html
Sent from the Solr - User mailing list archive at Nabble.com.


Removing empty dynamic fields from a Solr 1.4 index

2012-02-17 Thread Andrew Ingram
Hi all

(Note: this question is cross-posted on stackoverflow: 
http://stackoverflow.com/questions/9327542/removing-empty-dynamic-fields-from-a-solr-1-4-index)

I have a Solr index that uses quite a few dynamic fields. I've recently changed 
my code to reduce the amount of data we index with Solr, significantly reducing 
the number of dynamic fields that are in use.

I've reindexed my data, and the doc count (as displayed in the admin schema 
browser) for the old fields has dropped to zero. But I'm confused as to why the 
fields still exist. I've done an optimize, and restarted the server, but I 
can't find any information on whether there's a way to get these fields to 
disappear.

Am I now stuck with these fields unless I recreate the index from scratch? 
We're talking about a significant reduction in fields (about 200 - 30), and 
I'm worried about the performance impact of keeping them floating around.

Thanks, 
Andrew Ingram


How to handle to run testcases in ruby code for solr

2012-02-17 Thread solr
Hi  all,
Am writing rails application by using solr_ruby gem to access solr . 
Can anybody suggest how to handle testcaeses for solr code and connections
in functionaltetsing.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3753479.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtime search with multi clients updating index simultaneously.

2012-02-17 Thread Erick Erickson
See below

On Thu, Feb 16, 2012 at 6:18 AM, v_shan varun.c...@gmail.com wrote:
 I have a heldesk application developed in PHP/MySQL. I want to implement real
 time Full text search and I have shortlisted Solr. MySQL database will store
 all the tickets and their updates and that data will be imported for
 building Solr index. All Search requests will be handled by Solr.

 What I want is a real time search. The moment someone updates a ticket, it
 should be available for search.

 As per my understanding of Solr, this is how I think the system will work.
 A user updates a ticket - database record is modified - a request is sent
 to Solr server to modify corresponding document in index.

The first thing to understand: Solr does not update a document, it deletes
the old one and adds a new one based on uniqueKey.


 I have read a book on Solr and below questions are troubling me.
 1. The book mentions that commits are slow in Solr. Depending on the index
 size, Solr's auto-warming
 configuration, and Solr's cache state prior to committing, a commit can take
 a non-trivial amount of time. Typically, it takes a few seconds, but it can
 take
 some number of minutes in extreme cases. If this is true then how will I
 know when the data will be availbale for search and how can I implemnt
 realtime search? Also I don't want the ticket update operation to be slowed
 down (by adding extra step of updating Solr index)

Well, Solr trunk is in the midst of getting NRT searching (Near Real Time),
so that may be of interest. Otherwise, there is some latency defined by
time until commit + replication time + autowarming time. You haven't
indicated how big your data set is, so what those numbers really are
is hard to even guess. Even if you do know how many records will
be there, the answer is still try it and see. Replication time may not
be necessary if you have a small enough system, it is possible to
index and search on the same machine. On larger installations,
a latency of a few minutes is common.



 2. It is also mentioned that there is no transaction isolation. This means
 that if more than one Solr client
 were to submit modifications and commit them at overlapping times, it is
 possible for part of one client's set of changes to be committed before that
 client told Solr to commit. This applies to rollback as well. If this is a
 problem
 for your architecture then consider using one client process responsible for
 updating Solr.

 Does it mean that due to lack of transactional commits, Solr can mess up the
 updates when multiple people update the ticket simultaneously?

As above, Solr deletes and replaces complete documents. So in this case
your update process would simply honor the last-received.

But I think you're missing a bit here. User's won't update your Solr index.
Somewhere, you'll have a process that queries your MySql database
and updates any changed records. The MySql database is your
system-of-record and where your transactional integrity is maintained.
The process that queries the database and sends the results to Solr will
just see the results of the aggregate changes to the underlying database
as single records, so I don't think this is an issue.

Best
Erick

 Now the question before me is: Is Solr fit in my case? If yes, How?

Can't answer this for you.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Payload and exact search - 2

2012-02-17 Thread Erick Erickson
OK, payloads are a bit of a mystery to me, so this may be way off
base.

But...

The ordering of your analysis chain is suspicious, the admin/analysis
page is a life-saver.

WordDelimiterFilterFactory is breaking up your input before it gets to
the payload filter I think, so your payload information is completely
disassociated with from your terms and treated as individual terms
all by themselves. At that point what you get
in your index *probably* has no payloads attached at all!

Use the admin/schema browser link to actually look at the data (or
just go straight to Luke) and I believe you'll see that your position
information is being treated just like any other token in the input stream.

There should be nothing about payloads that prevents normal
text query on the text part, although.

Best
Erick

On Thu, Feb 16, 2012 at 9:18 AM, leonardo2 leonardo.rigut...@gmail.com wrote:
 Hello,
 I already posted this question but for some reason it was attached to a
 thread with different topic.


 Is there the possibility of perform 'exact search' in a payload field?

 I'have to index text with auxiliary info for each word. In particular at
 each word is associated the bounding box containing it in the original pdf
 page (it is used for highligthing the search terms in the pdf). I used the
 payload to store that information.

 In the schema.xml, the fieldType definition is:

 ---
 fieldtype name=wppayloads stored=false indexed=true
 class=solr.TextField 
 analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1
                         catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.DelimitedPayloadTokenFilterFactory encoder=identity/
 /analyzer
 /fieldtype
 ---

 while the field definition is:

 ---
 field name=words type=wppayloads indexed=true stored=true
 required=true multiValued=true/
 ---

 When indexing, the field 'words' contains a list of word|box as in the
 following example:

 ---
 doc_id=example
 words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25
 di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25}
 ---

 Such solution works well except in the case of an exact search. For example,
 assuming the only indexed doc is the 'example' doc (before shown), the query
 words:Comune di Bologna returns no results.

 Someone know if there is the possibility of perform 'exact search' in a
 payload field?

 Thanks in advance,
 Leonardo

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html
 Sent from the Solr - User mailing list archive at Nabble.com.


customizing standard tokenizer

2012-02-17 Thread Torsten Krah
Hi,

is it possible to extend the standard tokenizer or use a custom one
(possible via extending the standard one) to add some custom tokens
like Lucene-Core to be one token.

regards


smime.p7s
Description: S/MIME cryptographic signature


Re: problem to indexing pdf directory

2012-02-17 Thread alessio crisantemi
thanks gora for your help.
I installed Maven and downloaded Tika following the guide: But I have an
errore during the built of Tika about 'tika compiler', and the maven
installation of Tika is stopped.

there is another way?
thank you
a.

2012/2/16 Gora Mohanty g...@mimirtech.com

 On 16 February 2012 21:37, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  here the log:
 
 
  org.apache.solr.handler.dataimport.DataImporter doFullImport
  Grave: Full Import failed
  org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
 is
  a required attribute Processing Document # 1
 [...]

 The exception message above is pretty clear. You need to define a
 baseDir attribute for the second entity.

 However, even if you fix this, the setup will *not* work for indexing
 PDFs. Did you read the URLs that I sent earlier?

 Regards,
 Gora



Re: problem to indexing pdf directory

2012-02-17 Thread Erick Erickson
You should not have to do anything with Maven, the instructions
you followed were from 1.4.1 days..

Assuming you're working with a 3.x build, here's a data-config
that worked for me, just a straight distro. But note a couple of things:
1 for simplicity, I changed the schema.xml to NOT require
the id field. You'll have to change this back probably and
select a good uniqueKey
2 I had to add this line to solrconfig.xml to find the path:
lib dir=../../dist/ regex=apache-solr-dataimporthandler-extras-\d.*\.jar/
3 If this all works without errors in the Solr log and you still
 can't find anything, be sure you issue a commit.

Best
Erick

dataConfig
  dataSource name=bin type=BinFileDataSource/
  document
entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd
processor=FileListEntityProcessor recursive=true
rootEntity=false
  entity dataSource=bin format=text name=tika-test
processor=TikaEntityProcessor url=${sd.fileAbsolutePath}
field column=Author meta=true name=author/
field column=Content-Type meta=true name=title/
!-- field column=title name=title meta=true/ --
field column=text name=text/
  /entity
  !-- field column=fileLastModified name=date
dateTimeFormat=-MM-dd'T'hh:mm:ss / --
  field column=fileSize meta=true name=size/
/entity
  /document
/dataConfig
On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 thanks gora for your help.
 I installed Maven and downloaded Tika following the guide: But I have an
 errore during the built of Tika about 'tika compiler', and the maven
 installation of Tika is stopped.

 there is another way?
 thank you
 a.

 2012/2/16 Gora Mohanty g...@mimirtech.com

 On 16 February 2012 21:37, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  here the log:
 
 
  org.apache.solr.handler.dataimport.DataImporter doFullImport
  Grave: Full Import failed
  org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
 is
  a required attribute Processing Document # 1
 [...]

 The exception message above is pretty clear. You need to define a
 baseDir attribute for the second entity.

 However, even if you fix this, the setup will *not* work for indexing
 PDFs. Did you read the URLs that I sent earlier?

 Regards,
 Gora



Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
Thanks Mark.  I'm still seeing some issues while indexing though.  I
have the same setup describe in my previous email.  I do some indexing
to the cluster with everything up and everything looks good.  I then
take down one instance which is running 2 cores (shard2 slice 1 and
shard 1 slice 2) and do some more inserts.  I then bring this second
instance back up expecting that the system will recover the missing
documents from the other instance but this isn't happening.  I see the
following log message

Feb 17, 2012 9:53:11 AM org.apache.solr.cloud.RecoveryStrategy run
INFO: Sync Recovery was succesful - registering as Active

which leads me to believe things should be in sync, but they are not.
I've made no changes to the default solrconfig.xml, not sure if I need
to or not but it looks like everything should work now.  Am I missing
a configuration somewhere?

Initial state

{collection1:{
slice1:{
  JamiesMac.local:8501_solr_slice1_shard1:{
shard_id:slice1,
leader:true,
state:active,
core:slice1_shard1,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice1_shard2:{
shard_id:slice1,
state:active,
core:slice1_shard2,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr}},
slice2:{
  JamiesMac.local:8501_solr_slice2_shard2:{
shard_id:slice2,
leader:true,
state:active,
core:slice2_shard2,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice2_shard1:{
shard_id:slice2,
state:active,
core:slice2_shard1,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr


state with 1 solr instance down

{collection1:{
slice1:{
  JamiesMac.local:8501_solr_slice1_shard1:{
shard_id:slice1,
leader:true,
state:active,
core:slice1_shard1,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice1_shard2:{
shard_id:slice1,
state:active,
core:slice1_shard2,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr}},
slice2:{
  JamiesMac.local:8501_solr_slice2_shard2:{
shard_id:slice2,
leader:true,
state:active,
core:slice2_shard2,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice2_shard1:{
shard_id:slice2,
state:active,
core:slice2_shard1,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr

state when everything comes back up after adding documents

{collection1:{
slice1:{
  JamiesMac.local:8501_solr_slice1_shard1:{
shard_id:slice1,
leader:true,
state:active,
core:slice1_shard1,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice1_shard2:{
shard_id:slice1,
state:active,
core:slice1_shard2,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr}},
slice2:{
  JamiesMac.local:8501_solr_slice2_shard2:{
shard_id:slice2,
leader:true,
state:active,
core:slice2_shard2,
collection:collection1,
node_name:JamiesMac.local:8501_solr,
base_url:http://JamiesMac.local:8501/solr},
  JamiesMac.local:8502_solr_slice2_shard1:{
shard_id:slice2,
state:active,
core:slice2_shard1,
collection:collection1,
node_name:JamiesMac.local:8502_solr,
base_url:http://JamiesMac.local:8502/solr


On Thu, Feb 16, 2012 at 10:24 PM, Mark Miller markrmil...@gmail.com wrote:
 Yup - deletes are fine.


 On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote:

 With solr-2358 being committed to trunk do deletes and updates get
 distributed/routed like adds do? Also when a down shard comes back up are
 the deletes/updates forwarded as well? Reading the jira I believe the
 answer is yes, I just want to verify before bringing the latest into my
 environment.




 --
 - Mark

 http://www.lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
and having looked at this closer, shouldn't the down node not be
marked as active when I stop that solr instance?

On Fri, Feb 17, 2012 at 10:04 AM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Mark.  I'm still seeing some issues while indexing though.  I
 have the same setup describe in my previous email.  I do some indexing
 to the cluster with everything up and everything looks good.  I then
 take down one instance which is running 2 cores (shard2 slice 1 and
 shard 1 slice 2) and do some more inserts.  I then bring this second
 instance back up expecting that the system will recover the missing
 documents from the other instance but this isn't happening.  I see the
 following log message

 Feb 17, 2012 9:53:11 AM org.apache.solr.cloud.RecoveryStrategy run
 INFO: Sync Recovery was succesful - registering as Active

 which leads me to believe things should be in sync, but they are not.
 I've made no changes to the default solrconfig.xml, not sure if I need
 to or not but it looks like everything should work now.  Am I missing
 a configuration somewhere?

 Initial state

 {collection1:{
    slice1:{
      JamiesMac.local:8501_solr_slice1_shard1:{
        shard_id:slice1,
        leader:true,
        state:active,
        core:slice1_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice1_shard2:{
        shard_id:slice1,
        state:active,
        core:slice1_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr}},
    slice2:{
      JamiesMac.local:8501_solr_slice2_shard2:{
        shard_id:slice2,
        leader:true,
        state:active,
        core:slice2_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice2_shard1:{
        shard_id:slice2,
        state:active,
        core:slice2_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr


 state with 1 solr instance down

 {collection1:{
    slice1:{
      JamiesMac.local:8501_solr_slice1_shard1:{
        shard_id:slice1,
        leader:true,
        state:active,
        core:slice1_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice1_shard2:{
        shard_id:slice1,
        state:active,
        core:slice1_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr}},
    slice2:{
      JamiesMac.local:8501_solr_slice2_shard2:{
        shard_id:slice2,
        leader:true,
        state:active,
        core:slice2_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice2_shard1:{
        shard_id:slice2,
        state:active,
        core:slice2_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr

 state when everything comes back up after adding documents

 {collection1:{
    slice1:{
      JamiesMac.local:8501_solr_slice1_shard1:{
        shard_id:slice1,
        leader:true,
        state:active,
        core:slice1_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice1_shard2:{
        shard_id:slice1,
        state:active,
        core:slice1_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr}},
    slice2:{
      JamiesMac.local:8501_solr_slice2_shard2:{
        shard_id:slice2,
        leader:true,
        state:active,
        core:slice2_shard2,
        collection:collection1,
        node_name:JamiesMac.local:8501_solr,
        base_url:http://JamiesMac.local:8501/solr},
      JamiesMac.local:8502_solr_slice2_shard1:{
        shard_id:slice2,
        state:active,
        core:slice2_shard1,
        collection:collection1,
        node_name:JamiesMac.local:8502_solr,
        base_url:http://JamiesMac.local:8502/solr


 On Thu, Feb 16, 2012 at 10:24 PM, Mark Miller markrmil...@gmail.com wrote:
 Yup - deletes are fine.


 On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson jej2...@gmail.com wrote:

 With solr-2358 being committed to trunk do deletes and updates get
 distributed/routed like adds do? Also when a down shard comes back up are
 the deletes/updates forwarded as well? Reading the jira I believe the
 answer is yes, I just want to verify before bringing the latest into my
 environment.




 --
 - Mark

 

Cloud tab hanging?

2012-02-17 Thread Ranjan Bagchi
Hi,

I'm pretty new to solr and especially solr cloud, so hopefully this isn't
too dumb:  I followed the wiki instructions for setting up a small cloud.
 Things seem to work, *except* on the UI [using chrome and safari], the
cloud tab hangs.  It says Zookeeper Data, and then there's a loading
symbol.The old ui allows me to see what's in zookeeper, so I'm pretty
sure it's mostly working.

There's nothing in the logs at all about a connection timing out -- any
help?

Thanks,

Ranjan


Re: distributed deletes working?

2012-02-17 Thread Sami Siren
On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote:
 and having looked at this closer, shouldn't the down node not be
 marked as active when I stop that solr instance?

Currently the shard state is not updated in the cloudstate when a node
goes down. This behavior should probably be changed at some point.

--
 Sami Siren


Re: How to handle to run testcases in ruby code for solr

2012-02-17 Thread Erik Hatcher
Just FYI the solr-ruby (hyphen, not underscore to be precise) is 
deprecated in that the source no longer lives under Apache's svn.  The gem is 
still out there, and it's still a useful library, but the Ruby/Solr world seems 
to use RSolr the most.  Both have their pros/cons, but solr-ruby works just 
fine as you'll see.   The source code for it was relocated to my personal 
github account for posterity: https://github.com/erikhatcher/solr-ruby-flare

All that being said, the solr-ruby library itself has extensive coverage with 
unit and functional tests.  For the functional side, you can see here 
https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/test/functional/server_test.rb
   which ends up getting wrapped with a test Solr instance and leveraged in the 
:test Rake task here: 
https://github.com/erikhatcher/solr-ruby-flare/blob/master/solr-ruby/Rakefile

Hope that helps.

Erik




On Feb 17, 2012, at 07:12 , solr wrote:

 Hi  all,
 Am writing rails application by using solr_ruby gem to access solr . 
 Can anybody suggest how to handle testcaeses for solr code and connections
 in functionaltetsing.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-handle-to-run-testcases-in-ruby-code-for-solr-tp3753479p3753479.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
Thanks Sami, so long at it's expected ;)

In regards to the replication not working the way I think it should,
am I missing something or is it simply not working the way I think?

On Fri, Feb 17, 2012 at 11:01 AM, Sami Siren ssi...@gmail.com wrote:
 On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote:
 and having looked at this closer, shouldn't the down node not be
 marked as active when I stop that solr instance?

 Currently the shard state is not updated in the cloudstate when a node
 goes down. This behavior should probably be changed at some point.

 --
  Sami Siren


Re: Frequent garbage collections after a day of operation

2012-02-17 Thread Erick Erickson
A wonderful writeup on various memory collection concerns
http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/



On Fri, Feb 17, 2012 at 12:27 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 One thing that could fit the pattern you describe would be Solr caches
 filling up and getting you too close to your JVM or memory limit

 This [uncommitted] issue would solve that problem by allowing the GC
 to collect caches that become too large, though in practice, the cache
 setting would need to be fairly large for an OOM to occur from them:
 https://issues.apache.org/jira/browse/SOLR-1513

 On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow
 bloofbour...@knowledgemosaic.com wrote:
 A couple of thoughts:

 We wound up doing a bunch of tuning on the Java garbage collection.
 However, the pattern we were seeing was periodic very extreme slowdowns,
 because we were then using the default garbage collector, which blocks
 when it has to do a major collection. This doesn't sound like your
 problem, but it's something to be aware of.

 One thing that could fit the pattern you describe would be Solr caches
 filling up and getting you too close to your JVM or memory limit. For
 example, if you have large documents, and have defined a large document
 cache, that might do it.

 I found it useful to point jconsole (free with the JDK) at my JVM, and
 watch the pattern of memory usage. If the troughs at the bottom of the GC
 cycles keep rising, you know you've got something that is continuing to
 grab more memory and not let go of it. Now that our JVM is running
 smoothly, we just see a sawtooth pattern, with the troughs approximately
 level. When the system is under load, the frequency of the wave rises. Try
 it and see what sort of pattern you're getting.

 -- Bryan

 -Original Message-
 From: Matthias Käppler [mailto:matth...@qype.com]
 Sent: Thursday, February 16, 2012 7:23 AM
 To: solr-user@lucene.apache.org
 Subject: Frequent garbage collections after a day of operation

 Hey everyone,

 we're running into some operational problems with our SOLR production
 setup here and were wondering if anyone else is affected or has even
 solved these problems before. We're running a vanilla SOLR 3.4.0 in
 several Tomcat 6 instances, so nothing out of the ordinary, but after
 a day or so of operation we see increased response times from SOLR, up
 to 3 times increases on average. During this time we see increased CPU
 load due to heavy garbage collection in the JVM, which bogs down the
 the whole system, so throughput decreases, naturally. When restarting
 the slaves, everything goes back to normal, but that's more like a
 brute force solution.

 The thing is, we don't know what's causing this and we don't have that
 much experience with Java stacks since we're for most parts a Rails
 company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
 seeing this, or can you think of a reason for this? Most of our
 queries to SOLR involve the DismaxHandler and the spatial search query
 components. We don't use any custom request handlers so far.

 Thanks in advance,
 -Matthias

 --
 Matthias Käppler
 Lead Developer API  Mobile

 Qype GmbH
 Großer Burstah 50-52
 20457 Hamburg
 Telephone: +49 (0)40 - 219 019 2 - 160
 Skype: m_kaeppler
 Email: matth...@qype.com

 Managing Director: Ian Brotherston
 Amtsgericht Hamburg
 HRB 95913

 This e-mail and its attachments may contain confidential and/or
 privileged information. If you are not the intended recipient (or have
 received this e-mail in error) please notify the sender immediately
 and destroy this e-mail and its attachments. Any unauthorized copying,
 disclosure or distribution of this e-mail and  its attachments is
 strictly forbidden. This notice also applies to future messages.


Re: customizing standard tokenizer

2012-02-17 Thread Em
Hi Torsten,

did you have a look at WordDelimiterTokenFilter?

Sounds like it fits your needs.

Regards,
Em

Am 17.02.2012 15:14, schrieb Torsten Krah:
 Hi,
 
 is it possible to extend the standard tokenizer or use a custom one
 (possible via extending the standard one) to add some custom tokens
 like Lucene-Core to be one token.
 
 regards


Re: distributed deletes working?

2012-02-17 Thread Sami Siren
On Fri, Feb 17, 2012 at 6:03 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Sami, so long at it's expected ;)

 In regards to the replication not working the way I think it should,
 am I missing something or is it simply not working the way I think?

It should work. I also tried to reproduce your issue but was not able
to. Could you try reproduce your problem with the provided scripts
that are in solr/cloud-dev/ I think example2.sh might be a good start.
It's not identical to your situation (it has 1 core per instance) but
would be great if you could verify that you see the issue with that
setup or not.

--
 Sami Siren


Re: distributed deletes working?

2012-02-17 Thread Mark Miller

On Feb 17, 2012, at 11:03 AM, Jamie Johnson wrote:

 Thanks Sami, so long at it's expected ;)

Yeah, its expected - we always use both the live nodes info and state to 
determine the full state for a shard.

 
 In regards to the replication not working the way I think it should,
 am I missing something or is it simply not working the way I think?

This should work - in fact I just did the same testing this morning.

Are you indexing while you bring the shard down and then up (it should still 
work fine)?
Or do you stop indexing, bring down the shard, index, bring up the shard?

How far out of sync is it?

When exactly is this build from?

 
 On Fri, Feb 17, 2012 at 11:01 AM, Sami Siren ssi...@gmail.com wrote:
 On Fri, Feb 17, 2012 at 5:10 PM, Jamie Johnson jej2...@gmail.com wrote:
 and having looked at this closer, shouldn't the down node not be
 marked as active when I stop that solr instance?
 
 Currently the shard state is not updated in the cloudstate when a node
 goes down. This behavior should probably be changed at some point.
 
 --
  Sami Siren

- Mark Miller
lucidimagination.com













Re: distributed deletes working?

2012-02-17 Thread Yonik Seeley
On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote:
 When exactly is this build from?

Yeah... I just checked in a fix yesterday dealing with sync while
indexing is going on.

-Yonik
lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
I stop the indexing, stop the shard, then start indexing again.  So
shouldn't need Yonik's latest fix?  In regards to how far out of sync,
it's completely out of sync, meaning index 100 documents to the
cluster (40 on shard1 60 on shard2) then stop the instance, index 100
more, when I bring the instance back up if I issue queries to just the
solr instance I brought up the counts are the old counts.

I'll startup the same test with out using multiple cores.  Give me a
few and I'll provide the details.



On Fri, Feb 17, 2012 at 11:19 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote:
 When exactly is this build from?

 Yeah... I just checked in a fix yesterday dealing with sync while
 indexing is going on.

 -Yonik
 lucidimagination.com


Re: Cloud tab hanging?

2012-02-17 Thread Mark Miller

On Feb 17, 2012, at 11:00 AM, Ranjan Bagchi wrote:

 Hi,
 
 I'm pretty new to solr and especially solr cloud, so hopefully this isn't
 too dumb:  I followed the wiki instructions for setting up a small cloud.
 Things seem to work, *except* on the UI [using chrome and safari], the
 cloud tab hangs.  It says Zookeeper Data, and then there's a loading
 symbol.The old ui allows me to see what's in zookeeper, so I'm pretty
 sure it's mostly working.
 
 There's nothing in the logs at all about a connection timing out -- any
 help?
 
 Thanks,
 
 Ranjan


I've intermittently seen this myself I think - its hard to debug without 
something like firebug to see what is actually failing (in the past, with the 
new UI, i've seen it choke on some json response (that was valid according to 
other tools)).

You might want to file a JIRA issue on it and in the mean time, try using the 
old UI for this? localhost:8983/solr/collection1/admin/zookeeper.jsp

It's still more full featured anyhow, in that you can actually inspect what 
data is on each node (important for being able to see the clusterstate.json!)

- Mark Miller
lucidimagination.com













Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
I'm seeing the following.  Do I need a _version_ long field in my schema?

Feb 17, 2012 1:15:50 PM
org.apache.solr.update.processor.LogUpdateProcessor finish
INFO: {delete=[f2c29abe-2e48-4965-adfb-8bd611293ff0]} 0 0
Feb 17, 2012 1:15:50 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing _version_ on
update from leader
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:707)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:478)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:137)
at org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:235)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:166)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1523)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:405)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:255)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)



On Fri, Feb 17, 2012 at 11:25 AM, Jamie Johnson jej2...@gmail.com wrote:
 I stop the indexing, stop the shard, then start indexing again.  So
 shouldn't need Yonik's latest fix?  In regards to how far out of sync,
 it's completely out of sync, meaning index 100 documents to the
 cluster (40 on shard1 60 on shard2) then stop the instance, index 100
 more, when I bring the instance back up if I issue queries to just the
 solr instance I brought up the counts are the old counts.

 I'll startup the same test with out using multiple cores.  Give me a
 few and I'll provide the details.



 On Fri, Feb 17, 2012 at 11:19 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 11:13 AM, Mark Miller markrmil...@gmail.com wrote:
 When exactly is this build from?

 Yeah... I just checked in a fix yesterday dealing with sync while
 indexing is going on.

 -Yonik
 lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Yonik Seeley
On Fri, Feb 17, 2012 at 1:27 PM, Jamie Johnson jej2...@gmail.com wrote:
 I'm seeing the following.  Do I need a _version_ long field in my schema?

Yep... versions are the way we keep things sane (shuffled updates to a
replica can be correctly reordered, etc).

-Yonik
lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
Ok, so I'm making some progress now.  With _version_ in the schema
(forgot about this because I remember asking about it before) deletes
across the cluster work when I delete by id.  Updates work as well if
a node is down it recovered fine.  Something that didn't work though
was if a node was down when a delete happened and then comes back up,
that node still listed the id I deleted.  Is this currently supported?


On Fri, Feb 17, 2012 at 1:33 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:27 PM, Jamie Johnson jej2...@gmail.com wrote:
 I'm seeing the following.  Do I need a _version_ long field in my schema?

 Yep... versions are the way we keep things sane (shuffled updates to a
 replica can be correctly reordered, etc).

 -Yonik
 lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Yonik Seeley
On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?

Yes, that should work fine.  Are you still seing that behavior?

-Yonik
lucidimagination.com


Custom Query Component: parameters are not appended to query

2012-02-17 Thread Vadim Kisselmann
Hello folks,

I build a simple custom component for “hl.q” query.
My case was to inject hl.q=params on the fly,  with filter params like
fields which were in my
standard query.  These were highlighted , because Solr/Lucene have no way of
interpreting an extended q clause and saying this part is a query and
should be highlighted and
this part isn't.
If it works,  the community can have it :)

Facts:  q=roomba AND irobot AND language:de

My component is extended form SearchComponent. I use ResponseBuilder to get
all needed params
like field-names from schema, q-params, etc…



My component  is called as first (it works(debugging,debugQuery)) from my
SearchHandler:
arr name=first-components
strhighlightQuery/str
 /arr



Important Clippings from Sourcecode:

public class HighlightQueryComponent extends SearchComponent {

…….
…….

public void process(ResponseBuilder rb) throws IOException {

   if(rb.doHighlights){

ListString terms = new ArrayListString(0);
  SolrQueryRequest req = rb.req;
IndexSchema schema = req.getSchema();
MapString,SchemaField fields = schema.getFields();
  SolrParams params = req.getParams();
…..
….
…magic
…
….
Query hlq = new TermQuery(new Term(“text”, hlQuery.toString()));
rb.setHighlightQuery(hlq);   // hlq = text:(roomba AND irobot)



Problem:
In last step my query is adjusted (hlq params from debugging are
“text:(roomba AND irobot)”). It looks fine, the magic in method process()
works.
But nothing happen. If I continue to debug the next components were called,
But my query is the same, without changes.
Either setHighlightQuery doesn´t work, or my params are overridden in
following components.
What can it be?

Best Regards
Vadim


Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
Yes, still seeing that.  Master has 8 items, replica has 9.  So the
delete didn't seem to work when the node was down.

On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?

 Yes, that should work fine.  Are you still seing that behavior?

 -Yonik
 lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Mark Miller
Hmm...just tried this with only deletes, and the replica sync'd fine for me.

Is this with your multi core setup or were you trying with instances?

On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote:

 Yes, still seeing that.  Master has 8 items, replica has 9.  So the
 delete didn't seem to work when the node was down.
 
 On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?
 
 Yes, that should work fine.  Are you still seing that behavior?
 
 -Yonik
 lucidimagination.com

- Mark Miller
lucidimagination.com













Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
This was with the cloud-dev solrcloud-start.sh script (after that I've
used solrcloud-start-existing.sh).

Essentially I run ./solrcloud-start-existing.sh
index docs
kill 1 of the solr instances (using kill -9 on the pid)
delete a doc from running instances
restart killed solr instance

on doing this the deleted document is still lingering in the instance
that was down.

On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...just tried this with only deletes, and the replica sync'd fine for me.

 Is this with your multi core setup or were you trying with instances?

 On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote:

 Yes, still seeing that.  Master has 8 items, replica has 9.  So the
 delete didn't seem to work when the node was down.

 On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?

 Yes, that should work fine.  Are you still seing that behavior?

 -Yonik
 lucidimagination.com

 - Mark Miller
 lucidimagination.com













RE: customizing standard tokenizer

2012-02-17 Thread Steven A Rowe
Hi Torsten,

The Lucene StandardTokenizer is written in JFlex (http://jflex.de) - you can 
see the version 3.X specification at: 

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup

You can make changes to this file, then run ant jflex-StandardAnalyzer from 
the checked-out branch_3x sources or a source release (in the lucene/core/ 
directory in branch_3x, and in the lucene/ directory in a pre-3.6 source 
release), to generate the corresponding java source code at:

  
lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java

However, I recommend a simpler strategy: use a MappingCharFilter[1] in front of 
your tokenizer to map the tokens you want left intact to strings that will not 
be broken up by the tokenizer.  For example, Lucene-Core could be mapped to 
Lucene_Core, because UAX#29[2], upon which StandardTokenizer is based, 
considers the underscore to be a word character, and so will leave 
Lucene_Core as a single token.  You would need to use this strategy at both 
index-time and query-time.

(I was going to add that if you wanted your indexed tokens to be the same as 
their original form, you could add a MappingTokenFilter after your tokenizer to 
do the reverse mapping, but such a thing does not yet exist :( - however, there 
is a JIRA issue for this idea: 
https://issues.apache.org/jira/browse/SOLR-1978.)

Steve

[1] 
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html

[2] http://unicode.org/reports/tr29/

 -Original Message-
 From: Torsten Krah [mailto:tk...@fachschaft.imn.htwk-leipzig.de]
 Sent: Friday, February 17, 2012 9:15 AM
 To: solr-user@lucene.apache.org
 Subject: customizing standard tokenizer
 
 Hi,
 
 is it possible to extend the standard tokenizer or use a custom one
 (possible via extending the standard one) to add some custom tokens
 like Lucene-Core to be one token.
 
 regards


Re: distributed deletes working?

2012-02-17 Thread Yonik Seeley
On Fri, Feb 17, 2012 at 2:07 PM, Jamie Johnson jej2...@gmail.com wrote:
 This was with the cloud-dev solrcloud-start.sh script (after that I've
 used solrcloud-start-existing.sh).

 Essentially I run ./solrcloud-start-existing.sh
 index docs
 kill 1 of the solr instances (using kill -9 on the pid)
 delete a doc from running instances
 restart killed solr instance

 on doing this the deleted document is still lingering in the instance
 that was down.

Hmmm.  Shot in the dark : is your id field type something other than string?

-Yonik
lucidimagination.com


Re: distributed deletes working?

2012-02-17 Thread Mark Miller
You are committing in that mix right?

On Feb 17, 2012, at 2:07 PM, Jamie Johnson wrote:

 This was with the cloud-dev solrcloud-start.sh script (after that I've
 used solrcloud-start-existing.sh).
 
 Essentially I run ./solrcloud-start-existing.sh
 index docs
 kill 1 of the solr instances (using kill -9 on the pid)
 delete a doc from running instances
 restart killed solr instance
 
 on doing this the deleted document is still lingering in the instance
 that was down.
 
 On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...just tried this with only deletes, and the replica sync'd fine for me.
 
 Is this with your multi core setup or were you trying with instances?
 
 On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote:
 
 Yes, still seeing that.  Master has 8 items, replica has 9.  So the
 delete didn't seem to work when the node was down.
 
 On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?
 
 Yes, that should work fine.  Are you still seing that behavior?
 
 -Yonik
 lucidimagination.com
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 

- Mark Miller
lucidimagination.com













Re: problem to indexing pdf directory

2012-02-17 Thread alessio crisantemi
i try...but i works with solr 1.4.1

Il giorno 17 febbraio 2012 15:59, Erick Erickson
erickerick...@gmail.comha scritto:

 You should not have to do anything with Maven, the instructions
 you followed were from 1.4.1 days..

 Assuming you're working with a 3.x build, here's a data-config
 that worked for me, just a straight distro. But note a couple of things:
 1 for simplicity, I changed the schema.xml to NOT require
 the id field. You'll have to change this back probably and
 select a good uniqueKey
 2 I had to add this line to solrconfig.xml to find the path:
 lib dir=../../dist/
 regex=apache-solr-dataimporthandler-extras-\d.*\.jar/
 3 If this all works without errors in the Solr log and you still
 can't find anything, be sure you issue a commit.

 Best
 Erick

 dataConfig
  dataSource name=bin type=BinFileDataSource/
  document
entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd
 processor=FileListEntityProcessor recursive=true
 rootEntity=false
  entity dataSource=bin format=text name=tika-test
 processor=TikaEntityProcessor url=${sd.fileAbsolutePath}
field column=Author meta=true name=author/
field column=Content-Type meta=true name=title/
!-- field column=title name=title meta=true/ --
field column=text name=text/
  /entity
  !-- field column=fileLastModified name=date
 dateTimeFormat=-MM-dd'T'hh:mm:ss / --
  field column=fileSize meta=true name=size/
/entity
  /document
 /dataConfig
 On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  thanks gora for your help.
  I installed Maven and downloaded Tika following the guide: But I have an
  errore during the built of Tika about 'tika compiler', and the maven
  installation of Tika is stopped.
 
  there is another way?
  thank you
  a.
 
  2012/2/16 Gora Mohanty g...@mimirtech.com
 
  On 16 February 2012 21:37, alessio crisantemi
  alessio.crisant...@gmail.com wrote:
   here the log:
  
  
   org.apache.solr.handler.dataimport.DataImporter doFullImport
   Grave: Full Import failed
   org.apache.solr.handler.dataimport.DataImportHandlerException:
 'baseDir'
  is
   a required attribute Processing Document # 1
  [...]
 
  The exception message above is pretty clear. You need to define a
  baseDir attribute for the second entity.
 
  However, even if you fix this, the setup will *not* work for indexing
  PDFs. Did you read the URLs that I sent earlier?
 
  Regards,
  Gora
 



Re: problem to indexing pdf directory

2012-02-17 Thread Erick Erickson
Sorry, my error! In that case you *do* have to do some fiddling to get
it all to work.

Good Luck!
Erick

On Fri, Feb 17, 2012 at 3:27 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
 i try...but i works with solr 1.4.1

 Il giorno 17 febbraio 2012 15:59, Erick Erickson
 erickerick...@gmail.comha scritto:

 You should not have to do anything with Maven, the instructions
 you followed were from 1.4.1 days..

 Assuming you're working with a 3.x build, here's a data-config
 that worked for me, just a straight distro. But note a couple of things:
 1 for simplicity, I changed the schema.xml to NOT require
 the id field. You'll have to change this back probably and
 select a good uniqueKey
 2 I had to add this line to solrconfig.xml to find the path:
 lib dir=../../dist/
 regex=apache-solr-dataimporthandler-extras-\d.*\.jar/
 3 If this all works without errors in the Solr log and you still
     can't find anything, be sure you issue a commit.

 Best
 Erick

 dataConfig
  dataSource name=bin type=BinFileDataSource/
  document
    entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd
 processor=FileListEntityProcessor recursive=true
 rootEntity=false
      entity dataSource=bin format=text name=tika-test
 processor=TikaEntityProcessor url=${sd.fileAbsolutePath}
        field column=Author meta=true name=author/
        field column=Content-Type meta=true name=title/
        !-- field column=title name=title meta=true/ --
        field column=text name=text/
      /entity
      !-- field column=fileLastModified name=date
 dateTimeFormat=-MM-dd'T'hh:mm:ss / --
      field column=fileSize meta=true name=size/
    /entity
  /document
 /dataConfig
 On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  thanks gora for your help.
  I installed Maven and downloaded Tika following the guide: But I have an
  errore during the built of Tika about 'tika compiler', and the maven
  installation of Tika is stopped.
 
  there is another way?
  thank you
  a.
 
  2012/2/16 Gora Mohanty g...@mimirtech.com
 
  On 16 February 2012 21:37, alessio crisantemi
  alessio.crisant...@gmail.com wrote:
   here the log:
  
  
   org.apache.solr.handler.dataimport.DataImporter doFullImport
   Grave: Full Import failed
   org.apache.solr.handler.dataimport.DataImportHandlerException:
 'baseDir'
  is
   a required attribute Processing Document # 1
  [...]
 
  The exception message above is pretty clear. You need to define a
  baseDir attribute for the second entity.
 
  However, even if you fix this, the setup will *not* work for indexing
  PDFs. Did you read the URLs that I sent earlier?
 
  Regards,
  Gora
 



Re: problem to indexing pdf directory

2012-02-17 Thread alessio crisantemi
I'm confused now..
so, my last question:
I add this in my solrconfig.xml:


requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
str name=configc:\solr\conf\db-config.xml/str
  /lst
/requestHandler


And I wrote my db-config.xml like this:
dataConfig
  dataSource type=BinFileDataSource name=bin /
document
  entity name=sd
processor=FileListEntityProcessor
newerThan='NOW-30DAYS'
fileName=.*pdf$
baseDir=D:\myfiles
recursive=true
rootEntity=false
transformer=DateFormatTransformer
  
entity name=tika-test processor=TikaEntityProcessor
url=${sd.fileAbsolutePath} format=text dataSource=bin
 field column=author  name=author meta=true/
   field column=title name=title meta=true/
 field column=description name=description /
 field column=comments name=comments /
 field column=content_type name=content_type /
 field column=last_modified name=last_modified /
/entity

!-- field column=fileLastModified name=date
dateTimeFormat=-MM-dd'T'hh:mm:ss / --
field column=fileSize name=size/
field column=file name=filename/
/entity
  /document
/dataConfig
that's must work, in your opinion, or you see an error in this code?
thanks,
alessio



Il giorno 17 febbraio 2012 21:29, Erick Erickson
erickerick...@gmail.comha scritto:

 Sorry, my error! In that case you *do* have to do some fiddling to get
 it all to work.

 Good Luck!
 Erick

 On Fri, Feb 17, 2012 at 3:27 PM, alessio crisantemi
 alessio.crisant...@gmail.com wrote:
  i try...but i works with solr 1.4.1
 
  Il giorno 17 febbraio 2012 15:59, Erick Erickson
  erickerick...@gmail.comha scritto:
 
  You should not have to do anything with Maven, the instructions
  you followed were from 1.4.1 days..
 
  Assuming you're working with a 3.x build, here's a data-config
  that worked for me, just a straight distro. But note a couple of things:
  1 for simplicity, I changed the schema.xml to NOT require
  the id field. You'll have to change this back probably and
  select a good uniqueKey
  2 I had to add this line to solrconfig.xml to find the path:
  lib dir=../../dist/
  regex=apache-solr-dataimporthandler-extras-\d.*\.jar/
  3 If this all works without errors in the Solr log and you still
  can't find anything, be sure you issue a commit.
 
  Best
  Erick
 
  dataConfig
   dataSource name=bin type=BinFileDataSource/
   document
 entity baseDir=/Users/Erick/testdocs fileName=.*pdf name=sd
  processor=FileListEntityProcessor recursive=true
  rootEntity=false
   entity dataSource=bin format=text name=tika-test
  processor=TikaEntityProcessor url=${sd.fileAbsolutePath}
 field column=Author meta=true name=author/
 field column=Content-Type meta=true name=title/
 !-- field column=title name=title meta=true/ --
 field column=text name=text/
   /entity
   !-- field column=fileLastModified name=date
  dateTimeFormat=-MM-dd'T'hh:mm:ss / --
   field column=fileSize meta=true name=size/
 /entity
   /document
  /dataConfig
  On Fri, Feb 17, 2012 at 9:35 AM, alessio crisantemi
  alessio.crisant...@gmail.com wrote:
   thanks gora for your help.
   I installed Maven and downloaded Tika following the guide: But I have
 an
   errore during the built of Tika about 'tika compiler', and the maven
   installation of Tika is stopped.
  
   there is another way?
   thank you
   a.
  
   2012/2/16 Gora Mohanty g...@mimirtech.com
  
   On 16 February 2012 21:37, alessio crisantemi
   alessio.crisant...@gmail.com wrote:
here the log:
   
   
org.apache.solr.handler.dataimport.DataImporter doFullImport
Grave: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException:
  'baseDir'
   is
a required attribute Processing Document # 1
   [...]
  
   The exception message above is pretty clear. You need to define a
   baseDir attribute for the second entity.
  
   However, even if you fix this, the setup will *not* work for indexing
   PDFs. Did you read the URLs that I sent earlier?
  
   Regards,
   Gora
  
 



Re: Solritas: Modify $content in layout.vm

2012-02-17 Thread Erick Erickson
Why do you want to? That is what are you trying to accomplish by
modifying that variable? You may not really need to...

This seems like an XY problem...

Best
Erick

On Thu, Feb 16, 2012 at 11:06 PM, remi tassing tassingr...@gmail.com wrote:
 Hi all,

 How do we modify the $content variable in the layout.vm file? I could
 managed to change other stuff in doc.vm or header.vm but not this one.

 Is there any tutorial on this?

 Remi


Re: distributed deletes working?

2012-02-17 Thread Jamie Johnson
yes committing in the mix.

id field is a UUID.

On Fri, Feb 17, 2012 at 3:22 PM, Mark Miller markrmil...@gmail.com wrote:
 You are committing in that mix right?

 On Feb 17, 2012, at 2:07 PM, Jamie Johnson wrote:

 This was with the cloud-dev solrcloud-start.sh script (after that I've
 used solrcloud-start-existing.sh).

 Essentially I run ./solrcloud-start-existing.sh
 index docs
 kill 1 of the solr instances (using kill -9 on the pid)
 delete a doc from running instances
 restart killed solr instance

 on doing this the deleted document is still lingering in the instance
 that was down.

 On Fri, Feb 17, 2012 at 2:04 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...just tried this with only deletes, and the replica sync'd fine for me.

 Is this with your multi core setup or were you trying with instances?

 On Feb 17, 2012, at 1:52 PM, Jamie Johnson wrote:

 Yes, still seeing that.  Master has 8 items, replica has 9.  So the
 delete didn't seem to work when the node was down.

 On Fri, Feb 17, 2012 at 1:41 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Fri, Feb 17, 2012 at 1:38 PM, Jamie Johnson jej2...@gmail.com wrote:
 Something that didn't work though
 was if a node was down when a delete happened and then comes back up,
 that node still listed the id I deleted.  Is this currently supported?

 Yes, that should work fine.  Are you still seing that behavior?

 -Yonik
 lucidimagination.com

 - Mark Miller
 lucidimagination.com












 - Mark Miller
 lucidimagination.com













Re: distributed deletes working?

2012-02-17 Thread Mark Miller

On Feb 17, 2012, at 3:56 PM, Jamie Johnson wrote:

 id field is a UUID.

Strange - was using UUID's myself in same test this morning...

I'll try again soon.

- Mark Miller
lucidimagination.com













proper syntax for using sort query parameter in responseHandler

2012-02-17 Thread geeky2
what is the proper syntax for including sort directive in my responseHandler?

i tried this but got an error:


  requestHandler name=partItemNoSearch class=solr.SearchHandler
default=false
lst name=defaults
  str name=defTypeedismax/str
  str name=echoParamsall/str
  int name=rows10/int
  str name=qfitemNo^1.0/str
  str name=q.alt*:*/str
 * str name=sortrankNo desc/str*
/lst
lst name=appends
  str name=fqitemType:1/str
/lst
lst name=invariants
  str name=facetfalse/str
/lst
  /requestHandler


thank you
mark

--
View this message in context: 
http://lucene.472066.n3.nabble.com/proper-syntax-for-using-sort-query-parameter-in-responseHandler-tp3755077p3755077.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solritas: Modify $content in layout.vm

2012-02-17 Thread Erik Hatcher
$content is output of the main template rendered.

To modify what is generated into $content, modify the main template or the 
sub-#parsed templates (which is what you've discovered, looks like) that is 
rendered (browse.vm, perhaps, if you're using the default example setup).  The 
main template that is rendered is specified as v.template (in the /browse 
handler definition in solrconfig.xml, again if you're using the example 
configuration).

Does that help?  If not, let us know what you're trying to do exactly.

Erik




On Feb 16, 2012, at 23:06 , remi tassing wrote:

 Hi all,
 
 How do we modify the $content variable in the layout.vm file? I could
 managed to change other stuff in doc.vm or header.vm but not this one.
 
 Is there any tutorial on this?
 
 Remi



Indexing 100Gb of readonly numeric data

2012-02-17 Thread Pedro Ferreira
Hi guys, I'm cross posting this from lucene list as I guess I can have
better help here for this scenario.
Suppose I want to index 100Gb+ of numeric data. I'm not yet sure the
specifics, but I can expect the following:
- data is expected to be in one gigantic table. conceptually, is likea
spreadsheet table: rows are objects and columns are properties.-
values are mostly floating point numbers, and I expect them to
be,let's say, unique or discreet, or almost randomly distributed
(1.89868776E+50,1.434E-12)- The data is readonly. it will never
change.
Now I need to query this data based mostly in range queries on
thecolumns. Something like:
SELECT * FROM Table WHERE (Col1  1.2E2 AND Col1  1.8E2) OR (Col3 == 0)
which is basically give me all the rows that satisfy this criteria.
I believe this could be easily done with a standard RDBMS, but I
wouldlike to avoid that route.
While thinking about this, and assuming this could work well withSolr,
I had some things I couldn't answer:-
- In this case, it makes total sense to store the data in the index.
If I will index all columns, I might as well have the data right
there.
- Does it make any sense to index this whole thing once, while
offline, and then upload only the index to the servers?
- I'm almost sure I will have to shard the index in some way, and this
isn't difficult. But what are the possible hardware requirements to
host this thing? I know this depends on lots of information I didn't
provide (searches/sec for example), but can someone throw a number? I
have completely no ideia...

Thanks
--
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferre...@gmail.com
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira


Re: Indexing 100Gb of readonly numeric data

2012-02-17 Thread Pedro Ferreira
Ouch... sorry about the format... I have no idea why gmail turned my
text into that...

On Fri, Feb 17, 2012 at 10:07 PM, Pedro Ferreira
psilvaferre...@gmail.com wrote:
 Hi guys, I'm cross posting this from lucene list as I guess I can have
 better help here for this scenario.
 Suppose I want to index 100Gb+ of numeric data. I'm not yet sure the
 specifics, but I can expect the following:
 - data is expected to be in one gigantic table. conceptually, is likea
 spreadsheet table: rows are objects and columns are properties.-
 values are mostly floating point numbers, and I expect them to
 be,let's say, unique or discreet, or almost randomly distributed
 (1.89868776E+50,1.434E-12)- The data is readonly. it will never
 change.
 Now I need to query this data based mostly in range queries on
 thecolumns. Something like:
 SELECT * FROM Table WHERE (Col1  1.2E2 AND Col1  1.8E2) OR (Col3 == 0)
 which is basically give me all the rows that satisfy this criteria.
 I believe this could be easily done with a standard RDBMS, but I
 wouldlike to avoid that route.
 While thinking about this, and assuming this could work well withSolr,
 I had some things I couldn't answer:-
 - In this case, it makes total sense to store the data in the index.
 If I will index all columns, I might as well have the data right
 there.
 - Does it make any sense to index this whole thing once, while
 offline, and then upload only the index to the servers?
 - I'm almost sure I will have to shard the index in some way, and this
 isn't difficult. But what are the possible hardware requirements to
 host this thing? I know this depends on lots of information I didn't
 provide (searches/sec for example), but can someone throw a number? I
 have completely no ideia...

 Thanks
 --
 Pedro Ferreira

 mobile: 00 44 7712 557303
 skype: pedrosilvaferreira
 email: psilvaferre...@gmail.com
 linkedin: http://uk.linkedin.com/in/pedrosilvaferreira



-- 
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferre...@gmail.com
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira


Re: proper syntax for using sort query parameter in responseHandler

2012-02-17 Thread Tommaso Teofili
Hi Mark,
Having a look at that requestHandler it looks ok [1], are you experiencing
any errors?
If so did you check the wiki page FieldOptionsByUseCase [2], maybe that
field (rankNo) options contain indexed=false or multiValued=true?
HTH,
Tommaso

[1] : http://wiki.apache.org/solr/CommonQueryParameters#sort
[2] : http://wiki.apache.org/solr/FieldOptionsByUseCase


2012/2/17 geeky2 gee...@hotmail.com

 what is the proper syntax for including sort directive in my
 responseHandler?

 i tried this but got an error:


  requestHandler name=partItemNoSearch class=solr.SearchHandler
 default=false
lst name=defaults
  str name=defTypeedismax/str
  str name=echoParamsall/str
  int name=rows10/int
  str name=qfitemNo^1.0/str
  str name=q.alt*:*/str
 * str name=sortrankNo desc/str*
/lst
lst name=appends
  str name=fqitemType:1/str
/lst
lst name=invariants
  str name=facetfalse/str
/lst
  /requestHandler


 thank you
 mark

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/proper-syntax-for-using-sort-query-parameter-in-responseHandler-tp3755077p3755077.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr Wiki and mailing lists

2012-02-17 Thread Lance Norskog
The Apache Solr main page does not mention the mailing lists. The wiki
main page has a broken link. I have had to search my incoming mail to
find out how to unsubscribe to solr-user.

Someone with full access- please fix these problems.

Thanks,

-- 
Lance Norskog
goks...@gmail.com


Re: Solr Wiki and mailing lists

2012-02-17 Thread Artem Lokotosh
To unsubscribe, e-mail: solr-user-unsubscr...@lucene.apache.org

Also you can request a FAQ, e-mail: solr-user-...@lucene.apache.org


On Sat, Feb 18, 2012 at 12:38 AM, Lance Norskog goks...@gmail.com wrote:
 The Apache Solr main page does not mention the mailing lists. The wiki
 main page has a broken link. I have had to search my incoming mail to
 find out how to unsubscribe to solr-user.

 Someone with full access- please fix these problems.

 Thanks,

 --
 Lance Norskog
 goks...@gmail.com



-- 
Best regards,
Artem Lokotosh        mailto:arco...@gmail.com


RE: Improving proximity search performance

2012-02-17 Thread Bryan Loofbourrow
Apologies. I meant to type “1.4 TB” and somehow typed “1.4 GB.” Little
wonder that no one thought the question was interesting, or figured I must
be using Sneakernet to run my searches.



-- Bryan Loofbourrow


  --

*From:* Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
*Sent:* Thursday, February 16, 2012 7:07 PM
*To:* 'solr-user@lucene.apache.org'
*Subject:* Improving proximity search performance



Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow


Using nested entities in FileDataSource import of xml file contents

2012-02-17 Thread Mike O'Leary
Can anybody help me understand the right way to define a data-config.xml file 
with nested entities for indexing the contents of an XML file?

I used this data-config.xml file to index a database containing sample patient 
records:

dataConfig
  dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver 
url=jdbc:mysql://localhost:3306/bioscope user=db_user password=/
  document name=bioscope
entity name=docs pk=doc_id query=SELECT doc_id, type FROM 
bioscope.docs
  field column=doc_id name=doc_id/
  field column=type name=doc_type/
  entity name=codes query=SELECT id, origin, type, code FROM 
bioscope.codes WHERE doc_id='${docs.doc_id}'
field column=origin name=code_origin/
field column=type name=code_type/
field column=code name=code_value/
  /entity
  entity name=notes query=SELECT id, origin, type, text FROM 
bioscope.texts WHERE doc_id='${docs.doc_id}'
field column=origin name=note_origin/
field column=type name=note_type/
field column=text name=note_text/
  /entity
/entity
  /document
/dataConfig

I would like to do the same thing with an XML file containing the same data as 
is in the database. That XML file looks like this:

docs
  doc id=97634811 type=RADIOLOGY_REPORT
codes
  code origin=CMC_MAJORITY type=ICD-9-CM786.2/code
  code origin=COMPANY3 type=ICD-9-CM786.2/code
  code origin=COMPANY1 type=ICD-9-CM786.2/code
  code origin=COMPANY2 type=ICD-9-CM786.2/code
/codes
texts
  text origin=CCHMC_RADIOLOGY type=CLINICAL_HISTORYSeventeen year old 
with cough./text
  text origin=CCHMC_RADIOLOGY type=IMPRESSIONNormal./text
/texts
  /doc
  
/docs

I tried using this data-config.xml file, in order to preserve the nested entity 
structure used with the database case:

dataConfig
  dataSource type=FileDataSource encoding=UTF-8/
  document name=bioscope
entity name=doc processor=XPathEntityProcessor stream=true 
forEach=/docs/doc url=C:/data/bioscope.xml
  field column=doc_id xpath=/docs/doc/@id/
  field column=doc_type xpath=/docs/doc/@type/
  entity name=code processor=XPathEntityProcessor stream=true 
forEach=/docs/doc[@id='${doc.doc_id}']/codes/code url=C:/data/bioscope.xml
field column=code_origin 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@origin/
field column=code_type 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@type/
field column=code_value 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/
  /entity
  entity name=note processor=XPathEntityProcessor stream=true 
forEach=/docs/doc[@id='${doc.doc_id}']/texts/text url=C:/data/bioscope.xml
   field column=note_origin 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@origin/
   field column=note_type 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@type/
   field column=note_text 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/
  /entity
/entity
  /document
/dataConfig

This is wrong, and it fails to index any of the codes and texts blocks in 
the XML file. I'm sure that part of the problem  must be that the xpath 
expressions such as /docs/doc[@id='${doc.doc_id}']/texts/text/@origin fail to 
match anything in the XML file, because when I try the same import without 
nested entities, using this data-config.xml file, the codes and texts 
blocks are also not indexed:

dataConfig
  dataSource type=FileDataSource encoding=UTF-8/
  document name=bioscope
entity name=doc processor=XPathEntityProcessor stream=true 
forEach=/docs/doc url=C:/data/bioscope.xml
  field column=doc_id xpath=/docs/doc/@id/
  field column=doc_type xpath=/docs/doc/@type/
  field column=code_origin 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@origin/
  field column=code_type 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/@type/
  field column=code_value 
xpath=/docs/doc[@id='${doc.doc_id}']/codes/code/
  field column=note_origin 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@origin/
  field column=note_type 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/@type/
  field column=note_text 
xpath=/docs/doc[@id='${doc.doc_id}']/texts/text/
/entity
  /document
/dataConfig

However, when I use this data-config.xml file, which doesn't use nested 
entities, all of the fields are included in the index:

dataConfig
  dataSource type=FileDataSource encoding=UTF-8/
  document name=bioscope
entity name=doc processor=XPathEntityProcessor stream=true 
forEach=/docs/doc url=C:/data/bioscope.xml
  field column=doc_id xpath=/docs/doc/@id/
  field column=doc_type xpath=/docs/doc/@type/
  field column=code_origin xpath=/docs/doc/codes/code/@origin/
  field column=code_type xpath=/docs/doc/codes/code/@type/
  field column=code_value xpath=/docs/doc/codes/code/
  field column=note_origin xpath=/docs/doc/texts/text/@origin/
  field column=note_type xpath=/docs/doc/texts/text/@type/
  field column=note_text 

Re: how to delta index linked entities in 3.5.0

2012-02-17 Thread AdamLane
Thanks for your thoughts Shawn.  I did notice 3.x tightened up alot and I did
account for it by making sure I had pk defined and columns explicitly
aliased with the same name (and I will make sure the bug text reflects
that).

To help others that are having the same problem, I just found a thread
describing a workaround using group_concat() in mysql and then transformer
on solr.  So far this appears to work and also seems to delta around 10x
faster.  The only disadvantage is that the delta index process doesn't tell
you how many rows have changed.  It just says 1 row because you are hacking
deltaQuery to return a single dummy row and making deltaImportQuery take in
last_index_time and return all rows that have changed.

Quote:

The following (MySql) query concatenates 3 lang_code fields from the main
table into one field and multiple emails from a secondary table into
another field:
SELECT u.id,
   u.name,
   IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND
u.lang_code3 IS NULL), NULL,
   CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2,
u.lang_code3) USING ascii)) AS multi_lang_codes,
   GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails
FROM users_tb u
LEFT JOIN emails_tb e ON u.id = e.id
GROUP BY u.id

The entity in data-config.xml looks something like:
entity name=my_entity
query=call get_solr_full();
transformer=RegexTransformer
field name=email column=multiple_emails splitBy=\| /
field name=lang_code column=multiple_lang_codes
splitBy=\| /
/entity

Full Thread:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3E

So until the bug is fixed or docs are changed I hope this helps someone else
searching for this same error message.

Adam


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-delta-index-linked-entities-in-3-5-0-tp3752455p3755453.html
Sent from the Solr - User mailing list archive at Nabble.com.


PointType hard-coded to Doubles?

2012-02-17 Thread Lance Norskog
The PointType seems to be hard-coded to use doubles. Where in the code
does this happen?

-- 
Lance Norskog
goks...@gmail.com