date:20110117

Hi,

I have made an index using CommonGrams. Now when I query a b and explain
it, SOLR makes it +MultiPhraseQuery(Contents:(a a_b) b).

Shouldn't it just be searching a_b? I am asking this coz even though I am
using CommonGrams it's much slower than normal index which just searches on
a b.

Note: Both words are in the words list of CommonGrams.

-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210

Re: spell suggest response

2011-01-17 Thread satya swaroop

Hi Grijesh,
   Though i use autosuggest i maynot get the exact results, the
order is not accurate.. As for example if i type
http://localhost:8080/solr/terms/?terms.fl=spellterms.prefix=solrterms.sort=indexterms.lower=solrterms.upper.incl=true
 i get results as...
solr
solr.amp
solr.datefield
solr.p
solr.pdf
   like that.But this may not lead to getting accurate results as we get in
spellchecking,

i require suggestions for any word irrespective of whether it is correct or
not, is there anything to be changed in solr to get suggestions as we get
when we type a wrong word in spellchecking... If so please let me know...

Regards,
satya

Re: spell suggest response

2011-01-17 Thread Grijesh


Hi Satya,

In this example you are not using spellchecking .I am saying use spellcheck
component also with Terms component so it will give you the spellcheck
suggestion also. Then combined both the lists.

-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2271114.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CommonGrams phrase query

Ok sorry it was my fault.

I wasn't using CommonGramsQueryFilter for query, just had Filter for
indexing. The query seems fine now.

On Mon, Jan 17, 2011 at 1:44 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Hi,

 I have made an index using CommonGrams. Now when I query a b and explain
 it, SOLR makes it +MultiPhraseQuery(Contents:(a a_b) b).

 Shouldn't it just be searching a_b? I am asking this coz even though I am
 using CommonGrams it's much slower than normal index which just searches on
 a b.

 Note: Both words are in the words list of CommonGrams.

 --
 Regards,

 Salman Akram



-- 
Regards,

Salman Akram

sort problem

2011-01-17 Thread Philippe VINCENT-ROYOL


Hi guys,

I use solr with utf8 charset and i've a sort problem. For example, i 
make a sort on a name field.. results looks like:


Article
Banana
Foo
aviation
brunch
...

So my question is, how to force solr to ignore case in result ? I would 
like to have result as:

Article
aviation
Banana
brunch
Foo
...

Thanks
Philippe

Re: sort problem

2011-01-17 Thread Grijesh


Use Lowercase filter to lowering your data at both index time and search time
it will make case insensitive

-
Thanx:
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-problem-tp2271207p2271231.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: spell suggest response

2011-01-17 Thread satya swaroop

Hi Grijesh,
i added both the termscomponent and spellcheck component to the
terms requesthandler, when i send a query as
http://localhost:8080/solr/terms?terms.fl=textterms.prefix=javarows=7omitHeader=truespellcheck=truespellcheck.q=javaspellcheck.count=20

the result i get is
response
-
lst name=terms
-
lst name=text
int name=java6/int
int name=javabas6/int
int name=javas6/int
int name=javascript6/int
int name=javac6/int
int name=javax6/int
/lst
/lst
-
lst name=spellcheck
lst name=suggestions/
/lst
/response



when i send this
http://localhost:8080/solr/terms?terms.fl=textterms.prefix=jawarows=5omitHeader=truespellcheck=truespellcheck.q=jawaspellcheck.count=20
i get the result as

response
-
lst name=terms
lst name=text/
/lst
-
lst name=spellcheck
-
lst name=suggestions
-
lst name=jawa
int name=numFound20/int
int name=startOffset0/int
int name=endOffset4/int
-
arr name=suggestion
strjava/str
straway/str
strjav/str
strjar/str
strara/str
strapa/str
strana/str
strajax/str


Now i need to know how to make ordering of the terms as in the 1st query the
result obtained is inorder and i want only javax, javac,javascript but not
javas,javabas how can it be done??

Regards,
satya

Re: sort problem

2011-01-17 Thread Philippe VINCENT-ROYOL


Le 17/01/11 10:32, Grijesh a écrit :

Use Lowercase filter to lowering your data at both index time and search time
it will make case insensitive

-
Thanx:
Grijesh

Thanks,
so tell me if i m wrong... i need to modify my schema.xml to add 
lowercase filter and reindex my content?

Re: sort problem

Yes.

On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL 
vincent.ro...@gmail.com wrote:

 Le 17/01/11 10:32, Grijesh a écrit :

  Use Lowercase filter to lowering your data at both index time and search
 time
 it will make case insensitive

 -
 Thanx:
 Grijesh

 Thanks,
 so tell me if i m wrong... i need to modify my schema.xml to add lowercase
 filter and reindex my content?





-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210

latest patches and big picture of search grouping

2011-01-17 Thread Marc Sturlese


I need to dive into search grouping / field collapsing again. I've seen there
are lot's of issues about it now.
Can someone point me to the minimum patches I need to run this feature in
trunk? I want to see the code of the most optimised version and what's being
done in distributed search. I think I need this:

https://issues.apache.org/jira/browse/SOLR-2068
https://issues.apache.org/jira/browse/SOLR-2205
https://issues.apache.org/jira/browse/SOLR-2066

But not sure if I am missing anything else.

By the way, I think the current implementation of group searching is totally
different that what it was before when you could choose normal or adjacent
collapse.
Can someone give me a quick big picture of the current implementation (I
will trace the code anyway, but it's just to get an idea). Is there still a
double trip?

Thanks in advance.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/latest-patches-and-big-picture-of-search-grouping-tp2271383p2271383.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: exception obtaining write lock on startup

2011-01-17 Thread samarth s

In that case why is there a separate lock factory of SingleInstanceLockFactory?

On Fri, Dec 31, 2010 at 6:25 AM, Lance Norskog goks...@gmail.com wrote:
 This will not work. At all.

 You can only have one Solr core instance changing an index.

 On Thu, Dec 30, 2010 at 4:38 PM, Tri Nguyen tringuye...@yahoo.com wrote:
 Hi,

 I'm getting this exception when I have 2 cores as masters.  Seems like one 
 of the cores obtains a lock (file) and then the other tries to obtain the 
 same one.   However, the first one is not deleted.

 How do I fix this?

 Dec 30, 2010 4:34:48 PM org.apache.solr.handler.ReplicationHandler inform
 WARNING: Unable to get IndexCommit on startup
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
 Native
 FSLock@..\webapps\solr\tnsolr\data\index\lucene-fe3fc928a4bbfeb55082e49b32a70c10
 -write.lock
     at org.apache.lucene.store.Lock.obtain(Lock.java:85)
     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1565)
     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1421)
     at 
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:19
 1)
     at 
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHand
 ler.java:98)
     at 
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHa
 ndler2.java:173)
     at 
 org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpd
 ateHandler2.java:376)
     at 
 org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler.


 Tri



 --
 Lance Norskog
 goks...@gmail.com

Re: Single value vs multi value setting in tokenized field

2011-01-17 Thread kenf_nc


No, I have both, a single field (for free form text search), and individual
fields (for directed search). I already duplicate the data and that's not a
problem, disk space is cheap. What I wanted to know was whether it is best
to make the single field multiValued=true or not. That is, should my
'content' field hold data like:
   some description maybe a paragraph or two
   a product or service title
   tag1
   tag2
   feature1
   feature2
or would it be better to make it a concatenated, single value field like:
 some description maybe a paragraph or two a product or service title
tag1 tag2 feature1 feature2

my indexing seems to take longer than most, it takes about 2 1/2 hours to
index 3.5 million records. I have a colleague that, in a separate project,
is indexing 70 million records in about 4 hours, albeit in a much simpler
schema. So I'm trying to see if this could be a factor in my indexing
performance. I also wanted to know what impact, in general, not just in this
situation, using a MultiValued field versus a Single Valued field has in
search results.

I would have thought that having to support a free-form-text search, and a
field (directed) search would be a common problem, and was just looking for
advice.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2271543.html
Sent from the Solr - User mailing list archive at Nabble.com.

solrconfig.xml settings question

2011-01-17 Thread kenf_nc


In the Wiki and the book by Smiley and Pugh, and in the comments inside the
solrconfig.xml file itself, it always talks about the various settings in
the context of a blended use solr index. By that I mean, it assumes you are
indexing and querying from the same solr instance. However, if I have a
Master-Slave set up I should be able to optimize the Master for indexing
data, and optimize the Slave for querying the data. Does anyone have links
to information that talks about this? I want to index as furiously as
possible into one solr instance without regard to the impact it will have on
queries, and to query on another solr instance that only has to worry about
replication, but not constant add/update/delete/commit activity. I want my
solrconfig settings to be as optimal as possible.

Links, comments, references to previous forum threads, any and all feedback
is appreciated.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solrconfig-xml-settings-question-tp2271594p2271594.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: boilerpipe solr tika howto please

2011-01-17 Thread arnaud gaudinat


Thanks Ken,
this what I wanted to know, I'm not very familiar with this kind of 
modification. However, I will try to do it and ask you some information 
in case of need.

regards,

Arno

Le 14.01.2011 18:04, Ken Krugler a écrit :

Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the 
html content from surplus clutter).
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible 
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change 
solrconfig.xml ( with 
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum 
(http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) 
is it the right way?


You need to add the BoilerpipeContentHandler into Tika's content 
handler chain.


Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) 
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:


return new BoilerpipeContentHandler(new ContentHandlerDecorator(

Though from a quick look at that code, I'm curious why it doesn't use 
BodyContentHandler, versus the current ContentHandlerDecorator.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: solrconfig.xml settings question

2011-01-17 Thread Ahmet Arslan

 In the Wiki and the book by Smiley and Pugh, and in the
 comments inside the
 solrconfig.xml file itself, it always talks about the
 various settings in
 the context of a blended use solr index. By that I mean, it
 assumes you are
 indexing and querying from the same solr instance. However,
 if I have a
 Master-Slave set up I should be able to optimize the Master
 for indexing
 data, and optimize the Slave for querying the data. Does
 anyone have links
 to information that talks about this? I want to index as
 furiously as
 possible into one solr instance without regard to the
 impact it will have on
 queries, and to query on another solr instance that only
 has to worry about
 replication, but not constant add/update/delete/commit
 activity. I want my
 solrconfig settings to be as optimal as possible.
 
 Links, comments, references to previous forum threads, any
 and all feedback
 is appreciated.

Besides caches described here http://search-lucene.com/m/DBdghoZPh01 , 
ramBufferSizeMB can be different on slave and master.

Clustering using Carrot2 clustering componet


Dear All,
   Can anyone tell me how to use carrot2 clustering component 
to cluster search results. What are its dependencies ? Which type of 
changes are required in solr.config or anywhere else.


Thanks!
Isha

FilterQuery reaching maxBooleanClauses, alternatives?

2011-01-17 Thread Stefan Matheis

Hi List,

we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per
default). So, the used query looks like:

?q=name:Stefanfq=5 10 12 15 16 [...]

where the values are ids of users, which the current user is allowed to see
- so long, nothing special. sometimes the filter-query includes user-ids
from an different Type of User (let's say we have TypeA and TypeB) where
TypeB contains more then 2k users. Then we hit the given Limit.

Now the Question is .. is it possible to enable an Filter/Function/Feature
in Solr, which it makes possible, that we don't need to send over alle the
user ids from TypeB Users? Just to tell Solr include all TypeB Users in the
(given) FilterQuery (or something in that direction)?

If so, what's the Name of this Filter/Function/Feature? :)

Don't hesitate to ask, if my question/description is weird!

Thanks
Stefan

RE: sort problem

2011-01-17 Thread Brad Dewar

Haha, Yes, you're not wrong.

The field you are sorting on should be a fieldtype that has the lowercase 
filter applied.  You'll probably have to re-index your data, unless you happen 
to already have such a field (via copyField, perhaps).

Brad




-Original Message-
From: Salman Akram [mailto:salman.ak...@northbaysolutions.net] 
Sent: January-17-11 5:47 AM
To: solr-user@lucene.apache.org
Subject: Re: sort problem

Yes.

On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL 
vincent.ro...@gmail.com wrote:

 Le 17/01/11 10:32, Grijesh a écrit :

  Use Lowercase filter to lowering your data at both index time and search
 time
 it will make case insensitive

 -
 Thanx:
 Grijesh

 Thanks,
 so tell me if i m wrong... i need to modify my schema.xml to add lowercase
 filter and reindex my content?





-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210

Re: Single value vs multi value setting in tokenized field

Functionally, the two options are equivalent, and I've never really
heard of any speed difference. Assuming it's not that big a programming
change, though, you probably want to just test...

Do be aware of one subtle difference in the approaches, though. If the
increment gap is != 1 then multiValued fields will NOT be functionally
equivalent because phrases won't match across boundaries quite the
same way. Which is often desirable behavior but may not be in your
situation.

Best
Erick

On Mon, Jan 17, 2011 at 5:50 AM, kenf_nc ken.fos...@realestate.com wrote:

No, I have both, a single field (for free form text search), and individual
fields (for directed search). I already duplicate the data and that's not a
problem, disk space is cheap. What I wanted to know was whether it is best
to make the single field multiValued=true or not. That is, should my
'content' field hold data like:
some description maybe a paragraph or two
a product or service title
tag1
tag2
feature1
feature2
or would it be better to make it a concatenated, single value field like:
some description maybe a paragraph or two a product or service title
tag1 tag2 feature1 feature2

my indexing seems to take longer than most, it takes about 2 1/2 hours to
index 3.5 million records. I have a colleague that, in a separate project,
is indexing 70 million records in about 4 hours, albeit in a much simpler
schema. So I'm trying to see if this could be a factor in my indexing
performance. I also wanted to know what impact, in general, not just in
this
situation, using a MultiValued field versus a Single Valued field has in
search results.

I would have thought that having to support a free-form-text search, and a
field (directed) search would be a common problem, and was just looking for
advice.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2271543.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: FilterQuery reaching maxBooleanClauses, alternatives?

You can index a field which can the User types e.g. UserType (possible
values can be TypeA,TypeB and so on...) and then you can just do

?q=name:Stefanfq=UserType:TypeB

BTW you can even increase the size of maxBooleanClauses but in this case
definitely this is not a good idea. Also you would hit the max limit of HTTP
GET so you will have to change it to POST. Better handle it with a new
field.

On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Hi List,

 we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per
 default). So, the used query looks like:

 ?q=name:Stefanfq=5 10 12 15 16 [...]

 where the values are ids of users, which the current user is allowed to see
 - so long, nothing special. sometimes the filter-query includes user-ids
 from an different Type of User (let's say we have TypeA and TypeB) where
 TypeB contains more then 2k users. Then we hit the given Limit.

 Now the Question is .. is it possible to enable an Filter/Function/Feature
 in Solr, which it makes possible, that we don't need to send over alle the
 user ids from TypeB Users? Just to tell Solr include all TypeB Users in
 the
 (given) FilterQuery (or something in that direction)?

 If so, what's the Name of this Filter/Function/Feature? :)

 Don't hesitate to ask, if my question/description is weird!

 Thanks
 Stefan




-- 
Regards,

Salman Akram

Re: sort problem

Note two things:
1 the lowercasefilter is NOT applied to the STORED data. So the
 display will still have the original case although the sorting
 should be what you want.
2 you should NOT be sorting on a tokenized field. Use something
 like KeywordTokenizer followed by the lowercase filter. String
 types don't go through filters as I remember.

Best
Erick

On Mon, Jan 17, 2011 at 7:57 AM, Brad Dewar bde...@stfx.ca wrote:

 Haha, Yes, you're not wrong.

 The field you are sorting on should be a fieldtype that has the lowercase
 filter applied.  You'll probably have to re-index your data, unless you
 happen to already have such a field (via copyField, perhaps).

 Brad




 -Original Message-
 From: Salman Akram [mailto:salman.ak...@northbaysolutions.net]
 Sent: January-17-11 5:47 AM
 To: solr-user@lucene.apache.org
 Subject: Re: sort problem

 Yes.

 On Mon, Jan 17, 2011 at 2:44 PM, Philippe VINCENT-ROYOL 
 vincent.ro...@gmail.com wrote:

  Le 17/01/11 10:32, Grijesh a écrit :
 
   Use Lowercase filter to lowering your data at both index time and search
  time
  it will make case insensitive
 
  -
  Thanx:
  Grijesh
 
  Thanks,
  so tell me if i m wrong... i need to modify my schema.xml to add
 lowercase
  filter and reindex my content?
 
 
 


 --
 Regards,

 Salman Akram
 Senior Software Engineer - Tech Lead
 80-A, Abu Bakar Block, Garden Town, Pakistan
 Cell: +92-321-4391210

Re: FilterQuery reaching maxBooleanClauses, alternatives?

2011-01-17 Thread Stefan Matheis

Thanks Salman,

talking with others about problems really helps. Adding another FilterQuery
is a bit too much - but combining both is working fine!

not seen the wood for the trees =)
Thanks, Stefan


On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 You can index a field which can the User types e.g. UserType (possible
 values can be TypeA,TypeB and so on...) and then you can just do

 ?q=name:Stefanfq=UserType:TypeB

 BTW you can even increase the size of maxBooleanClauses but in this case
 definitely this is not a good idea. Also you would hit the max limit of
 HTTP
 GET so you will have to change it to POST. Better handle it with a new
 field.

 On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis 
 matheis.ste...@googlemail.com wrote:

  Hi List,
 
  we are sometimes reaching the maxBooleanClauses Limit (which is 1024, per
  default). So, the used query looks like:
 
  ?q=name:Stefanfq=5 10 12 15 16 [...]
 
  where the values are ids of users, which the current user is allowed to
 see
  - so long, nothing special. sometimes the filter-query includes user-ids
  from an different Type of User (let's say we have TypeA and TypeB) where
  TypeB contains more then 2k users. Then we hit the given Limit.
 
  Now the Question is .. is it possible to enable an
 Filter/Function/Feature
  in Solr, which it makes possible, that we don't need to send over alle
 the
  user ids from TypeB Users? Just to tell Solr include all TypeB Users in
  the
  (given) FilterQuery (or something in that direction)?
 
  If so, what's the Name of this Filter/Function/Feature? :)
 
  Don't hesitate to ask, if my question/description is weird!
 
  Thanks
  Stefan
 



 --
 Regards,

 Salman Akram

Re: FilterQuery reaching maxBooleanClauses, alternatives?

You are welcome.

By new field I meant if you don't have a field for UserType already.

On Mon, Jan 17, 2011 at 6:22 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Thanks Salman,

 talking with others about problems really helps. Adding another FilterQuery
 is a bit too much - but combining both is working fine!

 not seen the wood for the trees =)
 Thanks, Stefan


 On Mon, Jan 17, 2011 at 2:07 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  You can index a field which can the User types e.g. UserType (possible
  values can be TypeA,TypeB and so on...) and then you can just do
 
  ?q=name:Stefanfq=UserType:TypeB
 
  BTW you can even increase the size of maxBooleanClauses but in this case
  definitely this is not a good idea. Also you would hit the max limit of
  HTTP
  GET so you will have to change it to POST. Better handle it with a new
  field.
 
  On Mon, Jan 17, 2011 at 5:57 PM, Stefan Matheis 
  matheis.ste...@googlemail.com wrote:
 
   Hi List,
  
   we are sometimes reaching the maxBooleanClauses Limit (which is 1024,
 per
   default). So, the used query looks like:
  
   ?q=name:Stefanfq=5 10 12 15 16 [...]
  
   where the values are ids of users, which the current user is allowed to
  see
   - so long, nothing special. sometimes the filter-query includes
 user-ids
   from an different Type of User (let's say we have TypeA and TypeB)
 where
   TypeB contains more then 2k users. Then we hit the given Limit.
  
   Now the Question is .. is it possible to enable an
  Filter/Function/Feature
   in Solr, which it makes possible, that we don't need to send over alle
  the
   user ids from TypeB Users? Just to tell Solr include all TypeB Users
 in
   the
   (given) FilterQuery (or something in that direction)?
  
   If so, what's the Name of this Filter/Function/Feature? :)
  
   Don't hesitate to ask, if my question/description is weird!
  
   Thanks
   Stefan
  
 
 
 
  --
  Regards,
 
  Salman Akram
 




-- 
Regards,

Salman Akram

Re: Tika Update, no Data

2011-01-17 Thread Jörg Agatz

Hey!

Thanks a lot, nice tip.. works fine..

But one Problem i have too...

to indexing ZIP. i tryed :

curl 
http://192.168.105.66:8983/solr/update/extract?literal.id=zipuprefix=attr_commit=true;
-F myfile@constellio_standalone-1.0.zip

and i get:
Warning: Illegally formatted input field!
curl: option -F: is badly used here
curl: try 'curl --help' or 'curl --manual' for more information
service@joa-Desktop:~/Downloads$

Maby you hav an idea?

Re: Tika Update, no Data

2011-01-17 Thread Stefan Matheis

missing the = char between myfile and @filename.ext?

On Mon, Jan 17, 2011 at 2:47 PM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 Hey!

 Thanks a lot, nice tip.. works fine..

 But one Problem i have too...

 to indexing ZIP. i tryed :

 curl 

 http://192.168.105.66:8983/solr/update/extract?literal.id=zipuprefix=attr_commit=true
 
 -F myfile@constellio_standalone-1.0.zip

 and i get:
 Warning: Illegally formatted input field!
 curl: option -F: is badly used here
 curl: try 'curl --help' or 'curl --manual' for more information
 service@joa-Desktop:~/Downloads$

 Maby you hav an idea?

Re: Tika Update, no Data

2011-01-17 Thread Jörg Agatz

ohh, your right.. embarrassing!


i have tryed, and it works, but it seems it works not Perfect, the txt
documents into the ZIP are not indext, lonly the Names of documents into the
zip..

King

CommonGrams and SOLR - 1604

Hi,

I am trying to use CommonGrams with SOLR - 1604 patch but doesn't seem to
work.

If I don't add {!complexphrase} it uses CommonGramsQueryFilterFactory and
proper bi-grams are made but of course doesn't use this patch.

If I add {!complexphrase} it simply does it the old way i.e. ignore
CommonGrams.

Does anyone know how to combine both these features?

Also once they are combined (hopefully they will be) would phrase proximity
search work fine?

Thanks

-- 
Regards,

Salman Akram

resetting the statistics

2011-01-17 Thread Roxana Angheluta

Hi everybody,

Is it possible to reset solr statistics without restarting solr or reloading 
cores?

Conform the thread here

http://osdir.com/ml/solr-user.lucene.apache.org/2010-03/msg01078.html

this was not possible in March 2010.

I am wondering if something like this has been implemented in the meanwhile.

Thanks,
roxana

spellchecking even the key is true....

2011-01-17 Thread satya swaroop

Hi All,
can we get the spellchecking results even when the keyword is true.
As for spellchecking will give only to the wrong keywords, cant we get
similar and near words of the keyword though the spellcheck.q is true..
as an example
http://localhost:8080/solr/spellcheck?q=javaspellcheck=truespellcheck.count=5
the result will be

1)-
response
-
lst name=spellcheck
lst name=suggestions/
/lst
/response


can we get the result as
2)
response
-
lst name=spellcheck
lst name=suggestions
strjavax/str
strjavac/str
strjavabean/str
strjavascript/str
/lst
/response

NOTE:: all the keywords in the 2nd result is are in index...

Regards,
satya

partitioning documents with fields

2011-01-17 Thread Claudio Martella

Hi,

I'm crawling different intranets so i developed a nutch plugin to add a
static field for each of these crawls.
I do have now in SOLR my documents with their specific craw field. If
i search withing solr i can see my documents being returned with that field.

The field definition in the schema is:

field name=crawl type=string stored=true indexed=true/

I'd like to put a checkbox in my websearch app to choose with partition
to search in. So i thought i'd implement it by simply using:

/select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is
returned. I also just tried crawl:value, which i'd expect to return all
the documents from that crawl, but no results are sent back. As the
field is indexed and stored and i can see the documents owning that
field from normal query results, what could i be missing?

-- 
Claudio Martella
Digital Technologies
Unit Research  Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.

Re: partitioning documents with fields

String fields are unanalyzed, so case matters. Are you sure you're not
using a different case (try KeywordTokenizer + lowercaseFilter if you
want these normalized to, say, lower case).

If that isn't the problem, could we see the results if you add
debugQuery=on
to your URL? That often helps diagnose the problem.

Take a look at your solr/admin page, schema browser to examine the actual
contents of the crawl field and see if they're really what you expect.

Best
Erick

On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella 
claudio.marte...@tis.bz.it wrote:

 Hi,

 I'm crawling different intranets so i developed a nutch plugin to add a
 static field for each of these crawls.
 I do have now in SOLR my documents with their specific craw field. If
 i search withing solr i can see my documents being returned with that
 field.

 The field definition in the schema is:

 field name=crawl type=string stored=true indexed=true/

 I'd like to put a checkbox in my websearch app to choose with partition
 to search in. So i thought i'd implement it by simply using:

 /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is
 returned. I also just tried crawl:value, which i'd expect to return all
 the documents from that crawl, but no results are sent back. As the
 field is indexed and stored and i can see the documents owning that
 field from normal query results, what could i be missing?

 --
 Claudio Martella
 Digital Technologies
 Unit Research  Development - Analyst

 TIS innovation park
 Via Siemens 19 | Siemensstr. 19
 39100 Bolzano | 39100 Bozen
 Tel. +39 0471 068 123
 Fax  +39 0471 068 129
 claudio.marte...@tis.bz.it http://www.tis.bz.it

 Short information regarding use of personal data. According to Section 13
 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
 process your personal data in order to fulfil contractual and fiscal
 obligations and also to send you information regarding our services and
 events. Your personal data are processed with and without electronic means
 and by respecting data subjects' rights, fundamental freedoms and dignity,
 particularly with regard to confidentiality, personal identity and the right
 to personal data protection. At any time and without formalities you can
 write an e-mail to priv...@tis.bz.it in order to object the processing of
 your personal data for the purpose of sending advertising materials and also
 to exercise the right to access personal data and other rights referred to
 in Section 7 of Decree 196/2003. The data controller is TIS Techno
 Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
 complete information on the web site www.tis.bz.it.

Re: partitioning documents with fields

2011-01-17 Thread Claudio Martella

Thanks for your answer.

Yes, schema browser shows that the field contains the right values as i
expect.
From debugQuery=on i see there must be some problem though:

str name=rawquerystringcrawl:DIGITALDATA/str
 str name=querystringcrawl:DIGITALDATA/str
 str name=parsedquery+DisjunctionMaxQuery((contentEN:crawl
(digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata
crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
(digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)
DisjunctionMaxQuery((contentEN:crawl (digitaldata
crawldigitaldata)^0.8 | title:crawl (digitaldata
crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
(digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)/str
 str name=parsedquery_toString+(contentEN:crawl (digitaldata
crawldigitaldata)^0.8 | title:crawl (digitaldata
crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
(digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1
(contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl
(digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 |
contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl
(digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1/str

It looks like there's some problem with my dismax query handler. It
doesn't recognize the query with the colon format.
Here's the handler definition:

requestHandler name=/content class=solr.SearchHandler default=true
lst name=defaults
   str name=defTypedismax/str
   str name=pftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8
contentIT^0.8 contentDE^0.8/str
   str name=qftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8
contentIT^0.8 contentDE^0.8/str
   float name=tie0.1/float
   bool name=hltrue/bool
   str name=hl.fltitle url content anchor/str
   int name=hl.fragsize150/int
   int name=hl.snippets3/int
   bool name=hl.mergeContiguoustrue/bool
/lst
/requestHandler



On 1/17/11 6:06 PM, Erick Erickson wrote:
 String fields are unanalyzed, so case matters. Are you sure you're not
 using a different case (try KeywordTokenizer + lowercaseFilter if you
 want these normalized to, say, lower case).

 If that isn't the problem, could we see the results if you add
 debugQuery=on
 to your URL? That often helps diagnose the problem.

 Take a look at your solr/admin page, schema browser to examine the actual
 contents of the crawl field and see if they're really what you expect.

 Best
 Erick

 On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella 
 claudio.marte...@tis.bz.it wrote:

 Hi,

 I'm crawling different intranets so i developed a nutch plugin to add a
 static field for each of these crawls.
 I do have now in SOLR my documents with their specific craw field. If
 i search withing solr i can see my documents being returned with that
 field.

 The field definition in the schema is:

 field name=crawl type=string stored=true indexed=true/

 I'd like to put a checkbox in my websearch app to choose with partition
 to search in. So i thought i'd implement it by simply using:

 /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is
 returned. I also just tried crawl:value, which i'd expect to return all
 the documents from that crawl, but no results are sent back. As the
 field is indexed and stored and i can see the documents owning that
 field from normal query results, what could i be missing?

 --
 Claudio Martella
 Digital Technologies
 Unit Research  Development - Analyst

 TIS innovation park
 Via Siemens 19 | Siemensstr. 19
 39100 Bolzano | 39100 Bozen
 Tel. +39 0471 068 123
 Fax  +39 0471 068 129
 claudio.marte...@tis.bz.it http://www.tis.bz.it

 Short information regarding use of personal data. According to Section 13
 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
 process your personal data in order to fulfil contractual and fiscal
 obligations and also to send you information regarding our services and
 events. Your personal data are processed with and without electronic means
 and by respecting data subjects' rights, fundamental freedoms and dignity,
 particularly with regard to confidentiality, personal identity and the right
 to personal data protection. At any time and without formalities you can
 write an e-mail to priv...@tis.bz.it in order to object the processing of
 your personal data for the purpose of sending advertising materials and also
 to exercise the right to access personal data and other rights referred to
 in Section 7 of Decree 196/2003. The data controller is TIS Techno
 Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
 complete information on the web site www.tis.bz.it.





-- 
Claudio Martella
Digital Technologies
Unit Research  Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel.

Re: partitioning documents with fields

2011-01-17 Thread Ahmet Arslan

 It looks like there's some problem with my dismax query
 handler. It
 doesn't recognize the query with the colon format.
 Here's the handler definition:

It is expected behavior of dismax. You can append/use defType=lucene for colon 
format.

Re: what would cause large numbers of executeWithRetry INFO messages?

2011-01-17 Thread sakunthalakishan


I am facing exact same issue.  Did you find out root cause for this?  Please
let me know any information you have

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/what-would-cause-large-numbers-of-executeWithRetry-INFO-messages-tp1453417p2274077.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: partitioning documents with fields

As Ahmet says, this is what dismax does. You could also append a
filter query (fq=crawl:DIGITALDATA) to your query.

eDismax supports fielded queries, see:
https://issues.apache.org/jira/browse/SOLR-1553

This is already in the trunk and 3.x code lines I'm pretty sure.

Best
Erick

On Mon, Jan 17, 2011 at 12:15 PM, Claudio Martella 
claudio.marte...@tis.bz.it wrote:

 Thanks for your answer.

 Yes, schema browser shows that the field contains the right values as i
 expect.
 From debugQuery=on i see there must be some problem though:

 str name=rawquerystringcrawl:DIGITALDATA/str
  str name=querystringcrawl:DIGITALDATA/str
  str name=parsedquery+DisjunctionMaxQuery((contentEN:crawl
 (digitaldata crawldigitaldata)^0.8 | title:crawl (digitaldata
 crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
 (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
 crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)
 DisjunctionMaxQuery((contentEN:crawl (digitaldata
 crawldigitaldata)^0.8 | title:crawl (digitaldata
 crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
 (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
 crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1)/str
  str name=parsedquery_toString+(contentEN:crawl (digitaldata
 crawldigitaldata)^0.8 | title:crawl (digitaldata
 crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 | contentDE:crawl
 (digitaldata crawldigitaldata)^0.8 | contentIT:crawl (digitald
 crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1
 (contentEN:crawl (digitaldata crawldigitaldata)^0.8 | title:crawl
 (digitaldata crawldigitaldata)^1.2 | url:crawl digitaldata^1.5 |
 contentDE:crawl (digitaldata crawldigitaldata)^0.8 | contentIT:crawl
 (digitald crawldigitald)^0.8 | anchor:crawl:DIGITALDATA^1.5)~0.1/str

 It looks like there's some problem with my dismax query handler. It
 doesn't recognize the query with the colon format.
 Here's the handler definition:

 requestHandler name=/content class=solr.SearchHandler default=true
 lst name=defaults
   str name=defTypedismax/str
   str name=pftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8
 contentIT^0.8 contentDE^0.8/str
   str name=qftitle^1.2 anchor^1.5 url^1.5 contentEN^0.8
 contentIT^0.8 contentDE^0.8/str
   float name=tie0.1/float
   bool name=hltrue/bool
   str name=hl.fltitle url content anchor/str
   int name=hl.fragsize150/int
   int name=hl.snippets3/int
   bool name=hl.mergeContiguoustrue/bool
 /lst
 /requestHandler



 On 1/17/11 6:06 PM, Erick Erickson wrote:
  String fields are unanalyzed, so case matters. Are you sure you're not
  using a different case (try KeywordTokenizer + lowercaseFilter if you
  want these normalized to, say, lower case).
 
  If that isn't the problem, could we see the results if you add
  debugQuery=on
  to your URL? That often helps diagnose the problem.
 
  Take a look at your solr/admin page, schema browser to examine the
 actual
  contents of the crawl field and see if they're really what you expect.
 
  Best
  Erick
 
  On Mon, Jan 17, 2011 at 11:59 AM, Claudio Martella 
  claudio.marte...@tis.bz.it wrote:
 
  Hi,
 
  I'm crawling different intranets so i developed a nutch plugin to add a
  static field for each of these crawls.
  I do have now in SOLR my documents with their specific craw field. If
  i search withing solr i can see my documents being returned with that
  field.
 
  The field definition in the schema is:
 
  field name=crawl type=string stored=true indexed=true/
 
  I'd like to put a checkbox in my websearch app to choose with partition
  to search in. So i thought i'd implement it by simply using:
 
  /select?indent=onversion=2.2q=crawl%3Avalue+AND+query but nothing is
  returned. I also just tried crawl:value, which i'd expect to return all
  the documents from that crawl, but no results are sent back. As the
  field is indexed and stored and i can see the documents owning that
  field from normal query results, what could i be missing?
 
  --
  Claudio Martella
  Digital Technologies
  Unit Research  Development - Analyst
 
  TIS innovation park
  Via Siemens 19 | Siemensstr. 19
  39100 Bolzano | 39100 Bozen
  Tel. +39 0471 068 123
  Fax  +39 0471 068 129
  claudio.marte...@tis.bz.it http://www.tis.bz.it
 
  Short information regarding use of personal data. According to Section
 13
  of Italian Legislative Decree no. 196 of 30 June 2003, we inform you
 that we
  process your personal data in order to fulfil contractual and fiscal
  obligations and also to send you information regarding our services and
  events. Your personal data are processed with and without electronic
 means
  and by respecting data subjects' rights, fundamental freedoms and
 dignity,
  particularly with regard to confidentiality, personal identity and the
 right
  to personal data protection. At any time and without formalities you can
  write an e-mail to priv...@tis.bz.it in order to object the processing
 of
  your personal data for the purpose of

RE: Spell Checking a multi word phrase

2011-01-17 Thread Dyer, James

Camden,

You may also want to be aware that there is a new feature added to Spell 
Check's collate functionality that will guarantee the collations will return 
hits.  It also is able to return more than one collation and tell you how many 
hits each one would result in if re-queried.  This might do the same thing 
you're trying to do using shingles, but with more accuracy and less work.

For info, look at spellcheck.collate, spellcheck.maxCollations, 
spellcheck.maxCollationTries  spellcheck.collateExtendedResults on the 
component's wiki page: 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

This feature is committed to 3.x and 4.x and is available as a patch for 1.4.1 
(here:  https://issues.apache.org/jira/browse/SOLR-2010).

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Camden Daily [mailto:cam...@jaunter.com] 
Sent: Monday, January 17, 2011 1:01 PM
To: solr-user@lucene.apache.org
Subject: Spell Checking a multi word phrase

Hello all,

I'm pretty new to Solr, and trying to set up a spell checker that can handle
entire phrases.  My goal would be to have something that could offer a
suggestion of united states for a query of untied stats.

I have a very large index, and I've worked a bit with creating shingles for
the spelling index.  The problem I'm running into now is that the
SpellCheckComponent is always tokenizing the query that I pass to it.

For example, a query like this
http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=on

The debug information shows me that the parsed query is:
PhraseQuery(text:untied stats)

But I receive the spelling suggestions for untied and stats separately.
From what I understand, this is not a case where I would want to collate; I
simply want the entire phrase treated as one token.

I found the following post after much searching that suggests setting up a
custom QueryConverter:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E

Does anyone know if that would be required?  I had hoped to avoid Java code
entirely with Solr (I haven't used Java in a very long time), but if I do
need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
able to give me some tips of exactly how I would add that functionality to
Solr?

Relevant configs below:

solrconfig.xml:

  searchComponent name=spellcheck class=solr.SpellCheckComponent
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspellShingle/str
  str name=spellcheckIndexDir./spellShingle/str
  str name=queryAnalyzerFieldTypetextSpellShingle/str
  str name=buildOnOptimizetrue/str
/lst
/searchComponent

schema.xml:

fieldType name=textSpellShingle class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2
outputUnigrams=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

(I had thought setting the KeywordTokenizer for the query analyzer would
keep it from being tokenized, but it doesn't seem to make any difference)

-Camden Daily

RE: spellchecking even the key is true....

2011-01-17 Thread Dyer, James

Add spellcheck.onlyMorePopular=true to your query and I think it'll do what you 
want.  See 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular for 
more info.

One caveat is if you use spellcheck.collate, this will likely result in 
useless, nonsensical collations most of the time.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: satya swaroop [mailto:satya.yada...@gmail.com] 
Sent: Monday, January 17, 2011 10:32 AM
To: solr-user@lucene.apache.org
Subject: spellchecking even the key is true

Hi All,
can we get the spellchecking results even when the keyword is true.
As for spellchecking will give only to the wrong keywords, cant we get
similar and near words of the keyword though the spellcheck.q is true..
as an example
http://localhost:8080/solr/spellcheck?q=javaspellcheck=truespellcheck.count=5
the result will be

1)-
response
-
lst name=spellcheck
lst name=suggestions/
/lst
/response


can we get the result as
2)
response
-
lst name=spellcheck
lst name=suggestions
strjavax/str
strjavac/str
strjavabean/str
strjavascript/str
/lst
/response

NOTE:: all the keywords in the 2nd result is are in index...

Regards,
satya

Re: Spell Checking a multi word phrase

2011-01-17 Thread Camden Daily

James,

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
stephenie meyer.  stephenie is a far less popular spelling than
stephanie, but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.comwrote:

 Camden,

 You may also want to be aware that there is a new feature added to Spell
 Check's collate functionality that will guarantee the collations will
 return hits.  It also is able to return more than one collation and tell you
 how many hits each one would result in if re-queried.  This might do the
 same thing you're trying to do using shingles, but with more accuracy and
 less work.

 For info, look at spellcheck.collate, spellcheck.maxCollations,
 spellcheck.maxCollationTries  spellcheck.collateExtendedResults on the
 component's wiki page:
 http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

 This feature is committed to 3.x and 4.x and is available as a patch for
 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Camden Daily [mailto:cam...@jaunter.com]
 Sent: Monday, January 17, 2011 1:01 PM
 To: solr-user@lucene.apache.org
 Subject: Spell Checking a multi word phrase

 Hello all,

 I'm pretty new to Solr, and trying to set up a spell checker that can
 handle
 entire phrases.  My goal would be to have something that could offer a
 suggestion of united states for a query of untied stats.

 I have a very large index, and I've worked a bit with creating shingles for
 the spelling index.  The problem I'm running into now is that the
 SpellCheckComponent is always tokenizing the query that I pass to it.

 For example, a query like this

 http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on

 The debug information shows me that the parsed query is:
 PhraseQuery(text:untied stats)

 But I receive the spelling suggestions for untied and stats separately.
 From what I understand, this is not a case where I would want to collate; I
 simply want the entire phrase treated as one token.

 I found the following post after much searching that suggests setting up a
 custom QueryConverter:

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E

 Does anyone know if that would be required?  I had hoped to avoid Java code
 entirely with Solr (I haven't used Java in a very long time), but if I do
 need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
 able to give me some tips of exactly how I would add that functionality to
 Solr?

 Relevant configs below:

 solrconfig.xml:

  searchComponent name=spellcheck class=solr.SpellCheckComponent
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspellShingle/str
  str name=spellcheckIndexDir./spellShingle/str
  str name=queryAnalyzerFieldTypetextSpellShingle/str
  str name=buildOnOptimizetrue/str
/lst
 /searchComponent

 schema.xml:

fieldType name=textSpellShingle class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

 (I had thought setting the KeywordTokenizer for the query analyzer would
 keep it from being tokenized, but it doesn't seem to make any difference)

 -Camden Daily

RE: solrj http client 4

2011-01-17 Thread Steven A Rowe

Hi Stevo,

Thanks for reviewing the Maven POMs in LUCENE-2657 - I appreciate it.

In those poms, not all modules have explicit version and groupId which
is a bad practice.

Really? According to the POM best practices section in Sonatype's Maven book
http://www.sonatype.com/books/mvnref-book/reference/pom-relationships-sect-pom-best-practice.html,
inheriting version and groupId is standard and acceptable.

However, since the Lucene/Solr source tree contains two groupIds
(org.apache.lucene and org.apache.solr), I agree that all modules should have
an explicit groupId, and you're right: several of the aggregator POMs don't
have explicit groupId. I'll fix this.

But I don't think it's a bad practice to inherit the version from the parent
POM. All Lucene and Solr modules have synchronized versions - it doesn't make
sense for them to be specified independent of the whole project.

Also some parent references contain invalid default
(../pom.xml) relativePath - path to their parent pom.xml.

AFAICT, the default relativePath concept no longer exists (as of Maven 2.2+).
That is, the parent POM resolution method uses the explicit relativePath if
specified, then the local repository -- ../pom.xml is never used unless
explicitly specified. (I don't know this for a fact, I just found that I had
to mvn install before parent POM changes because visible to child POMs, even
when the parent POM location was in the parent directory.)

That said, I agree it would be useful to have explicit relativePaths - I'll add
them.

Paths to build directories look suspicious to me. lucene-bdb module references
missing library com.sleepycat:berkeleydb:jar:4.7.25 - I see lib/db-4.7.25.jar,
if it's supposed to be installed in local repository then pom would be handy.

Run mvn -N -P bootstrap install from the top level to install non-mavenized
dependencies into your local repository.

Wiki page http://wiki.apache.org/solr/HowToContribute references this
http://markmail.org/message/yb5qgeamosvdscao mail but files
(.classpath) in archives attached to that email are very outdated.
eclipse target in base ant build script generates .classpath and
.settings so it seems mentioned wiki page is outdated too.

I agree, this should be changed. Go for it!

Steve

RE: Spell Checking a multi word phrase

2011-01-17 Thread Dyer, James

Camden,

Have you seen SmileyPugh's Solr book?  They describe something very similar to 
what you're trying to do on p180ff.  The difference seems to be they use a 
field that only has a couple of terms so they don't bother with shingles.  The 
book makes a big point about using spellcheck.q in this case in order to get 
the analysis right.  I'm not sure if this is the solution but I thought I'd 
mention it.  I never tried spell checking this way because it seemed very 
limited and possibly quite expensive. 

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Camden Daily [mailto:cam...@jaunter.com] 
Sent: Monday, January 17, 2011 1:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Spell Checking a multi word phrase

James,

Thank you, but I'm not sure that will work for my needs.  I'm very
interested in contextual spell checking.  Take for example the author
stephenie meyer.  stephenie is a far less popular spelling than
stephanie, but in this context it's the correct option.  I feel like
shingles with an un tokenized query string would be able to catch this, but
I can't find too many examples of people attempting this.

On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.comwrote:

 Camden,

 You may also want to be aware that there is a new feature added to Spell
 Check's collate functionality that will guarantee the collations will
 return hits.  It also is able to return more than one collation and tell you
 how many hits each one would result in if re-queried.  This might do the
 same thing you're trying to do using shingles, but with more accuracy and
 less work.

 For info, look at spellcheck.collate, spellcheck.maxCollations,
 spellcheck.maxCollationTries  spellcheck.collateExtendedResults on the
 component's wiki page:
 http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate

 This feature is committed to 3.x and 4.x and is available as a patch for
 1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Camden Daily [mailto:cam...@jaunter.com]
 Sent: Monday, January 17, 2011 1:01 PM
 To: solr-user@lucene.apache.org
 Subject: Spell Checking a multi word phrase

 Hello all,

 I'm pretty new to Solr, and trying to set up a spell checker that can
 handle
 entire phrases.  My goal would be to have something that could offer a
 suggestion of united states for a query of untied stats.

 I have a very large index, and I've worked a bit with creating shingles for
 the spelling index.  The problem I'm running into now is that the
 SpellCheckComponent is always tokenizing the query that I pass to it.

 For example, a query like this

 http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on

 The debug information shows me that the parsed query is:
 PhraseQuery(text:untied stats)

 But I receive the spelling suggestions for untied and stats separately.
 From what I understand, this is not a case where I would want to collate; I
 simply want the entire phrase treated as one token.

 I found the following post after much searching that suggests setting up a
 custom QueryConverter:

 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E

 Does anyone know if that would be required?  I had hoped to avoid Java code
 entirely with Solr (I haven't used Java in a very long time), but if I do
 need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
 able to give me some tips of exactly how I would add that functionality to
 Solr?

 Relevant configs below:

 solrconfig.xml:

  searchComponent name=spellcheck class=solr.SpellCheckComponent
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspellShingle/str
  str name=spellcheckIndexDir./spellShingle/str
  str name=queryAnalyzerFieldTypetextSpellShingle/str
  str name=buildOnOptimizetrue/str
/lst
 /searchComponent

 schema.xml:

fieldType name=textSpellShingle class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.ShingleFilterFactory maxShingleSize=2
 outputUnigrams=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

 (I had thought setting the KeywordTokenizer for the query analyzer would
 keep it from being tokenized, but it doesn't seem to make any difference)

 -Camden Daily

what is the diff between katta and solrcloud?

2011-01-17 Thread Sean Bigdatafun

 Are their goal fudanmentally different at all or just different approaches
to solve the same problem (sharding)? Can someone give a technical review?

Thanks,
--Sean

Does field collapsing (with facet) reduce performance?

2011-01-17 Thread Andy

Just wanted to know how efficient field collapsing is. And if there is a 
performance penalty, how big is it likely to be?

I'm interested in using field collapsing with faceting.

Thanks.

Any way to query by offset?

2011-01-17 Thread 5 Diamond IT

Say I do a query that matches 4000 documents. Is there a query syntax  or 
parser that would allow me to say retrieve offsets 1000, 2000, 3000?

I would prefer to not do multiple starts and limit 1's.

Thanks in advance.

Steve

Re: Any way to query by offset?

Have you seen the start and rows parameters? If they don't work,
perhaps you could explain what you need that they don't provide.

Best
Erick

On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT 
i...@smallbusinessconsultingexperts.com wrote:

 Say I do a query that matches 4000 documents. Is there a query syntax  or
 parser that would allow me to say retrieve offsets 1000, 2000, 3000?

 I would prefer to not do multiple starts and limit 1's.

 Thanks in advance.

 Steve

Re: Any way to query by offset?

2011-01-17 Thread Markus Jelsma

I think Steve wants the 1000th, 2000th and 3000th document from the query. And 
since there's no method of doing so you're constrained to executing three 
queries with rows=1 and start is resp. 1000, 2000 and 3000.

If you want these documents to return you will have to do multiple queries 
with different start and limit=1 parameters.

 Have you seen the start and rows parameters? If they don't work,
 perhaps you could explain what you need that they don't provide.
 
 Best
 Erick
 
 On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT 
 
 i...@smallbusinessconsultingexperts.com wrote:
  Say I do a query that matches 4000 documents. Is there a query syntax  or
  parser that would allow me to say retrieve offsets 1000, 2000, 3000?
  
  I would prefer to not do multiple starts and limit 1's.
  
  Thanks in advance.
  
  Steve

Re: Any way to query by offset?

2011-01-17 Thread 5 Diamond IT

I want to start at row 1000, 2000, and 3000 and retrieve those 3 rows ONLY from 
the result set of whatever search was used. Yes, I can do 3 queries, start=1000 
and limit 1, etc., but, want ONE query to get those 3 rows from the result set.

It's the poor mans way of doing price buckets the way I want them to be.

So, what I need that they do not provide is the ability to find those 3 rows 
out of the result set in one query. Was hoping for a function, a parser that 
supported this perhaps, some hidden field I am not aware of I could simply 
match on, any trick that would work.




On Jan 17, 2011, at 6:13 PM, Erick Erickson wrote:

 Have you seen the start and rows parameters? If they don't work,
 perhaps you could explain what you need that they don't provide.
 
 Best
 Erick
 
 On Mon, Jan 17, 2011 at 4:58 PM, 5 Diamond IT 
 i...@smallbusinessconsultingexperts.com wrote:
 
 Say I do a query that matches 4000 documents. Is there a query syntax  or
 parser that would allow me to say retrieve offsets 1000, 2000, 3000?
 
 I would prefer to not do multiple starts and limit 1's.
 
 Thanks in advance.
 
 Steve

Re: Does field collapsing (with facet) reduce performance?

2011-01-17 Thread Markus Jelsma

There is always CPU and RAM involved for every nice component you use. Just 
how much the penalty is depends completely on your hardware, index and type of 
query. Under heavy load it numbers will change.

Since we don't know your situation and it's hard to predict without 
benchmarks, you should really do the tests yourself.

 Just wanted to know how efficient field collapsing is. And if there is a
 performance penalty, how big is it likely to be?
 
 I'm interested in using field collapsing with faceting.
 
 Thanks.

Re: Is deduplication possible during Tika extract?

2011-01-17 Thread Markus Jelsma

In my opinion it should work for every update handler. If you're really sure 
your configuration if fine and it still doesn't work you might have to file an 
issue.

Your configuration looks alright but don't forget you've configured 
overwriteDupes=false!

 Hello,
 
 here is an excerpt of my solrconfig.xml:
 
 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 startup=lazy
 lst name=defaults
 
 str name=update.processordedupe/str
 
 !-- All the main content goes into text... if you need to return
 the extracted text or do highlighting, use a stored field. --
 str name=fmap.contenttext/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str
 
 !-- capture link hrefs but ignore div attributes --
 str name=captureAttrtrue/str
 str name=fmap.alinks/str
 str name=fmap.divignored_/str
 /lst
 /requestHandler
 
 and
 
 updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
 str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
 str name=fieldstext/str
 str
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature
 /str /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain
 
 deduplication works when I use only /update but not when solr does an
 extract with Tika!
 Is deduplication possible during Tika extract?
 
 Thanks in advance,
 Arno

NRT

2011-01-17 Thread Dennis Gearon

How is NRT doing, being used in production? 

Which Solr is it in? 

And is there built in Spatial in that version?

How is Solr 4.x doing?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Re: Does field collapsing (with facet) reduce performance?

2011-01-17 Thread Andy

I understand that the specific figures differ for everybody.

I just wanted to see if anyone who has used this feature could share their 
experience. A ballpark figure -- e.g. 50% slowdown or 10 times slower -- would 
be helpful.


--- On Mon, 1/17/11, Markus Jelsma markus.jel...@openindex.io wrote:

 From: Markus Jelsma markus.jel...@openindex.io
 Subject: Re: Does field collapsing (with facet) reduce performance?
 To: solr-user@lucene.apache.org
 Cc: Andy angelf...@yahoo.com
 Date: Monday, January 17, 2011, 7:27 PM
 There is always CPU and RAM involved
 for every nice component you use. Just 
 how much the penalty is depends completely on your
 hardware, index and type of 
 query. Under heavy load it numbers will change.
 
 Since we don't know your situation and it's hard to predict
 without 
 benchmarks, you should really do the tests yourself.
 
  Just wanted to know how efficient field collapsing is.
 And if there is a
  performance penalty, how big is it likely to be?
  
  I'm interested in using field collapsing with
 faceting.
  
  Thanks.

Re: Spell Checking a multi word phrase

2011-01-17 Thread Camden Daily

James,

Thanks, the spellcheck.q was exactly what I needed to be using!

-Camden

On Mon, Jan 17, 2011 at 3:54 PM, Dyer, James james.d...@ingrambook.comwrote:

 Camden,

 Have you seen SmileyPugh's Solr book?  They describe something very
 similar to what you're trying to do on p180ff.  The difference seems to be
 they use a field that only has a couple of terms so they don't bother with
 shingles.  The book makes a big point about using spellcheck.q in this
 case in order to get the analysis right.  I'm not sure if this is the
 solution but I thought I'd mention it.  I never tried spell checking this
 way because it seemed very limited and possibly quite expensive.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Camden Daily [mailto:cam...@jaunter.com]
 Sent: Monday, January 17, 2011 1:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Spell Checking a multi word phrase

 James,

 Thank you, but I'm not sure that will work for my needs.  I'm very
 interested in contextual spell checking.  Take for example the author
 stephenie meyer.  stephenie is a far less popular spelling than
 stephanie, but in this context it's the correct option.  I feel like
 shingles with an un tokenized query string would be able to catch this, but
 I can't find too many examples of people attempting this.

 On Mon, Jan 17, 2011 at 2:19 PM, Dyer, James james.d...@ingrambook.com
 wrote:

  Camden,
 
  You may also want to be aware that there is a new feature added to Spell
  Check's collate functionality that will guarantee the collations will
  return hits.  It also is able to return more than one collation and tell
 you
  how many hits each one would result in if re-queried.  This might do the
  same thing you're trying to do using shingles, but with more accuracy and
  less work.
 
  For info, look at spellcheck.collate, spellcheck.maxCollations,
  spellcheck.maxCollationTries  spellcheck.collateExtendedResults on
 the
  component's wiki page:
  http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.collate
 
  This feature is committed to 3.x and 4.x and is available as a patch for
  1.4.1 (here:  https://issues.apache.org/jira/browse/SOLR-2010).
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Camden Daily [mailto:cam...@jaunter.com]
  Sent: Monday, January 17, 2011 1:01 PM
  To: solr-user@lucene.apache.org
  Subject: Spell Checking a multi word phrase
 
  Hello all,
 
  I'm pretty new to Solr, and trying to set up a spell checker that can
  handle
  entire phrases.  My goal would be to have something that could offer a
  suggestion of united states for a query of untied stats.
 
  I have a very large index, and I've worked a bit with creating shingles
 for
  the spelling index.  The problem I'm running into now is that the
  SpellCheckComponent is always tokenizing the query that I pass to it.
 
  For example, a query like this
 
 
 http://localhost:8080/solr/spell?q=untied\statsspellcheck=truedebugQuery=onhttp://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on
 
 http://localhost:8080/solr/spell?q=untied%5Cstatsspellcheck=truedebugQuery=on
 
 
  The debug information shows me that the parsed query is:
  PhraseQuery(text:untied stats)
 
  But I receive the spelling suggestions for untied and stats
 separately.
  From what I understand, this is not a case where I would want to collate;
 I
  simply want the entire phrase treated as one token.
 
  I found the following post after much searching that suggests setting up
 a
  custom QueryConverter:
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200810.mbox/%3c1224516331.3820.119.ca...@localhost.localdomain.tld%3E
 
  Does anyone know if that would be required?  I had hoped to avoid Java
 code
  entirely with Solr (I haven't used Java in a very long time), but if I do
  need to set up the 'MultiWordSpellingQueryConvert' class, would anyone be
  able to give me some tips of exactly how I would add that functionality
 to
  Solr?
 
  Relevant configs below:
 
  solrconfig.xml:
 
   searchComponent name=spellcheck class=solr.SpellCheckComponent
 lst name=spellchecker
   str name=namedefault/str
   str name=fieldspellShingle/str
   str name=spellcheckIndexDir./spellShingle/str
   str name=queryAnalyzerFieldTypetextSpellShingle/str
   str name=buildOnOptimizetrue/str
 /lst
  /searchComponent
 
  schema.xml:
 
 fieldType name=textSpellShingle class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
 filter class=solr.ShingleFilterFactory maxShingleSize=2
  outputUnigrams=true/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer

Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-17 Thread Chamnap Chhorn

No other way around to fit this requirement?

On Sat, Jan 15, 2011 at 10:01 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote:

 Ahh, thanks guys for helping me!

 For Adam solution, it doesn't work for me. Here is my Field, FieldType, and
 solr query:

 fieldType name=text_keyword class=solr.TextField
 positionIncrementGap=100

analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ShingleFilterFactory
  maxShingleSize=4 outputUnigrams=true
 outputUnigramIfNoNgram=false /
  /analyzer
 /fieldType

 field name=keyphrase type=text_keyword indexed=true stored=false
 multiValued=true/


 http://localhost:8081/solr/select?q=printing%20houseqf=keyphrasedebugQuery=ondefType=dismax


 str name=parsedquery
 +((DisjunctionMaxQuery((keyphrase:smart))
 DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
 /str
 str name=parsedquery_toString+(((keyphrase:smart)
 (keyphrase:mobile))~2) ()/str
  lst name=explain/

 The result is not found.

 For erick solution, it works for me. However, I can't put filter query,
 since it's part of full text search. If I put fq, it would just return
 documents that match exactly as the query. I want to show those that match
 exactly on the top and the rest for documents that match partially.

 The problem is that when the user search a word (eg. printing of the
 keyword printing house), that document also include in the search results.
 The other problem is that if the user search the reverse order(eg. house
 printing), it's also found.

 Cheers


 On Sat, Jan 15, 2011 at 3:31 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 This might work:

 Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory

 Use a filter query referencing this field.

 If you wanted the words to appear in their exact order, you could just
 define
 the pf field in your dismax.

 Best
 Erick

 On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

  Ahhh...the fun of open source software ;-). Requires a ton of trial and
  error! I found what worked for me and figured it was worth passing it
 along.
  If you don't mind...when you sort everything out on your end, please
 post
  results for the rest of us to take a gander at.
 
  Cheers,
  Adam
 
  On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com
  wrote:
 
   Thanks for your reply. However, it doesn't work for my case at all. I
  think
   it's the problem with query parser or something else. It forces me to
 put
   double quote to the search query in order to get the results found.
  
   str name=rawquerystringsim 010/str
   str name=querystringsim 010/str
   str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010))
  ()/str
   str name=parsedquery_toString+(keyphrase:sim 010) ()/str
  
   str name=rawquerystringsmart mobile/str
   str name=querystringsmart mobile/str
   str name=parsedquery
   +((DisjunctionMaxQuery((keyphrase:smart))
   DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
   /str
   str name=parsedquery_toString+(((keyphrase:smart)
  (keyphrase:mobile))~2)
   ()/str
  
   The intent here is to do a full text search, part of that is to search
   keyword field, so I can't put quote to it.
  
   On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com wrote:
  
   Hi,
  
   the following seems to work pretty well.
  
 fieldType name=text_ws class=solr.TextField
   positionIncrementGap=100
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.ShingleFilterFactory
   maxShingleSize=4 outputUnigrams=true
   outputUnigramIfNoNgram=false /
   /analyzer
 /fieldType
  
 !-- A text field that uses WordDelimiterFilter to enable splitting
  and
   matching of
 words on case-change, alpha numeric boundaries, and
  non-alphanumeric
   chars,
 so that a query of wifi or wi fi could match a document
   containing Wi-Fi.
 Synonyms and stopwords are customized by external files, and
   stemming is enabled.
 The attribute autoGeneratePhraseQueries=true (the default)
  causes
   words that get split to
 form phrase queries. For example, WordDelimiterFilter splitting
   text:pdp-11 will cause the parser
 to generate text:pdp 11 rather than (text:PDP OR text:11).
 NOTE: autoGeneratePhraseQueries=true tends to not work well
 for
   non whitespace delimited languages.
 --
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
   autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a

Re: NRT

2011-01-17 Thread Jason Rutherglen

 How is NRT doing, being used in production?

It works and there are not any lingering bugs as it's been available
for quite a while.

 Which Solr is it in?

Per-segment field cache is used transparently by Solr,
IndexWriter.getReader is what's not used yet.  I'm not sure where
per-segment faceting is at.

 And is there built in Spatial in that version?

Spatial is independent of NRT?

On Mon, Jan 17, 2011 at 4:56 PM, Dennis Gearon gear...@sbcglobal.net wrote:
 How is NRT doing, being used in production?

 Which Solr is it in?

 And is there built in Spatial in that version?

 How is Solr 4.x doing?

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a 
 better
 idea to learn from others’ mistakes, so you do not have to make them yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

2011-01-17 Thread Lance Norskog

Solr itself does all three things. There is no need for Nutch- that is
needed for crawling web sites, not file systems (as the original
question specifies).

Solr operates as a web service, running in any Java servlet container.

Detecting changes to files is more tricky: there is no implementation
for the real-time update system available for Windows. You would have
to implement that. Otherwise you can poll a file system and re-index
altered files.

On Fri, Jan 14, 2011 at 4:54 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Nutch can crawl the file system as well. Nutch 1.x can also provide search but
 this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch
 can provide Solr with content from your intranet.

 On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote:
 Hi,
 Thanks for suggesting this.
 However, I'm not sure a 'crawler' will work:  as the various pages are not
 necessarily linked (it's complicated:  basically our intranet is a dynamic
 and managed collection of independantly published web sites, and users
 found information using categorisation and/or text searching), so we need
 something that will index all the files in a given folder, rather than
 follow links like a crawler. Can Nutch do this? As well as the other
 requirements below?
 Regards
 Cathy

 On 14 January 2011 12:09, Markus Jelsma markus.jel...@openindex.io wrote:
  Please visit the Nutch project. It is a powerful crawler and can
  integrate with Solr.
 
  http://nutch.apache.org/
 
   Hi Solr users,
  
   I hope you can help.  We are migrating our intranet web site management
   system to Windows 2008 and need a replacement for Index Server to do
   the text searching.  I am trying to establish if Lucene and Solr is a
 
  feasible
 
   replacement, but I cannot find the answers to these questions:
  
   1. Can Solr be set up to recursively index a folder containing an
   indeterminate and variable large number of subfolders, containing files
 
  of
 
   all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint
   presentations, text files etc.  If so, how?
   2. Can Solr be queried over the web and return a list of files that
   match
 
  a
 
   search query entered by a user, and also return the abstracts for these
   files, as well as 'hit highlighting'.  If so, how?
   3. Can Solr be run as a service (like Index Server) that automatically
   detects changes to the files within the indexed folder and updates the
   index? If so, how?
  
   Thanks for your help
  
   Cathy Hemsley

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350




-- 
Lance Norskog
goks...@gmail.com

just got 'the book' already have a question

2011-01-17 Thread Dennis Gearon

First of all, seems like a good book,

Solr-14-Enterprise-Search-Server.pdf

Question, is it possible to choose locale at search time? So if my customer is 
querying across cultural/national/linguistic boundaries and I have the data for 
him different languages in the same index, can I sort based on his language?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Carrot2 clustering Component


Hi,
  Please tell me how can I get the libraries and plugins for 
carrot2 clustering component in solr1.4.Tell me the site from where i 
can get them.


Thanks!
Isha

Carrot2 clustering component


Hi,
   I am not able to understand the caarot2 clustering component from

http://wiki.apache.org/solr/ClusteringComponent

please provide me more detailed information if someone had already worked on 
this. How to run this and use this during search query.


Thanks!
Isha

Re: Carrot2 clustering component

Isha,

You'll get more and better help if you provide more details about what you have 
done, what you have tried, what isn't working, what errors or behaviour you are 
seeing, etc.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Isha Garg isha.g...@orkash.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 12:38:03 AM
 Subject: Carrot2 clustering component
 
 Hi,
I am not able to understand the caarot2 clustering component  from
 
 http://wiki.apache.org/solr/ClusteringComponent
 
 please  provide me more detailed information if someone had already worked on 
this. How  to run this and use this during search  query.
 
 
 Thanks!
 Isha

explicit field type descriptions

2011-01-17 Thread Dennis Gearon

Is there any tabular data anywhere on ALL field types and ALL options?

For example, I've looked everywhere in the last hour, and I don't see anywhere 
on Solr site, google, or in the 1.4 manual where it says whether a copyField 
'directive' can be made ' required=true '.



 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Getting started with writing parser

2011-01-17 Thread Dinesh


how to write a parser program that will convert log files into XML..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2278092.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Not storing, but highlighting from document sentences

2011-01-17 Thread Tarjei Huse

On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
 Hello,

 I'm indexing some content (articles) whose text I cannot store in its 
 original 
 form for copyright reason.  So I can index the content, but cannot store it.  
 However, I need snippets and search term highlighting.  


 Any way to accomplish this elegantly?  Or even not so elegantly?

 Here is one idea:

 * Create 2 indices: main index for indexing (but not storing) the original 
 content, the secondary index for storing individual sentences from the 
 original 
 article.
How about storing the sentences in the same index in a separate field
but with random ordering, would that be ok?

Tarjei
 * That is, before indexing an article, split it into sentences.  Then index 
 the 
 article in the main index, and index+store each sentence in the secondary 
 index.  So for each doc in the main index there will be multiple docs in the 
 secondary index with individual sentences.  Each sentence doc includes an ID 
 of 
 the parent document.

 * Then run queries against the main index, and pull individual sentences from 
 the secondary index for snippet+highlight purposes.


 The problem I see with this approach (and there may be other ones that I am 
 not 
 seeing yet) is with queries like foo AND bar.  In this case foo may be a 
 match 
 from sentence #1, and bar may be a match from sentence #7.  Or maybe foo 
 is 
 a match in sentence #1, and bar is a match in multiple sentences: #7 and 
 #10 
 and #23.

 Regardless, when a query is run against the main index, you don't know where 
 the 
 match was, so you don't know which sentences to go get from the secondary 
 index.

 Does anyone have any suggestions for how to handle this?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413

Re: Carrot2 clustering component


On Tuesday 18 January 2011 11:12 AM, Otis Gospodnetic wrote:

Isha,

You'll get more and better help if you provide more details about what you have
done, what you have tried, what isn't working, what errors or behaviour you are
seeing, etc.

Otis

Sematext ::http://sematext.com/  :: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/



- Original Message 
   

From: Isha Gargisha.g...@orkash.com
To:solr-user@lucene.apache.org
Sent: Tue, January 18, 2011 12:38:03 AM
Subject: Carrot2 clustering component

Hi,
I am not able to understand the caarot2 clustering component  from

http://wiki.apache.org/solr/ClusteringComponent

please  provide me more detailed information if someone had already worked on
this. How  to run this and use this during search  query.


Thanks!
Isha




 

I had downloaded some jar files compatible with solr1.4  including:
carrot2-core-3.4.2.jar
guava-r05.jar
hppc-0.3.1.jar
jackson-core-asl-1.5.2.jar
mahout-collections-0.3.jar
jackson-mapper-asl-1.5.2.jar
log4j-1.2.14.jar
mahout-math-0.3.jar
simple-xml-2.3.5.jar
And placed them at contrib/clustering/lib

Then changed the solr.config as:

requestHandler name=standard default=true
 !-- default values for query parameters --
 lst name=defaults
 str name=echoParamsexplicit/str
 !--
 int name=rows10/int
 str name=fl*/str
 str name=version2.1/str
 --
 !--bool name=clusteringtrue/bool--
 str name=clustering.enginedefault/str
 bool name=clustering.resultstrue/bool
 !-- The title field --
 str name=carrot.titleheadin/str
 str name=carrot.urlid/str
 !-- The field to cluster on --
 str name=carrot.snippettext/str
 !-- produce summaries --
 bool name=carrot.produceSummarytrue/bool
 !-- the maximum number of labels per cluster --
 !--int name=carrot.numDescriptions5/int--
 !-- produce sub clusters --
 bool name=carrot.outputSubClustersfalse/bool

 /lst
 arr name=last-components
 strclustering/str
 /arr
 /requestHandler

 searchComponent name=clustering
 !-- Declare an engine --
 lst name=engine
 !-- The name, only one can be named default --
 str name=namedefault/str

 str 
name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str
 str name=LingoClusteringAlgorithm.desiredClusterCountBase20/str
 /lst
 lst name=engine
 str name=namestc/str
 str 
name=carrot.algorithmorg.carrot2.clustering.stc.STCClusteringAlgorithm/str
 /lst
 /searchComponent


And then run solr using command:
java -Dsolr.clustering.enabled=true -jar start.jar


Now can you tell me where i am wrong ?? what else should i do?

Re: Carrot2 clustering component

Isha,

Next, you need to run the actual search so Carrot2 has some search results to 
cluster.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Isha Garg isha.g...@orkash.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 1:54:39 AM
 Subject: Re: Carrot2 clustering component
 
 On Tuesday 18 January 2011 11:12 AM, Otis Gospodnetic wrote:
   Isha,
  
  You'll get more and better help if you provide more  details about what you 
have
  done, what you have tried, what isn't  working, what errors or behaviour 
  you 
are
  seeing, etc.
  
   Otis
  
  Sematext ::http://sematext.com/  :: Solr - Lucene - Nutch
   Lucene ecosystem search ::http://search-lucene.com/
  
  
  
  - Original  Message 
 
  From: Isha Gargisha.g...@orkash.com
   To:solr-user@lucene.apache.org
   Sent: Tue, January 18, 2011 12:38:03 AM
  Subject: Carrot2 clustering  component
  
  Hi,
  I am not able  to understand the caarot2 clustering component  from
  
  http://wiki.apache.org/solr/ClusteringComponent
  
  please  provide me more detailed information if someone had  already 
  worked 
on
  this. How  to run this and use this during  search  query.
  
  
  Thanks!
   Isha
  
  
  
  

 I had downloaded some jar files compatible with solr1.4   including:
 carrot2-core-3.4.2.jar
 guava-r05.jar
 hppc-0.3.1.jar
 jackson-core-asl-1.5.2.jar
 mahout-collections-0.3.jar
 jackson-mapper-asl-1.5.2.jar
 log4j-1.2.14.jar
 mahout-math-0.3.jar
 simple-xml-2.3.5.jar
 And  placed them at contrib/clustering/lib
 
 Then changed the solr.config  as:
 
 requestHandler name=standard default=true
  !--  default values for query parameters --
  lst name=defaults
   str name=echoParamsexplicit/str
  !--
  int  name=rows10/int
  str name=fl*/str
  str  name=version2.1/str
  --
  !--bool  name=clusteringtrue/bool--
  str  name=clustering.enginedefault/str
  bool  name=clustering.resultstrue/bool
  !-- The title field  --
  str name=carrot.titleheadin/str
  str  name=carrot.urlid/str
  !-- The field to cluster on  --
  str name=carrot.snippettext/str
  !-- produce  summaries --
  bool  name=carrot.produceSummarytrue/bool
  !-- the maximum number  of labels per cluster --
  !--int  name=carrot.numDescriptions5/int--
  !-- produce sub  clusters --
  bool  name=carrot.outputSubClustersfalse/bool
 
  /lst
   arr name=last-components
  strclustering/str
   /arr
  /requestHandler
 
  searchComponent  name=clustering
  !-- Declare an engine --
  lst  name=engine
  !-- The name, only one can be named default  --
  str name=namedefault/str
 
  str  
name=carrot.algorithmorg.carrot2.clustering.lingo.LingoClusteringAlgorithm/str

   str  name=LingoClusteringAlgorithm.desiredClusterCountBase20/str
   /lst
  lst name=engine
  str  name=namestc/str
  str  
name=carrot.algorithmorg.carrot2.clustering.stc.STCClusteringAlgorithm/str
   /lst
  /searchComponent
 
 
 And then run solr using  command:
 java -Dsolr.clustering.enabled=true -jar start.jar
 
 
 Now  can you tell me where i am wrong ?? what else should i  
do?

Re: NRT

Hi,

 How is NRT doing, being used in production? 

 Which Solr is it in? 

Unless I missed it, I don't think there is true NRT in Solr just yet.

 And is there built in Spatial in that version?
 
 How is Solr 4.x  doing?

Well :)

3 ways to know this sort of stuff:
* follow the dev list - high volume
* subscribe to Sematext Blog - we publish monthly Solr Digests
* check JIRA to see how many issues remain to be fixed

Otis
--
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: just got 'the book' already have a question

Hi,

Don't think so.  If you search across multiple languages and sort, I think the 
sort if based on UTF8 order.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Dennis Gearon gear...@sbcglobal.net
 To: solr-user@lucene.apache.org
 Sent: Mon, January 17, 2011 11:10:21 PM
 Subject: just got 'the book' already have a question
 
 First of all, seems like a good  book,
 
 Solr-14-Enterprise-Search-Server.pdf
 
 Question, is it  possible to choose locale at search time? So if my customer 
 is 

 querying  across cultural/national/linguistic boundaries and I have the data 
for 

 him  different languages in the same index, can I sort based on his language?
 
   Dennis Gearon
 
 
 Signature Warning
 
 It is always a  good idea to learn from your own mistakes. It is usually a 
better 

 idea to  learn from others’ mistakes, so you do not have to make them 
 yourself. 

 from  'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
 EARTH  has a Right To Life,
 otherwise we all die.

Re: just got 'the book' already have a question

I could be wrong, have a look at 
http://search-lucene.com/?q=locale+sortfc_project=Solr
plus: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CollationKeyFilterFactory


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 2:17:02 AM
 Subject: Re: just got 'the book' already have a question
 
 Hi,
 
 Don't think so.  If you search across multiple languages and  sort, I think 
 the 

 sort if based on UTF8  order.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene  ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message  
  From: Dennis Gearon gear...@sbcglobal.net
  To: solr-user@lucene.apache.org
   Sent: Mon, January 17, 2011 11:10:21 PM
  Subject: just got 'the book'  already have a question
  
  First of all, seems like a good   book,
  
  Solr-14-Enterprise-Search-Server.pdf
  
   Question, is it  possible to choose locale at search time? So if my  
customer is 

 
  querying  across cultural/national/linguistic  boundaries and I have the 
  data 

 for 
 
  him  different  languages in the same index, can I sort based on his 
language?
  
Dennis Gearon
  
  
  Signature  Warning
  
  It is always a  good idea to learn  from your own mistakes. It is usually a 
 better 
 
  idea  to  learn from others’ mistakes, so you do not have to make them 
yourself. 

 
  from  'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
  
  
  EARTH  has a Right To Life,
  otherwise we all  die.

Re: Not storing, but highlighting from document sentences

Hi Tarjei,

:)
Yeah, that is the solution we are going with, actually.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Tarjei Huse tar...@scanmine.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 1:33:44 AM
 Subject: Re: Not storing, but highlighting from document sentences
 
 On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
  Hello,
 
   I'm indexing some content (articles) whose text I cannot store in its 
original 

  form for copyright reason.  So I can index the content, but cannot  store 
it.  

  However, I need snippets and search term  highlighting.  
 
 
  Any way to accomplish this  elegantly?  Or even not so elegantly?
 
  Here is one  idea:
 
  * Create 2 indices: main index for indexing (but not  storing) the original 
  content, the secondary index for storing  individual sentences from the 
original 

  article.
 How about storing  the sentences in the same index in a separate field
 but with random ordering,  would that be ok?
 
 Tarjei
  * That is, before indexing an article,  split it into sentences.  Then 
  index 
the 

  article in the main  index, and index+store each sentence in the secondary 
  index.  So  for each doc in the main index there will be multiple docs in 
  the 

   secondary index with individual sentences.  Each sentence doc includes an  
ID of 

  the parent document.
 
  * Then run queries against  the main index, and pull individual sentences 
from 

  the secondary index  for snippet+highlight purposes.
 
 
  The problem I see with  this approach (and there may be other ones that I 
  am 
not 

  seeing yet) is  with queries like foo AND bar.  In this case foo may be a 
match 

   from sentence #1, and bar may be a match from sentence #7.  Or maybe  
foo is 

  a match in sentence #1, and bar is a match in multiple  sentences: #7 and 
#10 

  and #23.
 
  Regardless, when a query  is run against the main index, you don't know 
  where 
the 

  match was, so  you don't know which sentences to go get from the secondary  
index.
 
  Does anyone have any suggestions for how to handle  this?
 
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 -- 
 Regards / Med vennlig  hilsen
 Tarjei Huse
 Mobil: 920 63 413

Re: what is the diff between katta and solrcloud?