date:20111103

Re: Stopword filter - refreshing stop word list periodically

2011-11-03 Thread Jithin

Thanks Sami. I ended up setting up a proper core as per documentation,
named core0.

On Thu, Nov 3, 2011 at 11:07 PM, Sami Siren-2 [via Lucene] <
ml-node+s472066n3477844...@n3.nabble.com> wrote:

> On Fri, Oct 14, 2011 at 10:06 PM, Jithin <[hidden 
> email]>
> wrote:
> > What will be the name of this hard coded core? I was re arranging my
> > directory structure adding a separate directory for code. And it does
> work
> > with a single core.
>
> In trunk the "single core setup" core is called "collection1". So to
> reload that you'd call url:
> http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1
>
> --
>  Sami Siren
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3477844.html
>  To unsubscribe from Stopword filter - refreshing stop word list
> periodically, click 
> here.
>
>



-- 
Thanks
Jithin Emmanuel


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stopword-filter-refreshing-stop-word-list-periodically-tp3421611p3479040.html
Sent from the Solr - User mailing list archive at Nabble.com.

how to achieve google.com like results for phrase queries

2011-11-03 Thread alxsss

Hello,

I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word 
phrases like newspaper latimes, latimes.com is not in results at all.
This may be due to the dismax def type that I use in  request handler 

dismax
url^1.5 id^1.5 content^ title^1.2
url^1.5 id^1.5 content^0.5 title^1.2


 with mm as
2<-1 5<-2 6<90% 

However, changing it to 
1<-1 2<-1 5<-2 6<90% 

and q.op to OR or AND 

do not solve the problem. In this case latimes.com is ranked higher, but still 
is not in the first place.
Also in this case results with both words are ranked very low, almost at the 
end.

We need to be able to achieve the case when latimes.com is placed in the first 
place then results with both words and etc.

Any ideas how to modify config to this end?

Thanks in advance.
Alex.

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Vijay Ramachandran

On Thu, Nov 3, 2011 at 4:06 PM, Nagendra Mishr  wrote:

> The scenarios that could use dictionary matching:
>
> 1. Document being processed to see if it contains one of 10,000 terms.
>
> 2. Query completion as you type
>
> 3. Basically the inverse of finding a document.. Instead the document
> is the query term and the dictionary of terms is being matched in
> parallel
>
>
Try the Aho-Corasick algorithm -
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

"It is a kind of dictionary-matching algorithm that locates elements of a
finite set of strings (the "dictionary") within an input text. It matches
all patterns simultaneously. The
complexityof
the algorithm is linear in the length of the patterns plus the length
of
the searched text plus the number of output matches."

HTH,
Vijay

RE: Using Solr components for dictionary matching?

2011-11-03 Thread Nagendra Mishr

The scenarios that could use dictionary matching:

1. Document being processed to see if it contains one of 10,000 terms.

2. Query completion as you type

3. Basically the inverse of finding a document.. Instead the document
is the query term and the dictionary of terms is being matched in
parallel

Nagendra

Sent from my Windows Phone
From: Erick Erickson
Sent: 11/3/2011 8:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr components for dictionary matching?
I really don't understand what you're asking. Could you give some
examples of what you're trying to do?

Best
Erick

On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr  wrote:
> Hi all,
>
> Is there a good guide on using Solr components as a dictionary
> matcher?  I'm need to do some pre-processing that involves lots of
> dictionary lookups and it doesn't seem right to query solr for each
> instance.
>
> Thanks in advance,
>
> Nagendra
>

Highlighter showing matched query words only

2011-11-03 Thread Nikeman

Hello Folks,

I am a newbie of Solr. I wonder if Solr Highlighter can show the matched
query words only. Suppose my query is "godfather AND pacino." I just want to
display "godfather" and "pacino" in any of the highlighted fields. For the
sake of performance, I do not want to use regular expressions to parse the
text and locate the query words which are already enclosed between  and
. Solr obviously has already done the searching and highlighting, but
the Solr output mixes what I want with what I do not want. 

I just want to get out the intermediate results, the matching query words,
and nothing else. 

Is there a way to get the intermediate results, the matching query words,
before they are mixed with other text? Thank you all very much for your help
in advance! 

N. J. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighter-showing-matched-query-words-only-tp3478731p3478731.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Martijn v Groningen

Hi Micheal,

The FieldCache is an easier data structure and easier to create, so I
also expect it to be faster. Unfortunately for TextField
UnInvertedField
is always used even if you have one token per document. I think
overriding the multiValuedFieldCache method and return false would
work.

If you're using 4.0-dev (trunk) I'd use facet.method=fcs (this
parameter is only useable if multiValuedFieldCache method returns
false)
This is per segment faceting and the cache will only be extended for
new segments. This field facet approach is better for indexes with
frequent changes.
I think this even faster in your case then just using the FieldCache
method (which operates on a top level reader. After each commit the
complete cache is invalid and has to be recreated).

Otherwise I'd try facet.method=enum which is fast if you have fewer
distinct facet values (num of docs doesn't influence the performance
that much).
The facet.method=enum option is also valid for normal TextFields, so
no need to have custom code.

Martijn

On 3 November 2011 21:16, Michael Ryan  wrote:
> I have some fields I facet on that are TextFields but have just a single 
> token.
> The fieldType looks like this:
>
>     stored="false" omitNorms="true" sortMissingLast="true"
>    positionIncrementGap="100">
>  
>    
>  
> 
>
> SimpleFacets uses an UnInvertedField for these fields because
> multiValuedFieldCache() returns true for TextField. I tried changing the type 
> for
> these fields to the plain "string" type (StrField). The facets *seem* to be
> generated much faster. Is it expected that FieldCache would be faster than
> UnInvertedField for single-token strings like this?
>
> My goal is to make the facet re-generation after a commit as fast as 
> possible. I
> would like to continue using TextField for these fields since I have a need 
> for
> filters like LowerCaseFilterFactory, which still produces a single token. Is 
> it
> safe to extend TextField and have multiValuedFieldCache() return false for 
> these
> fields, so that UnInvertedField is not used? Or is there a better way to
> accomplish what I'm trying to do?
>
> -Michael
>

-- 
Met vriendelijke groet,

Martijn van Groningen

Re: Access Document Score in Custom Function Query (ValueSource)

2011-11-03 Thread sangrish


I understand that. Thanks. 

I just posted a related question , titled : "Access Score in Custom Function
Query " 

where (among other things) I am asking about the performance aspects of this
method. As you said, I need to execute "some" query first to create a
constrained recall set & then apply my custom function query (which in turn
executes another query) to it.

In my case I am using the same query again. First to create the recall set
(and also score the docs which I don't use though) and then execute that
query in my custom function to get the score. I am worried it may slow
things down.

Comments?

Thanks
Sid

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-Document-Score-in-Custom-Function-Query-ValueSource-tp3432459p3478619.html
Sent from the Solr - User mailing list archive at Nabble.com.

Access Score in Custom Function Query

2011-11-03 Thread sangrish



Hi,


  I have a custom function query (value source) where I want to use the
score for some computation. For example, for every document I want to add
some number (obtained from an external file) to its score. I am achieving
this like the following:

http://localhost:PORT/myCore/select?q=queryString&qt=my_request_handler&fl=field1,field2,score&debugQuery=on&sort=myfunc(query$(qq))
desc

Where, definition of "my_request_handler" & "qq" are as follows:



{!dismax v=$q}


field1^2 field^3



Questions:

1. To obtain the score in my function query I am executing the dismax query
again ( myfunc( query($qq)). Could it slow things down? Is there any way I
can access the score without querying again?
2. I also want to normalize the (query) score I get to a range between 0 -
1. Is there any way to access the MAX_SCORE in the same function query/Value
source (so that I can divide every score by that)?


Thanks lot guys

Sid




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-Score-in-Custom-Function-Query-tp3478597p3478597.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: facet with group by (or field collapsing)

2011-11-03 Thread Martijn v Groningen

collapse.facet=after doesn't exists in Solr 3.3. This parameter exists
in the SOLR-236 patches and is implemented differently in the released
versions of Solr.
>From Solr 3.4 you can use group.truncate. The facet counts are then
computed based on the most relevant documents per group.

Martijn

On 3 November 2011 22:47, erj35  wrote:
> I'm attempting the following query:
>
> http://{host}/apache-solr-3.3.0/select/?q=cesy&version=2.2&start=0&rows=10&indent=on&group=true&group.field=SIP&group.limit=1&facet=true&facet.field=REPOSITORYNAME
>
> The result is 4 matches all in 1 group (with group.limit=1).  Rather than
> show facet.field=REPOSITORYNAME's 4 facets, I want to see the
> REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the
> value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is
> this possible? I tried adding the parameter collapse.facet=after, but that
> seemed to have no effect.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Met vriendelijke groet,

Martijn van Groningen

Re: facet with group by (or field collapsing)

2011-11-03 Thread erj35

I'm attempting the following query:

http://{host}/apache-solr-3.3.0/select/?q=cesy&version=2.2&start=0&rows=10&indent=on&group=true&group.field=SIP&group.limit=1&facet=true&facet.field=REPOSITORYNAME

The result is 4 matches all in 1 group (with group.limit=1).  Rather than
show facet.field=REPOSITORYNAME's 4 facets, I want to see the
REPOSITORYNAMES facet with a count of 1 (for the 1 group returned) with the
value of the REPOSITORYNAMES field in the 1 doc returned in the group. Is
this possible? I tried adding the parameter collapse.facet=after, but that
seemed to have no effect.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/facet-with-group-by-or-field-collapsing-tp497252p3478515.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Default value for dynamic fields

2011-11-03 Thread Yonik Seeley

On Thu, Nov 3, 2011 at 12:59 PM, Milan Dobrota  wrote:
> Is there any way to define the default value for the dynamic fields in
> SOLR? I use some dynamic fields of type float with _val_ and if they
> haven't been created at index time, the value defaults to 0. I would want
> this to be 1. Can that be changed?


On trunk, there are some new (currently undocumented) function queries
that can do this:
def(myfield,1)

If there are not normally 0 values anyway, you can also map any 0
values encountered via map(),
or min() if existing values are all positive.

-Yonik
http://www.lucidimagination.com

Re: BaseTokenFilterFactory not found in plugin

2011-11-03 Thread Chris Hostetter


: myorg/solr/analysis/*.java`. I then made a `.jar` file from the .class files
: and put the .jar file in the solr/lib/ directory. I modified schema.xml to
: include the new filter:

what exactly do you mean by "the solr/lib/ directory" ? ... if you mean 
that "solr" is the solr home dir where you are running solr, so you have a 
structure like this...

  solr/conf/solrconfig.xml
  solr/conf/schema.xml
  solr/lib/your-jar-name.jar

...then that should be correct.  If however you put it in some other lib 
directory (like, perhaps jetty's lib directory) then it might get loaded 
by a lower level class loader so it has no runtime visibility of the 
classes loaded by Solr.

when Solr starts up, the SolrResourceLoader explicitly logs every jar file 
it finds in it's "lib" dir, or any jars explicitly specified, or loaded 
because of @sharedLib or  configurations, so check your logs to make 
sure your jar is listed there -- if it's not, but it's still getting 
loaded, then it's getting loaded by a different classloader.




-Hoss

RE: Questions about Solr's security

2011-11-03 Thread Robert Petersen

Me too!

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, November 01, 2011 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about Solr's security

I once had to deal with a severe performance problem caused by a bot
that was requesting results starting at 5000. We disallowed requests
over a certain number of pages in the front end to fix it.

wunder

On Nov 1, 2011, at 12:57 PM, Erik Hatcher wrote:

> Be aware that even /select could have some harmful effects, see
https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk).
> 
> Even disregarding that issue, /select is a potential gateway to any
request handler defined via /select?qt=/req_handler
> 
> Again, in general it's not a good idea to expose Solr to anything but
a controlled app server.  
> 
>   Erik
> 
> On Nov 1, 2011, at 15:51 , Alireza Salimi wrote:
> 
>> What if we just expose '/select' paths - by firewalls and load
balancers -
>> and
>> also use SSL and HTTP basic or digest access control?
>> 
>> On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter
wrote:
>> 
>>> 
>>> : I was wondering if it's a good idea to expose Solr to the outside
world,
>>> : so that our clients running on smart phones will be able to use
Solr.
>>> 
>>> As a general rule of thumb, i would say that it is not a good idea
to
>>> expose solr directly to the public internet.
>>> 
>>> there are exceptions to this rule -- AOL hosted some live solr
instances
>>> of the Sarah Palin emails for HufPo -- but it is definitely an
expert
>>> level type thing for people who are so familiar with solr they know
>>> exactly what to lock down to make it "safe"
>>> 
>>> for typical users: put an application between your untrusted users
and
>>> solr and only let that application generate "safe" welformed
requests to
>>> Solr...
>>> 
>>> https://wiki.apache.org/solr/SolrSecurity
>>> 
>>> 
>>> -Hoss
>>> 
>> 
>> 
>> 
>> -- 
>> Alireza Salimi
>> Java EE Developer
> 

--
Walter Underwood
Venture Asst. Scoutmaster
Troop 14, Palo Alto, CA

Re: Dismax and phrases

2011-11-03 Thread Chris Hostetter


: ...is this perhaps a side effect of the new autoGeneratePhraseQueries 
: option? ... you are explicitly specifying a quoted phrase, but 
: maybe somehwere in the code path of the dismax parser that information is 
: getting lost?

FWIW:

a) I just realized you said in your first message you were using Solr 
1.4.1, which *definitely* predates the autoGeneratePhraseQueries option - 
so i'm really at a loss to understand how you are getting that query 
structure (definitely want to see your configs)

b) I did some quick testing with Solr 3.4 using the example configs, and 
verified that regardless of how autoGeneratePhraseQueries is set on the 
fieldType for the "name" field, this request...

http://localhost:8983/solr/select/?fl=name&debugQuery=true&q=%22samsung%20hard%20drive%22&defType=dismax&qf=name&qs=100

..always produces a dismax query wrapped arround a phrase query.


-Hoss

UnInvertedField vs FieldCache for facets for single-token text fields

2011-11-03 Thread Michael Ryan

I have some fields I facet on that are TextFields but have just a single token.
The fieldType looks like this:


  

  


SimpleFacets uses an UnInvertedField for these fields because
multiValuedFieldCache() returns true for TextField. I tried changing the type 
for
these fields to the plain "string" type (StrField). The facets *seem* to be
generated much faster. Is it expected that FieldCache would be faster than
UnInvertedField for single-token strings like this?

My goal is to make the facet re-generation after a commit as fast as possible. I
would like to continue using TextField for these fields since I have a need for
filters like LowerCaseFilterFactory, which still produces a single token. Is it
safe to extend TextField and have multiValuedFieldCache() return false for these
fields, so that UnInvertedField is not used? Or is there a better way to
accomplish what I'm trying to do?

-Michael

Re: Dismax and phrases

2011-11-03 Thread Chris Hostetter


Interesting, in the case where you use quotes...

: +
...
: "asuntojen hinnat"
: "asuntojen hinnat"

...there is one DisjunctionMaxQuery (expected) for the entire phrase, 
but in the sub-clauses for each individual field the clauses coming from 
your "_fi" fields are just building boolean "OR" queries of the terms from 
your phrase (instead of building an actual phrase query...

: +DisjunctionMaxQuery((table.title_t:"asuntojen
: hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" |
: (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto
: table.description_fi:hinta) | table.description_t:"asuntojen hinnat" |
: graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto
: graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto
: table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" |
: text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) |
: (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto
: title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)

...is this perhaps a side effect of the new autoGeneratePhraseQueries 
option? ... you are explicitly specifying a quoted phrase, but 
maybe somehwere in the code path of the dismax parser that information is 
getting lost?

can you post the details of your schema.xml?  (ie: the "version" property 
on the schema file, and the dynamicField/field + fieldType definitions for 
all these fields)

In contrast, your unquoted example is working exactly as i'd expect.  a 
DisjunctionMaxQuery is built for each clause of the input, and the two 
DisjunctionMaxQuery objects are then combined in a BooleanQuery where the 
minNrShouldMatch property is set to "2"

: +
...
: asuntojen hinnat
: asuntojen hinnat
: 
: +((DisjunctionMaxQuery((table.title_t:asuntojen^2.0 |
: title_t:asuntojen^2.0 | ingress_t:asuntojen | text_fi:asunto |
: table.description_fi:asunto | table.description_t:asuntojen |
: graphic.title_t:asuntojen^2.0 | graphic.title_fi:asunto^2.0 |
: table.title_fi:asunto^2.0 | table.contents_t:asuntojen | text_t:asuntojen |
: ingress_fi:asunto | table.contents_fi:asunto | title_fi:asunto^2.0)~0.01)
: DisjunctionMaxQuery((table.title_t:hinnat^2.0 | title_t:hinnat^2.0 |
: ingress_t:hinnat | text_fi:hinta | table.description_fi:hinta |
: table.description_t:hinnat | graphic.title_t:hinnat^2.0 |
: graphic.title_fi:hinta^2.0 | table.title_fi:hinta^2.0 |
: table.contents_t:hinnat | text_t:hinnat | ingress_fi:hinta |
: table.contents_fi:hinta | title_fi:hinta^2.0)~0.01))~2) () type:tie^6.0
: type:kuv^2.0 type:tau^2.0
: 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319438484878),date(date.modified_dt)))+1.0))^100.0)


-Hoss

Re: DIH doesn't handle bound namespaces?

2011-11-03 Thread Chris Hostetter


: *It does not support namespaces , but it can handle xmls with namespaces .

The real crux of hte issue is that XPathEntityProcessor is terribly named.  
it should have been called "LimitedXPathishSyntaxEntityProcessor" or 
something like that because it doesn't support full xpath syntax...

"The XPathEntityProcessor implements a streaming parser which supports a 
subset of xpath syntax. Complete xpath syntax is not supported but most of 
the common use cases are covered..."

...i thought there was a DIH FAQ about this, but if not there really 
should be.


-Hoss

admin index version not updating

2011-11-03 Thread Nathan Moon

I have a setup with a master and single slave, using the collection 
distribution scripts.  I'm not sure if it's relevant, but I'm running multicore 
also.  I am on version 3.4.0 (we are upgrading from 1.3).

My understanding that the indexVersion (a number) reported by the stats page 
(admin/stats.jsp) is a timestamp that should correspond to the time of the 
latest snapshot.  At least that's how it has behaved on version 1.3.

When I install a new snapshot on the slave (snapinstaller), it does not report 
any errors, and the logs/snapshot.current is updated with the latest snapshot, 
but the admin/stats page still reports the old version.  Actually, the version 
number increases by 4 each time I install a new index, but doesn't update to 
anywhere near the time of the latest snapshot (it's a few days off at this 
point).

I have verified that the slave is actually running on the latest index by 
searching for something that only exists in the latest index.

Am I misunderstanding how to interpret the indexVersion, or is the latest 
snapshot not getting fully installed?

Thanks

Nathan

Re: Access Document Score in Custom Function Query (ValueSource)

2011-11-03 Thread Chris Hostetter


: In this value source I compute another score for every document
: using some features. I want to  access the score of the query myField^2 
: (for a given document) in this same value source.
: 
: Ideas?

your ValueSource can wrap the score from the other query using a 
QueryValueSource.

just keep in mind that by definition function queries "match" every 
document in the index, so you'll still need to use the other query in some 
way (or use somehting like the "frange" parser to constrain the set of 
docs returned based on a range of values produced by your function)


-Hoss

Re: score based on unique words matching

2011-11-03 Thread Chris Hostetter


: > q=david bowie changes
: > 
: > Problem : If a record mentions david bowie a lot, it beats out something
: > more relevant (more unique matches) ...
: > 
: > A. (now appearing david bowie at the cineplex 7pm david bowie goes on stage,
: > then mr. bowie will sign autographs)
: > B. song :david bowie - changes
: > 
: > (A) ends up more relevant because of the frequency or number of words in
: > it.. not cool...
: > I want it so the number of words matching will trump density/weight

debugQuery=true is your freind .. it will show you exactly how the scores 
are being computed.

the key factors in something like this are fieldNorm, tf, and the coord 
factor.

The fieldNorm includes as a factor the length of the field, so as long as 
you have omitNorm=false configured for this field, doc#A should be 
panalized relative doc#B for being longer -- but if you omitNorm's then 
that won't help you -- so start by checking that.

The coord factor will penalize documents that don't match all of the 
clauses of a boolean query (ie: doc #A only matches 2/3 clauses becuase it 
doesn't match the word "changes") so you could customize your Similarity 
implementation to make that coord penalty higher, but that requires some 
custom java code.

As an extreme option, you could use omitTf to completley eliminate the 
term frequency from being a factor in scoring so the number of times 
"bowie" appears won't affect the score, just that it appears at least 
once) but that probably isn't what you want: "david bowie changes 
some stuff" would get the same score as "david bowie changes david bowie"

in general the simplest way to deal with a lot of this type of thing is to 
think about how you are structuring your query.  something as simple as 
using the dismax parser with your field in both the "qf" and "pf" fields 
(and a little bit of slop in the "ps" param) may give you exactly what you 
want (since it will reward docs where the whole query string appears in 
the field...

https://wiki.apache.org/solr/DisMaxQParserPlugin


-Hoss

performance - dynamic fields versus static fields

2011-11-03 Thread Memory Makers

Hi,

Is there a handy resource on the:
  a. performance of: dynamic fields versus static fields
  b. other pros-cons?

Thanks.

Re: Can you please guide me through step-by-step installation of Solr Cell ?

2011-11-03 Thread Chris Hostetter


: Caused by: org.apache.solr.common.SolrException: Error loading class 
'solr.extraction.ExtractingRequestHandler'
: 
: With the jetty and the provided example, I have no problem. It all happens 
when I use tomcat and solr.
: 
: My setup is as follows: 
: 
: I downloaded the apache-solr-3.3.0 and unpacked itI am using 
: "apache-solr-3.3.0" folder as my solr-home folder. Inside the "dist" 
: folder I have the apache-solr-3.3.0.war and coppied everything from the 
: contrib/extraction/lib into dist.

just copying jars into "dist" isn't going to make things magically work 
for you -- what matters is that your solr instance knows how to find those 
plugin jars.  when you use the example jetty instance, the solrconfig.xml 
file has  directives with relative paths that indicate where to find 
them.  If you use a differnet "solr home" dir and/or move files arround 
then those  directives are no longer going to work...

https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins
https://wiki.apache.org/solr/SolrConfigXml#lib


-Hoss

Re: Default value for dynamic fields

2011-11-03 Thread Milan Dobrota

It doesn't work for me.

2011/11/3 Yury Kats 

> On 11/3/2011 12:59 PM, Milan Dobrota wrote:
> > Is there any way to define the default value for the dynamic fields in
> > SOLR? I use some dynamic fields of type float with _val_ and if they
> > haven't been created at index time, the value defaults to 0. I would want
> > this to be 1. Can that be changed?
>
> Does specifying default="1" not work?
>
>


-- 
Milan Dobrota
Ruby on Rails developer
milandobrota.com
rubylove.info

Ordered proximity search

2011-11-03 Thread LT.thomas

Hi,

By ordered I mean term1 will always come before term2 in the document.

I have two documents:
1. "By ordered I mean term1 will always come before term2 in the document"
2. "By ordered I mean term2 will always come before term1 in the document"

if I make the query:

"term1 term2"~Integer.MAX_VALUE

my results is: 2 documents

How can I query to have one result (only if term1 come before term2): 
"By ordered I mean term1 will always come before term2 in the document"

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ordered-proximity-search-tp3477946p3477946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-03 Thread Gustavo Falco

Hi guys!

I have a couple of questions that I hope someone could help me with:

1) Recently I've implemented Solr in my app. My use case is not
complicated. Suppose that there will be 50 concurrent users tops. This is
an app like, let's say, a CRM. I tell you this so you have an idea in terms
of how many read and write operations will be needed. What I do need is
that the data that is added / updated be available right after it's added /
updated (maybe a second later it's ok). I know that the commit operation is
expensive, so maybe doing a commit right after each write operation is not
a good idea. I'm trying to use the autoCommit feature with a maxTime of
1000ms, but then the question arised: Is this the best way to handle this
type of situation? and if not, what should I do?

2) I'm using a single index per entity type because I've read that if the
app is not handling lots of data (let's say, 1 million of records) then
it's "safe" to use a single index. Is this true? if not, why?

3) Is it a problem if I use a simple setup of Solr using a single core for
this use case? if not, what do you recommend?



Any help in any of these topics would be greatly appreciated.

Thanks in advance!

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread Erick Erickson

Right, the behavior when sorting on a multivalued field is not defined, so
results are unreliable.


There's nothing that I know of that'll allow your sort to occur on the matched
terms in a multiValued field. But, again, defining correct behavior here isn't
easy. What if you searched for two terms and both terms matched a value in
a single document's multiValued field? Which term should it sort by?

Sorry, but sorting just doesn't work that way and I don't have any bright
ideas how to get this to work as you'd like

Best
Erick

On Thu, Nov 3, 2011 at 1:06 PM, vrpar...@gmail.com  wrote:
> Thanks Erick,
>
> what i given 'abc',...etc...   its values of one multivalued field in one
> document, but might be its confusing.
>
> lets say, i have one field named  Array1   has multivalued=true
>
> now i want to Search on Array1 , but i want only affected values (which i
> can get in "highlighting"),  now i also want to sort on filed Array1,
>
> now whatever be the response should be sorted on only affected values (which
> contains search term).
>
> also without search sorting on Array1 sometimes works fine, sometimes not.
>
>
> Thanks
> Vishal Parekh
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Stopword filter - refreshing stop word list periodically

2011-11-03 Thread Sami Siren

On Fri, Oct 14, 2011 at 10:06 PM, Jithin  wrote:
> What will be the name of this hard coded core? I was re arranging my
> directory structure adding a separate directory for code. And it does work
> with a single core.

In trunk the "single core setup" core is called "collection1". So to
reload that you'd call url:
http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1

--
 Sami Siren

Re: Default value for dynamic fields

2011-11-03 Thread Yury Kats

On 11/3/2011 12:59 PM, Milan Dobrota wrote:
> Is there any way to define the default value for the dynamic fields in
> SOLR? I use some dynamic fields of type float with _val_ and if they
> haven't been created at index time, the value defaults to 0. I would want
> this to be 1. Can that be changed?

Does specifying default="1" not work?

Re: Selective Result Grouping

2011-11-03 Thread Martijn v Groningen

Ok I think I get this. I think this can be achieved if one could
specify a filter inside a group and only documents that pass the
filter get grouped. For example only group documents with the value
image for the mimetype field. This filter should be specified per
group command. Maybe we should open an issue for this?

Martijn

On 1 November 2011 19:58, entdeveloper  wrote:
>
> Martijn v Groningen-2 wrote:
>>
>> When using the group.field option values must be the same otherwise
>> they don't get grouped together. Maybe fuzzy grouping would be nice.
>> Grouping videos and images based on mimetype should be easy, right?
>> Videos have a mimetype that start with video/ and images have a
>> mimetype that start with image/. Storing the mime type's subtype and
>> type in separate fields and group on the type field would do the job.
>> Off course you need to know the mimetype during indexing, but
>> solutions like Apache Tika can do that for you.
>
> Not necessarily interested in grouping by mimetype (that's an analysis
> issue). I simply used videos and images as an example.
>
> I'm not sure what you mean by fuzzy grouping. But my goal is to have
> collapse be more selective somehow on what gets grouped. As a more specific
> example, I have a field called 'type', with the following possible field
> values:
>
> Type
> --
> image
> video
> webpage
>
>
> Basically I want to be able to collapse all the images into a single result
> so that they don't fill up the first page of the results. This is not
> possible with the current grouping implementation because if you call
> group.field=type, it'll group everything. I do not want to collapse videos
> or webpages, only images.
>
> I've attached a screenshot of google's srp to help explain what I mean.
>
> http://lucene.472066.n3.nabble.com/file/n3471548/Screen_Shot_2011-11-01_at_11.52.04_AM.png
>
> Hopefully that makes more sense. If it's still not clear I can email you
> privately.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Selective-Result-Grouping-tp3391538p3471548.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Met vriendelijke groet,

Martijn van Groningen

Re: exact matches possible?

2011-11-03 Thread Roland Tollenaar

Hi Erik,

you are spot on with your guess. I had reinserted my data but apparently
that does not reindex. Delete everything and re-enter was required.

Behaviour now seems to be as desired.

Thank you very much.

PS, thanks for pointing out that the !term is literal. Where can I find
that kind of information on the internet? I use the lucene syntax page
as my reference but it appears to be somewhat limited:

http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

Kind regards,

Roland

Erik Hatcher wrote:

Roland -

Is it possible that you indexed with a different field type and then changed to "string"
without reindexing? A query on a string will only match literally the exact value (barring any
wildcard/regex syntax), so something is fishy with your example. Your query example was odd, not
sure if you meant it literally, but given the Word field name the query would be q={!term
f=Word}apple - maybe you thought "term" was meta, but it is meant literally here.

Erik

On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote:

Hi Erik,

thanks for the response. I have ensured the type is string and that the field
is indexed. No luck though:

(Schema setting under solr/conf):

Query:

Word:apple

Desired result:

apple

Achieved Results:

apple, the red apple, pine-apple, etc, etc

I have also tried your other suggestion:
q={!" " f=Word}apple
(attmpting to eliminate any results with spaces)

But that just gives errors (from calling from the solr/admin query interface.

Am I doing something obviously wrong?

Thanks again,

Roland

It's certainly quite possible with Lucene/Solr. But you have to index >the field to accommodate it.
If you literally want an exact match >query, use the "string" field type and then issue a
term query. >q=field:value will work in simple cases (where the value has no spaces >or colons, or
other query parser syntax), but q={!term f=field}value is >the fail-safe way to do that.
Erik

Erik Hatcher wrote:

It's certainly quite possible with Lucene/Solr. But you have to index the field to
accommodate it. If you literally want an exact match query, use the "string"
field type and then issue a term query. q=field:value will work in simple cases (where
the value has no spaces or colons, or other query parser syntax), but q={!term
f=field}value is the fail-safe way to do that.
Erik
On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:

Hi,

I am trying to do a search that will only match exact words on a field.

I have read somewhere that this is not what SOLR is meant for but I am still
hoping that its possible.

This is an example of what I have tried (to exclude spaces) but the workaround
does not seem to work.

Word:apple NOT " "

What I am really looking for is the "=" operator in SQL (eg Word='apple') but I
cannot find its equivalent for lucene.

Thanks for the help.

Regards,

Roland

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread vrpar...@gmail.com

Thanks Erick,

what i given 'abc',...etc...   its values of one multivalued field in one
document, but might be its confusing.

lets say, i have one field named  Array1   has multivalued=true

now i want to Search on Array1 , but i want only affected values (which i
can get in "highlighting"),  now i also want to sort on filed Array1,

now whatever be the response should be sorted on only affected values (which
contains search term).

also without search sorting on Array1 sometimes works fine, sometimes not.


Thanks 
Vishal Parekh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3477747.html
Sent from the Solr - User mailing list archive at Nabble.com.

Default value for dynamic fields

2011-11-03 Thread Milan Dobrota

Is there any way to define the default value for the dynamic fields in
SOLR? I use some dynamic fields of type float with _val_ and if they
haven't been created at index time, the value defaults to 0. I would want
this to be 1. Can that be changed?

Re: Stream still in memory after tika exception? Possible memoryleak?

2011-11-03 Thread P Williams

Hi All,

I'm experiencing a similar problem to the other's in the thread.

I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
apache-solr-4.0-2011-10-14_08-56-59.war and then
apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
sizes, using the TikaEntityProcessor.  My indexing would run to completion
and was completely successful under the June build.  The only error was
readability of the fulltext in highlighting.  This was fixed in Tika 0.10
(TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
had recently been included (SOLR-2372).

On the same machine without changing any memory settings my initial problem
is a Perm Gen error.  Fine, I increase the PermGen space.

I've set the "onError" parameter to "skip" for the TikaEntityProcessor.
 Now I get several (6)

*SEVERE: Exception thrown while getting data*
*java.net.SocketTimeoutException: Read timed out*
*SEVERE: Exception in entity :
tika:org.apache.solr.handler.dataimport.DataImport*
*HandlerException: Exception in invoking url  # 2975*

pairs.  And after ~3881 documents, with auto commit set unreasonably
frequently I consistently get an Out of Memory Error

*SEVERE: Exception while processing: f document :
null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap s**pace*

The stack trace points
to 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:718).

The October 30 build performs identically.

Funny thing is that monitoring via JConsole doesn't reveal any memory
issues.

Because the out of Memory error did not occur in June, this leads me to
believe that a bug has been introduced to the code since then.  Should I
open an issue in JIRA?

Thanks,
Tricia

On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs  wrote:

> Hi Erick,
>
> I am using Solr 3.3.0, but with 1.4.1 the same problems.
> The connector is a homemade program in the C# programming language and is
> posting via http remote streaming (i.e.
>
> http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
> )
> I'm using Tika to extract the content (comes with the Solr Cell).
>
> A possible problem is that the filestream needs to be closed, after
> extracting, by the client application, but it seems that there is going
> something wrong while getting a Tika-exception: the stream never leaves the
> memory. At least that is my assumption.
>
> What is the common way to extract content from officefiles (pdf, doc, rtf,
> xls etc) and index them? To write a content extractor / validator yourself?
> Or is it possible to do this with the Solr Cell without getting a huge
> memory consumption? Please let me know. Thanks in advance.
>
> Marc
>
> 2011/8/30 Erick Erickson 
>
> > What version of Solr are you using, and how are you indexing?
> > DIH? SolrJ?
> >
> > I'm guessing you're using Tika, but how?
> >
> > Best
> > Erick
> >
> > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs  wrote:
> > > Hi all,
> > >
> > > Currently I'm testing Solr's indexing performance, but unfortunately
> I'm
> > > running into memory problems.
> > > It looks like Solr is not closing the filestream after an exception,
> but
> > I'm
> > > not really sure.
> > >
> > > The current system I'm using has 150GB of memory and while I'm indexing
> > the
> > > memoryconsumption is growing and growing (eventually more then 50GB).
> > > In the attached graph I indexed about 70k of office-documents
> > (pdf,doc,xls
> > > etc) and between 1 and 2 percent throws an exception.
> > > The commits are after 64MB, 60 seconds or after a job (there are 6
> evenly
> > > divided jobs).
> > >
> > > After indexing the memoryconsumption isn't dropping. Even after an
> > optimize
> > > command it's still there.
> > > What am I doing wrong? I can't imagine I'm the only one with this
> > problem.
> > > Thanks in advance!
> > >
> > > Kind regards,
> > >
> > > Marc
> > >
> >
>

Re: DIH doesn't handle bound namespaces?

2011-11-03 Thread P Williams

Hi Gary,

From
http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource

*It does not support namespaces , but it can handle xmls with namespaces .
When you provide the xpath, just drop the namespace and give the rest (eg
if the tag is '' the mapping should just
contain 'subject').Easy, isn't it? And you didn't need to write one line of
code! Enjoy **
*
You should be able to use xpath="//titleInfo/title" without making any
modifications (removing the namespace) to your xml.

I hope that answers your question.

Regards,
Tricia

On Mon, Oct 31, 2011 at 9:24 AM, Moore, Gary wrote:

> I'm trying to import some MODS XML using DIH.  The XML uses bound
> namespacing:
>
> http://www.w3.org/2001/XMLSchema-instance";
>  xmlns:mods="http://www.loc.gov/mods/v3";
>  xmlns:xlink="http://www.w3.org/1999/xlink";
>  xmlns="http://www.loc.gov/mods/v3";
>  xsi:schemaLocation="http://www.loc.gov/mods/v3
> http://www.loc.gov/mods/v3/mods-3-4.xsd";
>  version="3.4">
>   
>  Malus domestica: Arnold
>   
> 
>
> However, XPathEntityProcessor doesn't seem to handle xpaths of the type
> xpath="//mods:titleInfo/mods:title".
>
> If I remove the namespaces from the source XML:
>
> http://www.w3.org/2001/XMLSchema-instance";
>  xmlns:mods="http://www.loc.gov/mods/v3";
>  xmlns:xlink="http://www.w3.org/1999/xlink";
>  xmlns="http://www.loc.gov/mods/v3";
>  xsi:schemaLocation="http://www.loc.gov/mods/v3
> http://www.loc.gov/mods/v3/mods-3-4.xsd";
>  version="3.4">
>   
>  Malus domestica: Arnold
>   
> 
>
> then xpath="//titleInfo/title" works just fine.  Can anyone confirm that
> this is the case and, if so, recommend a solution?
> Thanks
> Gary
>
>
> Gary Moore
> Technical Lead
> LCA Digital Commons Project
> NAL/ARS/USDA
>
>

Re: how to apply sort and search both on multivalued field in solr

2011-11-03 Thread Erick Erickson

What does "sorting on a multivalued field" mean? Should the document
appear, in your example, in the a's? c's? e's? p's? There's no logical
place to sort a document into a list when there's more than one token that
makes sense in the general case that I can think of

Why wouldn't searching oh your multivalued field and sorting on your
min and max fields give you what you want? Can you give an example?

Best
Erick

On Wed, Nov 2, 2011 at 8:32 AM, vrpar...@gmail.com  wrote:
> Hello all,
>
> i did googling and also as per wiki, we can not apply sorting on multivalued
> field.
>
> workaround for that is we need to add two more fields for particular
> multivalued field, min and max.
>     e.g.     multivalued field have 4 values
>                         "abc",
>                         "cde",
>                         "efg",
>                         "pqr"
> than min="abc" and max="pqr"    and we can make sort on it.
>
> this is fine if there is only required to sort on multivalued field.
>
> but i want to do searching and sorting on same multivalued field, then
> result would not fine.
>
> how to solve this problem ?
>
> Thanks
> vishal parekh
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-apply-sort-and-search-both-on-multivalued-field-in-solr-tp3473652p3473652.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

RE: Jetty logging

2011-11-03 Thread darul

Well,  jetty is running as a unix service.

Here is run command :



jetty-logging.xml:




With this configuration I have  logs of jetty but no logs of log4j: exemple
"/logs/_mm_dd.stderrout.log"

2011-11-03 14:36:59.306:INFO::jetty-6.1-SNAPSHOT
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: using system property solr.solr.home: /opt/solr-slave/multicore
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader 
INFO: Solr home set to '/opt/solr-slave/multicore/'
Nov 3, 2011 2:36:59 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: JNDI not configured for solr (NoInitialContextEx)
Nov 3, 2011 2:36:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: using system property solr.solr.home: /opt/solr-slave/multicore
Nov 3, 2011 2:36:59 PM org.apache.solr.core.CoreContainer$Initializer
initialize

I would like jetty use my resource/log4j.properties file :




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3477221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multivalued fields question

2011-11-03 Thread Travis Low

Thanks much, Erick.  Between your explanation, and what I read at
http://lucene.472066.n3.nabble.com/positionIncrementGap-in-schema-xml-td488338.html,
the utility of multiValued fields is clear.

On Thu, Nov 3, 2011 at 8:26 AM, Erick Erickson wrote:

> multiValued has nothing to do with how many tokens are in the field,
> it's just whether you can call document.add("field1", val1) more than
> once on the same field. Or, equivalently, in input document in XML
> has two  entries with the same name="field" entries. So it
> strictly depends upon whether you want to take it upon yourself
> to make these long strings or call document.add once for each
> value in the field.
>
> The field is returned as an array if it's multiValued
>
> Just to make your life interesting If you define your increment gap as
> 0,
> there is no difference between how multiValued fields are searched
> as opposed to single-valued fields.
>
> FWIW
> Erick
>
> On Tue, Nov 1, 2011 at 1:26 PM, Travis Low  wrote:
> > Greetings.  We're finally kicking off our little Solr project.  We're
> > indexing a paltry 25,000 records but each has MANY documents attached, so
> > we're using Tika to parse those documents into a big long string, which
> we
> > use in a call to solrj.addField("relateddoccontents",
> > bigLongStringOfDocumentContents).  We don't care about search results
> > pointing back to a particular document, just one of the 25K records, so
> > this should work.
> >
> > Now my question.  Many of these records have related records in other
> > tables, and there are several types of these related records.  For
> example,
> > we have record #100 that my have blue records with numbers , ,
> > , and , and red records with numbers , , , .
> > Currently we're just handling these the same way as related document
> > contents -- we concatenate them, separated by spaces, into one long
> string,
> > then we do solrj.addField("redRecords", stringOfRedRecordNumbers).  That
> > is, stringOfRedRecordNumbers is "   ".
> >
> > We have no need to show these records to the user in Solr search results,
> > because we're going to use the database for displaying of detailed
> > information for any records found.  Is there any reason to specify
> > redRecords and blueRecords as multivalued fields in schema.xml?  And if
> we
> > did that, we'd call solrj.addField() once for each value, would we not?
> >
> > cheers,
> >
> > Travis
> >
>



-- 

**

*Travis Low, Director of Development*


** * *

*Centurion Research Solutions, LLC*

*14048 ParkEast Circle *•* Suite 100 *•* Chantilly, VA 20151*

*703-956-6276 *•* 703-378-4474 (fax)*

*http://www.centurionresearch.com* 

**The information contained in this email message is confidential and
protected from disclosure.  If you are not the intended recipient, any use
or dissemination of this communication, including attachments, is strictly
prohibited.  If you received this email message in error, please delete it
and immediately notify the sender.

This email message and any attachments have been scanned and are believed
to be free of malicious software and defects that might affect any computer
system in which they are received and opened. No responsibility is accepted
by Centurion Research Solutions, LLC for any loss or damage arising from
the content of this email.

AW: large scale indexing issues / single threaded bottleneck

2011-11-03 Thread sebastian.reese

Hi,

we are currently thinking about the performance facts too.
I wonder if there are any sites on the net describing what a large index is? 

People always talk about huge indexes and heavy commits etc. but i can't find 
some stats about it in numbers and no information about the hardware used.

Maybe an article in the wiki would help.

I expect our index to be about 4 to 5 gig with 500.000 docs and 80.000 commits 
a day. Is that considered to be large, medium or small?

Greets
Sebastian

-Ursprüngliche Nachricht-
Von: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov] 
Gesendet: Donnerstag, 3. November 2011 14:00
An: 'solr-user@lucene.apache.org'
Betreff: RE: large scale indexing issues / single threaded bottleneck

Shishir, we have 35 million "documents", and should be doing about 5000-1 
new "documents" a day, but with very small "documents":  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-Original Message-
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com]
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time 
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there be any 
performance issues?

Thanks,
Shishir

-Original Message-
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com]
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday night. 
Adding a whole bunch of debug statements to the info stream showed that every 
document is following "update document" path instead of "add document" path. 
Meaning that all document IDs are getting into the "pending deletes" queue, and 
Solr has to rescan its index on every commit for potential deletions. This is 
single threaded and seems to get progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my debug 
statements showed that messages still go to updateDocument() function with 
deleteTerm not being null. So, I hacked Lucene a little bit and set 
deleteTerm=null as a temporary solution in the beginning of updateDocument(), 
and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million 
documents/hour (producing 20Gb of index every hour). Right now it's at 1TB 
index size and going, without noticeable degradation of the indexing speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads are 
being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. 
I guess the changes on the trunk would definitely help, but we will likely stay 
on 3.4

Will dig more into the issue on Monday. Really curious to see why 
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or vi

RE: Questions about Solr's security

2011-11-03 Thread Jaeger, Jay - DOT

It seems to me that this issue needs to be addressed in the FAQ and in the 
tutorial, and that somewhere there should be a /select lock-down "how to".   
This is not obvious to many (most?) users of Solr.  It certainly wasn't obvious 
to me before I read this.

JRJ

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Tuesday, November 01, 2011 3:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about Solr's security

SSL and auth doesn't address that /select can hit any request handler defined 
(/select?qt=/update&stream.body=*:*&commit=true).
  Be careful!

But certainly knowing all the issues mentioned on this thread, it is possible 
to lock Solr down and make it safe to hit directly.  But not out of the box or 
trivially.

Erik



On Nov 1, 2011, at 16:09 , Alireza Salimi wrote:

> I'm not sure if anybody has asked these questions before or not.
> Sorry if they are duplicates.
> 
> The problem is that the clients (smart phones) of our Solr machines
> are outside the network in which solr machines are located. So, we
> need to somehow expose their service to the outside word.
> 
> What's the safest way to do that?
> If we implement just a controlled app sitting between those clients
> we gonna waste lots of processing power because of proxying between
> Solr and clients.
> 
> We might also ignore some HTTP headers that Solr would generate
> such as HTTP Cache headers. Anyways, creating such an application
> seems to be a lot of work which is not that needed.
> 
> Erik, do you think even if we use SSL and HTTP Authentication, still
> it's not a good idea to expose Solr services?
> 
> 
> 
> On Tue, Nov 1, 2011 at 3:57 PM, Erik Hatcher  wrote:
> 
>> Be aware that even /select could have some harmful effects, see
>> https://issues.apache.org/jira/browse/SOLR-2854 (addressed on trunk).
>> 
>> Even disregarding that issue, /select is a potential gateway to any
>> request handler defined via /select?qt=/req_handler
>> 
>> Again, in general it's not a good idea to expose Solr to anything but a
>> controlled app server.
>> 
>>   Erik
>> 
>> On Nov 1, 2011, at 15:51 , Alireza Salimi wrote:
>> 
>>> What if we just expose '/select' paths - by firewalls and load balancers
>> -
>>> and
>>> also use SSL and HTTP basic or digest access control?
>>> 
>>> On Tue, Nov 1, 2011 at 2:20 PM, Chris Hostetter <
>> hossman_luc...@fucit.org>wrote:
>>> 
 
 : I was wondering if it's a good idea to expose Solr to the outside
>> world,
 : so that our clients running on smart phones will be able to use Solr.
 
 As a general rule of thumb, i would say that it is not a good idea to
 expose solr directly to the public internet.
 
 there are exceptions to this rule -- AOL hosted some live solr instances
 of the Sarah Palin emails for HufPo -- but it is definitely an expert
 level type thing for people who are so familiar with solr they know
 exactly what to lock down to make it "safe"
 
 for typical users: put an application between your untrusted users and
 solr and only let that application generate "safe" welformed requests to
 Solr...
 
 https://wiki.apache.org/solr/SolrSecurity
 
 
 -Hoss
 
>>> 
>>> 
>>> 
>>> --
>>> Alireza Salimi
>>> Java EE Developer
>> 
>> 
> 
> 
> -- 
> Alireza Salimi
> Java EE Developer

RE: change solr url

2011-11-03 Thread Jaeger, Jay - DOT

The file that he refers to, web.xml, is inside the solr WAR file in folder 
web-inf.  That WAR file is in ...\example\webapps.   You would have to 
uncomment the  section under  and change the 
 to something else.  But, as the comments in the  
section explain, you would also have to make other changes.

If you are unfamiliar with how JEE Java applications are packaged, it might be 
best to leave it alone.

Note that both alternatives that he has suggested would change the path for all 
of solr, not just admin.

JRJ

-Original Message-
From: Ankita Patil [mailto:ankita.pa...@germinait.com] 
Sent: Tuesday, November 01, 2011 11:44 PM
To: solr-user@lucene.apache.org
Subject: Re: change solr url

I am not very clear. Could you explain a bit in detail or give an example.

Ankita.

On 2 November 2011 06:26, Chris Hostetter  wrote:

>
> : Is it possible to change the url for solr admin??
> : What i want is :
> : http://192.168.0.89:8983/solr/private/coreName/admin
> :
> : i want to add /private/ before the coreName. Is that possible? If yes
> how?
>
> You can either do this via settings in your servlet container (to specify
> that hte mapping of hte solr applicaiton should be "solr/private" instead
> of "solr/" or you can modify the "path-prefix" value in Solr's web.xml
> (but that is not very well tested/supported)
>
>
>
>
> -Hoss
>

RE: large scale indexing issues / single threaded bottleneck

2011-11-03 Thread Jaeger, Jay - DOT

Shishir, we have 35 million "documents", and should be doing about 5000-1 
new "documents" a day, but with very small "documents":  40 fields which have 
at most a few terms, with many being single terms.   

You may occasionally see some impact from top level index merges but those 
should be very infrequent given your stated volumes.

For more concrete advice, you should also provide information on the size of 
your documents, and your search volume.

JRJ

-Original Message-
From: Awasthi, Shishir [mailto:shishir.awas...@baml.com] 
Sent: Tuesday, November 01, 2011 10:58 PM
To: solr-user@lucene.apache.org
Subject: RE: large scale indexing issues / single threaded bottleneck

Roman,
How frequently do you update your index? I have a need to do real time
add/delete to SOLR documents at a rate of approximately 20/min.
The total number of documents are in the range of 4 million. Will there
be any performance issues?

Thanks,
Shishir

-Original Message-
From: Roman Alekseenkov [mailto:ralekseen...@gmail.com] 
Sent: Sunday, October 30, 2011 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: large scale indexing issues / single threaded bottleneck

Guys, thank you for all the replies.

I think I have figured out a partial solution for the problem on Friday
night. Adding a whole bunch of debug statements to the info stream
showed that every document is following "update document" path instead
of "add document" path. Meaning that all document IDs are getting into
the "pending deletes" queue, and Solr has to rescan its index on every
commit for potential deletions. This is single threaded and seems to get
progressively slower with the index size.

Adding overwrite=false to the URL in /update handler did NOT help, as my
debug statements showed that messages still go to updateDocument()
function with deleteTerm not being null. So, I hacked Lucene a little
bit and set deleteTerm=null as a temporary solution in the beginning of
updateDocument(), and it does not call applyDeletes() anymore. 

This gave a 6-8x performance boost, and now we can index about 9 million
documents/hour (producing 20Gb of index every hour). Right now it's at
1TB index size and going, without noticeable degradation of the indexing
speed.
This is decent, but still the 24-core machine is barely utilized :)

Now I think it's hitting a merge bottleneck, where all indexing threads
are being paused. And ConcurrentMergeScheduler with 4 threads is not
helping much. I guess the changes on the trunk would definitely help,
but we will likely stay on 3.4

Will dig more into the issue on Monday. Really curious to see why
"overwrite=false" didn't help, but the hack did.

Once again, thank you for the answers and recommendations

Roman



--
View this message in context:
http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-th
readed-bottleneck-tp3461815p3466523.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
This message w/attachments (message) is intended solely for the use of the 
intended recipient(s) and may contain information that is privileged, 
confidential or proprietary. If you are not an intended recipient, please 
notify the sender, and then please delete and destroy all copies and 
attachments, and be advised that any review or dissemination of, or the taking 
of any action in reliance on, the information contained in or attached to this 
message is prohibited. 
Unless specifically indicated, this message is not an offer to sell or a 
solicitation of any investment products or other financial product or service, 
an official confirmation of any transaction, or an official statement of 
Sender. Subject to applicable law, Sender may intercept, monitor, review and 
retain e-communications (EC) traveling through its networks/systems and may 
produce any such EC to regulators, law enforcement, in litigation and as 
required by law. 
The laws of the country of each sender/recipient may impact the handling of EC, 
and EC may be archived, supervised and produced in countries other than the 
country in which you are located. This message cannot be guaranteed to be 
secure or free of errors or viruses. 

References to "Sender" are references to any subsidiary of Bank of America 
Corporation. Securities and Insurance Products: * Are Not FDIC Insured * Are 
Not Bank Guaranteed * May Lose Value * Are Not a Bank Deposit * Are Not a 
Condition to Any Banking Service or Activity * Are Not Insured by Any Federal 
Government Agency. Attachments that are part of this EC may have additional 
important disclosures and disclaimers, which you should read. This message is 
subject to terms available at the following link: 
http://www.bankofamerica.com/emaildisclaimer. By messaging with Sender you 
consent to the foregoing.

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Andrea Gazzarini

Assuming that with "dictionary" you would mean (also) a thesaurus, you
can consider to use SIREn which is a SOLR / Lucene add-on, able to
index (and search) RDF data.

In this way, you could index an already available thesaurus like LCSH,
Agrovoc or build and index your own vocabulary.

subsequently, querying its services for lookups will benefit of SOLR /
Lucene features.

Best,
Andrea

On 11/1/11, Nagendra Mishr  wrote:
> Hi all,
>
> Is there a good guide on using Solr components as a dictionary
> matcher?  I'm need to do some pre-processing that involves lots of
> dictionary lookups and it doesn't seem right to query solr for each
> instance.
>
> Thanks in advance,
>
> Nagendra
>

RE: Jetty logging

2011-11-03 Thread Kai Gülzau

Hi,

remove slf4j-jdk14-1.6.1.jar from the war and repack it with slf4j-log4j12.jar 
and log4j-1.2.14.jar instead.

->http://wiki.apache.org/solr/SolrLogging

Regards,

Kai Gülzau

-Original Message-
From: darul [mailto:daru...@gmail.com] 
Sent: Thursday, November 03, 2011 11:26 AM
To: solr-user@lucene.apache.org
Subject: Jetty logging

Hello everybody,

I do not find a solution on how to configure jetty with sl4j and a 
log4j.properties file.

In  I have put :

- log4j-1.2.14.jar
- slf4j-api-1.3.1.jar

in  directory:
- log4j.properties



At the end, nothing append when running jetty.

Do you have any ideas ?

Thanks,

Julien





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Jetty-logging-tp3476715p3476715.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: pingQuery problem ?

2011-11-03 Thread darul

One of my core had a missing ping request handler.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476980.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multivalued fields question

2011-11-03 Thread Erick Erickson

multiValued has nothing to do with how many tokens are in the field,
it's just whether you can call document.add("field1", val1) more than
once on the same field. Or, equivalently, in input document in XML
has two  entries with the same name="field" entries. So it
strictly depends upon whether you want to take it upon yourself
to make these long strings or call document.add once for each
value in the field.

The field is returned as an array if it's multiValued

Just to make your life interesting If you define your increment gap as 0,
there is no difference between how multiValued fields are searched
as opposed to single-valued fields.

FWIW
Erick

On Tue, Nov 1, 2011 at 1:26 PM, Travis Low  wrote:
> Greetings.  We're finally kicking off our little Solr project.  We're
> indexing a paltry 25,000 records but each has MANY documents attached, so
> we're using Tika to parse those documents into a big long string, which we
> use in a call to solrj.addField("relateddoccontents",
> bigLongStringOfDocumentContents).  We don't care about search results
> pointing back to a particular document, just one of the 25K records, so
> this should work.
>
> Now my question.  Many of these records have related records in other
> tables, and there are several types of these related records.  For example,
> we have record #100 that my have blue records with numbers , ,
> , and , and red records with numbers , , , .
> Currently we're just handling these the same way as related document
> contents -- we concatenate them, separated by spaces, into one long string,
> then we do solrj.addField("redRecords", stringOfRedRecordNumbers).  That
> is, stringOfRedRecordNumbers is "   ".
>
> We have no need to show these records to the user in Solr search results,
> because we're going to use the database for displaying of detailed
> information for any records found.  Is there any reason to specify
> redRecords and blueRecords as multivalued fields in schema.xml?  And if we
> did that, we'd call solrj.addField() once for each value, would we not?
>
> cheers,
>
> Travis
>

Re: exact matches possible?

2011-11-03 Thread Erik Hatcher

Roland -

Is it possible that you indexed with a different field type and then changed to 
"string" without reindexing?   A query on a string will only match literally 
the exact value (barring any wildcard/regex syntax), so something is fishy with 
your example.  Your query example was odd, not sure if you meant it literally, 
but given the Word field name the query would be q={!term f=Word}apple - maybe 
you thought "term" was meta, but it is meant literally here.

Erik

On Nov 3, 2011, at 04:45 , Roland Tollenaar wrote:

> Hi Erik,
> 
> thanks for the response. I have ensured the type is string and that the field 
> is indexed. No luck though:
> 
> (Schema setting under solr/conf):
> 
> 
> Query:
> 
> Word:apple
> 
> Desired result:
> 
> apple
> 
> Achieved Results:
> 
> apple, the red apple, pine-apple, etc, etc
> 
> 
> I have also tried your other suggestion:
> q={!" " f=Word}apple
> (attmpting to eliminate any results with spaces)
> 
> But that just gives errors (from calling from the solr/admin query interface.
> 
> Am I doing something obviously wrong?
> 
> Thanks again,
> 
> Roland
> 
> >It's certainly quite possible with Lucene/Solr.  But you have to index >the 
> >field to accommodate it.  If you literally want an exact match >query, use 
> >the "string" field type and then issue a term query. >q=field:value will 
> >work in simple cases (where the value has no spaces >or colons, or other 
> >query parser syntax), but q={!term f=field}value is >the fail-safe way to do 
> >that.
> 
> > Erik
> 
> 
> 
> 
> 
> 
> 
> 
> Erik Hatcher wrote:
>> It's certainly quite possible with Lucene/Solr.  But you have to index the 
>> field to accommodate it.  If you literally want an exact match query, use 
>> the "string" field type and then issue a term query.  q=field:value will 
>> work in simple cases (where the value has no spaces or colons, or other 
>> query parser syntax), but q={!term f=field}value is the fail-safe way to do 
>> that.
>>  Erik
>> On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:
>>> Hi,
>>> 
>>> I am trying to do a search that will only match exact words on a field.
>>> 
>>> I have read somewhere that this is not what SOLR is meant for but I am 
>>> still hoping that its possible.
>>> 
>>> This is an example of what I have tried (to exclude spaces) but the 
>>> workaround does not seem to work.
>>> 
>>> Word:apple NOT " "
>>> 
>>> What I am really looking for is the "=" operator in SQL (eg Word='apple') 
>>> but I cannot find its equivalent for lucene.
>>> 
>>> Thanks for the help.
>>> 
>>> Regards,
>>> 
>>> Roland
>>> 
>>>

Re: SOLRJ commitWithin inconsistent

2011-11-03 Thread Nagendra Nagarajayya


Vijay:

You may want to try Solr 3.3/3.4 with RankingAlgorithm as it supports 
NRT (Real Time Updates). You can set the commit interval to about 15 
mins or as desired.


You can get more information about NRT with 3.3/3.4.0 from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.3/3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org


Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 11/2/2011 8:40 PM, Vijay Sampath wrote:

Hi,

  I'm using CommitWithin for immediate commit.  The response times are
inconsistent. Sometimes it's less than a second. Sometimes more than 25
seconds. I'm not sending concurrent requests. Any idea?

  http://wiki.apache.org/solr/CommitWithin

   Snippet:

   UpdateRequest req = new UpdateRequest(); 
   req.add( solrDoc);
   req.setCommitWithin(5000);
   req.process( server );



Thanks,
Vijay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLRJ-commitWithin-inconsistent-tp3476104p3476104.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr components for dictionary matching?

2011-11-03 Thread Erick Erickson

I really don't understand what you're asking. Could you give some
examples of what you're trying to do?

Best
Erick

On Tue, Nov 1, 2011 at 10:38 AM, Nagendra Mishr  wrote:
> Hi all,
>
> Is there a good guide on using Solr components as a dictionary
> matcher?  I'm need to do some pre-processing that involves lots of
> dictionary lookups and it doesn't seem right to query solr for each
> instance.
>
> Thanks in advance,
>
> Nagendra
>

pingQuery problem ?

2011-11-03 Thread darul

My solr instance works well, when calling ping page I get no problem :



But in logs, I see this error lines repeated, do you know how to solve this
?



solrconfig.xml



Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/pingQuery-problem-tp3476850p3476850.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: exact matches possible?

2011-11-03 Thread Roland Tollenaar

Hi Erik,

thanks for the response. I have ensured the type is string and that the 
field is indexed. No luck though:

(Schema setting under solr/conf):

Query:

Word:apple

Desired result:

apple

Achieved Results:

apple, the red apple, pine-apple, etc, etc

I have also tried your other suggestion:
q={!" " f=Word}apple
(attmpting to eliminate any results with spaces)

But that just gives errors (from calling from the solr/admin query 
interface.

Am I doing something obviously wrong?

Thanks again,

Roland

>It's certainly quite possible with Lucene/Solr.  But you have to index 
>the field to accommodate it.  If you literally want an exact match 
>query, use the "string" field type and then issue a term query. 
>q=field:value will work in simple cases (where the value has no spaces 
>or colons, or other query parser syntax), but q={!term f=field}value 
is >the fail-safe way to do that.

>Erik

Erik Hatcher wrote:

It's certainly quite possible with Lucene/Solr.  But you have to index the field to 
accommodate it.  If you literally want an exact match query, use the "string" 
field type and then issue a term query.  q=field:value will work in simple cases (where 
the value has no spaces or colons, or other query parser syntax), but q={!term 
f=field}value is the fail-safe way to do that.

Erik

On Nov 2, 2011, at 07:08 , Roland Tollenaar wrote:

Hi,

I am trying to do a search that will only match exact words on a field.

I have read somewhere that this is not what SOLR is meant for but I am still 
hoping that its possible.

This is an example of what I have tried (to exclude spaces) but the workaround 
does not seem to work.

Word:apple NOT " "

What I am really looking for is the "=" operator in SQL (eg Word='apple') but I 
cannot find its equivalent for lucene.

Thanks for the help.

Regards,

Roland

50 matches

Mail list logo