Re: QueryElevationComponent not working in Distributed Search

2012-10-08 Thread vasokan
Hi Erick,

I cannot migrate to 4.0-ALPHA or 4.0-BETA because of the dependency in
configuration as part of indexing in solrconfig.xml and schema.xml.

When I try to use 4.0 version, I get a series of errors that pops up.  Also
I cannot change the entire configuration files that are available to me.

So I tried patching up the diffs that were available as attachments in the
issue that I have mentioned below.
https://issues.apache.org/jira/browse/SOLR-2949 .  But still I was facing
some issues and tried replacing QueryElevationComponent.java from the newer
versions.  But I still do not find the functionality of elevating to be
working for distributed search.

Can you pleae let me know if there is any mean that I can include this fix
without migrating to newer versions.

Thank you,
Vinoth



--
View this message in context: 
http://lucene.472066.n3.nabble.com/QueryElevationComponent-not-working-in-Distributed-Search-tp4011785p4012382.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with relating values in two multi value fields

2012-10-08 Thread Torben Honigbaum
Hi Mikhail,

sorry, my fault. This was one of my first ideas. My problem is, that I've 
1.000.000 documents, each with about 20 attributes. Additionally each document 
has between 200 and 500 option-value pairs. So if I denormalize the data, it 
means that I've 1.000.000 x 350 (200 + 500 / 2) = 350.000.000 documents, each 
with 20 attributes. 

Is denormalization the only way to handle this problem? I 

Thank you
Torben 

Am 06.10.2012 um 12:30 schrieb Mikhail Khludnev:

 Torben,
 
 Denormalization implies copying attrs which are common for a group into the
 smaller docs:
 
 doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsA/str
  str name=value200/str
 /doc
 doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsB/str
  str name=value400/str
 /doc
 doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsB/str
  str name=value400/str
 /doc
 doc
  str name=setid3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=optionsC/str
  str name=value240/str
 /doc
 
 and use group.facet=true
 
 On Sat, Oct 6, 2012 at 2:24 AM, Torben Honigbaum 
 torben.honigb...@neuland-bfi.de wrote:
 
 Hi Mikhail,
 
 thank you for your answer. Maybe my sample data was a not so god. The
 document always have additional data which I need to use as facet like this:
 
 doc
  str name=id3/str
  str name=attribute_Avalue/str
  str name=attribute_Bvalue/str
  str name=options
strA/str
strB/str
...
  str
  str name=value
str200/str
str400/str
...
  str
 /doc
 
 Torben
 
 Am 05.10.2012 um 17:20 schrieb Mikhail Khludnev:
 
 denormalize your docs to option x value tuples, identify them by duping
 id.
 
 doc
 str name=setid3/str
 str name=optionsA/str
 str name=value200/str
 /doc
 doc
 str name=setid3/str
 str name=optionsB/str
 str name=value400/str
 /doc
 doc
 str name=setid3/str
 str name=optionsB/str
 str name=value400/str
 /doc
 doc
 str name=setid3/str
 str name=optionsC/str
 str name=value240/str
 /doc
 
 then collapse them by set setid field. (it can not be uniqkey).
 
 On Fri, Oct 5, 2012 at 6:26 PM, Torben Honigbaum 
 torben.honigb...@neuland-bfi.de wrote:
 
 Hi Mikhail,
 
 I read the article and can't see how to solve my problem with
 FieldCollapsing.
 
 Any other suggestions?
 
 Torben
 
 Am 04.10.2012 um 17:31 schrieb Mikhail Khludnev:
 
 it's a typical nested document problem. there are several approaches.
 Out
 of the box solution as far you need facets is
 http://wiki.apache.org/solr/FieldCollapsing .
 
 On Thu, Oct 4, 2012 at 7:19 PM, Torben Honigbaum 
 torben.honigb...@neuland-bfi.de wrote:
 
 Hi Jack,
 
 thank you for your answer. The problem is, that I don't know the value
 for
 option A and that the values are numbers and I've to use the values as
 facet. So I need something like this:
 
 Docs:
 
 doc
 str name=id3/str
 str name=options
  strA/str
  strB/str
  ...
 str
 str name=value
  str200/str
  str400/str
  ...
 str
 /doc
 doc
 str name=id4/str
 str name=options
  strA/str
  strE/str
  ...
 str
 str name=value
  str300/str
  str400/str
  ...
 str
 /doc
 doc
 str name=id6/str
 str name=options
  strA/str
  strC/str
  ...
 str
 str name=value
  str200/str
  str400/str
  ...
 str
 /doc
 
 Query: …?q=options:A
 
 Facet: 200 (2), 300 (1)
 
 Thank you
 Torben
 
 Am 04.10.2012 um 17:10 schrieb Jack Krupansky:
 
 Use a field called option_value_pairs with values like A 200 and
 then query with a quoted phrase A 200.
 
 You could use a special character like equal sign instead of space:
 A=200 and then you don't have to quote it in the query.
 
 -- Jack Krupansky
 
 -Original Message- From: Torben Honigbaum
 Sent: Thursday, October 04, 2012 11:03 AM
 To: solr-user@lucene.apache.org
 Subject: Problem with relating values in two multi value fields
 
 Hello,
 
 I've a problem with relating values in two multi value fields. My
 documents look like this:
 
 doc
 str name=id3/str
 str name=options
 strA/str
 strB/str
 strC/str
 strD/str
 str
 str name=value
 str200/str
 str400/str
 str240/str
 str310/str
 str
 /doc
 
 My problem is that I've to search for a set of documents and display
 only the value for option A, for example, and use the value field as
 facet
 field. I need a result like this:
 
 doc
 str name=id3/str
 str name=optionsA/str
 str name=value200/str
 /doc
 facet …
 
 I think that this is a use case which isn't possible, right? So can
 someone show me an alternative way to solve this problem? The
 documents
 each have 500 options with 500 related values.
 
 Thank you
 Torben
 
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com
 
 
 
 
 -- 
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 

Re: Adding a new pseudo field

2012-10-08 Thread Upayavira
If I've understood you correctly, you could achieve this also with the
XSLTResponseWriter, it would be pretty trivial to write an XLST that
exposes the node position in the results, containing:

positionxsl:value-of select=position()//position

Stick that in solr/conf/xslt, and reference it with wt=xslttr=.xsl

That way you wouldn't need to modify Solr at all.

Also, look in Solr 4.0, which has calculated fields. Not sure if there's
the scope to find the document position as a function query though.

Upayavira

On Mon, Oct 8, 2012, at 05:02 AM, deniz wrote:
 well basically i was about to explain and ask once more for your opinions
 but
 this morning i just wanted to try something in the source code and it
 succeeded... so here is what i want and i did for getting it:
 
 
 What I wanted: .
 
 The exact thing I want to is similar to score field. Normally it always
 exists but we can see it in a normal query response, unless we set
 fl:*,score. 
 For my case, I would like to see each documents position in a pseudo
 field
 like score, so when i run a query with fl:*,position I want to see
 position5/position for the 5th document in the result set.
 so to make it more clear when you search for
 q=name:denizfl=*,position,score the result set will be something like
 :
 
 docposition/position1id986/idscore5/score/doc
 docposition/position2id1002/idscore4/score/doc
 docposition/position3id140/idscore3/score/doc
 
 and when user runs another query lets say
 q=name:stephanfl=*,position,score the result set will be like:
 
 docposition/position1id140/idscore8/score/doc
 docposition/position2id986/idscore5/score/doc
 docposition/position3id1002/idscore1/score/doc
 
 as you see, each time a different query will have different score,
 therefore
 a documents position - or ranking whichever you prefer to say - will be
 changed according to query
 
 
 What I did:
 
 well after digging the source code, I am now able to see dynamic
 positions
 for each different search.. I have simply added a position function to
 DocIterator and implemented in in subclasses. Then I have added a control
 block in ReturnFields for checking if fl has position in it. It is
 working
 in a similar way with score. and the last thing to do was adding a custom
 augmenter class like PositionAugmenter - similar to ScoreAugmenter. Then
 I
 am done :) 
 
 I hope it helps if anyone faces a similar issue...
 
 
 
 
 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Adding-a-new-pseudo-field-tp4011995p4012375.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Storing queries in Solr

2012-10-08 Thread Upayavira
Solr has a small query cache, but this does not hold queries for any
length of time, so won't suit your purpose.

The LucidWorks Search product has (I believe) a click tracking feature,
but that is about boosting documents that are clicked on, not specific
search terms. Parsing the Solr log, or pushing query terms to a
different core/index would really be the only way to achieve what you're
suggesting, as far as I am aware.

Processing logs would be preferable anyhow, as you don't really want to
be triggering an index write during each query (assuming you have more
queries than updates to your main index), and also if this is for
building a suggester index, then it is unlikely to need updating that
regularly - every hour or every day should be more than sufficient. You
could write a SearchComponent that logs queries in another format,
should the existing log format not be sufficient for you.

Upayavira

On Mon, Oct 8, 2012, at 01:24 AM, Jorge Luis Betancourt Gonzalez wrote:
 Hi!
 
 I was wondering if there are any built-in mechanism that allow me to
 store the queries made to a solr server inside the index itself. I know
 that the suggester module exist, but as far as I know it only works for
 terms existing in the index, and not with queries. I remember reading
 about using some external program to parse the solr log and pushing the
 queries or any other interesting data into the index, is this the only
 way of accomplish this?
 
 Greetings!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


Re: Adding a new pseudo field

2012-10-08 Thread Upayavira
Good question. I know xslt could output json, but you'd have to write a
stylesheet that transforms the xml into json. I'm not sure whether you
can influence the content-type for the output with the xslt response
writer though.

There's also the velocity response writer, which sits behind the /browse
interface, that might help you also.

Upayavira

On Mon, Oct 8, 2012, at 08:54 AM, deniz wrote:
 Could xslt processor be useful for json response too? because i will be
 using
 the response not for browser but for some other jars.. 
 
 
 
 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Adding-a-new-pseudo-field-tp4011995p4012393.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: add shard to index

2012-10-08 Thread Upayavira
Given that Solr does not support distributed IDF, adding a shard without
balancing the number of documents could seriously skew your scoring. If
you are okay with that, then the next question is what happens if you
download the clusterstate.json from ZooKeeper, and add another entry,
along the lines of shard3:{}, then upload it again, what would happen
then?

My theory is that the next host you start up would become the first node
of shard3. Worth a try (unless someone more knowledgeable tells us
otherwise!)

Upayavira

On Mon, Oct 8, 2012, at 01:35 AM, Radim Kolar wrote:
 i am reading this: http://wiki.apache.org/solr/SolrCloud section 
 Re-sizing a Cluster
 
 Its possible to add shard to an existing index? I do not need to get 
 data redistributed, they can stay where they are, its enough for me if 
 new entries will be distributed into new number of shards. restarting 
 solr is fine.


Re: Problem with relating values in two multi value fields

2012-10-08 Thread Toke Eskildsen
On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote:
 sorry, my fault. This was one of my first ideas. My problem is, that
 I've 1.000.000 documents, each with about 20 attributes. Additionally
 each document has between 200 and 500 option-value pairs. So if I
 denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 /
 2) = 350.000.000 documents, each with 20 attributes. 

If you have a few hundred or less distinct primary attributes (the A, B,
C's in your example), you could create a new field for each of them:

/doc
  str name=id3/str
  str name=optionsA B C D/str
  str name=option_A200/str
  str name=option_B400/str
  str name=option_C240/str
  str name=option_D310/str
  ...
  ...
/doc

Query for options:A and facet on field option_A to get facets for
the specific field.

This normalization does increase the index size due to duplicated
secondary values between the option-fields, but since our assumption is
a relatively small amount of primary values, it should not be too much.


Alternatively, if you have many distinct primary attributes, index the
pairs as Jack suggests:
/doc
  str name=id3/str
  str name=optionsA B C D/str
  str name=optionA=200/str
  str name=optionB=400/str
  str name=optionC=240/str
  str name=optionD=310/str
  ...
  ...
/doc

Query for options:A and facet on field option with
field.prefix=A=. Your result will be A=200 (2), A=450 (1)... so you'll
have to strip whatever= before display.

This normalization is potentially a lot heavier than the previous one,
as we have distinct_primaries * distinct_secondaries distinct values. 

Worst case, where every document only contains distinct combinations of
primary/secondary, we have 350M distinct option-values, which is quite
heavy for a single box to facet on. Whether that is better or worse that
350M documents, I don't know.

 Is denormalization the only way to handle this problem? I 

What you are trying to do does look quite a lot like hierarchical
faceting, which Solr does not support directly. But even if you apply
one of the experimental patches, it does not mitigate the potential
combinatorial explosion of your primary  secondary values.

So that leaves the question: How many distinct combinations of primary
and secondary values do you have?

Regards,
Toke Eskildsen



Re: add shard to index

2012-10-08 Thread Rafał Kuć
Hello!

Radim there is a JIRA issue -
https://issues.apache.org/jira/browse/SOLR-3755. It is work in
progress, but once finished Solr will enable you to add additional
shards on a live collection and split the ones that were already
created.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

 Given that Solr does not support distributed IDF, adding a shard without
 balancing the number of documents could seriously skew your scoring. If
 you are okay with that, then the next question is what happens if you
 download the clusterstate.json from ZooKeeper, and add another entry,
 along the lines of shard3:{}, then upload it again, what would happen
 then?

 My theory is that the next host you start up would become the first node
 of shard3. Worth a try (unless someone more knowledgeable tells us
 otherwise!)

 Upayavira

 On Mon, Oct 8, 2012, at 01:35 AM, Radim Kolar wrote:
 i am reading this: http://wiki.apache.org/solr/SolrCloud section 
 Re-sizing a Cluster
 
 Its possible to add shard to an existing index? I do not need to get 
 data redistributed, they can stay where they are, its enough for me if 
 new entries will be distributed into new number of shards. restarting 
 solr is fine.



Reloading ExternalFileField blocks Solr

2012-10-08 Thread Martin Koch
Hi List

We're using Solr-4.0.0-Beta with a 7M document index running on a single
host with 16 shards. We'd like to use an ExternalFileField to hold a value
that changes often. However, we've discovered that the file is apparently
re-read by every shard/core on *every commit*; the index is unresponsive in
this period (around 20s on the host we're running on). This is unacceptable
for our needs. In the future, we'd like to add other values as
ExternalFileFields, and this will make the problem worse.

It would be better if the external file were instead read in in the
background, updating previously read relevant values for each shard as they
are read in.

I guess a change in the ExternalFileField code would be required to achieve
this, but I have no experience here, so suggestions are very welcome.

Thanks,
/Martin Koch - Issuu - Senior Systems Architect.


Solr 4 spatial search - point intersects polygon

2012-10-08 Thread Jorge Suja
Hi everyone, 

I've been playing around with the new spatial search functionalities
included in the newer versions of solr (solr 4.1 and solr trunk 5.0), and
i've found something strange when I try to find a point inside a polygon
(particularly inside a square).

You can reproduce this problem using the spatial-solr-sandbox project that
has the following config for the fields:

/[...]
fieldType name=geohash   class=solr.SpatialRecursivePrefixTreeFieldType
units=degrees /
[...]
field name=geohash  type=geohash  indexed=true stored=true 
multiValued=false /
[...]/

I'm trying to find the following document:

/doc
str name=idG292223/str
str name=nameDubai/str
str name=geohash55.28 25.252220/str
/doc
/
I want to test if this point is located inside a polygon so i'm using the
following query:

/q=geohash:Intersects(POLYGON((55.18 25.352220,55.38
25.352220,55.38 25.152220,55.18 25.152220,55.18 25.352220)))/

As you can see, it's a small square that contains the point described
before. I get some results, but that document is not there, and the ones
returned are wrong since they are not even inside the square.

/result name=response numFound=8 start=0
doc
str name=idG1809498/str
str name=nameGuilin/str
str name=geohash110.286390 25.281940/str
/doc

[...]/

However, if i change a little bit the shape of the square (just changed a
little bit one corner), it returns the result as expected

/q=geohash:Intersects(POLYGON((55.18 25.352220,*55.48*
25.352220,55.38 25.152220,55.18 25.152220,55.18 25.352220)))/

Now it returns a single result and it's OK

/result name=response numFound=1 start=0
doc
str name=idG292223/str
str name=nameDubai/str
str name=geohash55.28 25.252220/str
/doc
/result/


If i use a bbox with the same size and position than the first square, it
returns correctly the document.

/q=geohash:Intersects(55.18 25.152220 55.38 25.352220)

result name=response numFound=1 start=0
doc
str name=idG292223/str
str name=nameDubai/str
str name=geohash55.28 25.252220/str
/doc
/result/

If you draw another polygon such a triangle it works well too.

I've tested this against different points and it's always the same, it seems
that if you draw a straight square (or rectangle),
it can't find the point inside it, and it returns wrong results.

Am i doing anything wrong?

Thanks in advance

Jorge



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-spatial-search-point-intersects-polygon-tp4012402.html
Sent from the Solr - User mailing list archive at Nabble.com.


I don't understand

2012-10-08 Thread Tolga

Hi,

There are two servers with the same configuration. I crawl the same URL. 
One of them is giving the following error:


Caused by: org.apache.solr.common.SolrException: ERROR: 
[doc=http://bilgisayarciniz.org/] multiple values encountered for non 
multiValued copy field text: bilgisayarciniz web hizmetleri


I really fail to understand. Why does this happen?

Regards,

PS: Neither server has multiValued=true for title field.


Re: I don't understand

2012-10-08 Thread Jan Høydahl
Hi,

Please describe your environemnt better

* How do you crawl, using which crawler?
* To which RequestHandler do you send the docs?
* Which version of Solr
* Can you share your schema and other relevant config with us?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

8. okt. 2012 kl. 12:11 skrev Tolga to...@ozses.net:

 Hi,
 
 There are two servers with the same configuration. I crawl the same URL. One 
 of them is giving the following error:
 
 Caused by: org.apache.solr.common.SolrException: ERROR: 
 [doc=http://bilgisayarciniz.org/] multiple values encountered for non 
 multiValued copy field text: bilgisayarciniz web hizmetleri
 
 I really fail to understand. Why does this happen?
 
 Regards,
 
 PS: Neither server has multiValued=true for title field.



solr1.4 code Example

2012-10-08 Thread Sujatha Arun
hi,

I am unable to unzip the  5883_Code.zip file for solr 1.4 from paktpub site
.I get the error message

  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.


any pointers?

Regards
Sujatha


Re: I don't understand

2012-10-08 Thread Tolga

Hi Jan, thanks for your fast reply. Below is the information you requested:

* I use nutch, using the command nutch crawl urls -dir crawl-$(date 
+%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 10 -topN 5

* What do you mean which RequestHandler? How can I find that out?
* 3.6.1
* Both schemas are below:

schema name=nutch version=1.4
types
fieldType name=string class=solr.StrField 
sortMissingLast=true

omitNorms=true/
fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/
fieldType name=float class=solr.TrieFloatField 
precisionStep=0

omitNorms=true positionIncrementGap=0/
fieldType name=date class=solr.TrieDateField precisionStep=0
omitNorms=true positionIncrementGap=0/

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType
fieldType name=url class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1/
/analyzer
/fieldType
/types
fields
field name=id type=string stored=true indexed=true/

!-- core fields --
field name=segment type=string stored=true indexed=false/
field name=digest type=string stored=true indexed=false/
field name=boost type=float stored=true indexed=false/

!-- fields for index-basic plugin --
field name=host type=string stored=false indexed=true/
field name=url type=url stored=true indexed=true
required=true/
field name=content type=text stored=false indexed=true/
field name=title type=text stored=true indexed=true/
field name=cache type=string stored=true indexed=false/
field name=tstamp type=date stored=true indexed=false/

!-- fields for index-anchor plugin --
field name=anchor type=string stored=true indexed=true
multiValued=true/

!-- fields for index-more plugin --
field name=type type=string stored=true indexed=true
multiValued=true/
field name=contentLength type=long stored=true
indexed=false/
field name=lastModified type=date stored=true
indexed=false/
field name=date type=date stored=true indexed=true/

!-- fields for languageidentifier plugin --
field name=lang type=string stored=true indexed=true/

!-- fields for subcollection plugin --
field name=subcollection type=string stored=true
indexed=true multiValued=true/

!-- fields for feed plugin (tag is also used by 
microformats-reltag)--

field name=author type=string stored=true indexed=true/
field name=tag type=string stored=true indexed=true 
multiValued=true/

field name=feed type=string stored=true indexed=true/
field name=publishedDate type=date stored=true
indexed=true/
field name=updatedDate type=date stored=true
indexed=true/

!-- fields for creativecommons plugin --
field name=cc type=string stored=true indexed=true
multiValued=true/
/fields
uniqueKeyid/uniqueKey
defaultSearchFieldcontent/defaultSearchField
solrQueryParser defaultOperator=OR/
/schema

schema name=nutch version=1.4
types
fieldType name=string class=solr.StrField 
sortMissingLast=true

omitNorms=true/
fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/
fieldType name=float class=solr.TrieFloatField 
precisionStep=0

omitNorms=true positionIncrementGap=0/
fieldType name=date class=solr.TrieDateField precisionStep=0
omitNorms=true positionIncrementGap=0/

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory

Re: I don't understand

2012-10-08 Thread Tolga

Hi Jan, thanks for your fast reply. Below is the information you requested:

* I use nutch, using the command nutch crawl urls -dir crawl-$(date 
+%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 10 -topN 5

* What do you mean which RequestHandler? How can I find that out?
* 3.6.1
* Both schemas are below:

schema name=nutch version=1.4
types
fieldType name=string class=solr.StrField 
sortMissingLast=true

omitNorms=true/
fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/
fieldType name=float class=solr.TrieFloatField 
precisionStep=0

omitNorms=true positionIncrementGap=0/
fieldType name=date class=solr.TrieDateField precisionStep=0
omitNorms=true positionIncrementGap=0/

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory
ignoreCase=true words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType
fieldType name=url class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1/
/analyzer
/fieldType
/types
fields
field name=id type=string stored=true indexed=true/

!-- core fields --
field name=segment type=string stored=true indexed=false/
field name=digest type=string stored=true indexed=false/
field name=boost type=float stored=true indexed=false/

!-- fields for index-basic plugin --
field name=host type=string stored=false indexed=true/
field name=url type=url stored=true indexed=true
required=true/
field name=content type=text stored=false indexed=true/
field name=title type=text stored=true indexed=true/
field name=cache type=string stored=true indexed=false/
field name=tstamp type=date stored=true indexed=false/

!-- fields for index-anchor plugin --
field name=anchor type=string stored=true indexed=true
multiValued=true/

!-- fields for index-more plugin --
field name=type type=string stored=true indexed=true
multiValued=true/
field name=contentLength type=long stored=true
indexed=false/
field name=lastModified type=date stored=true
indexed=false/
field name=date type=date stored=true indexed=true/

!-- fields for languageidentifier plugin --
field name=lang type=string stored=true indexed=true/

!-- fields for subcollection plugin --
field name=subcollection type=string stored=true
indexed=true multiValued=true/

!-- fields for feed plugin (tag is also used by 
microformats-reltag)--

field name=author type=string stored=true indexed=true/
field name=tag type=string stored=true indexed=true 
multiValued=true/

field name=feed type=string stored=true indexed=true/
field name=publishedDate type=date stored=true
indexed=true/
field name=updatedDate type=date stored=true
indexed=true/

!-- fields for creativecommons plugin --
field name=cc type=string stored=true indexed=true
multiValued=true/
/fields
uniqueKeyid/uniqueKey
defaultSearchFieldcontent/defaultSearchField
solrQueryParser defaultOperator=OR/
/schema

schema name=nutch version=1.4
types
fieldType name=string class=solr.StrField 
sortMissingLast=true

omitNorms=true/
fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/
fieldType name=float class=solr.TrieFloatField 
precisionStep=0

omitNorms=true positionIncrementGap=0/
fieldType name=date class=solr.TrieDateField precisionStep=0
omitNorms=true positionIncrementGap=0/

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory

Re: QueryElevationComponent not working in Distributed Search

2012-10-08 Thread Erick Erickson
You shouldn't try copying files around, your comment that you
 tried replacing QueryElevationComponent.java leads me to
think you tried that. Instead, I notice that there's a SOLR-2949.3x
patch. If you want to try that, you can apply the patch to the 3.x code
line. See working with patches at
http://wiki.apache.org/solr/HowToContribute

WARNING: I have no clue whether that patch will apply cleanly, nor
whether it will actually fix distrib QEV. It doesn't look like it was
applied to 3.x. Also, looking at the comments it's not clear that
it _would_ work, see Marks last comment.

What kinds of errors do you get with 4.0? It's true that a bunch
has changed, but I really don't see any other reliable way to
get distributed QEV working other than either using 4.0 or
patching 3.6... and if you do this latter you're kind of on you own.

Best
Erick

On Mon, Oct 8, 2012 at 2:21 AM, vasokan vaso...@andrew.cmu.edu wrote:
 Hi Erick,

 I cannot migrate to 4.0-ALPHA or 4.0-BETA because of the dependency in
 configuration as part of indexing in solrconfig.xml and schema.xml.

 When I try to use 4.0 version, I get a series of errors that pops up.  Also
 I cannot change the entire configuration files that are available to me.

 So I tried patching up the diffs that were available as attachments in the
 issue that I have mentioned below.
 https://issues.apache.org/jira/browse/SOLR-2949 .  But still I was facing
 some issues and tried replacing QueryElevationComponent.java from the newer
 versions.  But I still do not find the functionality of elevating to be
 working for distributed search.

 Can you pleae let me know if there is any mean that I can include this fix
 without migrating to newer versions.

 Thank you,
 Vinoth



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/QueryElevationComponent-not-working-in-Distributed-Search-tp4011785p4012382.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: add shard to index

2012-10-08 Thread Erick Erickson
Right, but even if that worked, you'd then get docs being assigned
to the wrong shard. The shard assignment would be something
like (hash(id)/3). So a document currently on shard 0 would be
indexed next time, perhaps, on shard 2, leaving two live docs
in your system with the same ID. Bad Things would happen
then...

I believe that currently your only real option is to re-index from
scratch when you add more shards.

I was thinking about this at one point. Unless the guys work
some magic, it will be an expensive process. Not as
expensive as re-indexing for sure, but consider 12
documents in 3 shards.

shard1 - 1, 4, 7, 10
shard2 - 2, 5, 8, 11
shard3 - 3, 6, 9, 12

Now you add a shard and the docs are re-distributed
shard1 - 1, 5, 9
shard2 - 2, 6, 10
shard3 - 3, 7, 11
shard4 - 4, 8, 12

In this simple case, only 3 out of your 12 documents stayed on the
same shard! All the rest had to be moved.

Then the indexes have to be distributed across all replicas, then

Now, there won't have to be any analysis done. You won't have to
reconstruct all of the documents from your system-of-record. You
won't have to a _ton_ of work that you originally had to do. This should
be enormously faster than re-indexing. But it still won't be
something to casually do on a live system under load G.

Disclaimer: I really may be talking through my hat here, but this _sounds_
right.

FWIW
Erick

On Mon, Oct 8, 2012 at 4:33 AM, Upayavira u...@odoko.co.uk wrote:
 Given that Solr does not support distributed IDF, adding a shard without
 balancing the number of documents could seriously skew your scoring. If
 you are okay with that, then the next question is what happens if you
 download the clusterstate.json from ZooKeeper, and add another entry,
 along the lines of shard3:{}, then upload it again, what would happen
 then?

 My theory is that the next host you start up would become the first node
 of shard3. Worth a try (unless someone more knowledgeable tells us
 otherwise!)

 Upayavira

 On Mon, Oct 8, 2012, at 01:35 AM, Radim Kolar wrote:
 i am reading this: http://wiki.apache.org/solr/SolrCloud section
 Re-sizing a Cluster

 Its possible to add shard to an existing index? I do not need to get
 data redistributed, they can stay where they are, its enough for me if
 new entries will be distributed into new number of shards. restarting
 solr is fine.


Re: I don't understand

2012-10-08 Thread Erick Erickson
Well, the schemas are different. The first schema doesn't have a
copyField directive anywhere in it and the second one does.

And the copyField is in a non-standard place anyway, it's
usually outside the /fields tag. Kind of surprising it works
at all there, now I've got to go figure out why G.

Anyway apparently you've edited the schemas inconsistently.
and this copyField will never work unless the text field is multiValued...

Best
Erick

On Mon, Oct 8, 2012 at 7:11 AM, Tolga to...@ozses.net wrote:
 Hi Jan, thanks for your fast reply. Below is the information you requested:

 * I use nutch, using the command nutch crawl urls -dir crawl-$(date
 +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 10 -topN 5
 * What do you mean which RequestHandler? How can I find that out?
 * 3.6.1
 * Both schemas are below:

 schema name=nutch version=1.4
 types
 fieldType name=string class=solr.StrField
 sortMissingLast=true
 omitNorms=true/
 fieldType name=long class=solr.TrieLongField precisionStep=0
 omitNorms=true positionIncrementGap=0/
 fieldType name=float class=solr.TrieFloatField
 precisionStep=0
 omitNorms=true positionIncrementGap=0/
 fieldType name=date class=solr.TrieDateField precisionStep=0
 omitNorms=true positionIncrementGap=0/

 fieldType name=text class=solr.TextField
 positionIncrementGap=100
 analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=url class=solr.TextField
 positionIncrementGap=100
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1/
 /analyzer
 /fieldType
 /types
 fields
 field name=id type=string stored=true indexed=true/

 !-- core fields --
 field name=segment type=string stored=true indexed=false/
 field name=digest type=string stored=true indexed=false/
 field name=boost type=float stored=true indexed=false/

 !-- fields for index-basic plugin --
 field name=host type=string stored=false indexed=true/
 field name=url type=url stored=true indexed=true
 required=true/
 field name=content type=text stored=false indexed=true/
 field name=title type=text stored=true indexed=true/
 field name=cache type=string stored=true indexed=false/
 field name=tstamp type=date stored=true indexed=false/

 !-- fields for index-anchor plugin --
 field name=anchor type=string stored=true indexed=true
 multiValued=true/

 !-- fields for index-more plugin --
 field name=type type=string stored=true indexed=true
 multiValued=true/
 field name=contentLength type=long stored=true
 indexed=false/
 field name=lastModified type=date stored=true
 indexed=false/
 field name=date type=date stored=true indexed=true/

 !-- fields for languageidentifier plugin --
 field name=lang type=string stored=true indexed=true/

 !-- fields for subcollection plugin --
 field name=subcollection type=string stored=true
 indexed=true multiValued=true/

 !-- fields for feed plugin (tag is also used by
 microformats-reltag)--
 field name=author type=string stored=true indexed=true/
 field name=tag type=string stored=true indexed=true
 multiValued=true/
 field name=feed type=string stored=true indexed=true/
 field name=publishedDate type=date stored=true
 indexed=true/
 field name=updatedDate type=date stored=true
 indexed=true/

 !-- fields for creativecommons plugin --
 field name=cc type=string stored=true indexed=true
 multiValued=true/
 /fields
 uniqueKeyid/uniqueKey
 defaultSearchFieldcontent/defaultSearchField
 solrQueryParser defaultOperator=OR/
 /schema

 schema name=nutch version=1.4
 types
 fieldType name=string class=solr.StrField
 sortMissingLast=true
 omitNorms=true/
 fieldType 

Re: solr 1.4.1 - 3.6.1; SOLR-758

2012-10-08 Thread Jack Krupansky
The Extended Dismax query parser (edismax) mostly obsoletes Dismax except 
in the sense that some apps prefer the restricted syntax of Dismax:


http://wiki.apache.org/solr/ExtendedDisMax

-- Jack Krupansky

-Original Message- 
From: Patrick Kirsch

Sent: Monday, October 08, 2012 2:32 AM
To: solr-user@lucene.apache.org
Subject: solr 1.4.1 - 3.6.1; SOLR-758

Regarding https://issues.apache.org/jira/browse/SOLR-758 (Enhance
DisMaxQParserPlugin to support full-Solr syntax and to support alternate
escaping strategies.)

I'm updating from solr 1.4.1 to 3.6.1 (I'm aware that it is not beautiful).
After applying the attached patches to 3.6.1 I'm experiencing this problem:
 - SEVERE: org.apache.solr.common.SolrException: Error Instantiating
QParserPlugin, org.apache.solr.search.AdvancedQParserPlugin is not a
org.apache.solr.search.QParserPlugin
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:421)
at
org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:441)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1612)
[...]
   These patches seems no valid anymore.

Which leads me to the more experienced users here:

- Although not directly mentioned in
https://issues.apache.org/jira/browse/SOLR-758, is there any other (new)
QParser which obsoletes the DisMax?

- Futhermore I tried to make the patches apply (forward porting), but
always get the error Error Instantiating QParserPlugin,
org.apache.solr.search.AdvancedQParserPlugin is not a
org.apache.solr.search.QParserPlugin, although the class dependency is
linear:

./core/src/java/org/apache/solr/search/AdvancedQParserPlugin.java:
[...]
public class AdvancedQParserPlugin extends DisMaxQParserPlugin {
[...]

./core/src/java/org/apache/solr/search/DisMaxQParserPlugin.java:
[...]
public class DisMaxQParserPlugin extends QParserPlugin {
[...]


Thanks,
 Patrick 



Re: solr1.4 code Example

2012-10-08 Thread Toke Eskildsen
On Mon, 2012-10-08 at 13:08 +0200, Sujatha Arun wrote:
 I am unable to unzip the  5883_Code.zip file for solr 1.4 from paktpub site
 .I get the error message
 
   End-of-central-directory signature not found. [...]

It is a corrupt ZIP-file. I'm guessing you got it from
http://www.packtpub.com/files/code/5883_Code.zip
I tried downloading the archive and it was indeed corrupt. You can read
some of the files by using jar for unpacking: 'jar xvf 5883_Code.zip'.

You'll need to contact packtpub to get them to fix it peroperly. A quick
search indicates that they've had problems before:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201005.mbox/%
3c4bf66e8f.4070...@shoptimax.de%3E




Re: long query response time in shards search

2012-10-08 Thread Jack Krupansky
What release of Solr are you on? Solr 4.0 has improved wildcard support (FST 
automatons.) But even then, such heavy use of wildcards may be 
problematic.


If you intend to use wildcard in that manner, you might want to create a 
customer stemming filter that does that stemming at index time (and query 
time) so you don't need to do such heavy wildcarding.


Do these complex queries always run slow (the first time each is tried) or 
just sometimes or some of the queries? (Solr will cache the results of a 
given query so that the next time the same results can be returned without 
re-querying the index.)


-- Jack Krupansky

-Original Message- 
From: Jason

Sent: Monday, October 08, 2012 12:26 AM
To: solr-user@lucene.apache.org
Subject: Re: long query response time in shards search

Hi, Otis
Thanks your reply.

yes, all cores are in same server.

* what do you consider too long?
just id(key) query response takes too long.
almost id(key) query response takes under 10ms.
example
-
2012-10-05 16:38:32,078 [http-8080-exec-3979] INFO
org.apache.solr.core.SolrCore - [usp00] webapp=/solr_us path=/select
params={rows=1shards=usp00,usp01,usp02,usp03,usp04,usp05fl=cin,scorestart=0q=id:(US200840881A1)}
status=0 QTime=164085

* how many queries are running concurrently?
approximately 5 to 10 queries.
but queries are very complex. complex means many terms include wildcard.

* can you show some example queries?
example
-
q=(angiogenesis*+OR+neovascula*+OR+(vessel*+OR+vascula*)+N+(proliferat*+OR+growth*))+5N+(inhibit*+OR+prevent*+OR+treat*+OR+thera*+OR+medic*)+AND+(ibd+OR+crohn*+OR+behcet*+OR+inflammat*+2N+(bowel*+OR+intestin*+OR+colitis*+OR+enteritis*+OR+gastroenteritis*)+OR+ulcerative*+W+colitis*+OR+intestin*+W+behcet*+OR+macula*+W+degenerat*+OR+amd+OR+armd)

* how many CPU cores does your server have?
32 cores (server has 4 CPU and 8 cores in each CPU.)
128G RAM

Also, total index for all cores include 15million docs and size is 400G.

complex queries are problem??



--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-query-response-time-in-shards-search-tp4012366p4012378.html
Sent from the Solr - User mailing list archive at Nabble.com. 



search by multiple 'LIKE' operator connected with 'AND' operator

2012-10-08 Thread gremlin
Hi.

I have a trouble with SOLR configuration. Just want to implement
configuration that would be operate with index like MySQL query: field_name
LIKE '%foo%' AND field_name LIKE '%bar%'.

So, for example, I have 4 indexed titles:
'Kathy Lee',
'Kathy Norris',
'Kathy Davies',
'Kathy Bird'

and with my query Kathy Norris I receive all these indexes. Quoted query
give no results at all.

latest field definition that I've try (very simple, just for tests):
fieldType name=text_ngram class=solr.TextField indexed=true
stored=true multiValued=true positionIncrementGap=100
autoGeneratePhraseQueries=false 
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.NGramFilterFactory minGramSize=2
maxGramSize=100/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PositionFilterFactory /
  /analyzer
/fieldType

Also I've try field with ShingleFilterFactory, also ShingleFilterFactory
combined with NGrams. But no results.

Btw. I have default solr configuration for drupal search_api_solr module,
just modified with a new request handler.

Trying different configurations not give expected results.

Thanks for help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-by-multiple-LIKE-operator-connected-with-AND-operator-tp4012536.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Storing queries in Solr

2012-10-08 Thread Jorge Luis Betancourt Gonzalez
Thanks for the quick response, I'm trying to get a suggester query, I found odd 
the being a very common issue solr doesn't provide any built in mechanism for 
query suggestions, but implementing the other components isn't so hard either.

Greetiings!

On Oct 8, 2012, at 3:38 AM, Upayavira wrote:

 Solr has a small query cache, but this does not hold queries for any
 length of time, so won't suit your purpose.
 
 The LucidWorks Search product has (I believe) a click tracking feature,
 but that is about boosting documents that are clicked on, not specific
 search terms. Parsing the Solr log, or pushing query terms to a
 different core/index would really be the only way to achieve what you're
 suggesting, as far as I am aware.
 
 Processing logs would be preferable anyhow, as you don't really want to
 be triggering an index write during each query (assuming you have more
 queries than updates to your main index), and also if this is for
 building a suggester index, then it is unlikely to need updating that
 regularly - every hour or every day should be more than sufficient. You
 could write a SearchComponent that logs queries in another format,
 should the existing log format not be sufficient for you.
 
 Upayavira
 
 On Mon, Oct 8, 2012, at 01:24 AM, Jorge Luis Betancourt Gonzalez wrote:
 Hi!
 
 I was wondering if there are any built-in mechanism that allow me to
 store the queries made to a solr server inside the index itself. I know
 that the suggester module exist, but as far as I know it only works for
 terms existing in the index, and not with queries. I remember reading
 about using some external program to parse the solr log and pushing the
 queries or any other interesting data into the index, is this the only
 way of accomplish this?
 
 Greetings!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle
Hi!

I'm quite new to Solr, I was recently asked to help out on a project where the 
previous Solr-person quit quite suddenly.  I've noticed that some of our 
searches don't return the expected result, and I'm hoping you guys can help me 
out.

We've indexed a lot of names, and would like to search for a person in our 
system using these names.  We previously used Oracle Text for this, and we 
experience that Solr is much faster.  So far so good! :)  But when we try to 
use wildcards things start to to wrong.

We're using Solr 3.4, and I see that some of our problems are solved in 3.6.  
Ref SOLR-2438:
https://issues.apache.org/jira/browse/SOLR-2438

But we would also like to be able to combine wildcards with fuzzy searches, and 
wildcards with a phonetic filter.  I don't see anything about phonetic filters 
in SOLR-2438 or SOLR-2921.  (https://issues.apache.org/jira/browse/SOLR-2921)  
Is it possible to make the phonetic filters MultiTermAware?

Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
Solr..) and find both christian and kristian.  As far as I understand, this is 
not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  Is this 
correct, or have I misunderstood anything?  Are there any workarounds or 
filter-combinations I can use to achieve the same result?  I've seen people 
suggest using a boolean query to combine the two, but I don't really see how 
that would solve my chr*-problem.

As I mentioned earlier I'm quite new to this, so I apologize if what I'm asking 
about only shows my ignorance..


Regards, Hågen

Re: search by multiple 'LIKE' operator connected with 'AND' operator

2012-10-08 Thread Jack Krupansky
The PositionFilterFactory is probably preventing phrase queries from 
working. What are you expecting it to do? It basically means query if all 
the quoted terms occur at the same position.


SQL like is comparable to Lucene wildcard, but change the % to * and 
_ to ?.


-- Jack Krupansky

-Original Message- 
From: gremlin

Sent: Monday, October 08, 2012 10:47 AM
To: solr-user@lucene.apache.org
Subject: search by multiple 'LIKE' operator connected with 'AND' operator

Hi.

I have a trouble with SOLR configuration. Just want to implement
configuration that would be operate with index like MySQL query: field_name
LIKE '%foo%' AND field_name LIKE '%bar%'.

So, for example, I have 4 indexed titles:
'Kathy Lee',
'Kathy Norris',
'Kathy Davies',
'Kathy Bird'

and with my query Kathy Norris I receive all these indexes. Quoted query
give no results at all.

latest field definition that I've try (very simple, just for tests):
fieldType name=text_ngram class=solr.TextField indexed=true
stored=true multiValued=true positionIncrementGap=100
autoGeneratePhraseQueries=false 
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.NGramFilterFactory minGramSize=2
maxGramSize=100/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PositionFilterFactory /
 /analyzer
/fieldType

Also I've try field with ShingleFilterFactory, also ShingleFilterFactory
combined with NGrams. But no results.

Btw. I have default solr configuration for drupal search_api_solr module,
just modified with a new request handler.

Trying different configurations not give expected results.

Thanks for help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-by-multiple-LIKE-operator-connected-with-AND-operator-tp4012536.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Storing queries in Solr

2012-10-08 Thread Gérard Dupont
Hi Jorge,

As far as I know, there isn't built-in component to achieve such function
in Solr (maybe in latest 4.1 that I didn't explored in depth yet). However
I've done myself in the past using different approaches.

The first one is similar to Upayavira's suggestion ans uses an independent
index where queries and clicks where stored in order to make popular
queries suggestion and/or document suggestions. My second implementation
was using a dedicated field on the original documents' index in order to
add terms of queries that lead to a click on each particular document (ie
re-indexing the document with a new field) and using this field as boosted
terms and/or document suggestion. However this later solution is likely to
not scale very well especially if your document index is very dynamic (my
particular case relied on almost static documents repository).

Finally, remember that exploiting queries and clicks may lead to private
data management issues.Since you're storing their queries, warn your users
appropriately.

br,

gdupont

On 8 October 2012 02:24, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 wrote:

 Hi!

 I was wondering if there are any built-in mechanism that allow me to store
 the queries made to a solr server inside the index itself. I know that the
 suggester module exist, but as far as I know it only works for terms
 existing in the index, and not with queries. I remember reading about using
 some external program to parse the solr log and pushing the queries or any
 other interesting data into the index, is this the only way of accomplish
 this?

 Greetings!
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION



-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC)
CASSIDIAN - an EADS company

Document  Learning team - LITIS Laboratory


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Jack Krupansky
A regular expression term may provide what you want, but not exactly. Maybe 
something like:


/(ch|k)r.*/

(No guarantee that will actually work.)

See:
http://lucene.apache.org/core/4_0_0-BETA/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches

And probably slower than desirable.

-- Jack Krupansky

-Original Message- 
From: Hågen Pihlstrøm Hasle

Sent: Monday, October 08, 2012 11:21 AM
To: solr-user@lucene.apache.org
Subject: Wildcards and fuzzy/phonetic query

Hi!

I'm quite new to Solr, I was recently asked to help out on a project where 
the previous Solr-person quit quite suddenly.  I've noticed that some of 
our searches don't return the expected result, and I'm hoping you guys can 
help me out.


We've indexed a lot of names, and would like to search for a person in our 
system using these names.  We previously used Oracle Text for this, and we 
experience that Solr is much faster.  So far so good! :)  But when we try to 
use wildcards things start to to wrong.


We're using Solr 3.4, and I see that some of our problems are solved in 3.6. 
Ref SOLR-2438:

https://issues.apache.org/jira/browse/SOLR-2438

But we would also like to be able to combine wildcards with fuzzy searches, 
and wildcards with a phonetic filter.  I don't see anything about phonetic 
filters in SOLR-2438 or SOLR-2921. 
(https://issues.apache.org/jira/browse/SOLR-2921)

Is it possible to make the phonetic filters MultiTermAware?

Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
Solr..) and find both christian and kristian.  As far as I understand, this 
is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined. 
Is this correct, or have I misunderstood anything?  Are there any 
workarounds or filter-combinations I can use to achieve the same result? 
I've seen people suggest using a boolean query to combine the two, but I 
don't really see how that would solve my chr*-problem.


As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
asking about only shows my ignorance..



Regards, Hågen= 



Re: SolrJ - IOException

2012-10-08 Thread Briggs Thompson
I have also just ran into this a few times over the weekend in a newly
deployed system. We are running Solr 4.0 Beta (not using SolrCloud) and it
is hosted via AWS.

I have a RabbitMQ consumer that reads updates from a queue and posts
updates to Solr via SolrJ. There is quite a bit of error handling around
the indexing request, and even if Solr is not live the consumer application
successfully logs the exception and attempts to move along in the queue.
There are two consumer applications running at once, and at times processes
400 requests per minute. The high volume times is not necessarily when this
problem occurs, though.

This exception is causing the entire application to hang - which is
surprising considering all SolrJ logic is wrapped with try/catches. Has
anyone found out more information regarding the possible keep alive bug?
Any insight is much appreciated.

Thanks,
Briggs Thompson


Oct 8, 2012 7:25:48 AM org.apache.http.impl.client.DefaultRequestDirector
tryExecute
INFO: I/O exception (java.net.SocketException) caught when processing
request: Broken pipe
Oct 8, 2012 7:25:48 AM org.apache.http.impl.client.DefaultRequestDirector
tryExecute
INFO: Retrying request
Oct 8, 2012 7:25:48 AM com..rabbitmq.worker.SolrWriter work
SEVERE: {id:4049703,datetime:2012-10-08 07:22:05}
IOException occured when talking to server at:
http://ec2-50-18-73-42.us-west-1.compute.amazonaws.com:8983/solr/coupon
server
org.apache.solr.client.solrj.SolrServerException: IOException occured when
talking to server at:
http://ec2-50-18-73-42.us-west-1.compute.amazonaws.com:8983/solr/coupon
server
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:362)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)
at org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:96)
at org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:79)
at com..solr.SolrIndexService.IndexCoupon(SolrIndexService.java:57)
at com..solr.SolrIndexService.Index(SolrIndexService.java:36)
at com..rabbitmq.worker.SolrWriter.work(SolrWriter.java:47)
at com..rabbitmq.job.Runner.run(Runner.java:84)
at com..rabbitmq.job.SolrConsumer.main(SolrConsumer.java:10)
Caused by: org.apache.http.client.ClientProtocolException
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:306)
... 10 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot
retry request with a non-repeatable request entity. The cause lists the
reason the original request failed.
at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
at
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
... 13 more
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:147)
at
org.apache.http.impl.io.AbstractSessionOutputBuffer.flush(AbstractSessionOutputBuffer.java:154)
at
org.apache.http.impl.conn.LoggingSessionOutputBuffer.flush(LoggingSessionOutputBuffer.java:95)
at
org.apache.http.impl.io.ChunkedOutputStream.flush(ChunkedOutputStream.java:178)
at
org.apache.http.entity.mime.content.InputStreamBody.writeTo(InputStreamBody.java:72)
at
org.apache.http.entity.mime.HttpMultipart.doWriteTo(HttpMultipart.java:206)
at org.apache.http.entity.mime.HttpMultipart.writeTo(HttpMultipart.java:224)
at
org.apache.http.entity.mime.MultipartEntity.writeTo(MultipartEntity.java:183)
at
org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
at
org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at
org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
at
org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
at
org.apache.http.impl.conn.AbstractClientConnAdapter.sendRequestEntity(AbstractClientConnAdapter.java:227)
at
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at

Re: multivalued filed question (FieldCache error)

2012-10-08 Thread giovanni.bricc...@banzai.it

Thank you very much!

I've singlelined, spaced removed every fl field in my solrconfig and now 
the app works fine


Giovanni

Il 05/10/12 20:49, Chris Hostetter ha scritto:

: So extracting the attachment you will be able to track down what appens
:
: this is the query that shows the error, and below you can see the latest stack
: trace and the qt definition

Awesome -- exactly what we needed.

I've reproduced your problem, and verified that it has something to do
with the extra newlines which are confusing the parsing into not
recognizing store_slug as a simple field name.

The workarround is to modify the fl in your config to look like this...

  str name=flsku,store_slug/str

...or even like this...

  str name=fl   sku,  store_slug   /str

...and then it should work fine.

having a newline immediately following the store_slug field name is
somehow confusing things, and making it not recognize store_slug as a
simple field name -- so then it tries to parse it as a function, and
since bare field names can also be used as functions that parsing works,
but then you get the error that the field can't be used as a function
since it's multivalued.

I'll try to get a fix for this into 4.0-FINAL...

https://issues.apache.org/jira/browse/SOLR-3916

-Hoss






Re: search by multiple 'LIKE' operator connected with 'AND' operator

2012-10-08 Thread gremlin
Disabling PositionFilterFactory is totally break multiword search, and I
could find titles only by single word.

Default solr.TextField field with WhitespaceTokenizerFactory returns only
complete words match, enabling NGramFilterFactory for that field doesn't do
anything for me. If I use field described I coud find by both words, but no
'both at a time', just 'one of any'.
TextField field copied by copyField into NGram field also doesn't helps.

Maybe I miss something from schema configuration?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-by-multiple-LIKE-operator-connected-with-AND-operator-tp4012536p4012554.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Erick Erickson
whether phonetic filters can be multiterm aware:

I'd be leery of this, as I basically don't quite know how that would
behave. You'd have to insure that the  algorithms changed the
first parts of the words uniformly, regardless of what followed. I'm
pretty sure that _some_ phonetic algorithms do not follow this
pattern, i.e. eric wouldn't necessarily have the same beginning
as erickson. That said, some of the algorithms _may_ follow this
rule and might be OK candidates for being MultiTermAware

But, you don't need this in order to try it out. See the Expert Level
Schema Possibilities
at:
http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

You can define your own analysis chain for wildcards as part of your fieldType
definition and include whatever you want, whether or not it's
MultiTermAware and it
will be applied at query time. Use the analyzer type=query entry
as a basis. _But_ you shouldn't include anything in this section that
produces more than one output per input token. Note, token, not
field. I.e. a really bad candidate for this section is
WordDelimiterFilterFactory
if you use the admin/analysis page (which you'll get to know intimately) and
look at a type that has WordDelimiterFilterFactory in its chain and
put something
like erickErickson1234, you'll see what I mean.. Make sure and check the
verbose box

If you can determine that some of the phonetic algorithms _should_ be
MultiTermAware, please feel free to raise a JIRA and we can discuss... I suspect
it'll be on a case-by-case basis.

Best
Erick

On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
haagenha...@gmail.com wrote:
 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.

 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try to 
 use wildcards things start to to wrong.

 We're using Solr 3.4, and I see that some of our problems are solved in 3.6.  
 Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438

 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?

 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  Is 
 this correct, or have I misunderstood anything?  Are there any workarounds or 
 filter-combinations I can use to achieve the same result?  I've seen people 
 suggest using a boolean query to combine the two, but I don't really see how 
 that would solve my chr*-problem.

 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..


 Regards, Hågen


Re: SolrJ - IOException

2012-10-08 Thread Briggs Thompson
Also note there were no exceptions in the actual Solr log, only on the
SolrJ side.

Thanks,
Briggs

On Mon, Oct 8, 2012 at 10:45 AM, Briggs Thompson 
w.briggs.thomp...@gmail.com wrote:

 I have also just ran into this a few times over the weekend in a newly
 deployed system. We are running Solr 4.0 Beta (not using SolrCloud) and it
 is hosted via AWS.

 I have a RabbitMQ consumer that reads updates from a queue and posts
 updates to Solr via SolrJ. There is quite a bit of error handling around
 the indexing request, and even if Solr is not live the consumer application
 successfully logs the exception and attempts to move along in the queue.
 There are two consumer applications running at once, and at times processes
 400 requests per minute. The high volume times is not necessarily when this
 problem occurs, though.

 This exception is causing the entire application to hang - which is
 surprising considering all SolrJ logic is wrapped with try/catches. Has
 anyone found out more information regarding the possible keep alive bug?
 Any insight is much appreciated.

 Thanks,
 Briggs Thompson


 Oct 8, 2012 7:25:48 AM org.apache.http.impl.client.DefaultRequestDirector
 tryExecute
 INFO: I/O exception (java.net.SocketException) caught when processing
 request: Broken pipe
 Oct 8, 2012 7:25:48 AM org.apache.http.impl.client.DefaultRequestDirector
 tryExecute
 INFO: Retrying request
 Oct 8, 2012 7:25:48 AM com..rabbitmq.worker.SolrWriter work
 SEVERE: {id:4049703,datetime:2012-10-08 07:22:05}
 IOException occured when talking to server at: 
 http://ec2-50-18-73-42.us-west-1.compute.amazonaws.com:8983/solr/coupon
 server
 org.apache.solr.client.solrj.SolrServerException: IOException occured when
 talking to server at: 
 http://ec2-50-18-73-42.us-west-1.compute.amazonaws.com:8983/solr/coupon
 server
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:362)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
 at
 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
 at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)
 at org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:96)
 at org.apache.solr.client.solrj.SolrServer.addBeans(SolrServer.java:79)
 at com..solr.SolrIndexService.IndexCoupon(SolrIndexService.java:57)
 at com..solr.SolrIndexService.Index(SolrIndexService.java:36)
 at com..rabbitmq.worker.SolrWriter.work(SolrWriter.java:47)
 at com..rabbitmq.job.Runner.run(Runner.java:84)
 at com..rabbitmq.job.SolrConsumer.main(SolrConsumer.java:10)
 Caused by: org.apache.http.client.ClientProtocolException
 at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
 at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
 at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
 at
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:306)
 ... 10 more
 Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot
 retry request with a non-repeatable request entity. The cause lists the
 reason the original request failed.
 at
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
 at
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
 at
 org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
 ... 13 more
 Caused by: java.net.SocketException: Broken pipe
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
 at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
 at
 org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:147)
 at
 org.apache.http.impl.io.AbstractSessionOutputBuffer.flush(AbstractSessionOutputBuffer.java:154)
 at
 org.apache.http.impl.conn.LoggingSessionOutputBuffer.flush(LoggingSessionOutputBuffer.java:95)
 at
 org.apache.http.impl.io.ChunkedOutputStream.flush(ChunkedOutputStream.java:178)
 at
 org.apache.http.entity.mime.content.InputStreamBody.writeTo(InputStreamBody.java:72)
 at
 org.apache.http.entity.mime.HttpMultipart.doWriteTo(HttpMultipart.java:206)
 at
 org.apache.http.entity.mime.HttpMultipart.writeTo(HttpMultipart.java:224)
 at
 org.apache.http.entity.mime.MultipartEntity.writeTo(MultipartEntity.java:183)
 at
 org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
 at
 org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
 at
 org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
 at
 org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
 at
 

Re: Problem with relating values in two multi value fields

2012-10-08 Thread Mikhail Khludnev
Toke,
You are absolutely right, concatenating term is a possible solution. I
found faceting is quite complicated in this case, but it was a hot fix
which we delivered to production.

Torben,
This problem arise quite often, beside of these two approaches discussed
there, also possible to approach SpanQueries and TermPositions - you can
check our experience here:
http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
http://vimeo.com/album/2012142/video/33817062
Our current way is BlockJoin which is really performant in case of batched
updates: http://blog.griddynamics.com/2012/08/block-join-query-performs.html.
Bad thing that there is no open facet component for block join. We
have a
code, but are not ready to share it, yet.

On Mon, Oct 8, 2012 at 12:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Mon, 2012-10-08 at 08:42 +0200, Torben Honigbaum wrote:
  sorry, my fault. This was one of my first ideas. My problem is, that
  I've 1.000.000 documents, each with about 20 attributes. Additionally
  each document has between 200 and 500 option-value pairs. So if I
  denormalize the data, it means that I've 1.000.000 x 350 (200 + 500 /
  2) = 350.000.000 documents, each with 20 attributes.

 If you have a few hundred or less distinct primary attributes (the A, B,
 C's in your example), you could create a new field for each of them:

 /doc
   str name=id3/str
   str name=optionsA B C D/str
   str name=option_A200/str
   str name=option_B400/str
   str name=option_C240/str
   str name=option_D310/str
   ...
   ...
 /doc

 Query for options:A and facet on field option_A to get facets for
 the specific field.

 This normalization does increase the index size due to duplicated
 secondary values between the option-fields, but since our assumption is
 a relatively small amount of primary values, it should not be too much.


 Alternatively, if you have many distinct primary attributes, index the
 pairs as Jack suggests:
 /doc
   str name=id3/str
   str name=optionsA B C D/str
   str name=optionA=200/str
   str name=optionB=400/str
   str name=optionC=240/str
   str name=optionD=310/str
   ...
   ...
 /doc

 Query for options:A and facet on field option with
 field.prefix=A=. Your result will be A=200 (2), A=450 (1)... so you'll
 have to strip whatever= before display.

 This normalization is potentially a lot heavier than the previous one,
 as we have distinct_primaries * distinct_secondaries distinct values.

 Worst case, where every document only contains distinct combinations of
 primary/secondary, we have 350M distinct option-values, which is quite
 heavy for a single box to facet on. Whether that is better or worse that
 350M documents, I don't know.

  Is denormalization the only way to handle this problem? I

 What you are trying to do does look quite a lot like hierarchical
 faceting, which Solr does not support directly. But even if you apply
 one of the experimental patches, it does not mitigate the potential
 combinatorial explosion of your primary  secondary values.

 So that leaves the question: How many distinct combinations of primary
 and secondary values do you have?

 Regards,
 Toke Eskildsen




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Mikhail Khludnev
Martin,

Can you tell me what's the content of that field, and how it should affect
search result?

On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:

 Hi List

 We're using Solr-4.0.0-Beta with a 7M document index running on a single
 host with 16 shards. We'd like to use an ExternalFileField to hold a value
 that changes often. However, we've discovered that the file is apparently
 re-read by every shard/core on *every commit*; the index is unresponsive in
 this period (around 20s on the host we're running on). This is unacceptable
 for our needs. In the future, we'd like to add other values as
 ExternalFileFields, and this will make the problem worse.

 It would be better if the external file were instead read in in the
 background, updating previously read relevant values for each shard as they
 are read in.

 I guess a change in the ExternalFileField code would be required to achieve
 this, but I have no experience here, so suggestions are very welcome.

 Thanks,
 /Martin Koch - Issuu - Senior Systems Architect.




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Otis Gospodnetic
Hi,

Consider looking into synonyms and ngrams.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com
wrote:

 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where
 the previous Solr-person quit quite suddenly.  I've noticed that some of
 our searches don't return the expected result, and I'm hoping you guys can
 help me out.

 We've indexed a lot of names, and would like to search for a person in our
 system using these names.  We previously used Oracle Text for this, and we
 experience that Solr is much faster.  So far so good! :)  But when we try
 to use wildcards things start to to wrong.

 We're using Solr 3.4, and I see that some of our problems are solved in
 3.6.  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438

 But we would also like to be able to combine wildcards with fuzzy
 searches, and wildcards with a phonetic filter.  I don't see anything about
 phonetic filters in SOLR-2438 or SOLR-2921.  (
 https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?

 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in
 Solr..) and find both christian and kristian.  As far as I understand, this
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.
  Is this correct, or have I misunderstood anything?  Are there any
 workarounds or filter-combinations I can use to achieve the same result?
  I've seen people suggest using a boolean query to combine the two, but I
 don't really see how that would solve my chr*-problem.

 As I mentioned earlier I'm quite new to this, so I apologize if what I'm
 asking about only shows my ignorance..


 Regards, Hågen


Re: solr 1.4.1 - 3.6.1; SOLR-758

2012-10-08 Thread Chris Hostetter

: Regarding https://issues.apache.org/jira/browse/SOLR-758 (Enhance
: DisMaxQParserPlugin to support full-Solr syntax and to support alternate
: escaping strategies.)

FWIW: i'm not really sure what/how that issue relates to the problem you 
are seeing (or how you *think* it relates to hte problem you are seeing) 
... so i'm just going to focus on the specifics of your error... 

: After applying the attached patches to 3.6.1 I'm experiencing this problem:

The mailing list typically rejects patches - none came with your message.

:  - SEVERE: org.apache.solr.common.SolrException: Error Instantiating
: QParserPlugin, org.apache.solr.search.AdvancedQParserPlugin is not a
: org.apache.solr.search.QParserPlugin

Besides the obvious problem of not extending the expect class, the other 
posibility is that when compiling you AdvancedQParserPlugin you may be 
compailing against the wrong version of solr -- ie: you could get this 
error if the AdvancedQParserPlugin.class file you have was generated when 
your AdvancedQParserPlugin.java file was compiled against a different 
QParserPlugin.class then the one in use at runtime.



-Hoss


Re: solr1.4 code Example

2012-10-08 Thread Sujatha Arun
did get some files by jar unpacking ,but could not get the  ones I wanted
...thanks anyway !!

On Mon, Oct 8, 2012 at 5:56 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote:

 On Mon, 2012-10-08 at 13:08 +0200, Sujatha Arun wrote:
  I am unable to unzip the  5883_Code.zip file for solr 1.4 from paktpub
 site
  .I get the error message
 
End-of-central-directory signature not found. [...]

 It is a corrupt ZIP-file. I'm guessing you got it from
 http://www.packtpub.com/files/code/5883_Code.zip
 I tried downloading the archive and it was indeed corrupt. You can read
 some of the files by using jar for unpacking: 'jar xvf 5883_Code.zip'.

 You'll need to contact packtpub to get them to fix it peroperly. A quick
 search indicates that they've had problems before:
 https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201005.mbox/%
 3c4bf66e8f.4070...@shoptimax.de%3E





Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle

I guess synonyms would give me a similar result as using regexes, like Jack 
wrote about.  

I've thought about that, but I don't think it would be good enough.  
Substituting k for ch is easy enough, but the problem is that I have to 
think of every possible substitution in advance.  I'd like Fil* to find 
Phillip, I'd like Hen* to find Hansen, and so on.  The possibilities are 
quite endless, and I can't think of them all.  I can't limit myself to 
Norwegian names either, a lot of people living in Norway have names from other 
countries.  I'd like Moha* to find Mouhammed, etc..  Or am I too 
pessimistic?

I haven't read enough about Ngrams yet, so I'm not sure if I've understood it 
properly.  It divides the word into several pieces and tries to find one or 
more matches?  Would that really help in my Chr* example?  I guess you mean 
the combination of synonyms and ngrams?  

Is it possible to combine ngrams with a fuzzy query?  So that every piece of a 
word is matched in a fuzzy way?  Could that help me?

I'll certainly look into ngrams more, thanks for the suggestion.


Regards, Hågen  

On Oct 8, 2012, at 7:23 PM, Otis Gospodnetic wrote:

 Hi,
 
 Consider looking into synonyms and ngrams.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm
 On Oct 8, 2012 11:21 AM, Hågen Pihlstrøm Hasle haagenha...@gmail.com
 wrote:
 
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where
 the previous Solr-person quit quite suddenly.  I've noticed that some of
 our searches don't return the expected result, and I'm hoping you guys can
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our
 system using these names.  We previously used Oracle Text for this, and we
 experience that Solr is much faster.  So far so good! :)  But when we try
 to use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in
 3.6.  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy
 searches, and wildcards with a phonetic filter.  I don't see anything about
 phonetic filters in SOLR-2438 or SOLR-2921.  (
 https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in
 Solr..) and find both christian and kristian.  As far as I understand, this
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.
 Is this correct, or have I misunderstood anything?  Are there any
 workarounds or filter-combinations I can use to achieve the same result?
 I've seen people suggest using a boolean query to combine the two, but I
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm
 asking about only shows my ignorance..
 
 
 Regards, Hågen



Re: add shard to index

2012-10-08 Thread Radim Kolar
Do it as it is done in cassandra database. Adding new node and 
redistributing data can be done in live system without problem it looks 
like this:


every cassandra node has key range assigned. instead of assigning keys 
to nodes like hash(key) mod nodes, then every node has its portion of 
hash keyspace. They do not need to be same, some node can have larger 
portion of keyspace then another.


hash function max possible value is 12.

shard1 - 1-4
shard2 - 5-8
shard3 - 9-12

now lets add new shard. In cassandra adding new shard by default cuts 
existing one by half, so you will have

shard1 - 1-2
shard23-4
shard35-8
shard4   9-12

see? You needed to move only documents from old shard1. Usually you are 
adding more then 1 shard during reorganization, you do not need to 
rebalance cluster by moving every node into different position in hash 
keyspace that much.


Re: add shard to index

2012-10-08 Thread Michael Della Bitta
AKA Consistent Hashing: http://en.wikipedia.org/wiki/Consistent_hashing

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Oct 8, 2012 at 11:33 AM, Radim Kolar h...@filez.com wrote:
 Do it as it is done in cassandra database. Adding new node and
 redistributing data can be done in live system without problem it looks like
 this:

 every cassandra node has key range assigned. instead of assigning keys to
 nodes like hash(key) mod nodes, then every node has its portion of hash
 keyspace. They do not need to be same, some node can have larger portion of
 keyspace then another.

 hash function max possible value is 12.

 shard1 - 1-4
 shard2 - 5-8
 shard3 - 9-12

 now lets add new shard. In cassandra adding new shard by default cuts
 existing one by half, so you will have
 shard1 - 1-2
 shard23-4
 shard35-8
 shard4   9-12

 see? You needed to move only documents from old shard1. Usually you are
 adding more then 1 shard during reorganization, you do not need to rebalance
 cluster by moving every node into different position in hash keyspace that
 much.


Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Hågen Pihlstrøm Hasle

I understand that I'm quickly reaching the boundaries of my Solr-competence 
when I'm supposed to read about Expert Level concepts.. :)  I had already 
read it once, but now I read it again. Twice.  And I'm not sure if I understand 
it correctly..  So let me ask a follow-up question:
If I define an analyzer of type multiterm, will every filter I include for that 
analyzer be applied, even if it's not MultiTermAware?

To complicate this further, I'm not really sure if phonetic filters is a good 
match for our needs.  We search for names, and these names can come from all 
over the world.  We use DoubleMetaphone, and Wikipedia says it tries to 
account for myriad irregularities in English of Slavic, Germanic, Celtic, 
Greek, French, Italian, Spanish, Chinese, and other origin.  So I guess it's 
quite good.  But how about names from the middle east, Pakistan or India?  Is 
DoubleMetaphone a good match also for names from these countries?  Are there 
any better algorithms?  

How about fuzzy-searches and wildcards, are they impossible to combine?

We actually do three queries for every search, one fuzzy, one phonetic and one 
using ngram.  Because I don't have too much confidence in the phonetic 
algorithm, I would really like to be able to combine fuzzy queries with 
wildcards.. :)


Regards, Hågen


On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote:

 whether phonetic filters can be multiterm aware:
 
 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware
 
 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
 
 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box
 
 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.
 
 Best
 Erick
 
 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!
 
 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.
 
 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is much faster.  So far so good! :)  But when we try to 
 use wildcards things start to to wrong.
 
 We're using Solr 3.4, and I see that some of our problems are solved in 3.6. 
  Ref SOLR-2438:
 https://issues.apache.org/jira/browse/SOLR-2438
 
 But we would also like to be able to combine wildcards with fuzzy searches, 
 and wildcards with a phonetic filter.  I don't see anything about phonetic 
 filters in SOLR-2438 or SOLR-2921.  
 (https://issues.apache.org/jira/browse/SOLR-2921)
 Is it possible to make the phonetic filters MultiTermAware?
 
 Regarding fuzzy queries, in Oracle Text I can search for chr% (chr* in 
 Solr..) and find both christian and kristian.  As far as I understand, this 
 is not possible in Solr, WildcardQuery and FuzzyQuery cannot be combined.  
 Is this correct, or have I misunderstood anything?  Are there any 
 workarounds or filter-combinations I can use to achieve the same result?  
 I've seen people suggest using a boolean query to combine the two, but I 
 don't really see how that would solve my chr*-problem.
 
 As I mentioned earlier I'm quite new to this, so I apologize if what I'm 
 asking about only shows my ignorance..
 
 
 Regards, Hågen



Re: Fallout from the deprecation of setQueryType

2012-10-08 Thread Shawn Heisey

On 9/28/2012 9:09 AM, Shawn Heisey wrote:
I am planning and building up a test system with Solr 4.0, for my 
eventual upgrade.  I have not made a lot of progress so far, but I 
have come across a potential problem.


It's been over a week with no response to this.  Please see the original 
email for full details.


I have all but decided that I will allow the default /select handler to 
receive queries currently assigned to my lbcheck handler, and use a new 
handler called /search for everything on which I want to track statistics.


There is still a possible problem.  I have a broker core that has the 
shards parameter included in the standard request handler, so this would 
migrate to the new /search request handler.  In the past, you could 
change the handler used on those shards with a shards.qt parameter, but 
if the qt parameter is no longer allowed to have a slash, this isn't 
going to work in the future.  I will instead need an alternate config 
option that makes it use a new handler instead of /select.  Does that 
option already exist?


Thanks,
Shawn



Re: Wildcards and fuzzy/phonetic query

2012-10-08 Thread Erick Erickson
To answer your first question, yes, you've got it right. If you define
a multiterm section in your fieldType, whatever you put in that section
gets applied whether the underlying class is MultiTermAware or not.
Which means you can shoot yourself in the foot really bad G...

Well, you have 6 or so possibilities out of the box...and all of them will
fail at times. Fuzzy searches will also fail at times. And so will most
anything else you try. The problem is these are algorithmic in nature
and there are just too many cases that don't fit, human language is
so endlessly variable

Whether Middle Eastern names will work well with phonetic filters, well,
what's the input language? Are you indexing English (or Norwegian or...)
translations? In that case things should work OK since the phonetic variations
should be accounted for in the translations.

If you're indexing in different languages, you can apply different
phonetic filters
on different fields, so you might be able to work it that way. But if you're
indexing multiple languages in to a _single_ field, you'll have a lot of other
problems to solve before you start worrying about phonetics...

All I can really say is give it a try and see how well it works since good
search results are so domain dependent

Fuzzy searches + wildcards. I don't think you can do that reasonably, but
I'm not entirely sure.

Best
Erick

On Mon, Oct 8, 2012 at 2:28 PM, Hågen Pihlstrøm Hasle
haagenha...@gmail.com wrote:

 I understand that I'm quickly reaching the boundaries of my Solr-competence 
 when I'm supposed to read about Expert Level concepts.. :)  I had already 
 read it once, but now I read it again. Twice.  And I'm not sure if I 
 understand it correctly..  So let me ask a follow-up question:
 If I define an analyzer of type multiterm, will every filter I include for 
 that analyzer be applied, even if it's not MultiTermAware?

 To complicate this further, I'm not really sure if phonetic filters is a good 
 match for our needs.  We search for names, and these names can come from all 
 over the world.  We use DoubleMetaphone, and Wikipedia says it tries to 
 account for myriad irregularities in English of Slavic, Germanic, Celtic, 
 Greek, French, Italian, Spanish, Chinese, and other origin.  So I guess it's 
 quite good.  But how about names from the middle east, Pakistan or India?  Is 
 DoubleMetaphone a good match also for names from these countries?  Are there 
 any better algorithms?

 How about fuzzy-searches and wildcards, are they impossible to combine?

 We actually do three queries for every search, one fuzzy, one phonetic and 
 one using ngram.  Because I don't have too much confidence in the phonetic 
 algorithm, I would really like to be able to combine fuzzy queries with 
 wildcards.. :)


 Regards, Hågen


 On Oct 8, 2012, at 6:09 PM, Erick Erickson wrote:

 whether phonetic filters can be multiterm aware:

 I'd be leery of this, as I basically don't quite know how that would
 behave. You'd have to insure that the  algorithms changed the
 first parts of the words uniformly, regardless of what followed. I'm
 pretty sure that _some_ phonetic algorithms do not follow this
 pattern, i.e. eric wouldn't necessarily have the same beginning
 as erickson. That said, some of the algorithms _may_ follow this
 rule and might be OK candidates for being MultiTermAware

 But, you don't need this in order to try it out. See the Expert Level
 Schema Possibilities
 at:
 http://searchhub.org/dev/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

 You can define your own analysis chain for wildcards as part of your 
 fieldType
 definition and include whatever you want, whether or not it's
 MultiTermAware and it
 will be applied at query time. Use the analyzer type=query entry
 as a basis. _But_ you shouldn't include anything in this section that
 produces more than one output per input token. Note, token, not
 field. I.e. a really bad candidate for this section is
 WordDelimiterFilterFactory
 if you use the admin/analysis page (which you'll get to know intimately) and
 look at a type that has WordDelimiterFilterFactory in its chain and
 put something
 like erickErickson1234, you'll see what I mean.. Make sure and check the
 verbose box

 If you can determine that some of the phonetic algorithms _should_ be
 MultiTermAware, please feel free to raise a JIRA and we can discuss... I 
 suspect
 it'll be on a case-by-case basis.

 Best
 Erick

 On Mon, Oct 8, 2012 at 11:21 AM, Hågen Pihlstrøm Hasle
 haagenha...@gmail.com wrote:
 Hi!

 I'm quite new to Solr, I was recently asked to help out on a project where 
 the previous Solr-person quit quite suddenly.  I've noticed that some of 
 our searches don't return the expected result, and I'm hoping you guys can 
 help me out.

 We've indexed a lot of names, and would like to search for a person in our 
 system using these names.  We previously used Oracle Text for this, and we 
 experience that Solr is 

Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Martin Koch
Sure: We're boosting search results based on user actions which could be
e.g. the number of times a particular document has been read. In future,
we'd also like to boost by e.g. impressions (the number of times a document
has been displayed) and other values.

/Martin

On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Martin,

 Can you tell me what's the content of that field, and how it should affect
 search result?

 On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:

  Hi List
 
  We're using Solr-4.0.0-Beta with a 7M document index running on a single
  host with 16 shards. We'd like to use an ExternalFileField to hold a
 value
  that changes often. However, we've discovered that the file is apparently
  re-read by every shard/core on *every commit*; the index is unresponsive
 in
  this period (around 20s on the host we're running on). This is
 unacceptable
  for our needs. In the future, we'd like to add other values as
  ExternalFileFields, and this will make the problem worse.
 
  It would be better if the external file were instead read in in the
  background, updating previously read relevant values for each shard as
 they
  are read in.
 
  I guess a change in the ExternalFileField code would be required to
 achieve
  this, but I have no experience here, so suggestions are very welcome.
 
  Thanks,
  /Martin Koch - Issuu - Senior Systems Architect.
 



 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



How to efficiently find documents that have a specific value for a field OR the field does not exist at all

2012-10-08 Thread Artem Shnayder
I'm trying to find documents using this query:

field:value OR (*:* AND NOT field:[* TO *])

Which means, either field is set to value or the field does not exist in
the document.

I'm running this for ~20 fields in a single query strung together with
ANDs. The query time is high, averaging around 3.5s. Does anyone have
suggestions on how to optimize this query? As a last resort, using
technologies outside of Solr is a possibility.

All suggestions are greatly appreciated!


Thanks for your time and efforts,
Artem



PS. For the record, a colleague and I have brainstormed some idea of our
own:

* Adding a meta field to each document that consists of 1s and 0s, where
each character represents a field's existence (1 yes, 0 no). In this case
the query would look like: field:value OR signature:???0???   
So we are looking for a certain field (the 0) that definitely does not
exist and all the others we do not care about (wildcard). Note that this
would have to be a leading wildcard query or we could prepend a dummy
character to beginning. A bit of a hack.

* Using bitwise operations to find all documents whose set of fields is a
subset of they query's set of fields. This would be more work and would
require writing a custom query parser or search handler.




Funny behavior in facet query on large dataset

2012-10-08 Thread kevinlieb
I am doing a facet query in Solr (3.4) and getting very bad performance. 
This is in a solr shard with 22 million records, but I am specifically doing
a small time slice.  However even if I take the time slice query out it
takes the same amount of time, so it seems to be searching the entire data
set.

I am trying to find all documents that contain the word dude or thedude
or anotherdude and count how many of these were written by eldudearino
(of course names are changed here to protect the innocent...).

My query is like this: 

http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22)

Any ideas what I could be doing wrong?

Thanks in advance!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Erik Hatcher
Faceting at that scale takes time to warm up.  If you've got your caches and 
such configured appropriately, then successive searches will be very fast, 
however you'll still need to do the cache warming (depends on the faceting 
implementation you're using, in this case you're probably using the FieldCache).

Faceting performance doesn't depend on the filters or query the caches that 
need to be built are indeed across the entire index.

Erik

On Oct 8, 2012, at 16:26 , kevinlieb wrote:

 I am doing a facet query in Solr (3.4) and getting very bad performance. 
 This is in a solr shard with 22 million records, but I am specifically doing
 a small time slice.  However even if I take the time slice query out it
 takes the same amount of time, so it seems to be searching the entire data
 set.
 
 I am trying to find all documents that contain the word dude or thedude
 or anotherdude and count how many of these were written by eldudearino
 (of course names are changed here to protect the innocent...).
 
 My query is like this: 
 
 http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22)
 
 Any ideas what I could be doing wrong?
 
 Thanks in advance!
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Mikhail Khludnev
Martin,

I have kind of hack approach in mind regarding hiding document from search.
So, it's a little bit easier than your task. I'm going to deliver talk
about it http://www.apachecon.eu/schedule/presentation/89/ .
Frankly speaking, there is no reliable out-of-the-box solution for it. I
saw that DocValues has been integrated with FunctionQueries already, but
DocValues updates, which sounds like doable thing, has not been delivered
yet.

Regards

On Mon, Oct 8, 2012 at 11:54 PM, Martin Koch m...@issuu.com wrote:

 Sure: We're boosting search results based on user actions which could be
 e.g. the number of times a particular document has been read. In future,
 we'd also like to boost by e.g. impressions (the number of times a document
 has been displayed) and other values.

 /Martin

 On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com
  wrote:

  Martin,
 
  Can you tell me what's the content of that field, and how it should
 affect
  search result?
 
  On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:
 
   Hi List
  
   We're using Solr-4.0.0-Beta with a 7M document index running on a
 single
   host with 16 shards. We'd like to use an ExternalFileField to hold a
  value
   that changes often. However, we've discovered that the file is
 apparently
   re-read by every shard/core on *every commit*; the index is
 unresponsive
  in
   this period (around 20s on the host we're running on). This is
  unacceptable
   for our needs. In the future, we'd like to add other values as
   ExternalFileFields, and this will make the problem worse.
  
   It would be better if the external file were instead read in in the
   background, updating previously read relevant values for each shard as
  they
   are read in.
  
   I guess a change in the ExternalFileField code would be required to
  achieve
   this, but I have no experience here, so suggestions are very welcome.
  
   Thanks,
   /Martin Koch - Issuu - Senior Systems Architect.
  
 
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Tech Lead
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Chris Hostetter

: a small time slice.  However even if I take the time slice query out it
: takes the same amount of time, so it seems to be searching the entire data
: set.

a) you might try using facet.method=enum - in some special cases it may be 
faster then the default (facet.method=fc).

: I am trying to find all documents that contain the word dude or thedude
: or anotherdude and count how many of these were written by eldudearino
: (of course names are changed here to protect the innocent...).

b) field faceting isn't really designed for this type of problem.  field 
faceting is very suitable for questions like find all docs matching 
QUERY, and for all of those docs, give me a list of hte top N authors and 
how many docs were written by those authors.

c) If you just wnat to query for just the docs written by a single author, 
you cna use an fq like you do in your example, and then look at the 
numFound to know the total-- but in that case the faceting is just making 
extra work to generate counts of 0 for all of the other authors.

d) if you want to query for an arbitrary set of documents, and then know 
how many of those documents were written by a particular author (or each 
of a particular set of authors) try facet.query instead.

...facet=truefacet.query=author_username:(%22@eldudearino%22)


-Hoss


Re: How to efficiently find documents that have a specific value for a field OR the field does not exist at all

2012-10-08 Thread Ahmet Arslan
 field:value OR (*:* AND NOT field:[* TO *])
 
 Which means, either field is set to value or the field
 does not exist in
 the document.

Instead of field:[* TO *], you can define a default value in schema.xml. Or 
DefaultValueUpdateProcessorFactory in solrconfig.

With this, the field does not exist in the document part becomes 
field:MySpecialDefaultValue


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread kevinlieb
Thanks for all the replies. 

I oversimplified the problem for the purposes of making my post small and
concise.  I am really trying to find the counts of documents by a list of 10
different authors that match those keywords.  Of course on looking up a
single author there is no reason to do a facet query.  To be clearer:
Find all documents that contain the word dude or thedude or
anotherdude and count how many of these were written by eldudearino and
zeedudearino and adudearino and beedudearino

I tried facet.query as well as facet.method=fc and neither really helped.

We are constantly adding documents to the solr index and committing, every
few seconds, which is probably why this is not working well.

Seems we need to re-architect the way we are doing this... 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: long query response time in shards search

2012-10-08 Thread Jason
Hi,

We're using Solr 4.0 and servicing patent search.
Patent search intends to very complex queries including wildcard.
I think Ngram or EdgeNgram filter is alternative.
But every terms included a query don't have wildcard.
So we can't use that filter.

If I make empty core and use in main core that just merge search results, is
it helpful?
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/long-query-response-time-in-shards-search-tp4012366p4012628.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Shawn Heisey

On 10/8/2012 4:09 PM, kevinlieb wrote:

Thanks for all the replies.

I oversimplified the problem for the purposes of making my post small and
concise.  I am really trying to find the counts of documents by a list of 10
different authors that match those keywords.  Of course on looking up a
single author there is no reason to do a facet query.  To be clearer:
Find all documents that contain the word dude or thedude or
anotherdude and count how many of these were written by eldudearino and
zeedudearino and adudearino and beedudearino

I tried facet.query as well as facet.method=fc and neither really helped.

We are constantly adding documents to the solr index and committing, every
few seconds, which is probably why this is not working well.

Seems we need to re-architect the way we are doing this...


I would definitely consider increasing the amount of time between 
commits.  You can add documents at whatever interval you want, but if 
you only do commits every minute or two, your caches will be much more 
useful.


Your time slice filter query (NOW-5MINUTES) will never be cached, 
because NOW is measured in milliseconds and will therefore be different 
for every query.  You might consider doing NOW/MINUTE-5MINUTES instead 
.. or even [NOW/MINUTE-5MINUTES TO *] so that you actually are dealing 
with a range.  For the space of that minute (at least until the cache 
gets invalidated by a commit), the filter cache entry will be valid.


Some general questions that may matter: How big are all your index 
directories on this server, how much RAM is in the server, and how much 
RAM are you giving to Java?  I'm also curious how big your Solr caches 
are, what the autowarm counts are, and how long it is taking for your 
caches to warm up after each commit.  You can get the warm times from 
the cache statistics in the admin interface.


Thanks,
Shawn



Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Otis Gospodnetic
Hi Kevin,

Right, it's the very frequent commits, most likely.  Change commits
to, say, every 60 or 120 seconds and compare the performance.  I think
you guys use SPM, so check the Cache graphs (hit % specifically)
before and after the above change.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Oct 8, 2012 at 6:09 PM, kevinlieb ke...@politear.com wrote:
 Thanks for all the replies.

 I oversimplified the problem for the purposes of making my post small and
 concise.  I am really trying to find the counts of documents by a list of 10
 different authors that match those keywords.  Of course on looking up a
 single author there is no reason to do a facet query.  To be clearer:
 Find all documents that contain the word dude or thedude or
 anotherdude and count how many of these were written by eldudearino and
 zeedudearino and adudearino and beedudearino

 I tried facet.query as well as facet.method=fc and neither really helped.

 We are constantly adding documents to the solr index and committing, every
 few seconds, which is probably why this is not working well.

 Seems we need to re-architect the way we are doing this...



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ 4.0 Beta maxConnectionsPerHost

2012-10-08 Thread Otis Gospodnetic
Hi,

Qs:
* Have you tried StreamingUpdateSolrServer?
* Newever version of Solr(J)?

When things hang, jstack your app that uses SolrJ and Solr a few times
and you should be able to see where they are stuck.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Oct 8, 2012 at 9:52 PM, Briggs Thompson
w.briggs.thomp...@gmail.com wrote:
 I am running into an issue of a multithreaded SolrJ client application used
 for indexing is getting into a hung state. I responded to a separate thread
 earlier today with someone that had the same error, see
 http://lucene.472066.n3.nabble.com/SolrJ-IOException-td4010026.html

 I did some digging and experimentation and found something interesting.
 When starting up the application, I see the following in Solr logs:
 Creating new http client, config:maxConnections=200maxConnectionsPerHost=8

 The way I instantiate the HttpSolrServer through SolrJ is like the
 following

 HttpSolrServer solrServer = new HttpSolrServer(serverUrl);
 solrServer.setConnectionTimeout(1000);
 solrServer.setDefaultMaxConnectionsPerHost(100);
 solrServer.setMaxTotalConnections(100);
 solrServer.setParser(new BinaryResponseParser());
 solrServer.setRequestWriter(new BinaryRequestWriter());

 It seems as though the maxConnections and maxConnectionsPerHost are not
 actually getting set. Anyone seen this problem or have an idea how to
 resolve?

 Thanks,
 Briggs Thompson


Re: long query response time in shards search

2012-10-08 Thread Otis Gospodnetic
Hi,

We've explored this with a few clients a while back.  If I remember
correctly, this doesn't make much difference and I don't expect it
will make any noticable difference for you since all your cores are on
that same 1 server.  If you had 1 server with more CPU cores you would
see better numbers.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Oct 8, 2012 at 9:43 PM, Jason hialo...@gmail.com wrote:
 Hi,

 We're using Solr 4.0 and servicing patent search.
 Patent search intends to very complex queries including wildcard.
 I think Ngram or EdgeNgram filter is alternative.
 But every terms included a query don't have wildcard.
 So we can't use that filter.

 If I make empty core and use in main core that just merge search results, is
 it helpful?
 Thanks.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/long-query-response-time-in-shards-search-tp4012366p4012628.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Otis Gospodnetic
Hi Martin,

Perhaps you could make a small change in Solr to add don't reload EFF
if it hasn't been modified since it was last opened.  I assume you
commit pretty often, but don't modify EFF files that often, so this
could save you some needless loading.  That said, I'd be surprised EFF
doesn't already do this... I didn't check.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Oct 8, 2012 at 4:55 AM, Martin Koch m...@issuu.com wrote:
 Hi List

 We're using Solr-4.0.0-Beta with a 7M document index running on a single
 host with 16 shards. We'd like to use an ExternalFileField to hold a value
 that changes often. However, we've discovered that the file is apparently
 re-read by every shard/core on *every commit*; the index is unresponsive in
 this period (around 20s on the host we're running on). This is unacceptable
 for our needs. In the future, we'd like to add other values as
 ExternalFileFields, and this will make the problem worse.

 It would be better if the external file were instead read in in the
 background, updating previously read relevant values for each shard as they
 are read in.

 I guess a change in the ExternalFileField code would be required to achieve
 this, but I have no experience here, so suggestions are very welcome.

 Thanks,
 /Martin Koch - Issuu - Senior Systems Architect.


Problem with dataimporter.request

2012-10-08 Thread Zakka Fauzan
I'm quite new in SOLR, I have a question regarding the request for data
importer.

In my data-config.xml, i have something like this

entity name=content pk=id query=SELECT * FROM tableX
deltaQuery=SELECT max(id) AS id from ${dataimporter.request.dataView}
deltaImportQuery=SELECT * FROM tableX WHERE
${dataimporter.delta.id} lt; id

/entity

However, everytime I execute delta-import (/dataimport?command=delta-import),
it always gives me exception like this:

Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query:
SELECT max(id) AS id FROM  Processing Document # 1

I believe this error exists because the system didn't recognize
${dataimporter.request.dataView}, but I don't know how to make that
recognized?

*I also asked the very same question in
http://stackoverflow.com/questions/12793025/cannot-get-anything-from-dataimporter-request-on-updating-index,
if you want to get some reputations there too, you can answer there. Thank
you!

--
Zakka Fauzan