Re: change get to post ??

2010-04-13 Thread Michael Kuhlmann
Hi,

the problem is not the GET request type, the problem is that you build a
far too complicated query. This won't scale very much and looks rather
weird.

Why don't you just add all parent category ids to every document at
index time? Then you could simply filter your request with the topmost
category id, and no more.

If you want additional queries filtered by some single category only,
then you maybe should define two fields for the category ids, one with a
single id token, and another with the ids of all sub-categories.

-Kuli

Am 13.04.2010 13:07, schrieb stockii:
 
 Hello.
 
 My client uses my autocompletion with an normal http-Request to solr. like
 this: http://XXX/solr/suggestpg/select/?q=harry
 
 so, when i want to search in a category with all his childs, my request is
 too long.  
 How can i change from GET to POST ??
 
 my request to solr looks like this. in short ;)
 http://XXX/solr/suggestpg/select/?q=harryfq=
 cat_id:7994+OR+cat_id:7995+OR+cat_id:8375+OR+cat_id:8465+OR+cat_id:8757+OR+cat_id:8766+OR+cat_id:8792+OR+cat_id:8843
 .
 
 
 thx, for fast support ;) its very important ^^



Re: change get to post ??

2010-04-13 Thread Michael Kuhlmann
You need to change the way how your data is imported. Or look for an
alternative how to build your query. It depends on your data model, and
your import mechanism.

Do your really have hundreds of categories?

BTW, childs is amusing! ;-)

-Michael

Am 13.04.2010 14:12, schrieb stockii:
 
 hi. thx for reply =)
 
 okay i think im little bit stupid.
 
 i dont know how can i filter the right categorys.
 
 i get only the id of one category. 
 
 i importet every childs for each item and the parent_category_id. but it ist
 only one parent for each item and not for each category. 
 
 so an example for dummies ;-)
 
 DOC: id:1; cat_id:5; cat_parent:4; cat_childs: 3,2,1
 
 so how can i filter so that i get all items of categories 1,2,3,4 and 5 when
 my request is only: q=harry potter; fq=cat_id:5
 
 thx
 
 i dont see it XD



Re: change get to post ??

2010-04-13 Thread Michael Kuhlmann
Hi,

Am 13.04.2010 14:52, schrieb stockii:
 some cat, have 300 child-categories.
And that's the reason why you shouldn't add them all to your filter query.

 or, how can i import the cat-data ? 
Again: How do you do it NOW?

-Michael



Re: change get to post ??

2010-04-13 Thread Michael Kuhlmann
I wouldn't do autosuggestion with normal queries anyway. Because of
better performance... :-)

I don't use DIH, so I can't tell what to do then. For us, we import data
with a simple PHP script, which was rather easy to write. So we have
full control on Solr's data structure. You somehow have to add
information about the category branch into every document for fast queries.

Michael

Am 13.04.2010 15:22, schrieb stockii:
 
 heya.
 
 okay NOW. 
 
 i import from database with DIH.
 
 every item have cat_id, more not. for the normal search it works to use
 fq and Post the search.
 
 but for my autosuggestion, it didnt work, because our app does not use the
 autosuggestion with our API. Because of better performance ...
 
 
 
 



Re: Turn off request logging for some handlers?

2010-04-15 Thread Michael Kuhlmann
Am 15.04.2010 17:45, schrieb Shawn Heisey:
 Is it possible to turn off request logging for some handlers? 
 Specifically, I'd like to stop logging requests to /admin/ping and
 /replication, which get hit very often.
 
Hi,

you can set logging for nearly every single task here:

http://host:port/solr/admin/logging.jsp

-Michael


Re: is solr ignored my filters ?

2010-04-19 Thread Michael Kuhlmann
Am 19.04.2010 16:09, schrieb stockii:
 so i want to see how it is indexed. 
 
 
Go to the admin panel, open the schema browser, and set the number of
shown tokens to 1 or something.

-Michael



Re: is solr ignored my filters ?

2010-04-19 Thread Michael Kuhlmann
Am 19.04.2010 16:29, schrieb stockii:
 
 oha, yes thx but
 
 we have 800 000 items ... to find the right in this way ? XD 

Then use the TermsComponent: http://wiki.apache.org/solr/TermsComponent

-Michael


Re: Best Open Source

2010-04-20 Thread Michael Kuhlmann
Nice site. Really!

In addition to Dave:
How do I search with tags enabled?
If I search for Blog, I can see that there's one blog software written
in Java. When I click on the Java tag, then my search is discarded, and
I get all Java software. when I do my search again, the tag filter is
lost. It seems to be impossible to combine tag filters with search.

-Michael

Am 20.04.2010 11:00, schrieb solai ganesh:
 Hello all,
 
 We have launched a new site hosting the best open source products and
 libraries across all categories. This site is powered by Solr search. There
 are many open source products available in all categories and it is
 sometimes difficult to identify which is the best. We identify the best. As
 a open source users, you might be using many opensource products and
 libraries , It would be great, if you help us to identify the best.
 
 http://www.findbestopensource.com/
 
 Regards
 Aditya
 



Re: SpellChecking

2010-05-03 Thread Michael Kuhlmann
Am 03.05.2010 16:43, schrieb Jan Kammer:
 Hi,
 
 It worked fine with a normal field. There must something wrong with
 copyfield, or why does dataimporthandler add/update no more documents?

Did you define your destination field as multivalue?

-Michael


Re: Score cutoff

2010-05-04 Thread Michael Kuhlmann
Am 03.05.2010 23:32, schrieb Satish Kumar:
 Hi,
 
 Can someone give clues on how to implement this feature? This is a very
 important requirement for us, so any help is greatly appreciated.
 

Hi,

I just implemented exactly this feature. You need to patch Solr to make
this work.

We at Zalando are planning to set up a technology blog where we'll offer
such tools, but at the moment this is not done. I can make a patch out
of my work and send it to you today.

Greetings,
Michael

 On Tue, Apr 27, 2010 at 5:54 PM, Satish Kumar 
 satish.kumar.just.d...@gmail.com wrote:
 
 Hi,

 For some of our queries, the top xx (five or so) results are of very high
 quality and results after xx are very poor. The difference in score for the
 high quality and poor quality results is high. For example, 3.5 for high
 quality and 0.8 for poor quality. We want to exclude results with score
 value that is less than 60% or so of the first result. Is there a filter
 that does this? If not, can someone please give some hints on how to
 implement this (we want to do this as part of solr relevance ranking so that
 the facet counts, etc will be correct).


 Thanks,
 Satish

 



Re: how to achieve filters

2010-05-18 Thread Michael Kuhlmann
Am 18.05.2010 16:54, schrieb Ahmet Arslan:
 2. Query=rock where bitrate128 where it should return
 only first and third docs where bitrate128
 
 q=rockfq:bitrate:[* TO 128] for this bitrate field must be tint type.
 

q=rockfq:bitrate:[* TO 127] would be better, because bitrate should be
lower than 128.

BTW, a bitrate of 127 is interesting...

@Prakash:
See http://wiki.apache.org/solr/SolrFacetingOverview


Re: Solr 1.4 query fails against all fields, but succeed if field is specified.

2010-05-31 Thread Michael Kuhlmann
Am 31.05.2010 11:50, schrieb olivier sallou:
 Hi,
 I have created in index with several fields.
 If I query my index in the admin section of solr (or via http request), I
 get results for my search if I specify the requested field:
 Query:   note:Aspergillus  (look for Aspergillus in field note)
 However, if I query the same word against all fields  (Aspergillus or
 all:Aspergillus) , I have no match in response from Solr.

Querying Aspergillus without a field does only work if you're using
DisMaxHandler.

Do you have a field all?

Try *:Aspergillus instead.


Re: Solr 1.4 query fails against all fields, but succeed if field is specified.

2010-05-31 Thread Michael Kuhlmann
Am 31.05.2010 12:36, schrieb olivier sallou:
 Is there any way to query all fields including dynamic ones?

Yes, using the *:term query. (Please note that the asterisk should not
be quoted.)

To answer your question, we need more details on your Solr
configuration, esp. the part of schema.xml that defines your note field.

Greetings,
Michael




Re: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Michael Kuhlmann
Am 02.06.2010 16:13, schrieb Paul Libbrecht:
 Is your server Linux?
 In this case this is very normal.. any java application spawns many new
 processes on linux... it's not exactly bound to threads unfortunately.

Uh, no. New threads in Java typically don't spawn new processes on OS level.

I never had more than one tomcat process on any Linux machine. In fact,
if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what ps
aux responds.


Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Michael Kuhlmann
Am 02.06.2010 16:15, schrieb Jörg Agatz:
 yes i done.. but i dont know how i get the information out of the big
 Array...

They're simply the keys of a single response array.


Re: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Michael Kuhlmann
Am 02.06.2010 16:39, schrieb Paul Libbrecht:
 This is impressive, I had this in any Linux I've been using: SuSE,
 Ubuntu, Debian, Mandrake, ...
 Maybe there's some modern JDK with a modern Linux where it doesn't happen?
 It surely is not one process per thread though.

I'm not a linux thread expert, but from what I know Linux doesn't know
lightweight threads as other systems do. Instead it uses processes for that.

But these processes aren't top level processes that show up in top/ps.
Instead, they're grouped hierarchically (AFAIK). Otherwise you would be
able to kill single user threads with their own process id, or kill the
main process and let the spawned threads continue. That would be totally
crazy.

In my configuration, Tomcat doesn't shut down correctly if I call
bin/shutdown.sh, so I have to kill the process manually. I don't know
why. This might be the reason why stockii has 3 Tomcat processes running.


Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Michael Kuhlmann
Am 02.06.2010 16:42, schrieb Jörg Agatz:
 i don't understand what you mean!
 
Then you should ask more precisely.


Re: Auto-suggest internal terms

2010-06-03 Thread Michael Kuhlmann
The only solution without doing any custom work would be to perform a
normal query for each suggestion. But you might get into performance
troubles with that, because suggestions are typically performed much
more often than complete searches.

The much faster solution that needs own work would be to build up a
large TreeMap with each word as the keys, and the matching terms as the
values.

-Michael

Am 02.06.2010 22:01, schrieb Jay Hill:
 I've got a situation where I'm looking to build an auto-suggest where any
 term entered will lead to suggestions. For example, if I type wine I want
 to see suggestions like this:
 
 french *wine* classes
 *wine* book discounts
 burgundy *wine*
 
 etc.
 
 I've tried some tricks with shingles, but the only solution that worked was
 pre-processing my queries into a core in all variations.
 
 Anyone know any tricks to accomplish this in Solr without doing any custom
 work?
 
 -Jay
 


Re: Auto-suggest internal terms

2010-06-03 Thread Michael Kuhlmann
Am 03.06.2010 13:02, schrieb Andrzej Bialecki:
 ..., and deploy this
 index in a separate JVM (to benefit from other CPUs than the one that
 runs your Solr core)

Every known webserver ist multithreaded by default, so putting different
Solr instances into different JVMs will be of no use.

-Michael


Re: Auto-suggest internal terms

2010-06-03 Thread Michael Kuhlmann
Am 03.06.2010 16:45, schrieb Andrzej Bialecki:
 You are right to a certain degree. Still, there are some contention
 points in Lucene/Solr, how threads are allocated on available CPU-s, and
 how the heap is used, which can make a two-JVM setup perform much better
 than a single-JVM setup given the same number of threads...

Allow me to don't belive this! ;-) It's not Solr that allocates threads,
it's the web server (Jetty, Glassfish, or whatever). In a normal
configuration, it will use as many threads as useful, so that there's no
need to start a second web server on the same machine.

To Lucene, there is some magic algorithm that reuses an IndexReader by a
limited number of threads (as far as I have seen in the code, but the
details are unimportant). But to the very least, if you've a multi core
setup, you'll get special IndexReader instances from Lucene per core. So
I don't see why you should scatter them on different VMs.

Greetings,
Michael


Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi all,

I have a field with some kind of category tree as a string. The format
is like this:
prefixfirstsecond#prefixotherfirstothersecond

So, the document is categorized in two categories, separated by '#', and
all categories start with the same prefix which I don't want to use.

For indexing, I have some fields for each category level, filled by
copyFields. For instance, the first level is defined using this type:

fieldType name=text_first_cat class=solr.TextField
positionIncrementGap=100
   analyzer type=index
  tokenizer class=solr.PatternTokenizerFactory
  pattern=(?:#|^)\w*([\p{L}\d]+) group=1/
/analyzer
/fieldType

This works fine, except one thing: This analyzer is being used for
queries also, not only for indexing. So, a query for xfirst gets
results, but a query for first only finds nothing. However, I want the
latter case.

If I add some pseudo-analyzer that does nothing like this:
   analyzer type=query
  tokenizer class=solr.PatternTokenizerFactory
  pattern=.* group=0/
/analyzer
then I get the result that I want. If I don't add a query analyzer at
all, the index analyzer is being used for queries, what is strange and
not what I would expect.

I just want some
Take-the-query-as-it-is-and-do-nothing-with-it-Analyzer, as if I don't
specify some analyzer at all. However, if I simply add
analyzer type=query /
to it, I get a parser exception from Solr.

Is there a clean solution for this? And why is Solr ignoring the
analyzer type as long as there is only one analyzer defines per type?

Greetings,
Michael


KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi Erick,

thank you very much for your help. What's confusing me is that another
of my fields does not have any analyzers defined at all, and it's
working fine without problems. So, it must be possible to define field
type without specifying any analyzers. I don't understand why it
shouldn't be possible any more if either the index or the query analyzer
is specified and the other not. Maybe it would be clearer if Solr would
raise an exception in this case instead of using some analyzer that was
specified for the opposite type.

Anyway; I took your advice and used the KeywordTokenizerFactory instead.
Great! Now it does excactly what I want. Thanks again!

But may I ask another question? As with the categories, I have some
fields that are only used for faceting, so they're only queried by facet
results. No modification is needed, no lowercase, nothing. So the
KeywordTokenizerFactory is perfect for them.

Alas, when the value contains spaces, I'm still getting too many
results. I have a field defined like this:

fieldType name=text_unchanged class=solr.StrField
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

(Using solr.TextField didn't change anything)

When quering for:
fq=label:Aces+of+London

I get the result:
 facet_fields:{
label:[
 Aces of London,31,
 Feud London,2,
 Fly London,2],
},

I get the same result when taking Feud London as the facet value.

When inspecting the index with the schema browser, I can see that all
labels are tokenized correctly in complete, i.e. there's no token
London, but a token Aces of London. So the KeywordTokenizer seems to
work as expected, at least for indexing. It's only that the facet query
is not narrow enough.

Even the superb Solr book didn't help me here. Do you - or any other -
has/have a clue what I'm doing wrong here?

Greetings,
Michael

On 03/12/10 14:52, Erick Erickson wrote:
 Well, what would you have SOLR do that makes sense if you
 don't define a query analyzer? Very very strange things
 happen if you use different analyzers for indexing
 and querying. At least defaulting that way has a *chance* of
 giving expected results...
 
 Why not use, say, KeywordTokenizerFactory if you really
 want the query analyzer to do nothing? Perhaps lowercasing
 etc. See:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFiltersHTH
 Erick
 
 On Fri, Mar 12, 2010 at 3:00 AM, Michael Kuhlmann 
 michael.kuhlm...@zalando.de wrote:
 


Re: KeywordTokenizer for faceting; was: Re: Analyzer for indexing only, not for queries

2010-03-12 Thread Michael Kuhlmann
Hi Erick,

On 03/12/10 17:09, Erick Erickson wrote:
 What's confusing me is that another
 of my fields does not have any analyzers defined at all, and it's
 working fine without problems.
 
 Field or fieldType?

...one of my fields with a fieldtype that does not have any analyzer
defined at all, ... ;-)

 
  So, it must be possible to define field
 type without specifying any analyzers. 
 
 Truth to tell, I don't know off the top of my head
 what happens if you define no analyzer for a fieldType.
 I think it would be bad practice anyway, *I* want to *know*
 what indexing and analyzing operations are going on so
 I can predict the resutls G. Someone want to chime in?

I looks like that the whole string will be used as a token, as the
KeywordTokenizerFactory already does. You're right that it's always
better to explicitly specify what you want. But as I didn't know what
the KeywordTokenizerFactory does before (I assumed that it would
tokenize the string into several keywords), and as long as I'm in
development phase, the un-specified behaviour was quite okay for me.

Once again, thank you for your help!

Michael


KeywordTokenizer for faceting gives too many results

2010-03-12 Thread Michael Kuhlmann
Hi,

I have some fields that are only used for faceting, so they're only
queried by facet results. No modification is needed, no lowercase,
nothing. So the KeywordTokenizerFactory seems to be perfect for them.

Alas, when the value contains spaces, I'm still getting too many
results. I have a field defined like this:

fieldType name=text_unchanged class=solr.StrField
positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
/fieldType

(Using solr.TextField didn't change anything)

When quering for:
fq=label:Aces+of+London

I get the result:
 facet_fields:{
label:[
 Aces of London,31,
 Feud London,2,
 Fly London,2],
},

I get the same result when taking Feud London as the facet value.

When inspecting the index with the schema browser, I can see that all
labels are tokenized correctly in complete, i.e. there's no token
London, but a token Aces of London. So the KeywordTokenizer seems to
work as expected, at least for indexing. It's only that the facet query
is not narrow enough.

Even the superb Solr book didn't help me here. Does anybody have a clue
what I'm doing wrong here?

Greetings,
Michael


Re: KeywordTokenizer for faceting gives too many results

2010-03-12 Thread Michael Kuhlmann
On 03/12/10 17:51, Ahmet Arslan wrote:
 

 try using Parenthesis with queries that contain more than
 one term. fq=label:(Aces+of+London) 
 Otherwise 
 defaultSearchField/defaultSearchField jumps
 in.
 
 defaultSearchField stuff is correct but I just realized that you need to use 
 quotes in your case. Because query parser splits on white-spaces. 
 fq=label:Aces of London 
 Or you need to escape spaces: fq=label:Aces\ of\ London

Wow! This is absolutely correct. Now it works as expected!

Thank you!

Michael


Re: RegexTransformer

2010-03-15 Thread Michael Kuhlmann
On 03/15/10 08:56, Shalin Shekhar Mangar wrote:
 On Mon, Mar 15, 2010 at 2:12 AM, blargy zman...@hotmail.com wrote:
 

 How would I go about splitting a column by a certain delimiter AND ignore
 all
 empty matches.
[...]
 You will probably have to write a custom Transformer to remove empty values.
 See http://wiki.apache.org/solr/DIHCustomTransformer
 
Shouldn't a PatternTokenizerFactory combined with a LengthFilterFactory
do the job?

See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

Greetings,
Michael


Re: Switching cores dynamically

2010-03-19 Thread Michael Kuhlmann
On 03/19/10 11:18, muneeb wrote:
 
 Hi,
 
 I have indexed almost 7 million articles on two separate cores, each with
 their own conf/ and data/ folder, i.e. they have their individual index.
 
 What I normally do is, use core0 for querying and core1 for any updates and
 once updates are finished i copy the index of core1 to core0's data folder.
 I know this isn't an efficient way of doing this, since this brings a
 downtime on my search service for a couple of minutes. 

And I wonder why do you need two cores. Why don't you just update the
one and only core, and when you send a commit, Solr automatically make
the changes public?

 
 I was wondering if its possible to switch between cores dynamically (keeping
 my current setup in mind) in such a way that there is no downtime at all
 during switching.

If you really want to stick with two cores, you maybe should create
symbolic links to your data folders. You then only need to remove and
recreate the links when you want to switch cores. You still have a
downtime, but only for a few milliseconds.



Re: Relevancy and random sorting

2012-01-12 Thread Michael Kuhlmann

Does the random sort function help you here?

http://lucene.apache.org/solr/api/org/apache/solr/schema/RandomSortField.html

However, you will get some very old listings then, if it's okay for you.

-Kuli

Am 12.01.2012 14:38, schrieb Alexandre Rocco:

Erick,

This document already has a field that indicates the source (site).
The issue we are trying to solve is when we list all documents without any
specific criteria. Since we bring the most recent ones and the ones that
contains images, we end up having a lot of listings from a single site,
since the documents are indexed in batches from the same site. At some
point we have several documents from the same site in the same date/time
and having images. I'm trying to give some random aspect to this search so
other documents can also appear in between that big dataset from the same
source.
Does the grouping help to achieve this?

Alexandre

On Thu, Jan 12, 2012 at 12:31 AM, Erick Ericksonerickerick...@gmail.comwrote:


Alexandre:

Have you thought about grouping? If you can analyze the incoming
documents and include a field such that similar documents map
to the same value, than group on that value you'll get output that
isn't dominated by repeated copies of the similar documents. It
depends, though, on being able to do a suitable mapping.

In your case, could the mapping just be the site from which you
got the data?

Best
Erick

On Wed, Jan 11, 2012 at 1:58 PM, Alexandre Roccoalel...@gmail.com
wrote:

Erick,

Probably I really written something silly. You are right on either

sorting

by field or ranking.
I just need to change the ranking to shift things around as you said.

To clarify the use case:
We have a listing aggregator that gets product listings from a lot of
different sites and since they are added in batches, sometimes you see a
lot of pages from the same source (site). We are working on some changes

to

shift things around and reduce this blocking effect, so we can present
mixed sources on the result pages.

I guess I will start with the document random field and later try to
develop a custom plugin to make things better.

Thanks for the pointers.

Regards,
Alexandre

On Wed, Jan 11, 2012 at 1:58 PM, Erick Ericksonerickerick...@gmail.com
wrote:


I really don't understand what this means:
random sorting for the records but also preserving the ranking

Either you're sorting on rank or you're not. If you mean you're
trying to shift things around just a little bit, *mostly* respecting
relevance then I guess you can do what you're thinking.

You could create your own function query to do the boosting, see:
http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser

which would keep you from having to re-index your data to get
a different randomness.

You could also consider external file fields, but I think your
own function query would be cleaner. I don't think math.random
is a supported function OOB

Best
Erick


On Wed, Jan 11, 2012 at 8:29 AM, Alexandre Roccoalel...@gmail.com
wrote:

Hello all,

Recently i've been trying to tweak some aspects of relevancy in one

listing

project.
I need to give a higher score to newer documents and also boost the
document based on a boolean field that indicates the listing has

pictures.

On top of that, in some situations we need a random sorting for the

records

but also preserving the ranking.

I tried to combine some techniques described in the Solr Relevancy FAQ
wiki, but when I add the random sorting, the ranking gets messy (as
expected).

This works well:




http://localhost:18979/solr/select/?start=0rows=15q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22fl=*,score


This does not work, gives a random order on what is already ranked




http://localhost:18979/solr/select/?start=0rows=15q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22fl=*,scoresort=random_1+desc


The only way I see is to create another field on the schema

containing a

random value and use it to boost the document the same way that was

tone

on

the boolean field.
Anyone tried something like this before and knows some way to get it
working?

Thanks,
Alexandre










Re: java.net.SocketException: Too many open files

2012-01-24 Thread Michael Kuhlmann

Hi Jonty,

no, not really. When we first had such problems, we really thought that 
the number of open files is the problem, so we implemented an algorithm 
that performed an optimize from time to time to force a segment merge. 
Due to some misconfiguration, this ran too often. With the result that 
an optimize was issued before thje previous optimization was finished, 
which is a really bad idea.


We removed the optimization calls, and since then we didn't have any 
more problems.


However, I never found out the initial reason for the exception. Maybe 
there was some bug in Solr's 3.1 version - we're using 3.5 right now -, 
but I couldn't find a hint in the changelog.


At least we didn't have this exception for more than two months now, so 
I'm hoping that the cause for this has disappeared somehow.


Sorry that I can't help you more.

Greetings,
Kuli

On 24.01.2012 07:48, Jonty Rhods wrote:

Hi Kuli,

Did you get the solution of this problem? I am still facing this problem.
Please help me to overcome this problem.

regards


On Wed, Oct 26, 2011 at 1:16 PM, Michael Kuhlmannk...@solarier.de  wrote:


Hi;

we have a similar problem here. We already raised the file ulimit on the
server to 4096, but this only defered the problem. We get a
TooManyOpenFilesException every few months.

The problem has nothing to do with real files. When we had the last
TooManyOpenFilesException, we investigated with netstat -a and saw that
there were about 3900 open sockets in Jetty.

Curiously, we only have one SolrServer instance per Solr client, and we
only have three clients (our running web servers).

We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
to 100. There should be room enough.

Sorry that I can't help you, we still have not solved tghe problem on
our own.

Greetings,
Kuli

Am 25.10.2011 22:03, schrieb Jonty Rhods:

Hi,

I am using solrj and for connection to server I am using instance of the
solr server:

SolrServer server =  new CommonsHttpSolrServer(
http://localhost:8080/solr/core0;);

I noticed that after few minutes it start throwing exception
java.net.SocketException: Too many open files.
It seems that it related to instance of the HttpClient. How to resolved

the

instances to a certain no. Like connection pool in dbcp etc..

I am not experienced on java so please help to resolved this problem.

  solr version: 3.4

regards
Jonty










Re: Bad Request (Solr + Weblogic + Oracle DB)

2012-02-02 Thread Michael Kuhlmann

Hi rzao!

I think this is the problem:

On 02.02.2012 13:59, rzoao wrote:

UpdateRequest req = new UpdateRequest();

req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false,
false);
req.add(documento);



You create a commit request, but send a document with it - that won't 
work. Either you add documents, or you perform a commit, but you can't 
do both.


Remove the line with setAction(), send the document, and after that, 
call commit() directly on the SolrServer.


If this doesn't help, then have a look into Weblogic's log files. You 
should find an exception there that helps you more.


-Kuli


Re: Help:Solr can't put all pdf files into index

2012-02-09 Thread Michael Kuhlmann
I'd suggest that you check which documents *exactly* are missing in Solr 
index. Or find at least one that's missing, and try to figure out how 
this document differs from the other ones that can be found in Solr.


Maybe we can then find out what exact problem there is.

Greetings,
-Kuli

On 09.02.2012 16:37, Rong Kang wrote:


Yes, I put all file in one directory and I have tested file names using code.




At 2012-02-09 20:45:49,Jan Høydahljan@cominvent.com  wrote:

Hi,

Are you 100% sure that the filename is globally unique, since you use it as the 
uniqueKey?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 9. feb. 2012, at 08:30, 荣康 wrote:


Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 
files(different file names) and all the files are in one same directory。But 
when I use solr to import the files into index using Dataimport method, solr 
report only import 17233 files. It's very strange. This problem has stoped out 
project for a few days. I can't handle it.


please help me!


Schema.xml


fields
   field name=text type=text indexed=true multiValued=true termVectors=true 
termPositions=true termOffsets=true/
   field name=filename type=filenametext indexed=true required=true termVectors=true 
termPositions=true termOffsets=true/
   field name=id type=string stored=true/
/fields
uniqueKeyid/uniqueKey
copyField source=filename dest=text/


and
dataConfig
dataSource type=BinFileDataSource name=bin/
document
entity name=f processor=FileListEntityProcessor recursive=true
rootEntity=false
dataSource=null  baseDir=H:/pdf/cls_1_16800_OCRed/1
fileName=.*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF) onError=skip


entity name=tika-test processor=TikaEntityProcessor
url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip
field column=text name=text/
/entity
field column=file name=id/
field column=file name=filename/
/entity
/document
/dataConfig




sincerecly
Rong Kang









Re: Help:Solr can't put all pdf files into index

2012-02-09 Thread Michael Kuhlmann

I don't know much about Tika, but this seems to be a bug in PDFBox.

See: https://issues.apache.org/jira/browse/PDFBOX-797

Yoz might also have a look at this: 
http://stackoverflow.com/questions/7489206/error-while-parsing-binary-files-mostly-pdf


At least that's what I found when I googled the NPE.

Greetings,
Kuli

On 09.02.2012 17:13, Rong Kang wrote:

I test one file that is missing in Solr index. And solr response as below

[...]


Exception in entity : 
tika-test:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to read content Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:617)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.ParserDecorator$1@190725e
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
... 8 more
Caused by: java.lang.NullPointerException
at org.apache.pdfbox.pdmodel.PDPageNode.getCount(PDPageNode.java:109)
at org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(PDDocument.java:943)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 10 more


I think this is because tika can't read the pdf file or this  pdf file's format 
has some error. But I can read this pdf file in Adobe Reader.
Regards,

Rong Kang


Re: sort my results alphabetically on facetnames

2012-02-14 Thread Michael Kuhlmann

Hi!

On 14.02.2012 13:09, PeterKerk wrote:

I want to sort my results on the facetnames (not by their number of results).


From the example you gave, I'd assume you don't want to sort by facet 
names but by facet values.


Simply add facet.sort=index to your request; see
http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort

Or simply sort the facet result on your own.

Greetings,
Kuli


Re: Too many open files - lots of sockets

2012-03-14 Thread Michael Kuhlmann

I had the same problem, without auto-commit.

I never really found out what exactly the reason was, but I think it was 
because commits were triggered before a previous commit had the chance 
to finish.


We now commit after every minute or 1000 (quite large) documents, 
whatever comes first. And we never optimize. We haven't had this 
exceptions for months now.


Good luck!
-Kuli

Am 14.03.2012 11:22, schrieb Colin Howe:

Currently using 3.4.0. We have autocommit enabled but we manually do
commits every 100 documents anyway... I can turn it off if you think that
might help.


Cheers,
Colin


On Wed, Mar 14, 2012 at 10:24 AM, Markus Jelsma
markus.jel...@openindex.iowrote:


Are you running trunk and have auto-commit enabled? Then disable
auto-commit. Even if you increase ulimits it will continue to swallow all
available file descriptors.


On Wed, 14 Mar 2012 10:13:55 +, Colin Howeco...@conversocial.com
wrote:


Hello,

We keep hitting the too many open files exception. Looking at lsof we have
a lot (several thousand) of entries like this:

java  19339 root 1619u sock0,70t0
  682291383 can't identify protocol


However, netstat -a doesn't show any of these.

Can anyone suggest a way to diagnose what these socket entries are? Happy
to post any more information as needed.


Cheers,
Colin



--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350









Re: Sorting on non-stored field

2012-03-14 Thread Michael Kuhlmann

Am 14.03.2012 11:43, schrieb Finotti Simone:

I was wondering: is it possible to sort a Solr result-set on a non-stored value?


Yes, it is. It must be indexed, indeed.

-Kuli


Re: Too many open files - lots of sockets

2012-03-14 Thread Michael Kuhlmann

Ah, good to know! Thank you!

I already had Jetty under suspicion, but we had this failure quite often 
in October and November, when the bug was not yet reported.


-Kuli

Am 14.03.2012 12:08, schrieb Colin Howe:

After some more digging around I discovered that there was a bug reported
in jetty 6:  https://jira.codehaus.org/browse/JETTY-1458

This prompted me to upgrade to Jetty 7 and things look a bit more stable
now :)


Re: Master/Slave switch on teh fly. Replication

2012-03-16 Thread Michael Kuhlmann

Am 16.03.2012 15:05, schrieb stockii:

i have 8 cores ;-)

i thought that replication is defined in solrconfig.xml and this file is
only load on startup and i cannot change master to slave and slave to master
without restarting the servlet-container ?!?!?!


No, you can reload the whole core at any time, without interruption. 
Even with a new solrconfig.xml.


You can even add a new core at runtime, fill it with data and switch 
cores afterwards.


See http://wiki.apache.org/solr/CoreAdmin for details.

-Kuli


Re: Maybe switching to Solr Cores

2012-03-16 Thread Michael Kuhlmann

Am 16.03.2012 16:42, schrieb Mike Austin:

It seems that the biggest real-world advantage is the ability to control
core creation and replacement with no downtime.  The negative would be the
isolation however the are still somewhat isolated.  What other benefits and
common real-world situations would you use to talk me into switching to
Solr cores?


Different Solr cores already are quite isolated: They use different 
configs, different caches, different readers, different handlers...


In fact, there is not much more common between Solr cores except the 
solr.xml configuration.


One additional advantage is that cores need less footprint in Tomcat 
than fully deployed Solr web applications.


I don't see a single drawback of multiples cores in contrast to multiple 
web apps


...except one, but that has nothing to do with Solr, only with the JVM 
itself: If you have large hardware environment with lots of RAM, than it 
might be better to have multiple Tomcat instances running in different 
OS processes. The reason is Java's garbage collector that works better 
with not-so-huge memory.


Sometimes it might be even better to have two or four replicated Solr 
instances in different Tomcat processes than just one. You'll avoid 
longer stop-the-world pauses with Java's GC as well.


However, this depends on the environment and needs to be avaluated as 
well...


-Kuli


Re: is the SolrJ call to add collection of documents a blocking function call ?

2012-03-20 Thread Michael Kuhlmann

Hi Ramdev,

add() is a blocking call. Otherwise it had to start an own background 
thread which is not what a library like Solrj should do (how many 
threads at most? At which priority? Which thread group? How long keep 
them pooled?)


And, additionally, you might want to know whether the transmission was 
successful, or whether your guinea pig has eaten the network cable just 
in the middle of the transmission.


But it's easy to write your own background task that adds your documents 
to the Solr server. Using Java's ExecutionService class, this is done 
within two minutes.


Greetings,
Kuli

Am 19.03.2012 16:48, schrieb ramdev.wud...@thomsonreuters.com:

Hi:
I am trying to index a collection of SolrInputDocs to a Solr server. I was 
wondering if the call I make to add the documents (the 
add(CollectionSolrInputDocument)  call ) is a blocking function call ?

I would also like to know if the add call is a call that would take longer for 
a larger collection of documents


Thanks

Ramdev





Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 10:29, schrieb Ahmet Arslan:

I'm looking at the following. I want
to (1) map some query fields to
some other query fields and add some things to FL, and then
(2)
rescore.

I can see how to do it as a RequestHandler that makes a
parser to get
the fields, or I could see making a SearchComponent that was
stuck
into the list just after the QueryComponent.

Anyone care to advise in the choice?


I would choose SearchComponent. I read somewhere that customizations are now 
better fit into SC rather than RH.



I would override QueryComponent and modify the normal query instead.

Adding an own SearchComponent after the regular QueryComponent (or 
better as a last-element) is goof when you simply want to modify the 
existing result. But since you want to rescore, you're likely interested 
in documents that fell already out of the original result list.


Greetings,
Kuli


Re: RequestHandler versus SearchComponent

2012-03-23 Thread Michael Kuhlmann

Am 23.03.2012 11:17, schrieb Michael Kuhlmann:

Adding an own SearchComponent after the regular QueryComponent (or
better as a last-element) is goof ...


Of course, I meant good, not goof! ;)



Greetings,
Kuli




Re: DIH NoClassFoundError.

2012-04-25 Thread Michael Kuhlmann

Am 25.04.2012 15:57, schrieb stockii:

is it not fucking possible to import DIH !?!?!? WTF!


It is fucking possible, you just need to either point your goddamn 
classpath to the data import handler jar in the contrib folders, or you 
have to add the appropriate contrib folder into the lib dir entries at 
the beginning of your motherfucking solrconfig.xml.


The pissed example already contains those libs. To stay with your woring...

-Kuli


Re: Boosting fields in SOLR using Solrj

2012-04-26 Thread Michael Kuhlmann

Am 26.04.2012 00:57, schrieb Joe:

Hi,

I'm using the solrj API to query my SOLR 3.6 index. I have multiple text
fields, which I would like to weight differently. From what I've read, I
should be able to do this using the dismax or edismax query types. I've
tried the following:

SolrQuery query = new SolrQuery();
query.setQuery( title:apples oranges content:apples oranges);
query.setQueryType(edismax);
query.set(qf, title^10.0 content^1.0);
QueryResponse rsp = m_Server.query( query );


Why do you try to construct your own query, when you're using an edismax 
query with a defined qf parameter?


What you're searching is the text title:apples oranges content:apples 
oranges. Depending on your analyzer chain, it might be that title:appes 
and content:apples are kept as one token, so nothing is found because 
there's no such token in the index.


Why don't you simply query for apples oranges? That's how (e)dismax is 
made for. Have a deeper look at http://wiki.apache.org/solr/DisMax.


BTW, if you used the above query in a Lucene parser, it would look for 
apples in title and content field, but look for oranges in your 
default search field. This is because you didn't quote apples oranges. 
Since you want to use Edismax, you can ignore this, it's just that you 
current query won't work as expected in both cases.


-Kuli


Re: Dynamic creation of cores for this use case.

2012-04-26 Thread Michael Kuhlmann

Am 26.04.2012 16:17, schrieb pprabhcisco123:

  The use case is to create a core for each customer as well as partner .
Since its very difficult to create cores statically in solr.xml file for all
4500 customers , is there any way to create the cores dynamically or on the
fly.


Yes there is. Have a look at: http://wiki.apache.org/solr/CoreAdmin#CREATE

I suggest to set the persistent flag in solr.xml to true.

I think all your cores will share the same configuration, so you can 
point all configuration directories to the same one, and install unique 
data dirs.


This should be relative simple in theory. In practise, you might detect 
performance issues with such a configuration. It should be no big 
problem if at most few hundred users work in parallel, but as soon as 
most cores are used all together, I predict you'll have bad performance.


Solr has no hard-coded limitation in the number of cores, but each core 
has its own caches and readers. Depending on your machine configuration, 
this may be too much.


My suggestion is to try it out. It should work first, and if you're 
hitting performance limits, then you can modify yourn configuration.


-Kuli


Re: Bridge between Solr and NoSQL

2012-05-08 Thread Michael Kuhlmann

Am 08.05.2012 04:13, schrieb Jeff Schmidt:

Francois:

Check out DataStax Enterprise 2.0, Solr integrated with Cassandra: 
http://www.datastax.com/docs/datastax_enterprise2.0/search/index

And, Solbase, Solr integrated with HBase: https://github.com/Photobucket/Solbase

I'm sure there are others, but these two come to mind.


I know of Solandra, Solr integrated with Cassandra: 
https://github.com/tjake/Solandra


In contrast to the DataStax solution, this is open source, but DataStax 
should be the better solution (at least regarding the performance).


Integrating Lucene with CouchDB was discussed here: 
http://lucene.472066.n3.nabble.com/Using-Solr-with-CouchDB-td762856.html

and a project is here: https://github.com/rnewson/couchdb-lucene

Greetings,
Kuli


On May 7, 2012, at 5:29 PM, Francois Perron wrote:


Hi all,

  I would like to know if there is some projects to integrate Solr with NoSQl 
like MongoDB.

They already had a link like this between ElasticSearch and CoughDB. (Cough 
River I think)

Thank you.


Re: Partition Question

2012-05-09 Thread Michael Kuhlmann

Am 08.05.2012 23:23, schrieb Lance Norskog:

Lucene does not support more 2^32 unique documents, so you need to
partition.


Just a small note:

I doubt that Solr supports more than 2^31 unique documents, as most 
other Java applications that use int values.


Greetings,
Kuli




Re: Field with attribut in the schema.xml ?

2012-05-10 Thread Michael Kuhlmann

Am 10.05.2012 14:33, schrieb Bruno Mannina:

like that:

field name=inventor-countryCH/field
field name=inventor-countryFR/field

but in this case Ioose the link between inventor and its country?


Of course, you need to index the two inventors into two distinct documents.

Did you mark those fields as multi-valued? That won't make much sense IMHO.

Greetings,
Kuli


Re: Field with attribut in the schema.xml ?

2012-05-10 Thread Michael Kuhlmann
I don't know the details of your schema, but I would create fields like 
name, country, street etc., and a field named role, which contains 
values like inventor, applicant, etc.


How would you do it otherwise? Create only four documents, each fierld 
containing 80 mio. values?


Greetings,
Kuli

Am 10.05.2012 14:47, schrieb Bruno Mannina:

But I have more than 80 000 000 documents with many fields with this
kind of description?!

i.e:
inventor
applicant
assignee
attorney

I must create for each document 4 documents ??

Le 10/05/2012 14:41, G.Long a écrit :

When you add data into Solr, you add documents which contain fields.
In your case, you should create a document for each of your inventors
with every attribute they could have.

Here is an example in Java:

SolrInputDocument doc = new SolrInputDocument();
doc.addField(inventor, Rossi);
doc.addField(country, FR);
solrServer.add(doc);
...
And then you do the same for all your inventors.

This way, each doc in your index represents one inventor and you can
query them like:
q=inventor:rossi AND country:FR

Le 10/05/2012 14:33, Bruno Mannina a écrit :

like that:

field name=inventor-countryCH/field
field name=inventor-countryFR/field

but in this case Ioose the link between inventor and its country?

if I search an inventor named ROSSI with CH:
q=inventor:rossi and inventor-country=CH

the I will get this result but it's not correct because Rossi is FR.

Le 10/05/2012 14:28, G.Long a écrit :

Hi :)

You could just add a field called country and then add the
information to your document.

Regards,
Gary L.

Le 10/05/2012 14:25, Bruno Mannina a écrit :

Dear,

I can't find how can I define in my schema.xml a field with this
format?

My original format is:

exch:inventors

exch:inventor
exch:inventor-name
nameWEBER WALTER/name
/exch:inventor-name
residence
countryCH/country
/residence
/exch:inventor

exch:inventor
exch:inventor-name
nameROSSI PASCAL/name
/exch:inventor-name
residence
countryFR/country
/residence
/exch:inventor

/exch:inventors

I convert it to:
...
field name=inventorWEBER WALTER/field
field name=inventorROSSI PASCAL/field
...

but how can I add Country code to the field without losing the link
between inventor?
Can I use an attribut ?

Any idea are welcome :)

Thanks,
Bruno Mannina
















Re: Identify indexed terms of document

2012-05-11 Thread Michael Kuhlmann

Am 10.05.2012 22:27, schrieb Ahmet Arslan:




It's possible to see what terms are indexed for a field of
document that
stored=false?


One way is to use http://wiki.apache.org/solr/LukeRequestHandler


Another approach is this:

- Query for exactly this document, e.g. by using the unique field
- Add this to your URL parameters:
facet=truefacet.field=Your fieldfacet.mincount=1

-Kuli


Re: Question about cache

2012-05-11 Thread Michael Kuhlmann

Am 11.05.2012 15:48, schrieb Anderson vasconcelos:

Hi

Analysing the solr server in glassfish with Jconsole, the Heap Memory Usage
don't use more than 4 GB. But, when was executed the TOP comand, the free
memory in Operating system is only 200 MB. The physical memory is only 10GB.

Why machine used so much memory? The cache fields are included in Heap
Memory usage? The other 5,8 GB is the caching of Operating System for
recent open files? Exists some way to tunning this?

Thanks

If the OS is Linux or some other Unix variant, it keeps as much disk 
content in memory as possible. Whenever new memory is needed, it 
automatically gets freed. That won't need time, and there's no need to 
tune anything.


Don't look at the free memory in top command, it's nearly useless. Have 
a look at how much memory your Glassfish process is consuming, and use 
the 'free' command (maybe together with the -m parameter for human 
readability) to find out more about your free memory. The 

-/+ buffers/cache line is relevant.

Greetings,
Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 05:56, schrieb arjit:

Thanks Erick for the reply.
I have 6 cores which doesn't contain duplicated data. every core has some
unique data. What I thought was when I read it would read parallel 6 cores
and join the result and return the query. And this would be efficient then
reading one big core.


No, it's not. When you request 10 documents from Solr, it can't know in 
prior which shards contain how many of those documents. It could be that 
each shard only needs to fill one or two documents into the result, but 
it might be that only one shard conatins all ten docuemnts. Therefor, 
Solr needs to request 10 documents from each shard, then taking only the 
10 top documents from those 60 ones and drop the rest. And it gets worse 
when you set an offset of, say, 100.


Sharding is (nearly) always slower than using one big index with 
sufficient hardware resources. Only use sharding when your index is too 
huge to fit into one single machine.


Greetings,
Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 13:22, schrieb Sami Siren:

Sharding is (nearly) always slower than using one big index with sufficient
hardware resources. Only use sharding when your index is too huge to fit
into one single machine.


If you're not constrained by CPU or IO, in other words have plenty of
CPU cores available together with for example separate hard discs for
each shard splitting your index into smaller shards can in some cases
make a huge difference in one box too.


Do you have an example?

This is hard to believe. If you've several shard on the same machine, 
you'll need that much memory that each shard has enough for all its 
caches and duch. With that lot of memory, a single Solr core should be 
really fast.


If dividing the index is the reason, then a software RAID 0 (striping) 
should be much better.


The only point I see is the concurrent search for one request. Maybe, 
for large requests, this might outweigh the sharding overhead, but only 
for long-running requests without disk I/O. I only see the case when 
using very complicated query functions. And, this only stays true as 
long as you don't run multiple concurrent requests.


Greetings,
Kuli


Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Michael Kuhlmann

Am 14.05.2012 16:18, schrieb Otis Gospodnetic:

Hi Kuli,

In a client engagement, I did see this (N shards on 1 beefy box with lots of 
RAM and CPU cores) be faster than 1 big index.



I want to believe you, but I also want to understand. Can you explain 
why? And did this only happen for single requests, or even under heavy load?


Greetings,
Kuli


Re: org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id

2012-05-21 Thread Michael Kuhlmann

Am 21.05.2012 12:07, schrieb Tolga:

Hi,

I am getting this error:

[doc=null] missing required field: id


[...]


I've got this entry in schema.xml: field name=id type=string
stored=true indexed=true/
What to do?


Simply make sure that every document you're sending to Solr contains 
this id field.


I assume it's declared as your unique id field, so it's mandatory.

Greetings,
Kuli



Re: org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id

2012-05-21 Thread Michael Kuhlmann

Am 21.05.2012 12:40, schrieb Tolga:

How do I verify it exists? I've been crawling the same site and it
wasn't giving an error on Thursday.


It depends on what you're doing.

Are you using nutch?

-Kuli


Re: Stopword filter - refreshing stop word list periodically

2011-10-14 Thread Michael Kuhlmann
Am 14.10.2011 15:10, schrieb Jithin:
 Hi,
 Is it possible to refresh the stop word list periodically say once in 6
 hours. Is this already supported in Solr or are there any work arounds.
 Kindly help me in understanding this.

Hi,

you can trigger a reload command to the core admin, assuming you're
running a multi core environment (which I'd recommend anyway).

Simply add
curl http://host:port/solr/admin/cores?action=RELOADcore=corename;
to your /etc/crontab file, and set the leading time fields correspondingly.

-Kuli


Re: prefix search

2011-10-25 Thread Michael Kuhlmann
I think what Radha Krishna (is this really her name?) means is different:

She wants to return only the matching token instead of the complete
field value.

Indeed, this is not possible. But you could use highlighting
(http://wiki.apache.org/solr/HighlightingParameters), and then extract
the matching part on your own. This shouldn't be too complicated.

-Kuli

Am 25.10.2011 12:12, schrieb Alireza Salimi:
 That's because the phrases are being tokenized and then indexed by Solr.
 You have to define a new fieldType which is not tokenized.
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
 
 I'm not sure if it would solve your problem
 
 On Tue, Oct 25, 2011 at 5:46 AM, Radha Krishna Reddy 
 radhakrishn...@gmail.com wrote:
 
 Hi,

 when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like
 q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result
 for the complete string that start with 'T'.means i want only 'Terry' as
 the
 result.

 Can i do this?

 Thanks and Regards,
 Radha Krishna.

 
 
 



Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-26 Thread Michael Kuhlmann
Hi,

Am 25.10.2011 23:53, schrieb Shawn Heisey:
 On 10/20/2011 11:00 AM, Shawn Heisey wrote:
 [...] I've noticed a performance discrepancy when
 processing every one of my delete records, currently about 25000 of
 them.

I din't understand what a delete record is. Do you delete records in
Solr? This shouldn't be done using records (what is a record in this
case? A document?); use a query for that.

Or do you add documents that you call delete records?

 I've managed to make this somewhat better by using multiple threads to
 do all the deletes on the six large static indexes at once, but that
 shouldn't be required.  The Perl version doesn't do them at the same time.

Are you sure? I don't know about the perl client, but maybe it's doing
the network operation in background?

I a single-thread environment, the client has to wait when sending each
request until it has been completely sent to the server, doing nothing.
Multiple threads can help you a lot here.

You can check this when you monitor your client's cpu load.

 10:27  cedrichurst the only difference i could see is deserializing
 the java binary object

This is true, but only in theory. Serializing and deserializing is so
fast that this shouldn't impact.

If you really want to be sure, use a SolrInputDocument instead of
annotated classes when sending documents, but as I said, this shouldn't
matter much.

What's more important: Don't send single documents but rather use
add(Collection) with multiple documents at once. At least when I
understood you correctly that you want to send 25000 documents for update...


-Kuli


Re: java.net.SocketException: Too many open files

2011-10-26 Thread Michael Kuhlmann
Hi;

we have a similar problem here. We already raised the file ulimit on the
server to 4096, but this only defered the problem. We get a
TooManyOpenFilesException every few months.

The problem has nothing to do with real files. When we had the last
TooManyOpenFilesException, we investigated with netstat -a and saw that
there were about 3900 open sockets in Jetty.

Curiously, we only have one SolrServer instance per Solr client, and we
only have three clients (our running web servers).

We have set defaultMaxConnectionsPerHost to 20 and maxTotalConnections
to 100. There should be room enough.

Sorry that I can't help you, we still have not solved tghe problem on
our own.

Greetings,
Kuli

Am 25.10.2011 22:03, schrieb Jonty Rhods:
 Hi,
 
 I am using solrj and for connection to server I am using instance of the
 solr server:
 
 SolrServer server =  new CommonsHttpSolrServer(
 http://localhost:8080/solr/core0;);
 
 I noticed that after few minutes it start throwing exception
 java.net.SocketException: Too many open files.
 It seems that it related to instance of the HttpClient. How to resolved the
 instances to a certain no. Like connection pool in dbcp etc..
 
 I am not experienced on java so please help to resolved this problem.
 
  solr version: 3.4
 
 regards
 Jonty
 



Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-27 Thread Michael Kuhlmann
Am 26.10.2011 18:29, schrieb Shawn Heisey:
 For inserting, I do use a Collection of SolrInputDocuments.  The delete
 process grabs values from idx_delete, does a query like the above (the
 part that's slow in Java), then if any documents are found, issues a
 deleteByQuery with the same string.

Why do you first query for these documents? Why don't you just delete
them? Solr won't harm if no documents are affected by your delete query,
and you'll get the number of affected documents in your response anyway.

When deleting, Solrj nearly does nothing on its own, it just sends the
POST request and analyzes the simple response. The behaviour in a get
request is similar. We do thousands of update, delete and get requests
in a minute using Solrj without problems, your timing problems must come
frome somewhere else.

-Kuli


Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-27 Thread Michael Kuhlmann
Sorry, I was wrong.

Am 27.10.2011 09:36, schrieb Michael Kuhlmann:
 and you'll get the number of affected documents in your response anyway.

That's not true, you don't get the affected document count. Anyway, it's
still true that you don't need to check for documents first, at least
not when you don't need this information somewhere else.

-Kuli


Re: Always return total number of documents

2011-10-28 Thread Michael Kuhlmann
Am 28.10.2011 11:16, schrieb Robert Brown:
 Is there no way to return the total number of docs as part of a search?

No, it isn't. Usually this information is of absolutely no value to the
end user.

A workaround would be to add some field to the schema that has the same
value for every document, and use this for facetting.

Greetings,
Kuli


Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread Michael Kuhlmann

Hi,

this is not exactly true. In Solr, you can't have the wildcard operator 
on both sides of the operator.


However, you can tokenize your fields and simply query for Solr. This 
is what's Solr made for. :)


-Kuli

Am 01.11.2011 13:24, schrieb François Schiettecatte:

Arshad

Actually it is available, you need to use the ReversedWildcardFilterFactory 
which I am sure you can Google for.

Solr and SQL address different problem sets with some overlaps but there are 
significant differences between the two technologies. Actually '%Solr%' is a 
worse case for SQL but handled quite elegantly in Solr.

Hope this helps!

Cheers

François


On Nov 1, 2011, at 7:46 AM, arshad ansari wrote:


Hi,

Is SQL Like operator feature available in Apache Solr Just like we have it
in SQL.

SQL example below -

*Select * from Employee where employee_name like '%Solr%'*

If not is it a Bug with Solr. If this feature available, please tell the
examples available.

Thanks!

--
Best Regards,
Arshad






Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread Michael Kuhlmann

Am 01.11.2011 16:06, schrieb Erick Erickson:

NGrams are often used in Solr for this case, but they will also add to
your index size.

It might be worthwhile to look closely at your user requirements
before going ahead
and supporting this functionality

Best
Erick


My opinion. Wildcards are good for peeking into the index, i.e. for 
checking data in the browser. I haven't yet found a real life use case 
for them.


-Kuli


Re: representing latlontype in pojo

2011-11-09 Thread Michael Kuhlmann

Am 08.11.2011 23:38, schrieb Cam Bazz:

How can I store a 2d point and index it to a field type that is
latlontype, if I am using solrj?


Simply use a String field. The format is $latitude,$longitude.

-Kuli



Re: Solr 3.3 Sorting is not working for long fields

2011-11-14 Thread Michael Kuhlmann

Am 14.11.2011 09:33, schrieb rajini maski:

query :
http://localhost:8091/Group/select?/indent=onq=studyid:120sort=studyidasc,groupid
asc,subjectid ascstart=0rows=10


Is it a copy-and-paste error, or did you realls sort on studyidasc?

I don't think you have a field studyidasc, and Solr should've given an 
exception that either asc or desc is missing.


-Kuli


Re: two word phrase search using dismax

2011-11-15 Thread Michael Kuhlmann

Am 14.11.2011 21:50, schrieb alx...@aim.com:

Hello,

I use solr3.4 and nutch 1.3. In request handler we have
str name=mm2lt;-1 5lt;-2 6lt;90%/str

As fas as I know this means that for two word phrase search match must be 100%.
However, I noticed that in most cases documents with both words are ranked 
around 20 place.
In the first places are documents with one of the words in the phrase.

Any ideas why this happening and is it possible to fix it?


Hi,

are you sure that only one of the words matched in the found documents? 
Have you checked all fields that are listed in the qf parameter? And did 
you check for stemmed versions of your search terms?


If all this is true, you maybe want to give an example.

And AFAIK the mm parameter does not affect the ranking.



Re: creating solr index from nutch segments, no errors, no results

2011-11-15 Thread Michael Kuhlmann
I don't know much about nutch, but it looks like there's simply a commit 
missing at the end.


Try to send a commit, e.g  by executing

curl http://host:port/solr/core/update -H Content-Type: text/xml 
--data-binary 'commit /'


-Kuli

Am 15.11.2011 09:11, schrieb Armin Schleicher:

hi there,

[...]


Re: Solr 3.3 Sorting is not working for long fields

2011-11-15 Thread Michael Kuhlmann

Hi,

Am 15.11.2011 10:25, schrieb rajini maski:

 fieldType name=long class=solr.TrieLongField precisionStep=0
omitNorms=true positionIncrementGap=0/


[...]


 fieldType name=tlong class=solr.TrieLongField precisionStep=8
omitNorms=true positionIncrementGap=0/


[...]


field name=studyid type=long indexed=true stored=true/


Hmh, why didn't you just changed the field type to tlong as you 
mentioned before? Instead you changed the class of the long type. 
There's nothing against this, it's just a bit confusing since long 
fields normally are of type solr.LongField, which is not sortable on its 
own.


You specified a precisionStep of 0, which means that the field would be 
slow in range queries, but it shouldn't harm for sorting. All in all, it 
should work.


So, the only chance I see is to re-index once again (and commit after 
that). I don't really see an error in your config except the confusing 
long type. It should work after reindexing, and it can't work if it 
was indexed with a genuine long type.


-Kuli


Re: Problems installing Solr PHP extension

2011-11-16 Thread Michael Kuhlmann

Am 16.11.2011 17:11, schrieb Travis Low:


If I can't solve this problem then we'll basically have to write our own
PHP Solr client, which would royally suck.


Oh, if you really can't get the library work, no problem - there are 
several PHP clients out there that don't need a PECL installation.


Personally, I have used http://code.google.com/p/solr-php-client/, it 
works well.


-Kuli


Re: Add copyTo Field without re-indexing?

2011-11-16 Thread Michael Kuhlmann

Am 17.11.2011 08:46, schrieb Kashif Khan:

Please advise how we can reindex SOLR with having fields stored=false. we
can not reindex data from the beginning just want to read and write indexes
from the SOLRJ only. Please advise a solution. I know we can do it using
lucene classes using indexreader and indexwriter but want to index all
fields


This is not possible. At least not when the index is modified in any way 
(stemmed, lowercased, tokenized, etc.).


The original data is not saved when stored is false. You'll need your 
original source data to reindex then.


-Kuli


Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread Michael Kuhlmann

Am 17.11.2011 11:53, schrieb sbarriba:

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false


:))

I think the shell handled the and sign as a flag to put the wget command 
into background.


You could put the full url into quotes, or escape the and sign with a 
backslash. Then it should work as well.


-Kuli


Re: PatternTokenizer failure

2011-11-29 Thread Michael Kuhlmann

Am 29.11.2011 15:20, schrieb Erick Erickson:

Hmmm, I tried this in straight Java, no Solr/Lucene involved and the
behavior I'm seeing is that no example works if it has more than
one whitespace character after the hyphen, including your failure
example.

I haven't lived inside regexes for long enough that I don't know what
the right regex should be, but it doesn't appear to be a Solr problem


Jay,
I think the problem is this:

You're checking whether the character preceding the array of at least 
one whitespace is not a hyphen.


However, when you've more than one whitespace, like this:
foo- \n bar
then there's another array of whitespaces - \n  - which is precedes by 
the first whitespace -  .


Therefore, you'll need to not only check for preceding hyphens, but also 
for preceding whitespaces.


I'll leave this as an exercise for you. ;)

-Kuli


Re: Best practise to automatically change a field value for a specific period of time

2011-12-02 Thread Michael Kuhlmann

Hi Mark,

I'm sure you can manage this using function queries somehow, but this is 
rather complicated, esp. if you both want to return the price and sort 
on it.


I'd rather update the index as soon as a campaign starts or ends. At 
least that's how we did it when I worked for online shops. Normally this 
isn't a matter of seconds, and you would need to update Solr anyway when 
you create such a campaign.


As a benefit, you're not limited in the number of running campaigns (at 
least not on the Solr side). Maybe you want to plan a campaign when the 
current one hasn't ended yet, which would be (nearly) impossible when 
you calculate the price at query time.


Greetings,
Kuli

Am 02.12.2011 12:21, schrieb Mark Schoy:

Hi,

I have an solr index for an online shop with a field price which
contains the standard price of a product.
But in the database, the shop owner can specify a period of time with
an alternative price.

For example: standard price is $20.00, but 12/24/11 08:00am to
12/26/11 11:59pm = $12.59

Of course I could use an cronjob to updating the documents. But I
think this is too unstable.
I also could save all price campaigns in a field an then extracting
the correct price. But then I could not sort by price or only by the
standard price.

What I need is an field where I can put a condition like that: if
[current_time between one of the price campains] then [return price of
price campaign]. But (unfortunately) this is not possible.

Thanks for advice.




Re: SolR for time-series data

2011-12-05 Thread Michael Kuhlmann

Hi Alan,

Solr can do this fast and easy, but I wonder if a simple key-value-store 
won't fit better for your suits.


Do you really only need to query be chart_id, or do you also need to 
query by time range?


In either case, as long as your data fits into an in-memory database, I 
would suggest Redis to you. It's easy to install and use, and it's fast 
as hell.


If you want to query by time ranges, you can use lists and query them by 
range using lrange (http://www.redis.io/commands/lrange), at least when 
you know the first timestamp and the steps are even. Or use a sorted 
set, and make sure that the values differ.


In my opinion, Solr has too many features that you don't need.

-Kuli

Am 03.12.2011 18:10, schrieb Alan Miller:

Hi,

I have a webapp that plots a bunch of time series data which
is just a series of doubles coupled with a timestamp.

Every chart in my webapp has a chart_id in my db and i am wondering if it
would be
effective to usr solr to serve the data to my app instead of keeping the
data in my rdbms.

Currently I'm using hadoop to calc and generate the report data and the
sticking it in my
rdbms but I could use solrj client to upload the data to a solr index
directly.

I know solr if for indexing text documents but would it be effective to use
solr in this way?

I want to query by chart_id and get back a series of timestamp:double pairs.

Regards
Alan





Re: Replication not done for real on commit?

2011-12-05 Thread Michael Kuhlmann

Am 05.12.2011 14:28, schrieb Per Steffensen:

Hi

Reading http://wiki.apache.org/solr/SolrReplication I notice the
pollInterval (guess it should have been pullInterval) on the slaves.
That indicate to me that indexed information is not really pushed from
master to slave(s) on events defined by replicateAfter (e.g. commit),
but that it only will be made available for pulling by the slaves at
those events. So even though I run with a master with
replicateAfter=commit, I am not sure that I will be able to query a
document that I have just indexed from one of the slaves immediately
after having done the indexing on the master - I will have to wait
pollInterval (+ time for replication). Can anyone confirm that this is
a correct interpretation, or explain how to understand pollInterval if
it is not?


This is totally correct.



I want to acheive this always-in-sync property between master and slaves
(primary and replica if you like). What is the easiest way? Will I just
have to make sure myself that indexing goes on directly on all replica
of a shard, and then drop using the replication explained on
http://wiki.apache.org/solr/SolrReplication?


When committing, Solr will need some time (at least some microseconds, 
may be much more) to zpdate your changes into its index. In the 
meantime, the existing index readers will still work on the old, 
uncommitted index state. Therefore you'll surely fail when you rely on a 
committed index state immediately after your commit command, even 
without any replication on a single machine.


Why do you need such a feature? I don't think that there's a way to make 
Solr behave like this.


-Kuli


Re: Solr response writer

2011-12-07 Thread Michael Kuhlmann

Am 07.12.2011 14:26, schrieb Finotti Simone:

That's the scenario:
I have an XML that maps words W to URLs; when a search request is issued by my 
web client, a query will be issued to my Solr application. If, after stemming, 
the query matches any in W, the client must be redirected to the associated URL.

I agree that it should be handled outside, but we are currently on progress of 
migrating from Endeca, and it has a feature that allow this scenario. For this 
reason, my boss asked if it was somehow possible to leave that functionality in 
the search engine.


Of course, your customers will never directly connect to your Solr 
server. They instead connect to your web application, which is itself a 
client to Solr.


Therefore, it's useless to return redirect response codes directly from 
Solr, since you customer's browsers will never get them.


Instead, you should handle Solr responses in your web application 
individually, and redirect your customers then.


-Kuli


Re: R: Solr response writer

2011-12-07 Thread Michael Kuhlmann

Am 07.12.2011 15:09, schrieb Finotti Simone:

I got your and Michael's point. Indeed, I'm not very skilled in web devolpment 
so there may be something that I'm missing. Anyway, Endeca does something like 
this:

1. accept a query
2. does the stemming;
3. check if the result of the step 2. matches one of the redirectable words. If 
so, returns an URL, otherwise returns the regular matching documents (our 
products' description).

Do you think that in Solr I will be able to replicate this behaviour without 
writing a custom plugin (request handler, response writer, etc)? Maybe I'm a 
little dense, but I fail to see how it would be possible...


Endeca not only is a search engine, it's part of a web application. You 
can send a query to the Endeca engine and send the response directly to 
the user; it's already fully rendered. (At least when you configured it 
this way.)


Solr can't do this in any way. Solr responses are always pure technical 
data, not meant to be delivered to an end user. An exception to this is 
the VelocityResponseWriter which can fill a web template.


Anything beyond the possibilities of the VelocityReponseWriter must be 
handled by some web application that anaylzes Solr's reponses.


How do you want ot display your product descriptions, the default case? 
I don't think you want to show some XML data.


Solr is a great search engine, but not more. It's just a small subset of 
commercial search frameworks like Endeca. Therefore, you can't simply 
replace it, you'll need some web application.


However, you don't need a custom response writer in this case, nor do 
you have to Solr extend in any way. At least not for this requrement.


-Kuli


Re: Copying few field using copyField to non multiValued field

2011-06-15 Thread Michael Kuhlmann
In addition to Bob's response:

Am 15.06.2011 13:59, schrieb Omri Cohen:
[...]
field name=at_location   type=text indexed=index
 stored=true required=false /
field name=at_country   type=text indexed=index
 stored=true required=false /
field name=at_city   type=text indexed=index
 stored=true required=false /
field name=at_state   type=text indexed=index
 stored=true required=false /.
1. The value for indexed should either be true or false, but not
index.
2. Why do you set all fields to be stored, when they are copied anyway?

Greetings,
Kuli



Re: Copying few field using copyField to non multiValued field

2011-06-16 Thread Michael Kuhlmann
Hi Omri,

there are two limitations:
1. You can't sort on a multiValued field. (Anyway, on which of the
copied fields would you want to sort first?)
2. You can't make the multiValued field the unique key.

Both are no real limitations:
1. Better sort on at_country, at_state, at_city instead.
2. Simply choose another unique key field. (Your location wouldn't be
unique anyway.)

Greetings,
Kuli

Am 16.06.2011 06:40, schrieb Omri Cohen:
 I just don't want to suffer all the limitation a multiValued field has.. (it
 does have some limitations, doesn't it?) I just remember I read somewhere
 that it does.




Re: MultiValued facet behavior question

2011-06-22 Thread Michael Kuhlmann
Am 22.06.2011 05:37, schrieb Bill Bell:
 It can get more complicated. Here is another example:
 
 q=cardiologydefType=dismaxqf=specialties
 
 
 (Cardiology and cardiologist are stems)...
 
 But I don't really know which value in Cardiologist match perfectly.
 
 Again, I only want it to return:
 
 Cardiologist: 3

You would never get Cardiologist: 3 as the facet result, because if
Cardiologist would be in your index, it's impossible to find it when
searching for cardiology (except when you manage to write some strange
tokenizer that translates cardiology to Cardiologist on query time,
including the upper case letter).

Facets are always taken from the index, so they usually match exactly or
never when querying for it.

-Kuli


Re: MultiValued facet behavior question

2011-06-22 Thread Michael Kuhlmann
Am 22.06.2011 09:49, schrieb Bill Bell:
 You can type q=cardiology and match on cardiologist. If stemming did not
 work you can just add a synonym:
 
 cardiology,cardiologist

Okay, synonyms are the only way I can think of a realistic match.

Stemming won't work on a facet field; you wouldn't get Cardiologist: 3
as the result but cardiolog: 3 or something like that instead.

Normally, you use declare facet field explicitly for facetting, and not
for searching, exactly because stemming and tokenizing on facet fields
don't make sense.

And the short answer is: No, that's not possible.

-Kuli


Re: Inconsistent search results

2011-06-27 Thread Michael Kuhlmann
Am 27.06.2011 15:56, schrieb Jihed Amine Maaref:
 - normalizedContents:(EDOUAR* AND une) doesn't return anything

This was discussed few days ago:

http://lucene.472066.n3.nabble.com/Conflict-in-wildcard-query-and-spellchecker-in-solr-search-tt3095198.html

 - normalizedContents:(edouar* AND un) returns the result (although there's
 no un word)
 - normalizedContents:(edouar* AND uned) returns the result (although there's
 no uned word)

text fields are stemmed (solr.SnowballPorterFilterFactory does this,
have a look at your schema.xml). So, both une and uned result to un.

-Kuli


Re: Include synonys in solr

2011-06-28 Thread Michael Kuhlmann
Am 28.06.2011 09:24, schrieb Romi:
 But as i suppose it would be very hard to include synonyms manually for each
 word as my application has large data.
 
 I want to know is there any way that this synonym.text file generate
 automatically referring to all dictionary words

I don't get the point here. Why should you want to add all dictionary
words to the synonyms? To what shall they translate? Just having all
words in synonyms.txt doesn't make much sense.

If you're asking about some kind of translation into another language:
In that case, you'd rather translate the text at index time and put it
into another field which you query as well.

In my last project, we had multi-valued fields like meta_description
and misspelled, where you could add arbitrary synonyms for each
document - maybe that's what you're asking for?

-Kuli


Re: Regex replacement not working!

2011-06-29 Thread Michael Kuhlmann
Am 29.06.2011 12:30, schrieb samuele.mattiuzzo:
 fieldType name=salary_min_text class=solr.TextField 
   analyzer type=index
...

 this is the final version of my schema part, but what i get is this:
 
 
 doc
 float name=score1.0/float
 str name=salaryNegotiable/str
 str name=salary_maxNegotiable/str
 str name=salary_minNegotiable/str
 /doc
...


The mistake is that you assume that the filter applied to the result.
This is not true. Index filters only affect the index (as the name
says), not the contents.

Therefore, if you have copyFields that are stored, the'll always return
the same value as the original field.

Try inspecting your index data with luke or the admin console. Then
you'll see whether your regex applies.

Greetings,
Kuli


Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Michael Kuhlmann
Am 07.07.2011 16:14, schrieb Bob Sandiford:
 [...] (Without the optimize, 'deleted' records still show up in query 
 results...)

No, that's not true. The terms remain in the index, but the document
won't show up any more.

Optimize is only for performance (and disk space) optimization, as the
name suggests.

-Kuli


Re: updating existing data in index vs inserting new data in index

2011-07-07 Thread Michael Kuhlmann
Am 07.07.2011 16:52, schrieb Mark juszczec:
 Ok.  That's really good to know because optimization of that kind will be
 important.

Optimization is only important if you had a lot of deletes or updated
docs, or if you want your segments get merged. (At least that's what I
know about it.)
 
 What of commit?  Does it somehow remove the previous version of an updated
 record?

Somehow, yes. If you don't commit, your changes won't be visible, and
the old documents remain unchanged. Physically they stay in the index
and are purged on optimize, but that's just an implementation detail.

-Kuli


Re: Average PDF index time

2011-07-12 Thread Michael Kuhlmann
Am 12.07.2011 12:03, schrieb alexander sulz:
 Still, why the PHP stops working correctly is beyond me, but it seems to
 be fixed now.

You should mind the max_execution_time parameter in you php.ini.

Greetings,
Kuli


Re: Result list order in case of ties

2011-07-12 Thread Michael Kuhlmann
Am 12.07.2011 12:13, schrieb Lox:
 Hi,
 
 In the case where two or more documents are returned with the same score, is
 there a way to tell Solr to sort them alphabetically?

Yes, add the parameter

sort=score desc,your_field_that_shall_be_sorted_alphabetically asc

to your request.

Greetings,
Kuli


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 14:05, schrieb Gabriele Kahlout:
 this is what i was expecting. Otherwise updating a field of a document that
 has an unstored but indexed field is impossible (without losing the unstored
 but indexed field. I call this updating a field of a document AND
 deleting/updating all its unstored but indexed fields).

Not necessarily. The usual use case is that you have some kind of
existing data source from where you fill your Solr index. When you want
to update field of a document, then you simply re-index from that
source. There's no need to fetch data from Solr before.

Otherwise, if you really don't have such an existing data source because
a horde of typewriting monkeys filled your Solr index, then you should
better declare all your fields as stored. Otherwise you'll never have a
chance to get that data back.

Greeting,
Kuli


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 15:37, schrieb Gabriele Kahlout:
 Well, I'm !sure how usual this scenario would be:
 1. In general those using solr with nutch don't store the content field to
 avoid storing the whole web/intranet in their index, twice (1 in the form of
 stored data, and one in the form of indexed data).
 

Not exactly. The indexed form is quite different from the stored form;
only the tokens are stored, each token only once, and some additional
data like the document count and, maybe, shingle information etc..

Hence, indexed data usually needs much less space on disk than the
original data.

There's no practical alternative to storing the content in a stored
field. What would you otherwise display as a search result? The
following web pages have your search term somewhere in their contents,
don't know where, take a look on your own?

Greetings,
Kuli


Re: Can I still search documents once updated?

2011-07-13 Thread Michael Kuhlmann
Am 13.07.2011 16:09, schrieb Gabriele Kahlout:
 Solr is already configured by default not to store more than a
 maxFieldLength anyway. Usually one stores content only to display
 snippets.

Yes, but the snippets must come from somewhere.

For instance, if you're using Solr's highlighting feature, all
highlighted fields must be stored.

See http://www.intellog.com/blog/?p=208 for explanation from someone
else. ;)

Greetings,
Kuli


LockObtainFailedException and open finalizing IndexWriters

2011-07-18 Thread Michael Kuhlmann
Hi,

we are running Solr 3.2.0 on Jetty for a web application. Since we just
went online and are still in beta tests, we don't have very much load on
our servers (indeed, they're currently much oversized for the current
usage), and our index size on file system ist just 1.1 MB.

We have one dedicated Solr instance for updates, and two replicated
read-only servers for requests. The update server gets filled by three
different Java web servers, each has a distinct Quartz job for its
updates. Every such Quartz job takes all collected updates, sends them
via Solrj's addBeans() method, and from time to time, they send an
additional commit() after that. Each update job has a
CommonHTTPSolrServer instance, which is a Spring controlled singleton.

We already had LockObtainFailedExceptions before, raising every few
days. Sometimes, we had such an exception before:
org.apache.solr.common.SolrException: java.io.IOException: directory
'/data/solr/data/index' exists and is a directory, but cannot be listed:
list() returned null

This looks like if there were no more file handles from the operating
system. This is strange, since the only index directory never had more
than 100 files, if ever. However, we raised ulimit -n from 1024 to 4096,
and reduced mergeFactor from 10 to 5, which firsted helped us with our
problem. Until yesterday.

Again, we had this:
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
out: SimpleFSLock@solr/main/data/index/write.lockat
org.apache.lucene.store.Lock.obtain(Lock.java:84)at
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1114)at
org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83)
 at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101)
  at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174)
   at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222)
   at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)

.


When we deleted the write.lock file without restarting Solr, several
hours later we had 441 same log entries:

Jul 18, 2011 7:20:29 AM org.apache.solr.update.SolrIndexWriter finalize
SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a
bug -- POSSIBLE RESOURCE LEAK!!!

Wow, if there really were 441 open IndexWriters trying to access the
index directory, it's no wonder that there will be Lock timeouts sooner
or later! However, I have no clue why there are so many IndexWriters
opened and never closed. The only accessing Solr instances are pure Java
applications using Solrj. Each application only has one SolrServer
instance - and even of not, this shouldn't harm, AFAIK. The update job
is started every five seconds. The installation is a pure 3.2.0 Solr,
without additional jars. And all jars are of the correct revision. The
solrconfig.xml is based on the example configuration, with nothing
special. We currently don't have any own extensions running. There is
absolutely only one jetty instance running on the machine. And I checked
the solr.xml, it's only one core defined, and we don't do any additional
core administration.

I'm using Solr since the beginning of 2010, but never had such a
problem. Any help is welcome.

Greetings,
Kuli


Re: Logically equivalent queries but vastly different no of results?

2011-07-22 Thread Michael Kuhlmann
Am 22.07.2011 14:27, schrieb cnyee:
 I think I know what it is. The second query has higher scores than the first.
 
 The additional condition domain_ids:(0^1.3 OR 1) which evaluates to true
 always - pushed up the scores and allows a LOT more records to pass.

This can't be, because the score doesn't affect the document count. Only
relative score values are relevant, not absolute ones.

Another guess: What's your query handler? If it's dismax, than the
reason is quite obvious.

-Kuli


Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more 
memory Solr can acquire, the more documents can you send in one update.


However, I wouldn't pish it too jard anyway. If you can send, say, 100 
documents per update, the you won't gain much if you send 200 documents 
instead, or even 1000. The number of requests don't count that much.


And, if the update fails for some reason, then the whole request will be 
ignored. If you had sent 1000 documents in an update, and one of them 
had a field missing, for example, then it's hard to find out which one.


Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno

















Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

pish it too jard - sounds funny. :)

I meant push it too hard.

Am 24.05.2012 11:46, schrieb Michael Kuhlmann:

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200 documents
instead, or even 1000. The number of requests don't count that much.

And, if the update fails for some reason, then the whole request will be
ignored. If you had sent 1000 documents in an update, and one of them
had a field missing, for example, then it's hard to find out which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno



















Re: How many doc/doc in the XML source file before indexing?

2012-05-24 Thread Michael Kuhlmann

Just try it!

Maybe you're lucky, and it works with 80M docs. If each document takes 
100 k, then it only needs 8 GB memory for indexing.


However, I doubt it. I've not been too deeply into the UpdateHandler 
yet, but I think it first needs to parse the complete XML file before it 
starts to index.


But that worst thing that can happen is an OOM exception. And when you 
need to split the xml files, then you can split into smaller chunks as well.


Just a note: In Solr, you're always updating, even in the first 
indexation. There's no difference between updates and inserts.


Greetings,
Michael

Am 24.05.2012 12:37, schrieb Bruno Mannina:

In fact it's not for an update but only for the first indexation.

I mean, I will receive the full database with around 80M docs in some
XML files (one per country in the world).
 From these 80M docs I will generate right XML format for each doc. (I
don't need all fields from the source)

And as actually for my test (12 000 docs), I generate one file per doc,
there is no problem.
But with 80M docs I can't generate one file per doc.

It's for this reason I asked the max number of doc in a file add.

For the first time, if a country file fails, no problem, I will check it
and re-generate it.

Is it bad to create a file with 5M doc ?


Le 24/05/2012 11:46, Michael Kuhlmann a écrit :

There is no hard limit for the maximum nunmber of documents per update.

It's only memory dependent. The smaller each document, and the more
memory Solr can acquire, the more documents can you send in one update.

However, I wouldn't pish it too jard anyway. If you can send, say, 100
documents per update, the you won't gain much if you send 200
documents instead, or even 1000. The number of requests don't count
that much.

And, if the update fails for some reason, then the whole request will
be ignored. If you had sent 1000 documents in an update, and one of
them had a field missing, for example, then it's hard to find out
which one.

Greetings,
Michael

Am 24.05.2012 10:58, schrieb Bruno Mannina:

I can't find my answer concerning the max number of doc/doc ?

Can someone can tell me if there is no limit?

Le 24/05/2012 09:55, Bruno Mannina a écrit :

Sorry I just found : http://wiki.apache.org/solr/UpdateXmlMessages

I will take also a look to find the max number of doc/doc.

Le 24/05/2012 09:51, Paul Libbrecht a écrit :

Bruno,
see the solrconfig.xml, you have all sorts of tweaks for this kind of
things.

paul


Le 24 mai 2012 à 09:49, Bruno Mannina a écrit :


Hi All,

Just a little question concerning the max number of

add
doc/doc
/add

that I can write in the xml source file before indexing? only one,
10, 100, 1000, unlimited...?

I must indexed 80M docs so I can't create one xml file by doc.

thanks,
Bruno























Re: Query elevation / boosting or something else to guarantee document position

2012-05-31 Thread Michael Kuhlmann

Hi Wenca,

I'm a bit late. but maybe you're still interested.

There's no such functionality in standard Solr. With sorting, this is 
not possible, because sort functions only rank each single document, 
they know nothing about the position of the others. And query elevation 
is similar, you'll raise the score of independent documents.


To achive this, you'll need an own QueryComponent. This isn't too 
complicated. You can't change the SolrIndexSearcher easily, this does 
the search job. But you can subclass 
org.apache.solr.handler.component.QueryComponent and overwrite 
process(). Alas the single main line - searcher.search() - is buried 
deeply in the huge monster method process(), and you first have to check 
for shards, grouping and twentythousand other parameters until you've 
arrived the code line you may want to expand.


Before calling search(), set the GET_DOCSET flag in your QueryCommand 
object, then execute the search. To check whether there's a document of 
the particular manufacturer in the result list, you can either
a) fetch the appropriate field value from the default field cache for 
every single result document until you found one; or
b) call getDocSet() on the SolrIndexSearcher with the manufacturer query 
as the parameter, and perform and and() operation on the resulting 
DocSet with the DocSet of your main query. (That's why you set the flag 
before.) You can then check which document that matches both the 
manufacturer and the main query fits best.


If you found a matching document, but it's behind pos. 5 in the 
resulting DocList, the you simoply have to re-order your list.


If there's no such document within the DocList (which is limited by your 
rows parameter), but there are some in the joined DocSet from strategy 
b), then you can simply choose one of them and ignore the fact that this 
is probably not the best matching one. Or you have to patch Solr and 
modify getDocListNC() in solrIndexSearcher (or one of the Collector 
classes), which is much more complicated.


Good luck!
-Kuli

Am 29.05.2012 14:26, schrieb Wenca:

Hi all,

I have an index with thousands of products with various fields
(manufacturer, price, popularity, type, color, ...) and I want to
guarantee at least one product by a particular manufacturer to be within
the first 5 results.

The search is done mainly by using filter params and results are ordered
by function e.g.: product(price, popularity) asc or by discount desc

And I need to guarantee that if there is any product matching the given
filters made by a concrete manufacturer, then it will be on the 5th
position at worst, even if the position by the order function is worse.

It seems to me that the Query elevation component is not the right thing
for me. I don't know the query in advance (or the set of filter
criteria) and I don't know concrete product that will be the best for
the criteria within the order.

And also I don't think that I can construct a function with such
requirements to use it directly for ordering the results.

Of course I can make a second query in case there is no desired product
on the first page of results and put it there, but it requires
additional request to solr and complicates results processing and
further pagination.

Can anybody suggest any solution?

Thanks
Wenca




Re: ERROR 400 undefined field

2012-06-07 Thread Michael Kuhlmann

Am 07.06.2012 09:55, schrieb sheethal shreedhar:

http://localhost:8983/solr/select/?q=fruitversion=2.2start=0rows=10indent=on

I get

HTTP ERROR 400

Problem accessing /solr/select/. Reason:

 undefined field text


Look at your schema.xml. You'll find a line like this:

defaultSearchFieldtext/defaultSearchField

Replace text with a field that s defined somewhere in schema.xml.

Or change your query to something with a field name like this:

http://localhost:8983/solr/select/?q=somefield:fruit

Or use the (e)dismax handler and configure it accordingly. See 
http://wiki.apache.org/solr/DisMaxRequestHandler.


Greetings,
Kuli


Re: timeAllowed flag in the response

2012-06-08 Thread Michael Kuhlmann

Hi Laurent,

alas there is currently no such option. The time limit is handled by an 
internal TimeLimitingCollector, which is used inside SolrIndexSearcher. 
Since the using method only returns the DocList and doesn't have access 
to the QueryResult, it won't be easy to return this information in a 
beautiful way.


Aborted Queries don't feed the caches, so you maybe can check whether 
the cache fill rate has changed, Of course, this is no reasonable 
approach in production environment.


The only way you can get the information is by patching Solr with a 
dirty hack.


Greetings,
Kuli

Am 07.06.2012 22:14, schrieb Laurent Vaills:

Hi everyone,

We have some grouping queries that are quite long to execute. Some are too
long to execute and are not acceptable. We have setup timeout for the
socket but with this we get no result and the query is still running on the
Solr side.
So, we are now using the timeAllowed parameter which is a good compromise.
However, in the response, how can we know that the query was stopped
because it was too long ?

I need this information for monitoring and to tell the user that the
results are not complete.

Regards,
Laurent





  1   2   >