Re: performance crossover between single index and sharding

2011-08-04 Thread Bernd Fehling

Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target for Solr.
The numbers for solr are from Solr's statistic report page. So 39.5 seconds
average per request is definately to long and I have to change to sharding.

For FAST system the numbers for the search dispatcher are:
 0.042 sec elapsed per normal search, on avg.
 0.053 sec average uncached normal search time (last 100 queries).
 99.898% of searches using  1 sec
 99.999% of searches using  3 sec
 0.000% of all requests timed out
 22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM are -Xmx 
for Java.
Yesterday I noticed that we are running out of heap during replication so I 
have to
increase -Xmx to about 22g.

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still taking
most of the load. I plan to switch completely to Solr after sharding is up and
running stable. So there will be additional 3 qps to Solr at peek times.

I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my todo list.

But the biggest problem currently is, how to configure the DIH to split up the
content to several indexer. Is there an indexing distributor?

Regards,
Bernd


Am 03.08.2011 16:33, schrieb Shawn Heisey:

Replies inline.

On 8/3/2011 2:24 AM, Bernd Fehling wrote:

To show that I compare apples and oranges here are my previous FAST Search 
setup:
- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
(searching and indexing at the same time, indexing once per week during the 
weekend)
- each server has 4GB RAM, all servers are physical on seperate machines
- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases worldwide
- index size is about 67GB per indexer -- about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times


An average query time of 50 milliseconds isn't too bad. If the number from your 
Solr setup below (39.5) is the QTime, then Solr thinks it is
performing better, but Solr's QTime does not include absolutely everything that 
hs to happen. Do you by chance have 95th and 99th percentile
query times for either system?


And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second is fallback
- each server has 32GB RAM, all server are virtuell
(master on a seperate physical machine, both slaves together on a physical 
machine)
- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours


I can't tell whether you mean that each physical host has 32GB or each VM has 
32GB. You want to be sure that you are not oversubscribing your
memory. If you can get more memory in your machines, you really should. Do you 
know whether that 0.6 seconds is most of the delay that a user
sees when making a search request, or are there other things going on that 
contribute more delay? In our webapp, the Solr request time is
usually small compared with everything else the server and the user's browser 
are doing to render the results page. As much as I hate being the
tall pole in the tent, I look forward to the day when the developers can change 
that balance.


The good thing is I have the ability to compare a commercial product and
enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it simple and 
stupid).
Actually it is doing excellent as single index on a single virtuell server.
But the average time per request should be reduced now, thats why I started
this discussion.
While searches with smaller Solr index size (3 mio. docs) showed that it can
stand with FAST Search it now shows that its time to go with sharding.
I think we are already far behind the point of search performance crossover.

What I hope to get with sharding:
- reduce time for building the index
- reduce average time per request


You will probably achieve both of these things by sharding, especially if you 
have a lot of CPU cores available. Like mine, your query volume is
very low, so the CPU cores are better utilized distributing the search.


What I fear with sharding:
- i currently have master/slave, do I then have e.g. 3 master and 3 slaves?
- the query changes because of sharding (is there a search distributor?)
- how to distribute the content the indexer with DIH on 3 server?
- anything else to think about while changing to sharding?


I 

Re: segment.gen file is not replicated

2011-08-04 Thread Bernd Fehling


I have now updated to solr 3.3 but segment.gen is still not replicated.

Any idea why, is it a bug or a feature?
Should I write a jira issue for it?

Regards
Bernd

Am 29.07.2011 14:10, schrieb Bernd Fehling:

Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd



Re: Update some fields for all documents: LUCENE-1879 vs. ParallelReader .FilterIndex

2011-08-04 Thread karsten-solr
Hi Erick,

thanks a lot!
This looks like a good idea:
Our queries with the changeable fields fits the join-idea from
https://issues.apache.org/jira/browse/SOLR-2272
because
 - we do not need relevance ranking
 - we can separate in a conjunction of a query with the changeable fields and 
our other stable fields
So we can use something like
q=stablefields:query1fq={!join from=changeable_fields_doc_id 
to:stable_fields_doc_id}changeablefields:query2

Only disprofit from the solution with ParallelReader is, that our stored fields 
and vector terms will be divided on two lucene-docs, which is ok in our 
use-case.

Best regards
  Karsten

in context:
http://lucene.472066.n3.nabble.com/Update-some-fields-for-all-documents-LUCENE-1879-vs-ParallelReader-amp-FilterIndex-td3215398.html

 Original-Nachricht 
 Datum: Wed, 3 Aug 2011 22:11:08 -0400
 Von: Erick Erickson erickerick...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Re: Update some fields for all documents: LUCENE-1879 vs. 
 ParallelReader .FilterIndex

 Hmmm, the only thing that comes to mind is the join feature being added
 to
 Solr 4.x, but I confess I'm not entirely familiar with that functionality
 so
 can't tell if it really solver your problem.
 
 Other than that I'm out of ideas, but the again it's late and I'm tired so
 maybe I'm not being very creative G...
 
 Best
 Erick
 On Aug 3, 2011 11:40 AM, karsten-s...@gmx.de wrote:


Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless
This file is actually optional; its there for redundancy in case the
filesystem is not reliable when listing a directory.  Ie, normally,
we list the directory to find the latest segments_N file; but if this
is wrong (eg the file system might have stale a cache) then we
fallback to reading the segments.gen file.

For example this is sometimes needed for NFS.

Likely replication is just skipping it?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:

 I have now updated to solr 3.3 but segment.gen is still not replicated.

 Any idea why, is it a bug or a feature?
 Should I write a jira issue for it?

 Regards
 Bernd

 Am 29.07.2011 14:10, schrieb Bernd Fehling:

 Dear list,

 is there a deeper logic behind why the segment.gen file is not
 replicated with solr 3.2?

 Is it obsolete because I have a single segment?

 Regards,
 Bernd




Re: A rant about field collapsing

2011-08-04 Thread Martijn v Groningen
The development of the field collapse feature is a long and confusing story.
The main point is that SOLR-236 was never going to scale
and the performance in general was bad. A new approach was needed.
This was implemented in SOLR-1682 and added to the trunk (4.0-dev)
around September last year. Later in LUCENE-1421 the code was moved
from Solr to Lucene as a module / contrib. After that the grouping module
and contrib were wired into Solr 3.3 and 4.0-dev in SOLR-2524 and SOLR-2564.

Field collapsing is not gone, but it is just a form of result grouping. The
core
SOLR-236 feature is in 3.3 / 4.0-dev. Other features that SOLR-236 offered
will
eventually get in. Like for example post grouping facets.
The http parameters and response have changed without keeping it compatible
with
the SOLR-236 patches. I think that isn't a problem since SOLR-236 was
never a committed feature. But a widely used feature should never be
attached as a patch
to a Jira issue for 3+ years

On 3 August 2011 18:33, baronDodd barond...@googlemail.com wrote:

 I am working on an implementation of search within our application using
 solr.

 About 2 months ago we had the need to group results by a certain field.
 After some searching I came across the JIRA in progress for this - field
 collapsing: https://issues.apache.org/jira/browse/SOLR-236

 It was scheduled for the next solr release and had a full set of proper
 JIRA
 subtasks and patch files of almost complete implementations attached. So as
 you can imagine I was happy to apply this patch and build it into our
 application and await for the next release when it would be part of the
 main
 trunk.

 Now imagine my surprise when we have come around to upgrade to see that
 suddenly field collapsing has been thrown away in favour of a totally
 different grouping implementation
 https://issues.apache.org/jira/browse/SOLR-2524

 How was it decided that this would be used instead? It was not made very
 clear that LUCENE-1421 was in progress which would effectively make the
 field collapsing work irrelevant by fixing the problem in lucene rather
 than
 primarily in solr. This has cost me days of work to now merge our custom
 changes somehow to the new implementation. I guess it is my own fault for
 basing our custom changes around an unresolved enhancement but as SOLR-236
 had been 3-4 years in progress and SOLR-2524 did not exist at the time it
 seemed pretty safe to assume that the same problem was not being fixed in 2
 totally different ways!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3222798.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Solr 3.3 crashes after ~18 hours?

2011-08-04 Thread alexander sulz

Thank you for the many replies!

Like I said, I couldn't find anything in logs created by solr.
I just had a look at the /var/logs/messages and there wasn't anything 
either.


What I mean by crash is that the process is still there and http GET 
pings would return 200
but when i try visiting /solr/admin, I'd get a blank page! The server 
ignores any incoming updates or commits,
thous throwing no errors, no 503's.. It's like the server has a blackout 
and stares blankly into space.


I just gave allocated more memory like proposed and will keep an eye on 
it if the problem still persists.


Thank you guys, you are awesome.


Am 02.08.2011 15:23, schrieb François Schiettecatte:

Assuming you are running on Linux, you might want to check /var/log/messages 
too (the location might vary), I think the kernel logs forced process 
termination there. I recall that the kernel will usually picks the process 
consuming the most memory, there may be other factors involved too.

François

On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote:


Monitor your memory usage.  I use to encounter a problem like this before
where nothing was in the logs and the process was just gone.

Turned out my system was out odd memory and swap got used up because of
another process which then forced the kernel to start killing off processes.
Google OOM linux and you will find plenty of other programs and people with
a similar problem.

Cameron
On Aug 2, 2011 6:02 AM, alexander sulza.s...@digiconcept.net  wrote:

Hello folks,

I'm using the latest stable Solr release -  3.3 and I encounter strange
phenomena with it.
After about 19 hours it just crashes, but I can't find anything in the
logs, no exceptions, no warnings,
no suspicious info entries..

I have an index-job running from 6am to 8pm every 10 minutes. After each
job there is a commit.
An optimize-job is done twice a day at 12:15pm and 9:15pm.

Does anyone have an idea what could possibly be wrong or where to look
for further debug info?

regards and thank you
alex




Re: segment.gen file is not replicated

2011-08-04 Thread Bernd Fehling



Am 04.08.2011 12:52, schrieb Michael McCandless:

This file is actually optional; its there for redundancy in case the
filesystem is not reliable when listing a directory.  Ie, normally,
we list the directory to find the latest segments_N file; but if this
is wrong (eg the file system might have stale a cache) then we
fallback to reading the segments.gen file.

For example this is sometimes needed for NFS.

Likely replication is just skipping it?


That was my first idea. If not changed and touched then it will be skipped.

While being smart I deleted it on slave from index dir and then
replicated, but segment.gen was not replicated.
Due to your explanation NFS could not be reliable any more.

So my idea either a bug or a feature and the experts will know :-)

Regards
Bernd



Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:


I have now updated to solr 3.3 but segment.gen is still not replicated.

Any idea why, is it a bug or a feature?
Should I write a jira issue for it?

Regards
Bernd

Am 29.07.2011 14:10, schrieb Bernd Fehling:


Dear list,

is there a deeper logic behind why the segment.gen file is not
replicated with solr 3.2?

Is it obsolete because I have a single segment?

Regards,
Bernd





Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-04 Thread thomas
Concerning the downtime, we found a solution that works well for us. We
allready implemented an update mechanism so that when authors are changing
some content in the cms, the index regarding this piece of content gets
updated (delete than index again) as well.

All we had to do is:
1. Change the schema.xml to support the PhoneticFilter in certain fieldtypes
2. Write a script that finds all individual content items
3. Starting the update mechanism for each piece of content item on after
another.

So the index slowly emerges from the old to the new phonetic state without
any noticeable downtime for users using the search function. Its just that
they get kind of mixed results for the time of the transition. Sure it needs
some time, but we can have cms users working with content all the time. If
they create or update content during the transition it will be indexed,
reindexed followinf the new schema.xml anyway.

If we need to rollback we just replace the schema.xml with the old version
and start the update process again. 

So far this is working, thanks for your support!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3225223.html
Sent from the Solr - User mailing list archive at Nabble.com.


Unbuffered entity enclosing request can not be repeated Invalid chunk header

2011-08-04 Thread Vadim Kisselmann
Hello folks,

i use solr 1.4.1 and every 2 to 6 hours i have indexing errors in my log
files.

on the client side:
2011-08-04 12:01:18,966 ERROR [Worker-242] IndexServiceImpl - Indexing
failed with SolrServerException.
Details: org.apache.commons.httpclient.ProtocolException: Unbuffered entity
enclosing request can not be repeated.:
Stacktrace: 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:469)
.
.
on the server side:
INFO: [] webapp=/solr path=/update params={wt=javabinversion=1} status=0
QTime=3
04.08.2011 12:01:18 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 0
04.08.2011 12:01:18 org.apache.solr.common.SolrException log
SCHWERWIEGEND: org.apache.solr.common.SolrException: java.io.IOException:
Invalid chunk header
.
.
.
i`m indexing ONE document per call, 15-20 documents per second, 24/7.
what may be the problem?

best regards
vadim


Re: performance crossover between single index and sharding

2011-08-04 Thread Peter Keegan
We have 16 shards on 4 physical servers. Shard size was determined by
measuring query response times as a function of doc count. Multiple shards
per server provides parallelism. In a VM environment, I would lean towards 1
shard per VM (with 1/4 the RAM). We implemented our own distributed search
(pre-Solr) and the extra sort/merge processing is not a performance issue.

Peter


On Tue, Aug 2, 2011 at 2:35 PM, Burton-West, Tom tburt...@umich.edu wrote:

 Hi Jonothan and Markus,

 Why 3 shards on one machine instead of one larger shard per machine?

 Good question!

 We made this architectural decision several years ago and I'm not
 remembering the rationale at the moment. I believe we originally made the
 decision due to some tests showing a sweetspot for I/O performance for
 shards with 500,000-600,000 documents, but those tests were made before we
 implemented CommonGrams and when we were still using attached storage.  I
 think we also might have had concerns about Java OOM errors with a really
 large shard/index, but we now know that we can keep memory usage under
 control by tweaking the amount of the terms index that gets read into
 memory.

 We should probably do some tests and revisit the question.

 The reason we don't have 12 shards on 12 machines is that current
 performance is good enough that we can't justify buying 8 more machines:)

 Tom

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Tuesday, August 02, 2011 2:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: performance crossover between single index and sharding

 Hi Tom,

 Very interesting indeed! But i keep wondering why some engineers choose to
 store multiple shards of the same index on the same machine, there must be
 significant overhead. The only reason i can think of is ease of maintenance
 in
 moving shards to a separate physical machine.
 I know that rearranging the shard topology can be a real pain in a large
 existing cluster (e.g. consistent hashing is not consistent anymore and
 having
 to shuffle docs to their new shard), is this the reason you choose this
 approach?

 Cheers,
 bble.com.



RE: performance crossover between single index and sharding

2011-08-04 Thread Bob Sandiford
Dumb question time - you are using a 64 bit Java, and not a 32 bit Java?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com


 -Original Message-
 From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
 Sent: Thursday, August 04, 2011 2:39 AM
 To: solr-user@lucene.apache.org
 Subject: Re: performance crossover between single index and sharding
 
 Hi Shawn,
 
 the 0.05 seconds for search time at peek times (3 qps) is my target for
 Solr.
 The numbers for solr are from Solr's statistic report page. So 39.5
 seconds
 average per request is definately to long and I have to change to
 sharding.
 
 For FAST system the numbers for the search dispatcher are:
   0.042 sec elapsed per normal search, on avg.
   0.053 sec average uncached normal search time (last 100 queries).
   99.898% of searches using  1 sec
   99.999% of searches using  3 sec
   0.000% of all requests timed out
   22454567.577 sec time up (that is 259 days)
 
 Is there a report page for those numbers for Solr?
 
 About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM
 are -Xmx for Java.
 Yesterday I noticed that we are running out of heap during replication
 so I have to
 increase -Xmx to about 22g.
 
 The reported 0.6 average requests per second seams to me right because
 the Solr system isn't under full load yet. The FAST system is still
 taking
 most of the load. I plan to switch completely to Solr after sharding is
 up and
 running stable. So there will be additional 3 qps to Solr at peek
 times.
 
 I don't know if a controlling master like FAST makes any sense for
 Solr.
 The small VMs with heartbeat and haproxy sounds great, must be on my
 todo list.
 
 But the biggest problem currently is, how to configure the DIH to split
 up the
 content to several indexer. Is there an indexing distributor?
 
 Regards,
 Bernd
 
 
 Am 03.08.2011 16:33, schrieb Shawn Heisey:
  Replies inline.
 
  On 8/3/2011 2:24 AM, Bernd Fehling wrote:
  To show that I compare apples and oranges here are my previous FAST
 Search setup:
  - one master server (controlling, logging, search dispatcher)
  - six index server (4.25 mio docs per server, 5 slices per index)
  (searching and indexing at the same time, indexing once per week
 during the weekend)
  - each server has 4GB RAM, all servers are physical on seperate
 machines
  - RAM usage controlled by the processes
  - total of 25.5 mio. docs (mainly metadata) from 1500 databases
 worldwide
  - index size is about 67GB per indexer -- about 402GB total
  - about 3 qps at peek times
  - with average search time of 0.05 seconds at peek times
 
  An average query time of 50 milliseconds isn't too bad. If the number
 from your Solr setup below (39.5) is the QTime, then Solr thinks it is
  performing better, but Solr's QTime does not include absolutely
 everything that hs to happen. Do you by chance have 95th and 99th
 percentile
  query times for either system?
 
  And here is now my current Solr setup:
  - one master server (indexing only)
  - two slave server (search only) but only one is online, the second
 is fallback
  - each server has 32GB RAM, all server are virtuell
  (master on a seperate physical machine, both slaves together on a
 physical machine)
  - RAM usage is currently 20GB to java heap
  - total of 31 mio. docs (all metadata) from 2000 databases worldwide
  - index size is 156GB total
  - search handler statistic report 0.6 average requests per second
  - average time per request 39.5 (is that seconds?)
  - building the index from scratch takes about 20 hours
 
  I can't tell whether you mean that each physical host has 32GB or
 each VM has 32GB. You want to be sure that you are not oversubscribing
 your
  memory. If you can get more memory in your machines, you really
 should. Do you know whether that 0.6 seconds is most of the delay that
 a user
  sees when making a search request, or are there other things going on
 that contribute more delay? In our webapp, the Solr request time is
  usually small compared with everything else the server and the user's
 browser are doing to render the results page. As much as I hate being
 the
  tall pole in the tent, I look forward to the day when the developers
 can change that balance.
 
  The good thing is I have the ability to compare a commercial product
 and
  enterprise system to open source.
 
  I started with my simple Solr setup because of kiss (keep it
 simple and stupid).
  Actually it is doing excellent as single index on a single virtuell
 server.
  But the average time per request should be reduced now, thats why I
 started
  this discussion.
  While searches with smaller Solr index size (3 mio. docs) showed
 that it can
  stand with FAST Search it now shows that its time to go with
 sharding.
  I think we are already far behind the point of search performance
 crossover.
 
  What I hope to get with sharding:
  - reduce time 

Re: performance crossover between single index and sharding

2011-08-04 Thread Bernd Fehling


java version 1.6.0_21
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

java: file format elf64-x86-64

Including the -d64 switch.


Am 04.08.2011 14:40, schrieb Bob Sandiford:

Dumb question time - you are using a 64 bit Java, and not a 32 bit Java?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com



-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
Sent: Thursday, August 04, 2011 2:39 AM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target for
Solr.
The numbers for solr are from Solr's statistic report page. So 39.5
seconds
average per request is definately to long and I have to change to
sharding.

For FAST system the numbers for the search dispatcher are:
   0.042 sec elapsed per normal search, on avg.
   0.053 sec average uncached normal search time (last 100 queries).
   99.898% of searches using  1 sec
   99.999% of searches using  3 sec
   0.000% of all requests timed out
   22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM
are -Xmx for Java.
Yesterday I noticed that we are running out of heap during replication
so I have to
increase -Xmx to about 22g.

The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still
taking
most of the load. I plan to switch completely to Solr after sharding is
up and
running stable. So there will be additional 3 qps to Solr at peek
times.

I don't know if a controlling master like FAST makes any sense for
Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my
todo list.

But the biggest problem currently is, how to configure the DIH to split
up the
content to several indexer. Is there an indexing distributor?

Regards,
Bernd


Am 03.08.2011 16:33, schrieb Shawn Heisey:

Replies inline.

On 8/3/2011 2:24 AM, Bernd Fehling wrote:

To show that I compare apples and oranges here are my previous FAST

Search setup:

- one master server (controlling, logging, search dispatcher)
- six index server (4.25 mio docs per server, 5 slices per index)
(searching and indexing at the same time, indexing once per week

during the weekend)

- each server has 4GB RAM, all servers are physical on seperate

machines

- RAM usage controlled by the processes
- total of 25.5 mio. docs (mainly metadata) from 1500 databases

worldwide

- index size is about 67GB per indexer --  about 402GB total
- about 3 qps at peek times
- with average search time of 0.05 seconds at peek times


An average query time of 50 milliseconds isn't too bad. If the number

from your Solr setup below (39.5) is the QTime, then Solr thinks it is

performing better, but Solr's QTime does not include absolutely

everything that hs to happen. Do you by chance have 95th and 99th
percentile

query times for either system?


And here is now my current Solr setup:
- one master server (indexing only)
- two slave server (search only) but only one is online, the second

is fallback

- each server has 32GB RAM, all server are virtuell
(master on a seperate physical machine, both slaves together on a

physical machine)

- RAM usage is currently 20GB to java heap
- total of 31 mio. docs (all metadata) from 2000 databases worldwide
- index size is 156GB total
- search handler statistic report 0.6 average requests per second
- average time per request 39.5 (is that seconds?)
- building the index from scratch takes about 20 hours


I can't tell whether you mean that each physical host has 32GB or

each VM has 32GB. You want to be sure that you are not oversubscribing
your

memory. If you can get more memory in your machines, you really

should. Do you know whether that 0.6 seconds is most of the delay that
a user

sees when making a search request, or are there other things going on

that contribute more delay? In our webapp, the Solr request time is

usually small compared with everything else the server and the user's

browser are doing to render the results page. As much as I hate being
the

tall pole in the tent, I look forward to the day when the developers

can change that balance.



The good thing is I have the ability to compare a commercial product

and

enterprise system to open source.

I started with my simple Solr setup because of kiss (keep it

simple and stupid).

Actually it is doing excellent as single index on a single virtuell

server.

But the average time per request should be reduced now, thats why I

started

this discussion.
While searches with smaller Solr index size (3 mio. docs) showed

that it can

stand with FAST Search it now shows that its time to go with


Re: A rant about field collapsing

2011-08-04 Thread baronDodd
Ok thank you very much for clearing that up a little. I think another reason
I was confused was that the wiki page for grouping was based around the
original field collapsing plan at the time which led me to the jira and
hence the patch files, rant over!

Perhaps you can help to clarify if the current grouping changes work with
solrj? In QueryResponse.setResponse() there is a loop which builds up the
results object, but has no check at present for grouped in the NamedList,
so the solrj client gets no results back when searching with grouping
parameters. I assume I can add this on my local working copy and all will be
well?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3225361.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: A rant about field collapsing

2011-08-04 Thread Martijn v Groningen
Well, the original page moved to:
http://wiki.apache.org/solr/FieldCollapsingUncommitted

Assuming that you're using Solr 3.3 you can't get the grouped result (lst
name=grouped) with SolrJ.
I added grouping support to SolrJ some time ago and will be in Solr 3.4. You
can use a nightly 3.x build to use the grouping support now.
You can also use the group.main=true option, that returns a response that is
compatible with the normal search response.
However you can only use one group command per request (group.field,
group.func and group.query). Also there were
some bugs with this response format in Solr 3.3 that have been fixed and
will be included when Solr 3.4 is released.

Martijn

On 4 August 2011 15:24, baronDodd barond...@googlemail.com wrote:

 Ok thank you very much for clearing that up a little. I think another
 reason
 I was confused was that the wiki page for grouping was based around the
 original field collapsing plan at the time which led me to the jira and
 hence the patch files, rant over!

 Perhaps you can help to clarify if the current grouping changes work with
 solrj? In QueryResponse.setResponse() there is a loop which builds up the
 results object, but has no check at present for grouped in the NamedList,
 so the solrj client gets no results back when searching with grouping
 parameters. I assume I can add this on my local working copy and all will
 be
 well?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/A-rant-about-field-collapsing-tp3222798p3225361.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Solr 3.3 crashes after ~18 hours?

2011-08-04 Thread Yonik Seeley
On Thu, Aug 4, 2011 at 8:09 AM, alexander sulz a.s...@digiconcept.net wrote:
 Thank you for the many replies!

 Like I said, I couldn't find anything in logs created by solr.
 I just had a look at the /var/logs/messages and there wasn't anything
 either.

 What I mean by crash is that the process is still there and http GET pings
 would return 200
 but when i try visiting /solr/admin, I'd get a blank page! The server
 ignores any incoming updates or commits,

ignores means what?  The request hangs?  If so, could you get a thread dump?

Do queries work (like /solr/select?q=*:*) ?

 thous throwing no errors, no 503's.. It's like the server has a blackout and
 stares blankly into space.

Are you using a different servlet container than what is shipped with solr?
If you did start with the solr example server, what jetty
configuration changes have you made?

-Yonik
http://www.lucidimagination.com


RE: Strategies for sorting by array, when you can't sort by array?

2011-08-04 Thread Olson, Ron
For anyone who comes across this topic in the future, I solved the problem 
this way: by agreement with the stakeholders, on the presumption that no one 
would look at more than 5000 records, I modified my search code so that, if the 
user selected to sort by the name, I set the row count to return 
(query.setRows) to 5000. I then put all the result records into a list, sort 
it, then, depending on what page they're on, extract that subset of the 5000 
and return it.

There is a small performance hit on initial searching for common names (e.g. 
Smith, Jones, etc.), but the performance is still far more acceptable than the 
legacy system Solr is meant to replace (a few seconds as opposed to twenty(!) 
minutes).

Most certainly there are better ways, but this one worked for me, and wanted to 
make sure it was added to the pool of options for anyone who comes across this 
problem in the future.

Thanks to everyone who offered suggestions!

Ron

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Wednesday, August 03, 2011 11:36 AM
To: solr-user@lucene.apache.org
Cc: Olson, Ron
Subject: Re: Strategies for sorting by array, when you can't sort by array?

Not so much that it's a corner case in the sense of being unusual
neccesarily (I'm not sure), it's just something that fundamentally
doesn't fit well into lucene's architecture.

I'm not sure that filing a JIRA will be much use, it's really unclear
how one would get lucene to do this, it would be signficant work to do,
and it's unlikely any Solr developer is going to decide to spend
signficant time on it unless they need it for their own clients.

On 8/3/2011 11:40 AM, Olson, Ron wrote:
 *Sigh*...I had thought maybe reversing it would work, but that would require 
 creating a whole new index, on a separate core, as the existing index is used 
 for other purposes. Plus, given the volume of data, that would be a big deal, 
 update-wise. What would be better would be to remove that particular sort 
 option-button on the webpage. ;)

 I'll create a Jira issue, but in the meanwhile I'll have to come up with 
 something else. I guess I didn't realize how much of a corner case this 
 problem is. :)

 Thanks for the suggestions!

 Ron

 -Original Message-
 From: Smiley, David W. [mailto:dsmi...@mitre.org]
 Sent: Wednesday, August 03, 2011 10:26 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Strategies for sorting by array, when you can't sort by array?

 Hi Ron.
 This is an interesting problem you have. One idea would be to create an index 
 with the entity relationship going in the other direction.  So instead of one 
 to many, go many to one.  You would end up with multiple documents with 
 varying names but repeated parent entity information -- perhaps simply using 
 just an ID which is used as a lookup. Do a search on this name field, sorting 
 by a non-tokenized variant of the name field. Use Result-Grouping to 
 consolidate multiple matches of a name to the same parent document. This 
 whole idea might very well be academic since duplicating all the parent 
 entity information for searching on that too might be a bit much than you 
 care to bother with. And I don't think Solr 4's join feature addresses this 
 use case. In the end, I think Solr could be modified to support this, with 
 some work. It would make a good feature request in JIRA.

 ~ David Smiley

 On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote:

 Hi all-

 Well, this is a problem. I have a list of names as a multi-valued field and 
 I am searching on this field and need to return the results sorted. I know 
 from searching and reading the documentation (and getting the error) that 
 sorting on a multi-valued field isn't possible. Okay, so, what I haven't 
 found is any real good solution/workaround to the problem. I was wondering 
 what strategies others have done to overcome this particular situation; 
 collapsing the individual names into a single field with copyField doesn't 
 work because the name searched may not be the first name in the field.

 Thanks for any hints/tips/tricks.

 Ron

 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee and may contain CONFIDENTIAL, 
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
 recipient, you are hereby notified that any use, disclosure, copying or 
 distribution of this message or any of the information included in or with 
 it is  unauthorized and strictly prohibited.  If you have received this 
 message in error, please notify the sender immediately by reply e-mail and 
 permanently delete and destroy this message and its attachments, along with 
 any copies thereof. This message does not create any contractual obligation 
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.


 DISCLAIMER: This electronic message, including any attachments, files or 
 documents, is intended only for the addressee 

Re: performance crossover between single index and sharding

2011-08-04 Thread Shawn Heisey

On 8/4/2011 12:38 AM, Bernd Fehling wrote:

Hi Shawn,

the 0.05 seconds for search time at peek times (3 qps) is my target 
for Solr.
The numbers for solr are from Solr's statistic report page. So 39.5 
seconds
average per request is definately to long and I have to change to 
sharding.


Solr reports all query times in milliseconds.  39.5 would be 0.0395 seconds.


For FAST system the numbers for the search dispatcher are:
 0.042 sec elapsed per normal search, on avg.
 0.053 sec average uncached normal search time (last 100 queries).
 99.898% of searches using  1 sec
 99.999% of searches using  3 sec
 0.000% of all requests timed out
 22454567.577 sec time up (that is 259 days)

Is there a report page for those numbers for Solr?


The Solr statistics normally page reports averages, but not percentile 
statistics.  You can add percentile-based statistics (on a limited 
subset of your queries) to a 3.X or trunk (4.0) version with SOLR-1972.  
I am using this patch in production.  Alternatively, you can use INFO 
logging in Solr and crawl the logfiles to gather statistics.  In the 
list below, (the standard section on the stats page) the ones that 
start with rolling are provided by the patch, the others are included 
by default.  Remember that all these times are in milliseconds.


handlerStart : 1312433464327
requests : 24112
errors : 547
timeouts : 0
totalTime : 2565584
avgTimePerRequest : 106.40279
avgRequestsPerSecond : 0.7097045
rollingRequests : 16384
rollingTotalTime : 1594420
rollingAvgTimePerRequest : 97.315674
rollingAvgRequestsPerSecond : 0.74394274
rollingMedian : 16
rolling75thPercentile : 35
rolling95thPercentile : 225
rolling99thPercentile : 2202
rollingMax : 9397

About the RAM, the 32GB RAM sind physical for each VM and the 20GB RAM 
are -Xmx for Java.
Yesterday I noticed that we are running out of heap during replication 
so I have to

increase -Xmx to about 22g.


That doesn't leave much RAM for the OS disk cache, the primary way to 
speed things up with Solr.  You should check how long it takes to warm 
your caches when you commit, you can find that on the stats page.  It's 
probably a good idea to lower your autowarmCount values.


If you sharded, you could drop your Java heap size and get more of your 
index into RAM.  I have a heap size of 3GB for an 18.25GB index (total 
of all shards is about 110GB) and do not expect to be increasing that 
unless we have problems when we start using faceting, spellchecking, and 
suggestions.  I have made particular tweaks to garbage collection and 
wrote about my experiences on this list.  My memory-related java parameters:


-Xms3072M -Xmx3072M
-XX:NewSize=2048M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled


The reported 0.6 average requests per second seams to me right because
the Solr system isn't under full load yet. The FAST system is still 
taking
most of the load. I plan to switch completely to Solr after sharding 
is up and

running stable. So there will be additional 3 qps to Solr at peek times.


When I read that before, I thought you were saying it was 0.6 seconds 
per request, not requests per second.  My apologies.  A qps of 3 is 
quite low.  I've seen numbers mentioned here above 3 qps, and I'm 
sure some of the list veterans have seen much higher.



I don't know if a controlling master like FAST makes any sense for Solr.
The small VMs with heartbeat and haproxy sounds great, must be on my 
todo list.


If you don't create a core to automatically add the shards parameter 
(the master server idea), your application will have to include the 
parameter on every request, which means it must be aware of how you have 
sharded your index.  If that's acceptable to you, there's no problem.  
In my case, every single Solr instance has a copy of this broker core.  
I only use it on two of them, the two that the load balancer knows about.


But the biggest problem currently is, how to configure the DIH to 
split up the

content to several indexer. Is there an indexing distributor?


There is currently no way to have Solr figure out distributed indexing.  
Solr doesn't know how you have sharded your data, and it cannot keep 
track of primary/secondary indexers.  Your build system must figure 
these things out.  My dih-config.xml accepts variables via the URL, 
which I use to tailor my SQL queries.


SELECT * FROM ${dataimporter.request.dataView}
WHERE (
  (
did gt; ${dataimporter.request.minDid}
AND did lt;= ${dataimporter.request.maxDid}
  )
  ${dataimporter.request.extraWhere}
) AND (crc32(did) % ${dataimporter.request.numShards})
  IN (${dataimporter.request.modVal})

I index all new content to a smaller index which I have called the 
incremental.  The updates that run every two minutes include a modVal 
for the above query of 0 1 2 3 4 5.  Once a night, I figure out which 
content is older than one week.  I 

Re: Solr 3.3 crashes after ~18 hours?

2011-08-04 Thread Manish Bafna
Check out Physcial memory/virtual memory usage.
RAM usage might be less but Physical memory usage goes up as you index more
documents.
It might be because of MMapDirectory which used MappedByteBuffer.

On Thu, Aug 4, 2011 at 7:38 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Aug 4, 2011 at 8:09 AM, alexander sulz a.s...@digiconcept.net
 wrote:
  Thank you for the many replies!
 
  Like I said, I couldn't find anything in logs created by solr.
  I just had a look at the /var/logs/messages and there wasn't anything
  either.
 
  What I mean by crash is that the process is still there and http GET
 pings
  would return 200
  but when i try visiting /solr/admin, I'd get a blank page! The server
  ignores any incoming updates or commits,

 ignores means what?  The request hangs?  If so, could you get a thread
 dump?

 Do queries work (like /solr/select?q=*:*) ?

  thous throwing no errors, no 503's.. It's like the server has a blackout
 and
  stares blankly into space.

 Are you using a different servlet container than what is shipped with solr?
 If you did start with the solr example server, what jetty
 configuration changes have you made?

 -Yonik
 http://www.lucidimagination.com



RE: Joining on multi valued fields

2011-08-04 Thread matthew . fowler
Hi Yonik

So I tested the join using the sample data below and the latest trunk. I still 
got the same behaviour.

HOWEVER! In this case it was nothing to do with the patch or solr version. It 
was the tokeniser splitting G1 into G and 1.

So thank you for a nice patch and your suggestions.

I do have a couple of questions for you: At what level does the join happen and 
what do you expect the performance penalty to be. We might use this extensively 
if the performance penalty isn't great.

Thanks again,

Matt

-Original Message-
From: Fowler, Matthew (Markets Eikon) 
Sent: 03 August 2011 15:04
To: yo...@lucidimagination.com
Cc: solr-user@lucene.apache.org
Subject: RE: Joining on multi valued fields

No I haven't. I will get the latest out of the trunk and report back.

Cheers again,

Matt

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: 03 August 2011 14:51
To: Fowler, Matthew (Markets Eikon)
Cc: solr-user@lucene.apache.org
Subject: Re: Joining on multi valued fields

Hmmm, if these are real responses from a solr server at rest (i.e.
documents not being changed between queries) then what you show
definitely looks like a bug.
That's interesting, since TestJoin implements a random test that
should cover cases like this pretty well.

I assume you are using a version of trunk (4.0-dev) and not just the
actual attached to the JIRA issue (which IIRC had at least one bug...
SOLR-2521).
Have you tried a more recent version of trunk?

-Yonik
http://www.lucidimagination.com



On Wed, Aug 3, 2011 at 7:00 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 Sorry for my late reply. I have been trying to get to the bottom of this
 but I'm getting inconsistent behaviour. Here's an example:

 Query = pi:rcs100     -       Here going to use pid_rcs as join
 value

 result name=response numFound=1 start=0
  doc
  str name=pircs100/str
  str name=ctrcs/str
  str name=pid_rcsG1/str
  str name=name_rcsEmerging Market Countries/str
  str name=definition_rcsAll business events relating to companies
 and other issuers of securities./str
  /doc
  /result
  /response

 Query = code:G1       -       See how many docs have G1 in their
 code field. Notice that code is multi valued

 - result name=response numFound=2 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
  /result
  /response

 Now for the join: http://10.15.39.137:8983/solr/file/select?q={!join
 from=pid_rcs to=code}pi:rcs100

 - result name=response numFound=3 start=0
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF3wGpXk+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:57Z/date
  str name=pinCIF7YcLP+1029782/str
 - arr name=code
  strG1/str
  strG7U/str
  strGK/str
  strME7/str
  strME8/str
  strMN/str
  strMR/str
  /arr
  /doc
 - doc
  str name=ctcat/str
  date name=maindocdate2011-04-22T05:48:58Z/date
  str name=pinCN1763203+1029782/str
 - arr name=code
  strA2/str
  strA5/str
  strA9/str
  strAN/str
  strB125/str
  strB126/str
  strB130/str
  strBL63/str
  strG41/str
  strGK/str
  strMZ/str
  /arr
  /doc
  /result
  /response

 So as you can see I get back 3 results when only 2 match the criteria.
 i.e. docs where G1 is present in multi valued code field. Why should
 the last document be included in the result of the join?

 Thank you,

 Matt


 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: 01 August 2011 18:28
 To: solr-user@lucene.apache.org
 Subject: Re: Joining on multi valued fields

 On Mon, Aug 1, 2011 at 12:58 PM,  matthew.fow...@thomsonreuters.com
 wrote:
 I have been using the JOIN patch
 https://issues.apache.org/jira/browse/SOLR-2272 with great success.

 However I have hit a case where it doesn't seem to be working. It
 doesn't seem to work when joining to a multi-valued field.

 That should work (and the unit tests do test with multi-valued fields).
 Can you come up with a simple example where you are not getting the
 expected results?

 -Yonik
 http://www.lucidimagination.com

 This email was sent to you by Thomson Reuters, the global news and 
 information company. Any views expressed in this message are those of the 
 individual sender, except where the sender specifically states them to be the 
 views of Thomson Reuters.


This email was sent to you by Thomson Reuters, the global news and information 

Indexing tweet and searching @keyword OR #keyword

2011-08-04 Thread Mohammad Shariq
I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType

-- 
Thanks and Regards
Mohammad Shariq


Re: Joining on multi valued fields

2011-08-04 Thread Yonik Seeley
On Thu, Aug 4, 2011 at 11:21 AM,  matthew.fow...@thomsonreuters.com wrote:
 Hi Yonik

 So I tested the join using the sample data below and the latest trunk. I 
 still got the same behaviour.

 HOWEVER! In this case it was nothing to do with the patch or solr version. It 
 was the tokeniser splitting G1 into G and 1.

Ah, glad you figured it out!

 So thank you for a nice patch and your suggestions.

 I do have a couple of questions for you: At what level does the join happen 
 and what do you expect the performance penalty to be. We might use this 
 extensively if the performance penalty isn't great.

With the current implementation, the performance is proportional to
the number of unique terms in the fields being joined.

-Yonik
http://www.lucidimagination.com


Re: Indexing tweet and searching @keyword OR #keyword

2011-08-04 Thread Jonathan Rochkind
It's the WordDelimiterFactory in your filter chain that's removing the 
punctuation entirely from your index, I think.


Read up on what the WordDelimiter filter does, and what it's settings 
are; decide how you want things to be tokenized in your index to get the 
behavior your want; either get WordDelimiter to do it that way by 
passing it different arguments, or stop using WordDelimiter; come back 
with any questions after trying that!



On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType



How can I create a good autosuggest list with phrases?

2011-08-04 Thread Shawn Heisey
I'm at the point in my Solr deployment where I want to start using it 
for autosuggest, but I've run into a snag.  Because the fields that I 
want to use for autosuggest are tokenized, I can only get single terms 
out of it.  I would like to have it find common phrases that are between 
two and five words long, so that if someone starts typing ang their 
autosuggest list will include Angelina Jolie as well as possibly Brad 
Pitt and Angelina Jolie.


My index is already quite large, so I do not want to add shingles.  I 
tried to use the clustering component, but that will only give you 
halfway decent results if you make the rows= parameter absolutely huge 
and therefore things run very slowly.  Also, it only works against 
stored fields, so I can only run it against the field where we retrieve 
captions, not the full description.  It's impractical to get results 
based on an entire index, much less all seven shards.


I'm OK with offline analysis to generate a list of suggestions, and I'm 
also OK with doing that analysis against the MySQL data source rather 
than Solr.  I just need some pointers about what software and/or 
techniques I can use to generate a good list, and then some idea of how 
to configure Solr to use that list.  Can anyone help?


Thanks,
Shawn



Re: How can I create a good autosuggest list with phrases?

2011-08-04 Thread Sethi, Parampreet
We handled similar requirement in our product kitchendaily.com by creating a
list of Search terms which were frequently searched over a period of time
and then building auto-suggestion index from this data. The constant updates
of this will allow you to support a well formed auto-suggest feature. This
is a good and faster solution if you have application logs to start with and
not very high volume of data.

Or you can search Solr with the user entered data, which returns all the
matching results and boost the data by field which will be used in
AutoSuggest box, use top 5 items in the dynamic div.

Hope it Helps.

-param


On 8/4/11 11:42 AM, Shawn Heisey s...@elyograg.org wrote:

 I'm at the point in my Solr deployment where I want to start using it
 for autosuggest, but I've run into a snag.  Because the fields that I
 want to use for autosuggest are tokenized, I can only get single terms
 out of it.  I would like to have it find common phrases that are between
 two and five words long, so that if someone starts typing ang their
 autosuggest list will include Angelina Jolie as well as possibly Brad
 Pitt and Angelina Jolie.
 
 My index is already quite large, so I do not want to add shingles.  I
 tried to use the clustering component, but that will only give you
 halfway decent results if you make the rows= parameter absolutely huge
 and therefore things run very slowly.  Also, it only works against
 stored fields, so I can only run it against the field where we retrieve
 captions, not the full description.  It's impractical to get results
 based on an entire index, much less all seven shards.
 
 I'm OK with offline analysis to generate a list of suggestions, and I'm
 also OK with doing that analysis against the MySQL data source rather
 than Solr.  I just need some pointers about what software and/or
 techniques I can use to generate a good list, and then some idea of how
 to configure Solr to use that list.  Can anyone help?
 
 Thanks,
 Shawn
 



Re: How can I create a good autosuggest list with phrases?

2011-08-04 Thread Shawn Heisey

On 8/4/2011 10:04 AM, Sethi, Parampreet wrote:

We handled similar requirement in our product kitchendaily.com by creating a
list of Search terms which were frequently searched over a period of time
and then building auto-suggestion index from this data. The constant updates
of this will allow you to support a well formed auto-suggest feature. This
is a good and faster solution if you have application logs to start with and
not very high volume of data.


I do have some separate plans to include data from our query logs, but 
I'd also like to get data from the index itself, more than one term at a 
time.


Thanks,
Shawn



merge factor performance

2011-08-04 Thread Naveen Gupta
Hi,

We are having a requirement where we are having almost 100,000 documents to
be indexed (atleast 20 fields). These fields are not having length greater
than 10 KB.

Also we are running parallel search for the same index.

We found that it is taking almost 3 min to index the entire documents.

Strategy what we are doing is that

We are making a commit after  15000 docs (single large xml doc)

We are having merge factor of 10 as if now

I am wondering if increasing the merge factor to 25 or 50 would increase the
performance.

also what about RAM Size (default is 32 MB) ?

Which other factors we need to consider ?

When should we consider optimize ?

Any other deviation from default would help us in achieving the target.

We are allocating JVM max heap size allocation 512 MB, default concurrent
mark sweep is set for garbage collection.


Thanks
Naveen


Re: merge factor performance

2011-08-04 Thread Naveen Gupta
Sorry for 15k Docs, it is taking 3 mins.

On Thu, Aug 4, 2011 at 10:07 PM, Naveen Gupta nkgiit...@gmail.com wrote:

 Hi,

 We are having a requirement where we are having almost 100,000 documents to
 be indexed (atleast 20 fields). These fields are not having length greater
 than 10 KB.

 Also we are running parallel search for the same index.

 We found that it is taking almost 3 min to index the entire documents.

 Strategy what we are doing is that

 We are making a commit after  15000 docs (single large xml doc)

 We are having merge factor of 10 as if now

 I am wondering if increasing the merge factor to 25 or 50 would increase
 the performance.

 also what about RAM Size (default is 32 MB) ?

 Which other factors we need to consider ?

 When should we consider optimize ?

 Any other deviation from default would help us in achieving the target.

 We are allocating JVM max heap size allocation 512 MB, default concurrent
 mark sweep is set for garbage collection.


 Thanks
 Naveen






Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless
I think we should fix replication to copy it?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 8:16 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:


 Am 04.08.2011 12:52, schrieb Michael McCandless:

 This file is actually optional; its there for redundancy in case the
 filesystem is not reliable when listing a directory.  Ie, normally,
 we list the directory to find the latest segments_N file; but if this
 is wrong (eg the file system might have stale a cache) then we
 fallback to reading the segments.gen file.

 For example this is sometimes needed for NFS.

 Likely replication is just skipping it?

 That was my first idea. If not changed and touched then it will be skipped.

 While being smart I deleted it on slave from index dir and then
 replicated, but segment.gen was not replicated.
 Due to your explanation NFS could not be reliable any more.

 So my idea either a bug or a feature and the experts will know :-)

 Regards
 Bernd


 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 I have now updated to solr 3.3 but segment.gen is still not replicated.

 Any idea why, is it a bug or a feature?
 Should I write a jira issue for it?

 Regards
 Bernd

 Am 29.07.2011 14:10, schrieb Bernd Fehling:

 Dear list,

 is there a deeper logic behind why the segment.gen file is not
 replicated with solr 3.2?

 Is it obsolete because I have a single segment?

 Regards,
 Bernd





Re: using distributed search with the suggest component

2011-08-04 Thread mdz-munich
Hi Tobias,

sadly, it seems you are right.  

After a little bit investigation we also recognized that some names (we use
it for auto-completing author-names), are missing. And since it is a
distributed setup ... 

But I am almost sure it worked with Solr 3.2. 



Best regards,

Sebastian 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-distributed-search-with-the-suggest-component-tp3197651p3226082.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.3 crashes after ~18 hours?

2011-08-04 Thread Stephen Duncan Jr
On Thu, Aug 4, 2011 at 10:08 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 ignores means what?  The request hangs?  If so, could you get a thread dump?

 Do queries work (like /solr/select?q=*:*) ?

 thous throwing no errors, no 503's.. It's like the server has a blackout and
 stares blankly into space.

 Are you using a different servlet container than what is shipped with solr?
 If you did start with the solr example server, what jetty
 configuration changes have you made?

 -Yonik
 http://www.lucidimagination.com


We're seeing something similar here.  Not sure exactly what the
circumstances are, but occasionally our Solr 3.3 test instance is
hanging, nothing seems to be happening for several minutes.  It does
seem to be happening while data is being added and continuous queries
are being sent.  It also may be related to an optimize happening (we
attempt to optimize after adding all the new data from our database).
The last log message is:

2011-08-04 13:46:56,418 [qtp30604342-451] INFO
org.apache.solr.core.SolrCore - [report] webapp= path=/update
params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=2}
status=0 QTime=109109

Here is our thread dump:


2011-08-04 13:47:16
Full thread dump Java HotSpot(TM) Client VM (20.1-b02 mixed mode):

RMI TCP Connection(13)-172.16.10.102 daemon prio=6 tid=0x47a4a400
nid=0x1384 runnable [0x4861f000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked 0x183a55a0 (a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:517)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
- 0x183a7c68 (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

qtp30604342-451 prio=6 tid=0x475c4800 nid=0x1a58 waiting on
condition [0x4897f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x18214c08 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at 
org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558)
at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
- None

qtp30604342-450 prio=6 tid=0x47ad1c00 nid=0x1ca4 waiting on
condition [0x49d2f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x18214c08 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at 
org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558)
at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
- None

qtp30604342-449 prio=6 tid=0x47a57c00 nid=0xb2c waiting on condition
[0x49c2f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x18214c08 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at 

Re: MultiSearcher/ParallelSearcher - searching over multiple cores?

2011-08-04 Thread Ralf Musick

Hi Erik,

I have several types with different properties, but they are supposed 
to be combined to one search.
Imagine a book with property title and a journal with property name. 
(the types in my project have of course more complex properties.)


So I created a new core with combined searchfields: field name is 
indexed, title is indexed, some shared properties are indexed like id.

Further an additional solr field type is created.
Of course there are several indexer, each per type. A specific type 
indexer stores only the fields of that type and stores further the type 
information eg book.

After indexing, all types are in the same core.

To search over all types, the query has to look like that ((title: bla) 
and (type: book)) or ((name: bla) and (type: journal)).


At least you get books or journal sorted by boost factor - and you have 
the type information as return field to differ the search results.


I hope it is coherent.

Thanks for your answer,
 Best Ralf







Re: Is there anyway to sort differently for facet values?

2011-08-04 Thread Way Cool
Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
I will try that though.

Can it handle the values below in the correct order?
Under 10
10 - 20
20 - 30
Above 30

Or
Small
Medium
Large
XL
...

My second question is that if Solr can't do that for the values above by
using facet.sort. Is there any other ways in Solr?

Thanks in advance,

YH

On Wed, Aug 3, 2011 at 8:35 PM, Erick Erickson erickerick...@gmail.comwrote:

 have you looked at the facet.sort parameter? The index value is what I
 think you want.

 Best
 Erick
 On Aug 3, 2011 7:03 PM, Way Cool way1.wayc...@gmail.com wrote:
  Hi, guys,
 
  Is there anyway to sort differently for facet values? For example,
 sometimes
  I want to sort facet values by their values instead of # of docs, and I
 want
  to be able to have a predefined order for certain facets as well. Is that
  possible in Solr we can do that?
 
  Thanks,
 
  YH



What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?

2011-08-04 Thread Way Cool
Hi, guys,

What's the best way (practice) to do index distribution at this moment?
Hadoop? or rsyncd (back to 3 years ago ;-)) ?

Thanks,

Yugang


Re: Is there anyway to sort differently for facet values?

2011-08-04 Thread Jonathan Rochkind

No, it can not. It just sorts alphabetically, actually by raw byte-order.

No other facet sorting functionality is available, and it would be 
tricky to implement in a performant way because of the way lucene 
works.  But it would certainly be useful to me too if someone could 
figure out a way to do it.


On 8/4/2011 2:43 PM, Way Cool wrote:

Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
I will try that though.

Can it handle the values below in the correct order?
Under 10
10 - 20
20 - 30
Above 30

Or
Small
Medium
Large
XL
...

My second question is that if Solr can't do that for the values above by
using facet.sort. Is there any other ways in Solr?

Thanks in advance,

YH

On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote:


have you looked at the facet.sort parameter? The index value is what I
think you want.

Best
Erick
On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com  wrote:

Hi, guys,

Is there anyway to sort differently for facet values? For example,

sometimes

I want to sort facet values by their values instead of # of docs, and I

want

to be able to have a predefined order for certain facets as well. Is that
possible in Solr we can do that?

Thanks,

YH


Re: lucene/solr, raw indexing/searching

2011-08-04 Thread dhastings
I have decided to use solr for indexing as well.  

the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
3.56,
 4 harv. l. rev. 45,
 187-532,
3 llm 56,
 5 unts 8,
6 u.n.t.s. 78,
father's obligation


i seem to keep running into issues getting this to work.  the searching is
being done on a text field that is not stored.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?

2011-08-04 Thread Jonathan Rochkind
I'm not sure what you mean by index distribution, that could possibly 
mean several things.


But Solr has had a replication feature built into it from 1.4, that can 
probably handle the same use cases as rsync, but better.  So that may be 
what you want.


There are certainly other experiments going on involving various kinds 
of scaling distribution, that I'm not familiar with, including the 
sharding feature, that I'm not very familiar with. I don't know if 
anyone's tried to do anything with hadoop.




On 8/4/2011 2:52 PM, Way Cool wrote:

Hi, guys,

What's the best way (practice) to do index distribution at this moment?
Hadoop? or rsyncd (back to 3 years ago ;-)) ?

Thanks,

Yugang



Re: lucene/solr, raw indexing/searching

2011-08-04 Thread Jonathan Rochkind

It depends. Okay, the source contains 4 harv. l. rev. 45 .

Do you want a user entered harv. to ALSO match harv (without the 
period) in source, and vice versa? Or do you require it NOT match? Or do 
you not care?


The default filter analysis chain will index 4 harv. l. rev. 45 
essentially as 4;harv;l;rev;45.  A phrase search for
4 harv. l. rev. 45 will match it, but so will a phrase search for 4 
harv l rev 45 , and in fact so will a phrase search for 4 harv. l. rev45


That could be good, or it could be bad.

The point of the Solr analysis chain is to apply tokenization and 
transformation at both index time and query time, so queries will match 
source in the way you want. You can customize this analysis chain 
however you want, in extreme cases even writing your own analyzers in 
Java. If the out of the box default isn't doing what you want, you'll 
have to spend some time thinking about how an inverted index like lucene 
works, and what you want. You would need to provide a lot more 
specifications/details for someone else to figure out what analysis 
chain will do what you want, but I bet you can figure it our yourself 
after reading up a bit and thinking up a bit.


See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 On 8/4/2011 4:30 PM, dhastings wrote:

I have decided to use solr for indexing as well.

the types of searches im doing are professional/academic.
so for example, i need to match:
all over the following exactly from my source data:
 3.56,
  4 harv. l. rev. 45,
  187-532,
 3 llm 56,
  5 unts 8,
 6 u.n.t.s. 78,
 father's obligation


i seem to keep running into issues getting this to work.  the searching is
being done on a text field that is not stored.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there anyway to sort differently for facet values?

2011-08-04 Thread Sethi, Parampreet
It can be achieved by creating own (app specific) custom comparators for
fields defined in schema.xml and having an extra attribute to specify the
comparator class in the field tag itself. But it will require changes in the
Solr to support this feature. (Not sure if it's feasible though just
throwing an idea.)

-param

On 8/4/11 4:29 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 No, it can not. It just sorts alphabetically, actually by raw byte-order.
 
 No other facet sorting functionality is available, and it would be
 tricky to implement in a performant way because of the way lucene
 works.  But it would certainly be useful to me too if someone could
 figure out a way to do it.
 
 On 8/4/2011 2:43 PM, Way Cool wrote:
 Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it.
 I will try that though.
 
 Can it handle the values below in the correct order?
 Under 10
 10 - 20
 20 - 30
 Above 30
 
 Or
 Small
 Medium
 Large
 XL
 ...
 
 My second question is that if Solr can't do that for the values above by
 using facet.sort. Is there any other ways in Solr?
 
 Thanks in advance,
 
 YH
 
 On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote:
 
 have you looked at the facet.sort parameter? The index value is what I
 think you want.
 
 Best
 Erick
 On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com  wrote:
 Hi, guys,
 
 Is there anyway to sort differently for facet values? For example,
 sometimes
 I want to sort facet values by their values instead of # of docs, and I
 want
 to be able to have a predefined order for certain facets as well. Is that
 possible in Solr we can do that?
 
 Thanks,
 
 YH



Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?

2011-08-04 Thread Way Cool
Yes, I am talking about replication feature. I remember I tried rsync 3
years ago with solr 1.2. Just not sure if someone else have done anything
better than that during the last 3 years. ;-) Personally I am thinking about
using Hadoop and ZooKeeper. Has anyone tried those features?
I found a couple links below, but no success on that yet.
http://wiki.apache.org/solr/SolrCloud
http://wiki.apache.org/solr/DeploymentofSolrCoreswithZookeeper

Thanks for your reply Jonathan.

On Thu, Aug 4, 2011 at 2:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 I'm not sure what you mean by index distribution, that could possibly
 mean several things.

 But Solr has had a replication feature built into it from 1.4, that can
 probably handle the same use cases as rsync, but better.  So that may be
 what you want.

 There are certainly other experiments going on involving various kinds of
 scaling distribution, that I'm not familiar with, including the sharding
 feature, that I'm not very familiar with. I don't know if anyone's tried to
 do anything with hadoop.




 On 8/4/2011 2:52 PM, Way Cool wrote:

 Hi, guys,

 What's the best way (practice) to do index distribution at this moment?
 Hadoop? or rsyncd (back to 3 years ago ;-)) ?

 Thanks,

 Yugang




deleting index directory/files

2011-08-04 Thread Mark juszczec
Hello all

I'm using multiple cores.  I there's a directory named by the core and it
contains a subdir named data that contains a subdir named index that
contains a bunch of files that contain the data for my index.

Let's say I want to completely rebuild the index from scratch.

Can I delete the dir named index?  I know the next thing I'd have to do is a
full data import, and that's ok.  I want to blow away any traces of the
core's previous existence.

Mark


RE: deleting index directory/files

2011-08-04 Thread Olson, Ron
I ran into a problem when I deleted just the index directory; I deleted the 
entire data directory and it was recreated on the next load. BTW, if you're 
using the DIH, its default behavior is to remove all records on a full import, 
so you can save yourself having to remove any actual files.

-Original Message-
From: Mark juszczec [mailto:mark.juszc...@gmail.com]
Sent: Thursday, August 04, 2011 4:01 PM
To: solr-user@lucene.apache.org
Subject: deleting index directory/files

Hello all

I'm using multiple cores.  I there's a directory named by the core and it
contains a subdir named data that contains a subdir named index that
contains a bunch of files that contain the data for my index.

Let's say I want to completely rebuild the index from scratch.

Can I delete the dir named index?  I know the next thing I'd have to do is a
full data import, and that's ok.  I want to blow away any traces of the
core's previous existence.

Mark

DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


Json update using HttpURLConnection

2011-08-04 Thread Sharath Jagannath
I am trying to post the json update request using
java.net.HttpURLConnection.

Parameters I am using:

url : http://localhost:8983/solr/update/json?commit=true

 String data =
[{\id\ : \TestDoc7\, \title\ : \test 7\}, {\id\ :
\TestDoc8\, \title\ : \another test 8\}];

uri +=  + data;

String requestType = POST;


Header info:
Content-type, application/json
Content-Length,  + data.length()


I see, the request going to the solr server but the data is not being
persisted.

I was able to add the same documents with curl, I am sure there is something
stupid.

Any pointers  on what might be the mistake.

Thanks,
Sharath


Re: Minimum Score

2011-08-04 Thread Kissue Kissue
Hi,

I am using Solr 3.1 with the SolrJ client library. I can see that it is
possible to get the maximum score for your search by using the following:

response.getResults().getMaxScore()

I am wondering is there some simple solution to get the minimum score?

Many thanks.


SOLR Support for Span Queries

2011-08-04 Thread Joshua Harness
How does one issue span queries in SOLR (Span, SpanNear, etc)? I've
done a bit of research and it seems that these are not supported. It
would seem that I need to implement a QueryParserPlugin to accomplish
this. Is this the correct path? Surely this has been done before. Does
anybody have links to examples? I had trouble finding anything.

Thanks!

Josh Harness


Re: Records skipped when using DataImportHandler

2011-08-04 Thread anand sridhar
Ok. After analysis, I narrowed the reduced results set to the fact that the
zipcode field is not indexed 'as is'. i.e the zipcode field values are
broken down into tokens and then stored. Hence, if there are 10 documents
with zipcode fields varying from 91000-91009, then the zipcode fields are
not stored as 91000, 91001 etc.. instead, the most common recurrences are
grabbed together and stored as tokens  hence resulting in a reduced
resultset.
The net effect is I cannot search for a value like 91000  since its not
stored as it is.

I suspect this to do something with the type of field the zipcode is
associated to. Right now , zipcode is a field of type text_general where the
StandardTokenizerFactory may be breakign the values into tokens. However, I
want to store them without tokenizing. Whats the best field type to do this.
?

I already explored the String fieldtype which is supposed to store the
values as is, but I see that the values are still being tokenized.


Thanks,
Anand
On Wed, Aug 3, 2011 at 7:24 PM, Erick Erickson erickerick...@gmail.comwrote:

 Sorry, I'm on a restricted machine so can't get the precise URL. But,
 there's a debug page for DIH that might allow you to see what the query
 actually returns. I'd guess one of two things:
 1 you aren't getting the number of rows you think.
 2 you aren't committing the documents you add.

 But that's just a guess.

 Best
 Erick
 On Aug 3, 2011 2:15 PM, anand sridhar anand.for...@gmail.com wrote:
  Hi,
  I am a newbie to Solr and have been trying to learn using
  DataImportHandler.
  I have a query in data-config.xml that fetches about 5 records when i
 fire
  it in SQL Query manager.
  However, when Solr does a full import, it is skipping 4 records and only
  importing 1 record.
  What could be the reason for that. ?
 
  My data-config.xml looks like this -
 
  dataConfig
  dataSource type=JdbcDataSource
  name=GeoService
  driver=net.sourceforge.jtds.jdbc.Driver
  url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup
  user=sa
  password=psiuser/
  document
  entity name=city
  query=select ll.cityId as id, ll.zip as zipCode, c.cityName as
  cityName, st.stateName as state, ct.countryName as country from
 latlonginfo
  ll,city c, state st, country ct where ll.cityId = c.cityID and
  c.stateID=st.stateID and st.countryID = ct.countryID
  order by ll.areacode
  dataSource=GeoService
  field column=zipCode name=zipCode/
  field column=cityName name=cityName/
  field column=state name=state/
  field column=country name=country/
  /entity
  /document
  /dataConfig
 
  My fields definition in schema.xml looks as below -
 
  field name=CityName type=text_general indexed=true stored=true
 /
  field name=zipCode type=text_general indexed=true stored=true/
  field name=state type=text_general indexed=true stored=true /
  field name=country type=text_general indexed=true stored=true /
 
  One observation I made was the 1 record that is being indexes is the last
  record in the result set. I have verified that there are no duplicate
  records being retreived.
 
  For eg, if the result set from Database is -
 
  zipcode CityName state country
  --- - - ---
  91324 Northridge CA USA
  91325 Northridge CA USA
  91327 Northridge CA USA
  91328 Northridge CA USA
  91329 Northridge CA USA
  91330 Northridge CA USA
 
  The record being indexed is the last record all the time.
 
  Any suggestions are welcome.
 
  Thanks,
  Anand



Re: Minimum Score

2011-08-04 Thread Darren Govoni
Off the top of my head you maybe you can get the number of results and 
then
look at the last document and check its score. I believe the results 
will be ordered by score?


On 08/04/2011 05:44 PM, Kissue Kissue wrote:

Hi,

I am using Solr 3.1 with the SolrJ client library. I can see that it is
possible to get the maximum score for your search by using the following:

response.getResults().getMaxScore()

I am wondering is there some simple solution to get the minimum score?

Many thanks.





Re: Json update using HttpURLConnection

2011-08-04 Thread Sharath Jagannath
Never mind, It was some stupid bug.
Figured it out.

Cheers,
Sharath



On Thu, Aug 4, 2011 at 2:35 PM, Sharath Jagannath
shotsonclo...@gmail.comwrote:

 I am trying to post the json update request using
 java.net.HttpURLConnection.

 Parameters I am using:

 url : http://localhost:8983/solr/update/json?commit=true

  String data =
 [{\id\ : \TestDoc7\, \title\ : \test 7\}, {\id\ :
 \TestDoc8\, \title\ : \another test 8\}];

 uri +=  + data;

 String requestType = POST;


 Header info:
 Content-type, application/json
 Content-Length,  + data.length()


 I see, the request going to the solr server but the data is not being
 persisted.

 I was able to add the same documents with curl, I am sure there is
 something stupid.

 Any pointers  on what might be the mistake.

 Thanks,
 Sharath



Loading huge synonym list in Solr

2011-08-04 Thread Arun Atreya
Hello,

I would like to know the best way to load a huge synonym list into Solr.

I would like to do concept indexing (a.k.a category indexing) with Solr. For
example, I want to be able to index all cities and be able to search for all
of them using a special keyword, say 'CONCEPTcity', where 'CONCEPTcity' will
match anything that IS-A city, as specified in the index_synonyms.txt file. I
believe the best way to do this is via the SynonymFilterFactory and do
index-time synonym expansion. Or is there a better alternative?

I would still like to keep the original city names and do not want to
replace them with 'CONCEPTcity', so if someone searches for 'Lake', the city
name 'Salt Lake City' still matches. Also, obviously, I do not want two
different city names to be synonyms of each other.

Is the correct way to specify the index_synonyms.txt file like this?

-
CONCEPTcity, Salt Lake City
CONCEPTcity, New York
CONCEPTcity, San Jose
.
.
.
-

and then keep
expand=true
for SynonymFilterFactory?

I tried to load a synonym file with 10K entries like this, and Solr/Jetty
took a few seconds to start, but if I try to load a synonym file with 1M+
entries, then it is taking a long time. What is the best way to do this?

Thanks,
Arun.


Re: Loading huge synonym list in Solr

2011-08-04 Thread Robert Muir
https://issues.apache.org/jira/browse/LUCENE-3233

On Thu, Aug 4, 2011 at 7:24 PM, Arun Atreya my.2.pai...@gmail.com wrote:
 Hello,

 I would like to know the best way to load a huge synonym list into Solr.

 I would like to do concept indexing (a.k.a category indexing) with Solr. For
 example, I want to be able to index all cities and be able to search for all
 of them using a special keyword, say 'CONCEPTcity', where 'CONCEPTcity' will
 match anything that IS-A city, as specified in the index_synonyms.txt file. I
 believe the best way to do this is via the SynonymFilterFactory and do
 index-time synonym expansion. Or is there a better alternative?

 I would still like to keep the original city names and do not want to
 replace them with 'CONCEPTcity', so if someone searches for 'Lake', the city
 name 'Salt Lake City' still matches. Also, obviously, I do not want two
 different city names to be synonyms of each other.

 Is the correct way to specify the index_synonyms.txt file like this?

 -
 CONCEPTcity, Salt Lake City
 CONCEPTcity, New York
 CONCEPTcity, San Jose
 .
 .
 .
 -

 and then keep
 expand=true
 for SynonymFilterFactory?

 I tried to load a synonym file with 10K entries like this, and Solr/Jetty
 took a few seconds to start, but if I try to load a synonym file with 1M+
 entries, then it is taking a long time. What is the best way to do this?

 Thanks,
 Arun.




-- 
lucidimagination.com


Copy Fields while Replication

2011-08-04 Thread Pawan Darira
Hi

I would like to know whether i can add new fields while replicating index on
Slave. E.g. My Master has index with field F1 which is created with type
string. Now, i don't want F1 as a type string  also have limitation
that i cannot change the field type at schema level.

Now, if i replicate that index on Slave, can i use copyField attribute
to create a copy of field F1 into field F2. F2 will be my field with
type text which i can use as per my requirement

Please suggest

Thanks
Pawan


Solr DIH import - Date Question

2011-08-04 Thread solruser@9913
This is perhaps a 'truly newbie' question.  I am processing some files via
DIH handler/XPATH Processor.  Some of the date fields in the XML are in
'Java Long format' i.e. just a big long number.  I am wondering how to map
them Solr Date field.  I used the DIH  DateFormatTransformer for some other
'date' fields that were written out in a regular date format.

However I am stumped on this - thought it would be simple but I was not able
to find a solution 

Any help would be much appreciated

-g

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-DIH-import-Date-Question-tp3227720p3227720.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr DIH import - Date Question

2011-08-04 Thread Lance Norskog
You might have to do this with an external script. The DIH lets you
process fields with javascript or Groovy.

Also, somewhere in the DIH you can give an XSL stylesheet instead of
just an XPath.

On Thu, Aug 4, 2011 at 10:31 PM, solruser@9913 gunaranj...@yahoo.com wrote:
 This is perhaps a 'truly newbie' question.  I am processing some files via
 DIH handler/XPATH Processor.  Some of the date fields in the XML are in
 'Java Long format' i.e. just a big long number.  I am wondering how to map
 them Solr Date field.  I used the DIH  DateFormatTransformer for some other
 'date' fields that were written out in a regular date format.

 However I am stumped on this - thought it would be simple but I was not able
 to find a solution 

 Any help would be much appreciated

 -g

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-DIH-import-Date-Question-tp3227720p3227720.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com