question on word parsing control

2012-05-01 Thread kenf_nc
I have a field that is defined using what I believe is fairly standard text
fieldType. I have documents with the words 'evaluate', 'evaluating',
'evaluation' in them. When I search on the whole word, obviously it works,
if I search on 'eval' it finds nothing. However for some reason if I search
on 'evalu' it finds all the matches.  Is that an indexing setting or query
setting that will tokenize 'evalu' but not 'eval' and how do I get 'eval' to
be a match?

Thanks,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-on-word-parsing-control-tp3952925.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: optional nested queries

2011-07-01 Thread kenf_nc
I don't use dismax, but do something similar with a regular query. I have a
field defined in my schema.xml called 'dummy' (not sure why its called that
actually) but it defaults to 1 on every document indexed. So say I want to
give a score bump to documents that have an image, I can do queries like:

q=Some+search+text+AND+(has_image:true^.5 OR dummy:1)

I'm doing that from memory haven't actually tested so my syntax may be off,
but i hope you get the idea. Basically the first part of the OR query in
parenthesis is your optional nested query, if it fails, or even if the
document doesn't have a field called has_image at all, the dummy:1 will
always pass.

Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/optional-nested-queries-tp3128847p3129064.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to optimize solr indexes

2011-07-01 Thread kenf_nc
I believe that is not a setting, it's not telling you that you have 'optimize
turned on' it's a state, your index is currently optimized. If you index a
new document or delete an existing document, and don't issue an optimize
command, then your index should be optimize=false.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-solr-indexes-tp3125293p3129078.html
Sent from the Solr - User mailing list archive at Nabble.com.


Compound word search not what I expected

2011-06-07 Thread kenf_nc
I have a field defined as:
field name=content type=text indexed=true stored=false
termVectors=true multiValued=true /
where text is unmodified from the schema.xml example that came with Solr
1.4.1.

I have documents with some compound words indexed, words like Sandstone. And
in several cases words that are camel case like MaxSize. If I query using
all lower case, sandstone or maxsize, I get the documents I expect. If I
query with proper case, ie. Sandstone or Maxsize I get the documents I
expect. However, if the query is camel case, MaxSize or SandStone, it
doesn't find the documents. In the case of MaxSize it is particularly
frustrating because that is the actual case of the word that was indexed. Is
this expected behavior?  The query analyzer definition the the text field
type is:
analyzer type=query 
  tokenizer class=solr.WhitespaceTokenizerFactory/ 
  filter class=solr.SynonymFilterFactory ignoreCase=true expand=true
synonyms=synonyms.txt/ 
  filter class=solr.StopFilterFactory enablePositionIncrements=true
words=stopwords.txt ignoreCase=true/ 
  filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
catenateAll=0 catenateNumbers=0 catenateWords=0
generateNumberParts=1 generateWordParts=1/ 
  filter class=solr.LowerCaseFilterFactory/ 
  filter language=English class=solr.SnowballPorterFilterFactory
protected=protwords.txt/ 
/analyzer

Is the order by the filters important? If LowerCaseFilterFactory came before
WordDelimiterFilterFactory, would that fix this? Would it break something
else?

Thanks,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3036089.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Compound word search not what I expected

2011-06-07 Thread kenf_nc
I tried setting catenateWords=1 on the Query analyzer and that didn't do
anything. I think what I need is to set my Index Analyzer to have
preserveOriginal=1 and then re-index everything. That will be a pain, so
I'll do a small test to make sure first. I'm really surprised
preserveOriginal=1 isn't the default. It's like saying slice and dice
this word so I can search on all kinds of partial matches...but do NOT let
me search on the actual word itself.  I know it's not quite that, but it's
close. Anyway, I'm going to try the preserveOriginal parameter on
WordDelimiterFilterFactory, on both the Index and Query side and see what
happens.

Thanks for all the suggestions,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Compound-word-search-not-what-I-expected-tp3036089p3037068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: sorting on date field in facet query

2011-05-19 Thread kenf_nc
This is more a speculation than direction, I don't currently use Field
Collapsing but my take on it is that it returns the number of docs
collapsed. So instead of faceting could you do a search returning DocID,
collapsing on DocID sorting on date, then the count of collapsed docs
*should* match the facet count?

Just wondering.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/sorting-on-date-field-in-facet-query-tp2956540p2961612.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Anyone having these Replication issues as well?

2011-05-18 Thread kenf_nc
Thanks Markus, for your patience with getting the response in as well the
comments.

This is my Dev environment, I'm actually going to be setting up a new
master-slave configuration in a different environment today. I'll see if
it's environment specific or not. One thing I didn't mention, wasn't sure it
was germane, is that these servers are in Amazon EC2. Also, the master is
currently on a 32 bit OS the slaves are on 64 bit OS's. Just the order in
which the servers are getting upgraded in dev. 

The master has AutoCommit turned on at 30 second intervals. Even if nothing
is getting indexed, could an AutoCommit occurring during a replication
request cause a failed replication?

Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-having-these-Replication-issues-as-well-tp2954365p2957127.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Anyone familiar with Solandra or Lucandra?

2011-05-17 Thread kenf_nc
But I can query Cassandra directly for the documents if I wanted/needed to? 

And, when I need to re-index, I could read from Cassandra, index into Solr,
which will write back to Cassandra overwriting the existing document(s)?

Basically the steps would be, index documents into Solr which would write to
Cassandra. If I need to update a document, I can query it from Solr OR query
it from Cassandra, make my modification and re-index it back to Solr which
will update Cassandra. If I need to drop my Solr index and completely
recreate it I could read all documents from Cassandra and index them into
the clean Solr instance, which will update (with no change) the documents in
Cassandra. If I update a document directly in Cassandra without going thru
Solr indexing, the change would show up on a Solr query of that document,
but the search indexes would not reflect any change.  Is all that correct?

Also, is index and query performance on par with a Sharded pure Solr
implementation?

Thanks for the feedback,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-familiar-with-Solandra-or-Lucendra-tp2927357p2953764.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Anyone familiar with Solandra or Lucandra?

2011-05-17 Thread kenf_nc
Ah. I see. That reduces its usefulness to me some. The multi-master aspect is
still a big draw of course. But I was hoping this also added an integrated
persistence layer to Solr as well. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-familiar-with-Solandra-or-Lucendra-tp2927357p2954320.html
Sent from the Solr - User mailing list archive at Nabble.com.


Anyone having these Replication issues as well?

2011-05-17 Thread kenf_nc
Is it just me or is Replication a POS?  (Solr 1.4.1, Tomcat  6.0.32)

1) I had set my pollInterval to 60 seconds but it appears to fire constantly
so I set it to 5 minutes and I see in the Tomcat logs where it fires the
replication check anywhere from 2 minutes to 4 1/2 minutes, but never
anything remotely consistent and never approaching 5 minutes. What kind of
timer is being used, sundial?

2) When it does fire it seems to do the check between slave and master
anywhere from 3 to 8 times, for a single poll interval. I have 3 slaves and
1 master, the master gets pounded by replication check queries, when it
should get 3 every 5 minutes, it gets up to 24 every couple minutes.

3) Worse of all, there is a replication.properties file on the slaves. It
constantly shows errors, but the tomcat logs on both the slaves and the
master are error free. Below is a representative sample. The timesFailed
number just keeps climbing. The one below went from 10 to 32 in about 8
minutes on the same server, and it should only attempt once every 5 minutes.

#Replication details
#Tue May 17 17:10:00 EDT 2011
replicationFailedAtList= {some long string of large numbers}
previousCycleTimeInSeconds=0
timesFailed=10
indexReplicatedAtList= {some long string of large numbers}
indexReplicatedAt=130500335
replicationFailedAt=130500335
timesIndexReplicated=10
lastCycleBytesDownloaded=0

Keep in mind, replication actually works! If I add/update a document on the
master i see it on the slaves eventually. So the errors above are especially
frustrating.

Any help on any or all of these issues would be greatly appreciated.
Thanks,
Ken


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-having-these-Replication-issues-as-well-tp2954365p2954365.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Schema Design Question

2011-05-15 Thread kenf_nc
create a separate document for each book-bookshelf combination.
doc 1 = book 1,shelf 1
doc 2 = book 1,shelf 3
doc 3 = book 2,shelf 1
etc.

then your queries are q=book_id   to get all bookshelfs a given book is on
or q=shelf_id to get all books on a given bookshelf.

Biggest problem people face with Solr schema design is thinking either
object orientedly or RDBMs orientedly. You need to think differently.
Solr/Lucene find text and they find it very fast over huge amounts of data. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-Design-Question-tp2939045p2942809.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Anyone familiar with Solandra or Lucendra?

2011-05-12 Thread kenf_nc
I modified the subject to include Lucendra, in case anyone has heard of it by
that name. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-familiar-with-Solandra-or-Lucendra-tp2927357p2933051.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do offline adding/updating index

2011-05-11 Thread kenf_nc
My understanding is that the Master has done all the indexing, that
replication is a series of file copies to a temp directory, then a move and
commit. The slave only gets hit with the effects of a commit, so whatever
warming queries are in place, and the caches get reset. Doing too many
commits too often is a problem in any situation with Solr and I wouldn't
recommend it here. However, the original question implied commits would
occur approximately once an hour, that is easily within the capabilities of
the system. Fine tuning of warming queries should minimize any performance
impact. Any effects should also be a relatively linear constant, they should
not be wildly affected by the size of the update or the number of documents.
Warming query results may be slightly different with new documents, but on
the other hand, your new documents are now in cache ready for fast search,
so a reasonable trade off.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2927336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Anyone familiar with Solandra?

2011-05-11 Thread kenf_nc
The recent Amazon outage exposed a weakness in our architecture. We could
really use a Master-Master redundancy. We already have Master to multiple
Slaves. I've looked at the various options of converting a Slave into a
Master, of having a Repeater (hybrid master/slave) become the Master etc.
But, just yesterday, found out about 
https://github.com/tjake/Solandra#readme Solandra . It looks like exactly
what I want. Especially considering we use Cassandra for a different part of
our architecture already. And, if I'm reading it correctly, it could replace
the SQL Server data store we use for persistence. It looks like Solandra
gives me Multi-Master instead of master-slave, and the documents stored in
Cassandra are persistent. If I need to drop and fully re-index the Solr side
of the house, it can do so on the existing documents already in Cassandra.

Two questions:
Is what I just described accurate?
Is Solandra ready for production?

Thanks,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyone-familiar-with-Solandra-tp2927357p2927357.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do offline adding/updating index

2011-05-10 Thread kenf_nc
Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh itself,
quickly, efficiently, minimal impact to search performance. No need to build
extra moving parts for swapping search servers or anything like that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html
Sent from the Solr - User mailing list archive at Nabble.com.


Replication question

2011-05-06 Thread kenf_nc
I have Replication set up with
  str name=pollInterval00:00:60/str 

I assumed that meant it would poll the master for updates once a minute. But
my logs make it look like it is trying to sync up almost constantly. Below
is an example of my log from just 1 minute in time. Am I reading this wrong?
This is from one of the slaves, I have 2 of them so my Master's log file is
double this.

Is this normal?

May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:34:14 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
May 6, 2011 1:35:05 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replication-question-tp2909157p2909157.html
Sent from the Solr - User mailing list archive at Nabble.com.


Result order when score is the same

2011-04-13 Thread kenf_nc
I'm using version 1.4.1. It appears that when several documents in a result
set have the same score, the secondary sort is by 'indexed_at' ascending.
Can this be altered in the config xml files? If I wanted the secondary sort
to be indexed_at descending for example, or by a different field, say
document title.

Thanks,
Ken

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Result-order-when-score-is-the-same-tp2816127p2816127.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Question for large dataset

2011-04-13 Thread kenf_nc
Indexing isn't a problem, it's just disk space and space is cheap. But, if
you do facets on all those price columns, that gets put into RAM which isn't
as cheap or plentiful. Your cache buffers may get overloaded a lot and
performance will suffer.

2000 price columns seems like a lot, could the documents be organized
differently? Hard to tell from your example.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816377.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing Question for large dataset

2011-04-13 Thread kenf_nc
Is NAME a product name? Why would it be multivalue? And why would it appear
on more than one document?  Is each 'document' a package of products? And
the pricing tiers are on the package, not individual pieces?

So sounds like you could, potentially, have a PriceListX column for each
user. As your User base grows, the number of columns you need may grow (you
already bumped up from 2000 to 5000 in the space of a couple posts :) ). Is
that right?

How many products (or packages of products) do you have? Could you flip this
on its ear and make a User the document. Then it could have just 3
multivalue fields (beyond any you need to identify the user like user_id)
product_id
product_name
product_price

Downside is if a new product is introduced you have to re-index all users
that have a price point on that product.  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Question-for-large-dataset-tp2816344p2816994.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Result order when score is the same

2011-04-13 Thread kenf_nc
Is sort order when 'score' is the same a Lucene thing? Should I ask on the
Lucene forum?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Result-order-when-score-is-the-same-tp2816127p2817330.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Result order when score is the same

2011-04-13 Thread kenf_nc
Au contraire, I have almost 4 million documents, representing businesses in
the US. And having the score be the same is a very common occurrence.

It is quite clear from testing that if score is the same, then it sorts on
indexed_at ascending. It seems silly to make me add a sort on every query,
there should be some configuration to modify this. However, if I make all my
queries include sort=score+desc,indexed_at+desc will that have a
detrimental performance effect?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Result-order-when-score-is-the-same-tp2816127p2817458.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Result order when score is the same

2011-04-13 Thread kenf_nc
Is a new DocID generated everytime a doc with the same UniqueID is added to
the index? If so, then docID must be incremental and would look like
indexed_at ascending. What I see (and why it's a problem for me) is the
following.

a search brings back the first 5 documents in a result set of say 60. The
score,titles are as follows (simulated)
1) 6.5, Doc 1
2) 6.3, Doc 2
3) 4.7, Doc 3
4) 4.7, Doc 4
5) 4.7, Doc 5
---
6) 4.7, Doc 6
7) 4.7, Doc 7
8) 4.4, Doc 8

If I query 6 times the results come back like that every time. However if I
change a field in Doc 4, a field that is not part of the search, it gets the
same score, but the results are now this.
1) 6.5, Doc 1
2) 6.3, Doc 2
3) 4.7, Doc 3
4) 4.7, Doc 5
5) 4.7, Doc 6
---
6) 4.7, Doc 7
7) 4.7, Doc 4
8) 4.4, Doc 8

So, in a specific situation I'm looking at, a user sees 5 items on a UI
page, they click a button to 'favorite' document #4, I update Doc 4 and
(because it was architecturally better) I re-issue the search. So from the
users viewpoint they 'favorited' number 4 and it disappeared from their
screen. Not a good user experience.

If I could modify the secondary sort when score is the same then worse case
doc 4 would pop to the top of the users screen but not disappear. Better
would be to secondary sort on Title or some other fixed field that exists on
all documents. But, I would want the sort to be at the system level, I dont'
want the overhead of sorting every query I ever make.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Result-order-when-score-is-the-same-tp2816127p2817766.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index Design Question

2011-02-17 Thread kenf_nc

Some options to reduce performance implications are:
   replication... index your documents in one solr instance, and query in a
different one. that way the users of the query side will not be as adversely
impacted by frequent changes. You have better control over when change
occurs.

  separate search from display...one mistake I see a lot is people putting
everything into Solr. Solr is optimized for search, therefore it sometimes
makes sense to only put into a solr index those fields you are searching
against. In some architectures this leaves a large amount of data that can
be stored somewhere else, an RDBMS, a file system, a third party host,
whatever. You search on Solr, and use some identifier to get the rest of the
data from somewhere else. That way, only changes to searchable fields need
to be indexed, the rest just need to be stored. It could minimize the impact
on your Solr documents.

   Multi-threading...usually any performance bottleneck is on the sending
side, not the Solr side. Solr handles multiple data inputs gracefully. Be
very aware of how many commits you are doing, and what kind of warming
queries you have in place. Those are the biggest performance issues from
what I've seen. Having 2 Solr instances, one optimized for Indexing (the
master) and one optimized for querying (the slave) with replication, would
help minimize the problem.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-Design-Question-tp2523811p2524424.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Any contribs available for Range field type?

2011-02-15 Thread kenf_nc

I've tried several times to get an active account on
solr-...@lucene.apache.org and the mailing list won't send me a confirmation
email, and therefore won't let me post because I'm not confirmed. Could I
get someone that is a member of Solr-Dev to post either my original request
in this thread, or a link to this thread on the Dev mailing list? I really
was hoping for more response than this to this question. This would be a
terrifically useful field type to just about any solr index.

Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-contribs-available-for-Range-field-type-tp2473601p2502203.html
Sent from the Solr - User mailing list archive at Nabble.com.


Any contribs available for Range field type?

2011-02-11 Thread kenf_nc

I have a huge need for a new field type. It would be a Poly field, similar to
Point or Payload. It would take 2 data elements and a search would return a
hit if the search term fell within the range of the elements. For example
let's say I have a document representing an Employment record. I may want to
create a field for years_of_service where it would take values 1999,2004. 
Then in a query q=years_of_service:2001 would be a hit,
q=years_of_service:2010 would not. The field would need to take a data type
attribute as a parameter. I may need to do integer ranges, float/double
ranges, date ranges. I don't see the need now, but heck maybe even a string
range. This would be useful for things like Event dates. An event often
occurs between several days (or hours) but the query is something like what
events are happening today. If I did q=event_date:NOW (or similar) it
should hit all documents where event_date has a range that in inclusive of
today. Another example would be product category document. A specific
automobile may have a fixed price, but a category of auto (2010 BMW 3-series
for example) would have a price range. 

I hope you get the point. My question (finally) is, does anyone know of an
existing contribution to the public domain that already does this? I'm more
of a .Net/C# developer than a Java developer. I know my way around Java, but
don't really have the right tools to build/test/etc. So was hoping to borrow
rather than build if I could.

Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-contribs-available-for-Range-field-type-tp2473601p2473601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Any contribs available for Range field type?

2011-02-11 Thread kenf_nc

True. And that's my temporary solution. But it's ugly code, even uglier
queries. I may have several such fields in a single query. A PolyField
solution would be so much more elegant and useful. I'm actually shocked more
people don't need/want something like it. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-contribs-available-for-Range-field-type-tp2473601p2474055.html
Sent from the Solr - User mailing list archive at Nabble.com.


List of indexed or stored fields

2011-01-25 Thread kenf_nc

I use a lot of dynamic fields, so looking at my schema isn't a good way to
see all the field names that may be indexed across all documents. Is there a
way to query solr for that information? All field names that are indexed, or
stored? Possibly a count by field name? Is there any other metadata about a
field that can be queried?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: List of indexed or stored fields

2011-01-25 Thread kenf_nc

That's exactly what I wanted, thanks. Any idea what

  long name=version1294513299077/long 

refers to under the index section? I have 2 cores on one Tomcat instance,
and 1 on a second instance (different server) and all 3 have different
numbers for version, so I don't think it's the version of Luke.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2333281.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Single value vs multi value setting in tokenized field

2011-01-20 Thread kenf_nc

Thanks guys. I read (and actually digested this time) how multivalued fields
work and now realize my question came from a 'structured language/dbms'
background. The multivalued field is stored basically as a single value with
extra spacing between terms (the positionIncrementGap previously mentioned). 

So the difference between a single value text field and multi value is (in a
simplification):
single:
 term term term term
mutli:
 term  term   termterm

If I wasn't heads down 12 hours a day trying to keep ahead of the
requirement changes...I might actually learn this thing :).
Thanks again.
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2294658.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrconfig.xml settings question

2011-01-20 Thread kenf_nc

Is that it? Of all the strange, esoteric, little understood configuration
settings available in solrconfig.xml, the only thing that affects Index
Speed vs Query Speed is turning on/off the Query Cache and RamBufferSize?
And for the latter, why wouldn't RamBufferSize be the same for both...that
is, as high as you can make it?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solrconfig-xml-settings-question-tp2271594p2294668.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Single value vs multi value setting in tokenized field

2011-01-17 Thread kenf_nc

No, I have both, a single field (for free form text search), and individual
fields (for directed search). I already duplicate the data and that's not a
problem, disk space is cheap. What I wanted to know was whether it is best
to make the single field multiValued=true or not. That is, should my
'content' field hold data like:
   some description maybe a paragraph or two
   a product or service title
   tag1
   tag2
   feature1
   feature2
or would it be better to make it a concatenated, single value field like:
 some description maybe a paragraph or two a product or service title
tag1 tag2 feature1 feature2

my indexing seems to take longer than most, it takes about 2 1/2 hours to
index 3.5 million records. I have a colleague that, in a separate project,
is indexing 70 million records in about 4 hours, albeit in a much simpler
schema. So I'm trying to see if this could be a factor in my indexing
performance. I also wanted to know what impact, in general, not just in this
situation, using a MultiValued field versus a Single Valued field has in
search results.

I would have thought that having to support a free-form-text search, and a
field (directed) search would be a common problem, and was just looking for
advice.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2271543.html
Sent from the Solr - User mailing list archive at Nabble.com.


solrconfig.xml settings question

2011-01-17 Thread kenf_nc

In the Wiki and the book by Smiley and Pugh, and in the comments inside the
solrconfig.xml file itself, it always talks about the various settings in
the context of a blended use solr index. By that I mean, it assumes you are
indexing and querying from the same solr instance. However, if I have a
Master-Slave set up I should be able to optimize the Master for indexing
data, and optimize the Slave for querying the data. Does anyone have links
to information that talks about this? I want to index as furiously as
possible into one solr instance without regard to the impact it will have on
queries, and to query on another solr instance that only has to worry about
replication, but not constant add/update/delete/commit activity. I want my
solrconfig settings to be as optimal as possible.

Links, comments, references to previous forum threads, any and all feedback
is appreciated.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solrconfig-xml-settings-question-tp2271594p2271594.html
Sent from the Solr - User mailing list archive at Nabble.com.


Single value vs multi value setting in tokenized field

2011-01-16 Thread kenf_nc

I have to support both general searches (free form text) and directed
searches (field:val field2:val). To do the general search I have a field
defined as:
   field name=content type=text indexed=true stored=false
termVectors=true multiValued=true /
and several copyField commands like:
  copyField source=description dest=content /
  copyField source=title dest=content /
  copyField source=tags dest=content /
  copyField source=features dest=content /
Note that tags and features are multi-value themselves. So after indexing I
have a 'general text' bucket with numerous (usually in the 20 to 30 range)
rows of strings. 

My question is would it be better, for indexing speed and search
speed/quality, to concatenate all the text into a single string and store it
in content as one value? What are the implications on search results? If
Description is say a couple paragraphs of text and tags are
Cuisine,Italian,Romantic would the tags get lost in the muck of the
bigger text?

One thing to keep in mind. I'm sure some of you are going to say 'Dismax'
and in some situations I will, but my index has numerous document types that
have vastly different schemas. Another document may not have title and
features but might have recommendations and location. In a general
query it wouldn't make sense to include every possible field in a dismax
query, I don't even know what all the fields are, new ones are added all the
time.

Has anyone got advice, suggestions on this topic (blending directed search
with general search)? 
Thanks in advance,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Single-value-vs-multi-value-setting-in-tokenized-field-tp2268635p2268635.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query : FAQ? Forum?

2011-01-14 Thread kenf_nc

http://wiki.apache.org/solr/FrontPage Solr Wiki 
http://wiki.apache.org/solr/FAQ Solr FAQ 
http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881/ref=sr_1_1?ie=UTF8qid=1295018231sr=8-1
A good book on Solr 

And this forum you posted to 
http://lucene.472066.n3.nabble.com/Solr-User-f472068.html (Solr-User)  is
one of the most active and useful Tech forums I've ever used. Don't be
afraid to ask stupid questions, folks here are pretty forgiving and patient,
especially if you attempt to use the Wiki or FAQ first.

Good Luck!
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-FAQ-Forum-tp2254898p2256030.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on deleting all rows for an index

2011-01-13 Thread kenf_nc

If this is a one-time cleanup, not something you need to do programmatically,
you could delete the index directory ( solrDir/data/index ). In my case I
have to stop Tomcat, delete .\index and restart Tomcat. It is very fast and
starts me out with a fresh, empty, index. Noticed you are multi-core, I'm
not, so this could be bogus information for you...but thought I'd toss it
out just in case.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-on-deleting-all-rows-for-an-index-tp2246726p2248332.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: basic document crud in an index

2011-01-13 Thread kenf_nc

A/ You have to update all the fields, if you leave one off, it won't be in
the document anymore. I have my 'persisted' data stored outside of Solr, so
on update I get the stored data, modify it and update Solr with every field
(even if one changed). You could also do a Query/Modify/Update directly in
Solr, just remember to send all fields in the update. There isn't (in 1.4
anyway) a way to update specific fields only.

B/ When you update, it is my understanding that, yes, the old doc is there
deleted and a new doc is in place. You can't get to the old one however and
it will go away at the next Optimize. I've never used it, but when you
Commit you can send an optional parameter 'expungeDeletes' that should
remove deleted docs as well.

C/ Not that I'm aware of

D/ don't know

E/ That is my understanding, but I'm admittedly a little weak on that part.
I just have a job that runs in the middle of the night and runs Optimize
once each night, I don't dig deeper than that into what goes on.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/basic-document-crud-in-an-index-tp2246793p2248422.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Consequences for using multivalued on all fields

2010-12-21 Thread kenf_nc

I have about 30 million documents and with the exception of the Unique ID,
Type and a couple of date fields, every document is made of dynamic fields.
Now, I only have maybe 1 in 5 being multi-value, but search and facet
performance doesn't look appreciably different from a fixed schema solution.
I don't do some of the fancier things, highlighting, spell check, etc. And I
use a lot more string or lowercase field types than I do Text (so not as
many fully tokenized fields), that probably helps with performance.

The only disadvantage I know of is dealing with field names at runtime.
Depending on your architecture, you don't really know what your document
looks like until you have it in a result set. For what I'm doing, that isn't
a problem.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Consequences-for-using-multivalued-on-all-fields-tp2125867p2126120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr site not accessible

2010-12-17 Thread kenf_nc

Yep, www.apache.org is down. They tick off the wikihackers too? :)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-site-not-accessible-tp2105072p2105095.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Thank you!

2010-12-16 Thread kenf_nc

Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have
done it without this site. Smiley and Pugh's book was useful, but this forum
was invaluable.  I don't have as many questions now, but each new venture,
Geospatial searching, replication and redundancy, performance tuning, brings
me back again and again. This and stackoverflow.com have to be two of the
most useful destinations on the internet for developers. Communities are so
much more relevant than reference materials, and the consistent activity in
this community is impressive.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Thank-you-tp2096329p2098512.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi Word searches in Solr

2010-11-17 Thread kenf_nc

Multi word queries is the bread and butter of Solr/Lucene, so I'm not sure I
understand the complete issue here. For clarity, is 'abstract' the name of
your default text field, or is your query

q=abstract: mouse genome 

if the latter, my thought was is it possible that the query is being
converted into a query of
q=abstract:mouse genome  where mouse is looked for in the field abstract,
and genome is compared to the default text field. This is a stab in the
dark, I don't know what your data looks like.

You say it doesn't work the way you expect, but you don't really say what
you do see. Are you getting zero results, or fewer than you expected, or
only results that match all fields (the AND proposition)?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-Word-searches-in-Solr-tp1918802p1918915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Link to download solr4.0 is not working?

2010-11-15 Thread kenf_nc

While we are on this subject...my company is kind of new to the whole open
source as a production tool concept. I can't push anything to production
that isn't labeled as 'release' or similar designation. So, 1.4.1 is what I
have right now. I can play with other versions but that's about it. I'm
fairly new to open source myself.

I was curious, who decides when 4.0 is ready for Release? The community?
What are the criteria under which that decision is made? Is there a
published timeline, or is it just ready when it's ready?  

I'm happy with 1.4.1, it does the job. But there are some features in 4.0
I'm really looking forward to. Also, anyone know if 
http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881/ref=sr_1_1?ie=UTF8s=booksqid=1289830206sr=8-1
Smiley and Pugh  are planning a 2nd Edition of their book to cover 4.0?

(This was only a partial hijacking of the thread, it felt relevant though,
apologies to any purists that disagree)

Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Link-to-download-solr4-0-is-not-working-tp1886719p1904550.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Link to download solr4.0 is not working?

2010-11-15 Thread kenf_nc

Thanks Jan. I didn't know about 1.4.2 I'll give it a look. However, your link
is something I've already seen. I understand the different Solr versions, my
question was more on what is the process, and timeline, for the community to
turn the current trunk into a 'release'. From that link, and other forum
comments I basically have determined that 3.x is useless and will be
skipped. 4.x is an important advancement and highly anticipated. 

I just wanted to know how/when 4.0 goes from  next major release (trunk in
svn) to  latest officially stable release. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Link-to-download-solr4-0-is-not-working-tp1886719p1904848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query question

2010-11-03 Thread kenf_nc

Unfortunately the default operator is set to AND and I can't change that at
this time. 

If I do  (city:Chicago^10 OR Romantic OR View) it returns way too many
unwanted results.
If I do (city:Chicago^10 OR (Romantic AND View)) it returns less unwanted
results, but still a lot.
iorixxx's solution of (Romantic AND View AND (city:Chicago^10 OR (*:*
-city:Chicago))) does seem to work. Chicago results are at the top, and the
remaining results seem to fit the other search parameters. It's an ugly
query, but does seem to do the trick for now until I master Dismax.

Thanks all!

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-question-tp1828367p1834793.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query question

2010-11-02 Thread kenf_nc

I can't seem to find the right formula for this. I have a need to build a
query where one of the fields should boost the score, but not affect the
query if there isn't a match. For example, if I have documents with
restaurants, name, address, cuisine, description, etc.  I want to search on,
say, 

Romantic AND View AND city:Chicago

if city is in fact Chicago it should score higher, but if city is not
Chicago (or even if it's missing the city field), but matches the other
query parameters it should still come back in the results. Is something like
this possible?  It's kind of like  q=(some query) optional boost if
field:value.

Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-question-tp1828367p1828367.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query question

2010-11-02 Thread kenf_nc

Jonathan, Dismax is something I've been meaning to look into, and bq does
seem to fit the bill, although I'm worried about this line in the wiki
 :TODO:  That latter part is deprecated behavior but still works. It can be
problematic so avoid it.
It still seems to be the closest to what I want however so I'll play with
it.

Erick, that query would return all restaurants in Chicago, whether they
matched Romantic View or not. Although the scores should sort relevant
results to the top, the results would still contain a lot of things I wasn't
interested in.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-question-tp1828367p1828639.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reverse range query

2010-10-29 Thread kenf_nc

I modified the text of this hopefully to make it clearer. I wasn't sure what
I was asking was coming across well. And I'm adding this comment in a
shameless attempt to boost my question back to the top for people to see.
Before I write a messy work around, just wanted to check the community to
see if this was already handled, it seems like a useful, common, data type.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Reverse-range-query-tp1789135p1792126.html
Sent from the Solr - User mailing list archive at Nabble.com.


Reverse range search

2010-10-28 Thread kenf_nc

Doing a range search is straightforward. I have a fixed value in a document
field, I search on [x TO y] and if the fixed value is in the range requested
it gets a hit. But, what if I have data in a document where there is a min
value and a max value and my query is a fixed value and I want to get a hit
if the query value is in that range. For example:

Solr Doc1:
field  min_price:100
field  max_price:500

Solr Doc2:
field  min_price:300
field  max_price:500

and my query is price:250. I could create a query of (min_price:[* TO 250]
AND max_price:[250 TO *]) and that should work. It should find only doc 1.
However, if I have several fields like this and complex queries that include
most of those fields, it becomes a very ugly query. Ideally I'd like to do
something similar to what the spatial contrib guys do where they make
lat/long a single point. If I had a min/max field, I could call it Price
(100, 500) or Price (300,500) and just do a query of  Price:250 and Solr
would see if 250 was in the appropriate range.

Looong question short...Is there something out there already that does this?
Does anyone else do something like this and have some suggestions?
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Reverse-range-search-tp1789135p1789135.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stored or indexed?

2010-10-27 Thread kenf_nc

Interesting wiki link, I hadn't seen that table before.

And to answer your specific question about indexed=true, stored=false, this
is most often done when you are using analyzers/tokenizers on your field.
This field is for search only, you would never retrieve it's contents for
display. It may in fact be an amalgam of several fields into one 'content'
field. You have your display copy stored in another field marked
indexed=false, stored=true and optionally compressed. I also have simple
string fields set to lowercase so searching is case-insensitive, and have a
duplicate field where the string is normal case. the first one is
indexed/not stored, the second is stored/not indexed. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Stored-or-indexed-tp1782805p1784315.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Missing facet values for zero counts

2010-09-29 Thread kenf_nc

I don't understand why you would want to show Sweden if it isn't in the
index, what will your UI do if the user selects Sweden?

However, one way to handle this would be to make a second document type.
Have a field called type or some such, and make the new document type be
'dummy' or 'system' or something like that. You can put documents in here
with fields for any pick-lists you want to facet on and include all possible
values from your database.

Do your facets on either just this doc, or all docs, either way should work.
However on your search queries always include   fq=-type:system
basically exclude all documents of type system from all your searches. 
Messy, but should do what you want.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Missing-facet-values-for-zero-counts-tp1602276p1603893.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Re:The search response time is too loong

2010-09-27 Thread kenf_nc

mem usage is over 400M, do you mean Tomcat mem size? If you don't give your
cache sizes enough room to grow you will choke the performance. You should
adjust your Tomcat settings to let the cache grow to at least 1GB or better
would be 2GB. You may also want to look into 
http://wiki.apache.org/solr/SolrCaching warming the cache  to make the first
time call a little faster. 

For comparison, I also have about 8GB in my index but only 2.8 million
documents. My search query times on a smaller box than you specify are 6533
milliseconds on an unwarmed (newly rebooted) instance. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Re-The-search-response-time-is-too-loong-tp1587395p1588554.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Reporting

2010-09-23 Thread kenf_nc

keep in mind that the str name=id paradigm isn't completely useless, the
str is a data type (string), it can be int, float, double, date, and others.
So to not lose any information you may want to do something like:

id type=int123/id 
title type=strxyz/title

Which I agree makes more sense to me. The name of the field is more
important than it's datatype, but I don't want to lose track of the data
type.

Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can I delete the entire contents of the index?

2010-09-23 Thread kenf_nc

Quick tangent... I went to the link you provided, and the delete part makes
sense. But the next tip, how to re-index after a schema change. What is the
point of step

5. Send an optimize/ command.

? Why do you need to optimize an empty index? Or is my understanding of
Optimize incorrect?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Searches with a period (.) in the query

2010-09-23 Thread kenf_nc

Do you have any other Analyzers or Formatters involved? I use delimiters in
certain string fields all the time. Usually a colon : or slash / but
should be the same for a period. I've never seen this behavior. But if you
have any kind of tokenizer or formatter involved beyond 
fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true / 
then you may be introducing something extra to the party.

What does your fieldType definition look like?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Searches with a period (.) in the query

2010-09-22 Thread kenf_nc

Could it be a case-sensitivity issue? The StrField type is not analyzed, but
indexed/stored verbatim. (from the schema comments).  If you are looking for
ab.pqr but it is in fact ab.Pqr in the solr document, it wouldn't find it.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1565057.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: getting a list of top page-ranked webpages

2010-09-17 Thread kenf_nc

A slightly different route to take, but one that should help test/refine a
semantic parser is wikipedia. They make available their entire corpus, or
any subset you define. The whole thing is like 14 terabytes, but you can get
smaller sets. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/getting-a-list-of-top-page-ranked-webpages-tp1515311p1516649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can i do relavence and sorting together?

2010-09-17 Thread kenf_nc

Those are at least 3 different questions. Easiest first, sorting.
   addsort=ad_post_date+desc   (or asc)  for sorting on date,
descending or ascending

check out how   http://www.supermind.org/blog/378/lucene-scoring-for-dummies
Lucene  scores by default. It might close to what you want. The only thing
it isn't doing that you are looking for is the relative distance between
keywords in a document. 

You can add a boost to the ad_title and ad_description fields to make them
more important to your search.

My guess is, although I haven't done this myself, the default Scoring
algorithm can be augmented or replaced with your own. That may be a route to
take if you are comfortable with java.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-i-do-relavence-and-sorting-together-tp1516587p1516691.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DataImportHandler with multiline SQL

2010-09-17 Thread kenf_nc

Sounds like you want the 
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor
CachedSqlEntityProcessor  it lets you make one query that is cached locally
and can be joined to with a separate query.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-with-multiline-SQL-tp1514893p1516737.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get all results from a solr query

2010-09-17 Thread kenf_nc

Chris, I agree, having the ability to make rows something like -1 to bring
back everything would be convenient. However, the 2 call approach
(q=blahrows=0 followed by q=blahrows=numFound) isn't that slow, and does
give you more information up front. You can optimize your Array or List
sizes in advance, you could make sure that it isn't a runaway query and you
are about to be overloaded with data, you could split it up into parallel
processes, ie:

Thread(q=blahstart=0rows=numFound/4)
Thread(q=blahstart=numFound/4rows=numFound/4)
Thread(q=blahstart=(numFound/4 *2)rows=numFound/4)
Thread(q=blahstart=(numFound/4*3)rows=numFound/4)

(not sure my math is right, did it quickly, but you get the point).  Anyway,
having that number can be very useful for more than just knowing max
results.
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-all-results-from-a-solr-query-tp1515125p1516751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-17 Thread kenf_nc

You don't give an indication of size. How large are the documents being
indexed and how many of them are there. However, my opinion would be a
single index with an 'active' flag. In your queries you can use
FilterQueries  (fq=) to optimize on just active if you wish, or just
inactive if that is necessary.

For the RDBMS, do you have any other reason to use a RDBMS besides storing
this data inbetween indexes? Do you need to make relational queries that
Solr can't handle? If not, then I think a file based approach may be better.
Or, as in my case, a small DB for generating/tracking unique_ids and
last_update_datetimes, but the bulk of the data is archived in files and can
easily be updated or read and indexed.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-partitioned-Full-indexing-by-MSSQL-or-MySQL-tp1515572p1516763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does SolrNet support indexing of Database tables and XML files

2010-09-03 Thread kenf_nc

Alok,
I noticed you also posted to the SolrNet forum, and that's a better place
for this question. But basically, SolrNet is a wrapper around Solr
functionality. It lets you build your Solr interactions (Queries, Stats,
Facets, etc) and Inserts/Deletes using .Net objects.

The reading of a data source is up to your application, and the data sources
available are completely up to you, no restrictions from SolrNet. You write
an application that reads from a data source, builds your list of document
objects and you can send it to Solr, commit, rollback, optimize, etc. But
you are responsible for getting the data from it's source.

SolrNet does have a rudimentary connection to nHibernate that may be useful
to you, but it's a little immature and doesn't work for more than basic
situations. But it may be of use to you.

Otherwise, you get out your C# skills and read from your data source however
you see fit, then use SolrNet library to handle all the Gets/Posts/URL
formatting, etc for you.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-SolrNet-support-indexing-of-Database-tables-and-XML-files-tp1410353p1412169.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr user

2010-09-02 Thread kenf_nc

You are querying for 'branch' and trying to place it in 'skill'.

Also, you have Name and Column backwards, it should be:

field column=id name=id/ 
field column=name name=name/ 
field column=city name=city_t/ 
field column=skill name=skill_t/ 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-user-tp1404814p1406343.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: High - Low field value?

2010-09-01 Thread kenf_nc

That's exactly what I want.  I was just searching the wiki using the wrong
terms.
Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/High-Low-field-value-tp1402568p1403164.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr

2010-08-31 Thread kenf_nc

We would really need to see more information, but some first things to look
for are:

are your field definitions in the schema.xml set to indexed=true (if you
want to search it) and stored=true (if you want to see it in the return
results)?

is the case of the field names the same in schema.xml and your DIH script? I
believe it is case sensitive. If your field is named 'tags' and in your DIH
script you call it Tags or TAGS, it may be failing to store.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-tp1394412p1394568.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem related to Sorting in Solr1.4

2010-08-27 Thread kenf_nc

the 'text'  fieldType is not suitable for sorting. You need to use the
copyField directive in your schema and at indexing time copy the data to
your TITLE and UPDBY fields, and you need to create 2 new fields:

field name=TITLE_sort type=string indexed=true stored=true /
field name=UPDBY_sort type=string indexed=true stored=true /

then you Search on TITLE but Sort on TITLE_sort
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-related-to-Sorting-in-Solr1-4-tp1370622p1371739.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Private data within SOLR Schema

2010-08-27 Thread kenf_nc

my feeling is that private fields in a public document will be the hardest
nut to crack, unless you have an intermediary layer that users call instead
of hitting your solr instance directly. If you front it with a web service
you could handle various authorization scenarios a little easier.

Private documents, the inclusion of a user_id field is an acceptable way to
go IMO.

And individualized schema is actually probably the easiest thing to do. My
schema allows almost any type of document to be stored at the users
discretion, no schema changes on my part. Something like that, or slightly
modified version of that, would handle user defined schemas.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Private-data-within-SOLR-Schema-tp1376174p1376355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Schema Definition Question

2010-08-12 Thread kenf_nc

One way I've done to handle this, and it works only for some types of data,
is to put the searchable part of the sub-doc in a search field
(indexed=true) and put an xml or json representation of the sub-doc in a
stored only field. Then if the main doc is hit via search I can grab the xml
or json, convert it to an object graph and do whatever I want.

If you need to search on a variety of elements in the sub-doc this becomes
less useful an approach. But in some use-cases it worked for me.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-Definition-Question-tp1049966p1110159.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Doc Lucene Doc !?

2010-08-12 Thread kenf_nc

Are you just trying to learn the tiny details of how Solr and DIH work? Is
this just an intellectual curiosity? Or are you having some specific problem
that you are trying to solve? If you have a problem, could you describe the
symptoms of the problem? I am using Solr, DIH, and several other related
technologies and have never needed to know the difference between a
SolrDocument and a LuceneDocument or how the UpdateHandler chains. So I'm
curious about what your ultimate goal is with these questions.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p1117472.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Delta-import with solrj client

2010-08-11 Thread kenf_nc

Short answer is no, there isn't a way. Solr doesn't have the concept of
'Update' to an indexed document. You need to add the full document (all
'columns') each time any one field changes. If doing that in your
DataImportHandler logic is difficult you may need to write a separate Update
Service that does:

1) Read UniqueID, UpdatedColumn(s)  from database
2) Using UniqueID Retrieve document from Solr
3) Add/Update field(s) with updated column(s)
4) Add document back to Solr

Although, if you use DIH to do a full import, using the same query in your
Delta-Import to get the whole document shouldn't be that difficult.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-import-with-solrj-client-tp1085763p1086173.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Handler Query

2010-08-11 Thread kenf_nc

It may not be the data config. Do you have the fields in the schema.xml that
the image data is going to set to be multiValued=true?

Although, I would think the last image would be stored, not the first, but
haven't really tested this.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Query-tp1092010p1092917.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Fields - ID vs. Display Value

2010-08-10 Thread kenf_nc

If your concern is performance, faceting integers versus faceting strings, I
believe Lucene makes the differences negligible. Given that choice I'd go
with string. Now..if you need to keep an association between id and string,
you may want to facet a combined field  id:string or some other
delimiter. Then parse it on display. But you can use the id if you need to
hit a database or some other external source. If you don't ever need to
reference the ID, I wouldn't even put it in the index.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-Fields-ID-vs-Display-Value-tp1062754p1072067.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: delete Problem..

2010-08-10 Thread kenf_nc

I'd try 2 things. 
First do a query
   q=EMAIL_HEADER_FROM:test.de
and make sure some documents are found. If nothing is found, there is
nothing to delete.

Second, how are you testing to see if the document is deleted? The physical
data isn't removed from the index until you Optimize I believe. Is it
possible your delete is working, but your method of verifying isn't telling
you it's marked for deletion?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-Problem-tp1070347p1072581.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH and multivariable fields problems

2010-08-10 Thread kenf_nc

Glad I could help. I also would think it was a very common issue. Personally
my schema is almost all dynamic fields. I have unique_id, content,
last_update_date and maybe one other field specifically defined, the rest 
are all dynamic. This lets me accept an almost endless variety of document
types into the same schema.  So if I planned on using DIH I had to come up
with a way, and stitching together solutions to a couple related issues got
me to my script transform. Mine is more convoluted than the one I gave here,
but obviously you got the gist of the idea.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR QUERY

2010-08-06 Thread kenf_nc

In your schema.xml there is a field called 
defaultSearchFieldcontent/defaultSearchField 
it may be something other than 'content'. This field is the one searched if
you don't specify one in the query. 

You can explicitly put something there with an add or you can have a
copyField directive in your schema to move ap_* fields to the default
search field.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-QUERY-tp1031554p1031567.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Best solution to avoiding multiple query requests

2010-08-04 Thread kenf_nc

Not sure the processing would be any faster than just querying again, but, in
your original result set the first doc that has a field value that matches a
to 10 facet, will be the number 1 item if you fq on that facet value. So you
don't need to query it again. You would only need to query those that aren't
in your result set.
ie:
   q=dogfacet=onfacet.field=foo
results 10 docs
   id=1, foo=A
   id=2, foo=A
   id=3, foo=B
   id=4, foo=C
   id=5, foo=B
   id=6, foo=A
   id=7, foo=Z
   id=8, foo=T
   id=9, foo=B
   id=10, foo=J

If your facet results top 10 were (A, B, T, J, D, X, Q, O, P, I)
you already have the number 1 for A (id 1), B (id 3), T (id 8) and J (id 10)
in your very first query. You only need to query D, X, Q, O, P, I. 

If your first query returned 100 instead of 10 you may even have more of the
top 10 represented. Again, the processing steps you would need to do may not
be any faster than re-querying, it depends on the speed of your index and
network etc.

I would think that if your second query was
q=dogfq=(foo=A OR foo=B OR foo=T...etc) then you have even a greater chance
of having the number 1 result for each of the top 10 in just your second
query.

  
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-solution-to-avoiding-multiple-query-requests-tp1020886p1022397.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Customize order field list ???

2010-07-30 Thread kenf_nc

I believe they come back alphabetically sorted (not sure if this is language
specific or not), so a quick way might be to change the name from createdate
to zz_createdate or something like that. 

Generally with XML you should not be worried about order however. It's
usually a sign of a design issue somewhere if the order of the fields
matters.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Customize-order-field-list-tp1007996p1008924.html
Sent from the Solr - User mailing list archive at Nabble.com.


Nabble problems?

2010-07-29 Thread kenf_nc

The Nabble.com page for Solr - User seems to be broken. I haven't seen an
update on it since early this morning. However I'm still getting email
notifications so people are seeing and responding to posts. I'm just
curious, are you just using email and responding to
solr-u...@lucene.apache.org? Or is there a mirror site that *is* working for
the Solr User forum?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Nabble-problems-tp1004870p1004870.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing Problem: Where's my data?

2010-07-27 Thread kenf_nc

for STRING_VALUE, I assume there is a property in the 'select *' results
called string_value? if so I'm not sure why it wouldn't work. If not, then
that's why, it doesn't have anything to put there.

For ATTRIBUTE_NAME, is it possibly a case issue? you called it
'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just
something to check I guess.

Also, not sure why you are using name= in your fields, for example, 

field column=PARENT_FAMILY name=Parent Family / 
I thought 'column' was the source field name and 'name' was supposed to be
the schema field name and if not there it would assume 'column' name. You
don't have a schema field called Parent Family so it looks like it's
defaulting to column name too which is lucky for you I suppose. But you may
want to either remove 'name=' or make it match the schema. (and I may be
completely wrong on this, it's been a while since I got DIH going).


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Doc Lucene Doc !?

2010-07-26 Thread kenf_nc

DataImportHandler (DIH) is an add-on to Solr. It lets you import documents
from a number of sources in a flexible way. The only connection DIH has to
Lucene is that Solr uses Lucene as the index engine.

When you work with Solr you naturally talk about Solr Documents, if you were
working with Lucene natively (without Solr) you would talk about Lucene
documents, but they are basically the same thing. 

Are you having a specific issue? Or are you just trying to learn more about
the technology?

If you are mostly trying to understand DIH, then you should think in terms
of Solr and Solr documents. Understand that Lucene is working behind the
scenes, but you really don't need to worry about that all that often.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Doc-Lucene-Doc-tp995922p996425.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nested query and number of matched records

2010-07-21 Thread kenf_nc

parallel calls. simultaneously query for type:short rows=10  and
type:extensive rows=1  and merge your results.  This would also let you
separate your short docs from your extensive docs into different solr
instances if you wished...depending on your document architecture this could
speed up one or the other.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984280.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: nested query and number of matched records

2010-07-21 Thread kenf_nc

That just gives a count of documents by type. The use-case, I believe, is to
return from a search, 10 documents of type 'short' and 1 document of type
'extensive'. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/nested-query-and-number-of-matched-records-tp983756p984539.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to change the default path of Solr Tomcat

2010-07-21 Thread kenf_nc

Your environment may be different, but this is how I did it. (Apache Tomcat
on Windows 2008)

go to \program files\apache...\Tomcat\conf\catalina\localhost
rename solr.xml to search.xml
recycle Tomcat service

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-change-the-default-path-of-Solr-Tomcat-tp985881p985937.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-19 Thread kenf_nc

Oh, okay. Got it now. Unfortunately I don't believe Solr supplies a total
count of matching facet values. One way to do this, although performance may
suffer, is to set your limit to -1 and just get back everything, that will
give you the count. You may want to set mincount to 1 so you aren't counting
facet values that aren't in your query, but that really depends on your
need.

...facet.limit=-1facet.mincount=1

adding that to any facet query will return all matching facet values.
Depending on how many unique values you have, this could be a lot. But it
will give you what you are looking for. Unless your data changes frequently,
maybe you can call it once and cache the results for some period of time.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p978548.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing best practices

2010-07-18 Thread kenf_nc

No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-16 Thread kenf_nc

It may just be a mis-wording, but if you do distinct on 'unique' IDs, the
count should be the same as response.numFound. But if you didn't mean
'unique', just count of some field in the results, Rebecca is correct,
facets should do the job. Something like:

?q=content:query+textfacet=onfacet.field=rootId
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p972601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fwd: send to list

2010-07-16 Thread kenf_nc

If at all possible I like to do any processing work up front and not deal
with extravagant queries. If your grid definitions don't change, or don't
change often, just assign a cell number to each 100 square grid. Then in a
pre-processing step assign the appropriate cell number to your document
along with the specific lat and lon. Then your facet query gets much
simpler.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Fwd-send-to-list-tp973191p973233.html
Sent from the Solr - User mailing list archive at Nabble.com.


indexing best practices

2010-07-16 Thread kenf_nc

I was curious if anyone has done work on finding what an optimal (or max)
number of client processes are for indexing. That is, if I have the ability
to spin up N number of processes that construct a POST to add/update a Solr
document, is there a point at which the number of clients posting
simultaneously overloads Solr's ability to keep up with the Add's? I know
this is very hardware dependent, but am looking for ballpark guidelines.
This will be in a Tomcat process running on Windows Server 2008, 2 Solr
instances, one master, one slave standard replication.

Related to this, is there a best practice number of documents to send in a
single POST. (again I know it depends on the complexity of the document,
field types, analyzers/tokenizers etc).

And finally, what do you find to be the best approach to getting data into
Solr. If the technology aspect isn't an issue (except I don't want to use
EmbeddedSolr), you just want to get documents added/updated as quickly as
possible.  POST, xml or csv document upload, DataImportHandler, other?  I'm
just looking for raw speed, not architectural factors.

So, nutshell, all other factors put aside, I'm looking for best approach to
indexing with pure raw speed the only criteria. 

Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p973274.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tag generation

2010-07-16 Thread kenf_nc

Thanks for all the suggestions! I'm absorbing them as quickly as I can. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p973277.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query help

2010-07-15 Thread kenf_nc

Your example though doesn't show different ContentType, it shows a different
sort order. That would be difficult to achieve in one call. Sounds like your
best bet is asynchronous (multi-threaded) calls if your architecture will
allow for it.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-help-tp969075p969334.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tag generation

2010-07-15 Thread kenf_nc

A colleague mentioned that he knew of services where you pass some content
and it spits out some suggested Tags or Keywords that would be best suited
to associate with that content.

Does anyone know if there is a contrib to Solr or Lucene that does something
like this? Or a third party tool that can be given a solr index or solr
query and it comes up with some good Tag suggestions?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tag-generation-tp969888p969888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strange the when search with dismax

2010-07-14 Thread kenf_nc

Sounds like you want the 'text' fieldType (or equivalent) and are using
'string' or 'lowercase'. Those must match all exactly (well, case
insensitively in the case of 'lowercase').  The TextType field types (like
'text') do tokenizations so matches will occur under many more conditions.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Strange-the-when-search-with-dismax-tp965473p966524.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MultiValue dynamicField and copyField

2010-07-14 Thread kenf_nc

Yep, my schema does this all day long.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/MultiValue-dynamicField-and-copyField-tp965941p966536.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query: URl too long

2010-07-12 Thread kenf_nc

Frederico, 
You should also pose your question on the SolrNet forum,
http://groups.google.com/group/solrnet?hl=en
Switching from GET to POST isn't a Solr issue, but a SolrNet issue.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-URl-too-long-tp959990p960208.html
Sent from the Solr - User mailing list archive at Nabble.com.