Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

2017-03-17 Thread Jay Hill
I've got a very difficult project to tackle. I've been tasked with using
schemaless mode to index json files that we receive. The structure of the
json files will always be very different as we're receiving files from
different customers totally unrelated to one another. We are attempting to
build a "one size fits all" approach to receiving documents from a wide
variety of sources and then index them into Solr.

We're running in Solr 5.3. The schemaless approach works well enough -
until it doesn't. It seems to fail on type guessing and also gets confused
indexing to different shards. If it was reliable it would be the perfect
solution for our task. But the larger the JSON file the more likely it is
to fail. At a certain size it just doesn't work.

I've been advised by some experts and committers that schemaless is a good
tool for prototyping, but risky to run in production, but we thought we
would try it by doing offline indexing using the Cloudera
MapReduceIndexerTool to build offline indexes - but still using managed
schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
that pipes together a series of commands to transform data. For example a
JSON or CSV file can be processed and loaded into a Solr index with a
"readJSON" command piped to a "loadSolr" command, for a simple example.

But the kite-sdk that manages the morphlines only seems to offer as they're
latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
4.10.3)

So I can't see any way to integrate schemaless (which has dependencies
after 4.10.3) with the morphlines.

But I thought I would ask here: Anybody had ANY experience with morphlines
to index to Solr? Any info would help me make sense of this.

Cheers to all!


Re: Very long running replication.

2014-02-27 Thread Jay Hill
Bumping this.

I'm seeing the error mentioned earlier in the thread - Unable to download
segment filename completely. Downloaded 0!=size often in my logs. I'm
dealing with a situation where maxDoc count is growing at a faster rate
than numDocs and is now almost twice as large. I'm not optimizing but
rather relying on the normal merge process to initiate the purging of
deleted docs. No purging has happened for months now and it snuck up on me.

Slaves are getting the newly indexed docs, but docs marked for delete are
never getting purged.

Index size is 23GB
Indexing about 3K docs an hour
Replication poll time is 60 seconds
Running Solr 3.6 (I know, we should upgrade...working on that)
autocommit every 30 seconds or 5K docs (so usually hitting the 30 second
threshold rather than the doc count)

Any pointers greatly appreciated!


On Fri, Jan 3, 2014 at 7:14 AM, anand chandak anand.chan...@oracle.comwrote:

 Folks, would really appreciate if somebody can help/throw some light on
 below issue . This issue is blocking our upgrade, we are doing a 3.x to 4.x
 upgrade and indexing around 100g of data.

 Any help would be highly appreciated.

 Thanks,

 Anand



 On 1/3/2014 11:46 AM, anand chandak wrote:

 Thanks Shalin.


 I am facing one issue while replicating, as my replication (very large
 index 100g)is happening, I am also doing the indexing and I believe the
 segment_N file is changing because of new commits. So would the replication
 fail if the the filename is different from what it found when fetching the
 filename list.


 Basically, I am seeing this exception :


 [explicit-fetchindex-cmd] ERROR org.apache.solr.handler.ReplicationHandler-
 SnapPull failed :org.apache.solr.common.SolrException: Unable to
 download _av3.fdt completely. Downloaded 0!=497037
   2 at org.apache.solr.handler.SnapPuller$
 DirectoryFileFetcher.cleanup(SnapPuller.java:1268)
   3 at org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.
 fetchFile(SnapPuller.java:1148)
   4 at org.apache.solr.handler.SnapPuller.downloadIndexFiles(
 SnapPuller.java:743)
   5 at org.apache.solr.handler.SnapPuller.fetchLatestIndex(
 SnapPuller.java:407)
   6 at org.apache.solr.handler.ReplicationHandler.doFetch(
 ReplicationHandler.java:319)
   7 at org.apache.solr.handler.ReplicationHandler$1.run(
 ReplicationHandler.java:220)


 And I am trying to find the root cause of this issue. Any help ?

 Thanks,

 Anand


 On 1/2/2014 5:32 PM, Shalin Shekhar Mangar wrote:

 Replications won't run concurrently. They are scheduled at a fixed
 rate and if a particular pull takes longer than the time period then
 subsequent executions are delayed until the running one finishes.

 On Tue, Dec 31, 2013 at 4:46 PM, anand chandak anand.chan...@oracle.com
 wrote:

 Quick question about solr replication : What happens if there's a
 replication running for very large index that runs more than the
 interval
 for 2 replication ? would the automatic runs of replication interfere
 with
 the current running one or it would not even spawn next iteration of
 replication ? Can somebody throw some light ?










Loading custom update request handler on startup

2012-07-09 Thread Jay Hill
I'm writing a custom update request handler that will poll a hot
directory for Solr xml files and index anything it finds there. The custom
class implements Runnable, and when the run method is called the loop
starts to do the polling. How can I tell Solr to load this class on startup
to fire off the run() method?

Thanks,
-Jay


Re: Loading custom update request handler on startup

2012-07-09 Thread Jay Hill
I may have found a good solution. I implemented my own SolrEventListener:

public class DynamicIndexerEventListener
implementsorg.apache.solr.core.SolrEventListener{

...

and then called it with a firstSearcher element in solrconfig.xml:

listener event=firstSearcher
class=com.bestbuy.search.foundation.solr.DynamicIndexerEventListener /

Then in the newSearcher() method I startup up the thread for my polling
UpdateRequestHandler.

This seems to work, but if anyone has a better (or more tested) approach
please let us know.


-Jay

On Mon, Jul 9, 2012 at 2:33 PM, Jay Hill jayallenh...@gmail.com wrote:

 I'm writing a custom update request handler that will poll a hot
 directory for Solr xml files and index anything it finds there. The custom
 class implements Runnable, and when the run method is called the loop
 starts to do the polling. How can I tell Solr to load this class on startup
 to fire off the run() method?

 Thanks,
 -Jay



Re: TermsComponent show only terms that matched query?

2012-02-27 Thread Jay Hill
Yes, per-doc. I mentioned TermsComponent but meant TermVectorComponent,
where we get back all the terms in the doc. Just wondering if there was a
way to only get back the terms that matched the query.

Thanks EE,
-Jay


On Sat, Feb 25, 2012 at 2:54 PM, Erick Erickson erickerick...@gmail.comwrote:

 Jay:

 I've seen the this question go 'round before, but don't remember
 a satisfactory solution. Are you talking on a per-document basis
 here? If so, I vaguely remember it being possible to do something
 with highlighting, just counting the tags returned after highlighting.

 Best
 Erick

 On Fri, Feb 24, 2012 at 3:31 PM, Jay Hill jayallenh...@gmail.com wrote:
  I have a situation where I want to show the term counts as is done in the
  TermsComponent, but *only* for terms that are *matched* in a query, so I
  get something returned like this (pseudo code):
 
  q=title:(golf swing)
 
  doc
  title: golf legends show how to improve your golf swing on the golf
 course
  ...other fields
  /doc
 
  terms
  golf (3)
  swing (1)
  /terms
 
  rather than getting back all of the terms in the doc.
 
  Thanks,
  -Jay



TermsComponent show only terms that matched query?

2012-02-24 Thread Jay Hill
I have a situation where I want to show the term counts as is done in the
TermsComponent, but *only* for terms that are *matched* in a query, so I
get something returned like this (pseudo code):

q=title:(golf swing)

doc
title: golf legends show how to improve your golf swing on the golf course
...other fields
/doc

terms
golf (3)
swing (1)
/terms

rather than getting back all of the terms in the doc.

Thanks,
-Jay


Complex query, need filtering after query not before

2012-01-27 Thread Jay Hill
I have a project where we need to search 1B docs and still have results 
700ms. The problem is, we are using geofiltering and that is happening *
before* the queries, so we have to geofilter on the 1B docs to restrict our
set of docs first, and then do the query on a name field. But it seems that
it would be better and faster to run the main query first, and only then
filter out that subset of docs by geo. Here is what a typical query looks
like:

?shards=list of 20 nodes
q={!boost
b=sum(recip(geodist(geo_lat_long,38.2493581,-122.0399663),1,1,1))}(given_name:Barack
OR given_name_exact:Barack^4.0) AND family_name:Obama
fq={!geofilt pt=38.2493581,-122.0399663 sfield=geo_lat_long d=120}
fq=(-source:somedatasource)
rows=4
QTime=1040

I've looked at the cache=false param, and the cost= param, but that's
not going to help much because we still have to do the filtering. (We
*will* use
cache=false to avoid the overhead of caching queries that will very
rarely be the same.)

Is there any way to indicate a filter query should happen *after* the other
results? The other fq on source restricts the docset somewhat, but
different variations don't eliminate a high number of docs, so we could use
the cost param to run the fq on source before the fq on geo, but it would
only help very minimally in some cases.


Thanks,
-Jay


Shard timeouts on large (1B docs) Solr cluster

2012-01-26 Thread Jay Hill
I'm on a project where we have 1B docs sharded across 20 servers. We're not
in production yet and we're doing load tests now. We're sending load to hit
100qps per server. As the load increases we're seeing query times
sporadically increasing to 10 seconds, 20 seconds, etc. at times. What
we're trying to do is set a shard timeout so that responses longer than 2
seconds are discarded. We can live with less results in these cases. We're
not replicating yet as we want to see how the 20 shards perform first (plus
we're waiting on the massive amount of hardware)

I've tried setting the following config in our default req. handler:
int name=shard-socket-timeout2000/int
int name=shard-connection-timeout2000/int

I've just added these, and am testing now, but this doesn't look promising
either:
int name=timeAllowed2000/int
bool name=partialResultstrue/bool

Couldn't find much on the wiki about these params - I'm looking for more
details about how these work. I'll be happy to update the wiki with more
details based on the discussion here.

Any details about exactly how I can achieve my goal of timing out and
disregarding queries longer that 2 seconds would be greatly appreciated.

The index is insanely lean - no stored fields, no norms, no stop words,
etc. RAM buffer is 128, and we're using the standard search req. handler.
Essentially we're running Solr as a nosql data store, which suits this
project, but we need responses to be no longer than 2 seconds at the max.

Thanks,
-Jay


Re: Shard timeouts on large (1B docs) Solr cluster

2012-01-26 Thread Jay Hill
We're on the trunk:
4.0-2011-10-26_08-46-59 1189079 - hudson - 2011-10-26 08:51:47

Client timeouts are set to 4 seconds.

Thanks,
-Jay

On Thu, Jan 26, 2012 at 1:40 PM, Mark Miller markrmil...@gmail.com wrote:


 On Jan 26, 2012, at 1:28 PM, Jay Hill wrote:

 
  I've tried setting the following config in our default req. handler:
  int name=shard-socket-timeout2000/int
  int name=shard-connection-timeout2000/int
 


 What version are you using Jay? At least on trunk, I took a look and it
 appears at some point these where renamed to socketTimeout and connTimeout.

 What about a timeout on your clients?

 - Mark Miller
 lucidimagination.com




Re: Shard timeouts on large (1B docs) Solr cluster

2012-01-26 Thread Jay Hill
i'm changing the params to socketTimeout and connTimeout and will test this
afternoon. client timeout was actually removed today, which helped a bit.

what about the other params, timeAllowed and partialResults. my
expectation was that these were specifically for distributed search,
meaning if a response wasn't received w/in the timeAllowed, and if
partialResults is true, then that shard would not be waited on for results.
is that correct?

thanks,
-jay


On Thu, Jan 26, 2012 at 2:23 PM, Jay Hill jayallenh...@gmail.com wrote:

 We're on the trunk:
 4.0-2011-10-26_08-46-59 1189079 - hudson - 2011-10-26 08:51:47

 Client timeouts are set to 4 seconds.

 Thanks,
 -Jay


 On Thu, Jan 26, 2012 at 1:40 PM, Mark Miller markrmil...@gmail.comwrote:


 On Jan 26, 2012, at 1:28 PM, Jay Hill wrote:

 
  I've tried setting the following config in our default req. handler:
  int name=shard-socket-timeout2000/int
  int name=shard-connection-timeout2000/int
 


 What version are you using Jay? At least on trunk, I took a look and it
 appears at some point these where renamed to socketTimeout and connTimeout.

 What about a timeout on your clients?

 - Mark Miller
 lucidimagination.com





/no_coord in dismax scoring explain

2012-01-06 Thread Jay Hill
What does /no_coord mean in the dismax scoring output? I've looked
through the wiki mail archives, lucidfind, and can't find any reference.

-- 
¡jah!


Re: facet search and UnInverted multi-valued field?

2011-05-03 Thread Jay Hill
UnInvertedField is similar to Lucene's FieldCache, except, while the
FieldCache cannot work with multivalued fields, UnInvertedField is designed
for that very purpose. So since your f_dcperson field is multivalued, by
default you use UnInvertedField. You're not doing anything wrong, that's
default and normal behavior.

-Jay
http://lucidimagination.com



On Tue, May 3, 2011 at 7:03 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Dear list,

 we use solr 3.1.0.

 my logs have the following entry:
 May 3, 2011 2:01:39 PM org.apache.solr.request.UnInvertedField uninvert
 INFO: UnInverted multi-valued field

 {field=f_dcperson,memSize=1966237,tindexSize=35730,time=849,phase1=782,nTerms=12,bigTerms=0,termInstances=368008,uses=0}

 The schema.xml has the field:
 field name=f_dcperson type=string indexed=true stored=true
 multiValued=true /

 The query was:
 May 3, 2011 2:01:40 PM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=null path=null
 params={facet=truefl=scorefacet.mincount=1facet.sort=start=0event=firstSearcherq=text:antigone^200facet.prefix=facet.limit=100facet.field=f_dcpersonfacet.field=f_dcsubjectfacet.field=f_dcyearfacet.field=f_dccollectionfacet.field=f_dctypenormfacet.field=f_dccontenttyperows=10}
 hits=1 status=0 QTime=1816

 At first the log entry is an info, but what does it tell me?

 Am I doing something wrong or can something be done better?

 Regards,
 Bernd



Scaling Search with Big Data/Hadoop and Solr now available at Lucene Revolution

2011-04-25 Thread Jay Hill
I've worked with a lot of different Solr implementations, and one area that
is emerging more and more is using Solr in combination with other big data
solutions. My company, Lucid Imagination, has added a two-day course to our
upcoming Lucene Revolution conference, Scaling Search with Big Data and
Solr, that covers Hadoop  Solr, on May 23-24 - it'll be at Lucene
Revolution in San Francisco (the conference is on May 25-26 -- see
lucenerevolution.org).

Description: The class covers Hadoop from the ground up, including
MapReduce, the Hadoop Distributed File System (HDFS), cluster management,
etc., before continuing on to connect it to Solr. Students will study common
use cases for generating search indexes from big data, typical patterns for
the data processing workflow, and how to make it all work reliably at scale.
We will explore in-depth an example of processing 1 billion records to
create a faceted Solr search solution.

This course will be presented on May 23 and 24 at the Lucene Revolution
conference in San Francisco (the conference is on May 25-26 -- see
lucenerevolution.org). Details here:
http://lucenerevolution.org/training#solr-scaling

I've been asked by a lot of Solr users whether Lucid offers anything like
this, so I know there is a lot of interest out there.

-Jay


Re: Multiple Tags and Facets

2011-04-21 Thread Jay Hill
I don't think I understand what you're trying to do. Are you trying to
preserve all facets after a user clicks on a facet, and thereby triggers a
filter query, which excludes the other facets? If that's the case, you can
use local parameters to tag the filter queries so they are not used for the
facets:

Let's say I have the following facets:
- Solr
- Lucene
- Nutch
- Mahout

And I do a search for solr.

All of these links will have a filter query:
- Solr [ ?q=solrfq=project:solr ]
- Lucene [ ?q=solrfq=project:lucene ]
- Nutch [ ?q=solrfq=project:nutch ]
- Mahout [ ?q=solrfq=project:mahout ]

But if a user clicks on the Solr facet, the resulting query will exclude
the other facets, so you only see this facet:
- Solr

By using local parameters like this:

?q=solrfq={!tag=myTag}project:solr facet=onfacet.field{!ex=myTag}=project

I can preserve all my facets, so that my query is filtered but all facets
still remain:
- Solr
- Lucene
- Nutch
- Mahout

Hope this helps, but I'm not sure that's what you were after.

-Jay



On Wed, Apr 20, 2011 at 8:03 AM, Em mailformailingli...@yahoo.de wrote:

 Hello,

 I watched an online video with Chris Hostsetter from Lucidimagination. He
 showed the possibility of having some Facets that exclude *all* filter
 while
 also having some Facets that take care of some of the set filters while
 ignoring other filters.

 Unfortunately the Webinar did not explain how they made this and I wasn't
 able to give a filter/facet more than one tag.

 Here is an example:

 Facets and Filters: DocType, Author

 Facet:
 - Author
 -- George (10)
 -- Brian (12)
 -- Christian (78)
 -- Julia (2)

 -Doctype
 -- PDF (70)
 -- ODT (10)
 -- Word (20)
 -- JPEG (1)
 -- PNG (1)

 When clicking on Julia I would like to achieve the following:
 Facet:
 - Author
 -- George (10)
 -- Brian (12)
 -- Christian (78)
 -- Julia (2)
  Julia's Doctypes:
 -- JPEG (1)
 -- PNG (1)

 -Doctype
 -- PDF (70)
 -- ODT (10)
 -- Word (20)
 -- JPEG (1)
 -- PNG (1)

 Another example which adds special options to your GUI could be as
 following:
 Imagine a fashion store.
 If you search for shirt you get a color-facet:

 colors:
 - red (19)
 - green (12)
 - blue (4)
 - black (2)

 As well as a brand-facet:

 brands:
 - puma (18)
 - nike (19)

 When I click on the red color-facet, I would like to get the following
 back:
 colors:
 - red (19)
 - green (12)*
 - blue (4)*
 - black (2)*

 brands:
 - puma (18)*
 - nike (19)

 All those filters marked by an * could be displayed half-transparent or
 so
 - they just show the user that those filter-options exist for his/her
 search
 but aren't included in the result-set, since he/she excluded them by
 clicking the red filter.

 This case is more interesting, if not all red shirts were from nike.
 This way you can show the user that i.e. 8 of 19 red - shirts are from the
 brand you selected/you see 8 of 19 red shirts.

 I hope I explained what I want to achive.

 Thank you!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Understanding the DisMax tie parameter

2011-04-15 Thread Jay Hill
Looks good, thanks Tom.

-Jay


On Fri, Apr 15, 2011 at 8:55 AM, Burton-West, Tom tburt...@umich.eduwrote:

 Thanks everyone.

 I updated the wiki.  If you have a chance please take a look and check to
 make sure I got it right on the wiki.

 http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29

 Tom



 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Thursday, April 14, 2011 5:41 PM
 To: solr-user@lucene.apache.org; yo...@lucidimagination.com
 Cc: Burton-West, Tom
 Subject: Re: Understanding the DisMax tie parameter


 : Perhaps the parameter could have had a better name.  It's essentially
 : max(score of matching clauses) + tie * (score of matching clauses that
 : are not the max)
 :
 : So it can be used and thought of as a tiebreak only in the sense that
 : if two docs match a clause (with essentially the same score), then a
 : small tie value will act as a tiebreaker *if* one of those docs also
 : matches some other fields.

 correct.  w/o a tiebreaker value, a dismax query will only look at the
 maximum scoring clause for each doc -- the tie param is named for it's
 ability to help break ties when multiple documents have the same score
 from the max scoring clause -- by adding in a small portion of the scores
 (based on the 0-1 ratio of the tie param) from the other clauses.


 -Hoss



Re: Understanding the DisMax tie parameter

2011-04-14 Thread Jay Hill
Dismax works by first selecting the highest scoring sub-query of all the
sub-queries that were run. If I want to search on three fields, manu, name
and features, I can configure dismax like this:

  requestHandler name=search_dismax class=solr.SearchHandler 
lst name=defaults
 str name=defTypedismax/str
* float name=tie0.0/float*
 str name=qfmanu name features/str
 str name=q.alt*:*/str
/lst
  /requestHandler

Now I'll use this query:
http://localhost:8983/solr/select/?qt=search_dismaxq=cord

Dismax will search for the term cord on the 3 fields I defined in the qf
parameter like this:
+(features:cord | manu:cord | name:cord)

Of those 3 sub-queries dismax will pick the highest one as the main part
of the score. The tie parameter is used like this:
Final Score = highest scoring sub-query + (*tie* * sum of scores for all
other sub-queries)

So with a tie value of *0*, the max scoring sub-query is added to 0 * other
sub-queries
Final Score = 0.9645969 + (*0* * sum of other sub-queries)

and this results in ONLY the max sub-query being used, hence a disjunction
max.

If I had a value of *1* for the tie parameter I get this:
Final Score = 0.9645969 + (*1* * sum of other sub-queries)

so the sum of all the other sub-queries is multiplied by 1, resulting in a
disjunction sum.

And then, of course, values between 0 and 1 result in the
non-highest-sub-queries being multiplied by a fraction, and factoring into
the scoring that way.

-Jay



On Thu, Apr 14, 2011 at 2:04 PM, Burton-West, Tom tburt...@umich.eduwrote:

 Hello,

 I'm having trouble understanding the relationship of the word tie and
 tiebreaker to the explanation of this parameter on the wiki.
 What two (or more things) are in a tie? and how does the number in the
 range from 0 to 1 break the tie?

 http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29

 A value of 0.0 makes the query a pure disjunction max query -- only
 the maximum scoring sub query contributes to the final score. A value of
 1.0 makes the query a pure disjunction sum query where it doesn't matter
 what the maximum scoring sub query is, the final score is the sum of the sub
 scores. Typically a low value (ie: 0.1) is useful.

 Tom Burton-West




Re: partial optimize does not reduce the segment number to maxNumSegments

2011-04-13 Thread Jay Hill
As Hoss mentioned earlier in the thread, you can use the statistics page
from the admin console to view the current number of segments. But if you
want to know by looking at the files, each segment will have a unique
prefix, such as _u. There will be one unique prefix for every segment in
the index.

-Jay


On Tue, Apr 12, 2011 at 3:16 PM, Renee Sun renee_...@mcafee.com wrote:

 ok I dug more into this and realize the file extensions can vary depending
 on
 schema, right?
 for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)...
 and
 I suspect the file extensions
 may change with future lucene releases?

 now it seems we can't just count the file using any formula, we have to
 list
 all files in that directory and count that way... any insight will be
 appreciated.
 thanks
 Renee

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: phrase, inidividual term, prefix, fuzzy and stemming search

2011-02-04 Thread Jay Hill
You mentioned that dismax does not support wildcards, but edismax does. Not
sure if dismax would have solved your other problems, or whether you just
had to shift gears because of the wildcard issue, but you might want to have
a look at edismax.

-Jay
http://www.lucidimagination.com


On Mon, Jan 31, 2011 at 2:22 PM, cyang2010 ysxsu...@hotmail.com wrote:


 My current project has the requirement to support search when user inputs
 any
 number of terms across a few index fields (movie title, actor, director).

 In order to maximize result, I plan to support all those searches listed in
 the subject, phrase, individual term, prefix, fuzzy and stemming.  Of
 course, score relevance in the right order is also important.

 I have considered using dismax query.  However, it does not support prefix
 query.  I am not sure if it supports fuzzy query, my guess is does not.

 Therefore, i still need to use standard query.   For example, if someone
 searches deim moer (typo for demi moore), i compare the phrase and terms
 with each searchable fields (title, actor, director):


 title_display: deim moer~30 actors: deim moer~30 directors: deim
 moer~30--  OR

 title_display: deim-- OR
 actors: deim
 directors: deim

 title_display: deim*   -- OR
 actors: deim*
 directors: deim*

 title_display: deim~0.6   -- OR
 actors: deim~0.6
 directors: deim~0.6

 title_display: moer-- OR
 actors: moer
 directors: moer

 title_display: moer*   -- OR
 actors: moer*
 directors: moer*

 title_display: moer~0.6-- OR
 actors: moer~0.6
 directors: moer~0.6

 The solr relevance score is sum for all those OR.  In that way, i can make
 sure relevance score are in order.  For example, for the exact match (deim
 moer), it will match phrase, term, prefix and fuzzy query all at the same
 time.   Therefore, it will score higher than some input text only matchs
 term, or prefix or fuzzy. At the same time, i can apply boost to a
 particular search field if requirement needs.


 Does it sound right to you?  Is there better ways to achieve the same
 thing?
 My concern is my query is not going to perform, since it tries to do too
 much.  But isn't that what people want to get (maximize result) when they
 just type in a few search words?

 Another question is that:  Can i combine the result of two query together?
 For example, first i query phrase and term match, next I query for prefix
 match.  Can I just append the result for prefix match to that for
 phrase/term match?   I thought two queries have different queryNorm,
 therefore, the score is not comparable to each other so as to combine.  Is
 it correct?


 Thanks.  love to hear what your thought is.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: WordDelimiterFilterFactory

2011-02-04 Thread Jay Hill
You can always try something like this out in the analysis.jsp page,
accessible from the Solr Admin home. Check out that page and see how it
allows you to enter text to represent what was indexed, and text for a
query. You can then see if there are matches. Very handy to see how the
various filters in a field type act on text. Make sure to check verbose
output for both index and query.

For this specific issue, yes, a query for cls500 will match both of those
examples.

To get the exact match to score higher:
- create a text field (or a custom type that uses the
WordDelimiterFilterFactory) (let's name the field foo)
- create a string field  (let's name it foo_string)
- create a copyField with the source being foo and the dest being
foo_string.
- use dismax (or edismax) to search both of those fields

http://localhost:8983/solr/select/?q=cls500defType=edismaxqf=foofoo_string

This should score the string field higher, but you could also add a boost to
it to make sure:

http://localhost:8983/solr/select/?q=cls500defType=edismaxqf=foofoo_string^4.0

-Jay
http://lucidimagination.com


On Fri, Feb 4, 2011 at 4:25 PM, John kim hongs...@gmail.com wrote:

 If i use WordDelimiterFilterFactory during indexing and at query time,
 will a search for cls500 find cls 500 and cls500x?  If so, will
 it find and score exact matches higher?  If not, how do you get exact
 matches to display first?



Re: Tuning Solr

2010-10-05 Thread Jay Hill
Removing those components is not likely to impact performance very much, if
at all. I would focus on other areas when tuning performance, such as
looking memory usage and configuration, query design, etc. But there isn't
any harm in removing them either. Why not do some load tests with the
components included in the configuration, and then some comparison tests
with the components removed from solrconfig.xml?

-Jay
http://www.lucidimagination.com


On Mon, Oct 4, 2010 at 11:36 PM, Floyd Wu floyd...@gmail.com wrote:

 Hi there,

 If I dont need Morelikethis, spellcheck, highlight.
 Can I remove this configuration section in solrconfig.xml?
 In other workd, does solr load and use these SearchComponet on statup and
 suring runtime?

 Remove this configuration will or will not speedup query?

 Thanks



Creating new Solr cores using relative paths

2010-08-17 Thread Jay Hill
I'm having trouble getting the core CREATE command to work with relative
paths in the solr.xml configuration.

I'm working with a layout like this:
/opt/solr [this is solr.solr.home: $SOLR_HOME]
/opt/solr/solr.xml
/opt/solr/core0/ [this is the template core]
/opt/solr/core0/conf/schema.xml [etc.]

/opt/tomcat/bin [where tomcat is started from: $TOMCAT_HOME/bin]

My very basic solr.xml:
solr persistent=true
cores adminPath=/admin/cores
  core name=core0 instanceDir=core0//
/cores
/solr

The CREATE core command works fine with absolute paths, but I have a
requirement to use relative paths. I want to be able to create a new core
like this:

http://localhost:8080/solr/admin/cores
?action=CREATE
name=core1
instanceDir=core1
config=core0/conf/solrconfig.xml
schema=core0/conf/schema.xml
(core1 is the name for the new core to be created, and I want to use the
config and schema from core0 to create the new core).

but the error is always due to the servlet container thinking
$TOMCAT_HOME/bin is the current working directory:
Caused by: java.lang.RuntimeException: *Can't find resource
'core0/conf/solrconfig.xml'* in classpath or '/opt/solr/core1/conf/', *
cwd=/opt/tomcat/bin
*
Does anyone know how to make this happen?

Thanks,
-Jay


Re: OutOfMemoryErrors

2010-08-17 Thread Jay Hill
A merge factor of 100 is very high and out of the norm. Try starting with a
value of 10. I've never seen a running system with a value anywhere near
this high.

Also, what is your setting for ramBufferSizeMB?

-Jay

On Tue, Aug 17, 2010 at 10:46 AM, rajini maski rajinima...@gmail.comwrote:

 yeah sorry I forgot to mention others...

 mergeFactor100/mergeFactor
 maxBufferedDocs1000/maxBufferedDocs
 maxMergeDocs10/maxMergeDocs
 maxFieldLength1/maxFieldLength

 above are the values

 Is this because of values here...initially I had mergeFactor parameter -10
 and maxMergedocs-1With the same error i changed them to above
 values..Yet I got that error after index was about 2lacs docs...

 On Tue, Aug 17, 2010 at 11:04 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  There are more merge paramaters, what values do you have for these:
 
  mergeFactor10/mergeFactor
  maxBufferedDocs1000/maxBufferedDocs
  maxMergeDocs2147483647/maxMergeDocs
  maxFieldLength1/maxFieldLength
 
  See: http://wiki.apache.org/solr/SolrConfigXml
 
  Hope that formatting comes through the various mail programs OK
 
  Also, what else happens while you're indexing? Do you search
  while indexing? How often do you commit your changes?
 
 
 
  On Tue, Aug 17, 2010 at 1:18 PM, rajini maski rajinima...@gmail.com
  wrote:
 
   mergefactor100 /mergefactor
   JVM Initial memory pool -256MB
 Maximum memory pool -1024MB
  
   add
   doc
   fieldlong:ID/field
   fieldstr:Body/field
   
   12 fields
   /filed
   /doc
   /add
   I have a solr instance in solr folder (D:/Solr) free space in disc is
   24.3GB
   .. How will I get to know what portion of memory is solr using ?
  
  
  
   On Tue, Aug 17, 2010 at 10:11 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
You shouldn't be getting this error at all unless you're doing
  something
out of the ordinary. So, it'd help if you told us:
   
What parameters you have set for merging
What parameters you have set for the JVM
What kind of documents are you indexing?
   
The memory you have is irrelevant if you only allocate a small
portion of it for the running process...
   
Best
Erick
   
On Tue, Aug 17, 2010 at 7:35 AM, rajini maski rajinima...@gmail.com
 
wrote:
   
 I am getting it while indexing data to solr not while querying...
 Though I have enough memory space upto 40GB and I my indexing data
 is
just
 5-6 GB yet that particular error is seldom observed... (SEVERE
 ERROR
  :
JAVA
 HEAP SPACE , OUT OF MEMORY ERROR )
 I could see one lock file generated in the data/index path just
 after
this
 error.



 On Tue, Aug 17, 2010 at 4:49 PM, Peter Karich peat...@yahoo.de
   wrote:

 
   Is there a way to verify that I have added correctlly?
  
 
  on linux you can do
  ps -elf | grep Boot
  and see if the java command has the parameters added.
 
  @all: why and when do you get those OOMs? while querying? which
   queries
  in detail?
 
  Regards,
  Peter.
 

   
  
 



SolrJ: Setting multiple parameters

2010-06-20 Thread Jay Hill
Working with SolrJ I'm doing a query using the StatsComponent, and the
stats.facet parameter. I'm not able to set multiple fields for the
stats.facet parameter using SolrJ. Here is the query I'm trying to create:

http://localhost:8983/solr/select/?q=*:*stats=onstats.field=fieldForStatsstats.facet=fieldAstats.facet=fieldBstats.facet=fieldC

This works perfectly, and I'm able to pull the sum value from all three
stats.facet fields, no problem.

Trying in SolrJ I have this:
  SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery(*:*);

solrQuery.setParam(stats, on);
solrQuery.setParam(stats.field, fieldForStats);
*solrQuery.setParam(stats.facet, fieldA);
solrQuery.setParam(stats.facet, fieldB);
solrQuery.setParam(stats.facet, fieldC);*

But when I try to retrieve the sum values, it seems as if only the LAST
setParam I called on stats.facet is taking. So in this case I can get the
sum for fieldC, but not the other two:

//works
  MapString, FieldStatsInfo statsInfoMap =
queryResponse.getFieldStatsInfo();
  FieldStatsInfo roomCountElement = statsInfoMap.get(fieldForStats);

  ArrayList fsi = (ArrayList) roomCountElement.getFacets().get(field*C*
);
  for (int i = 0; i  fsi.size(); i++) {

FieldStatsInfo m = (FieldStatsInfo) fsi.get(i);
System.out.println(--  + m.getName() +  +
m.getSum());

  }

//doesn't work, get a null pointer as fieldB doesn't seem to have been
passed to stats.facet
  MapString, FieldStatsInfo statsInfoMap =
queryResponse.getFieldStatsInfo();
  FieldStatsInfo roomCountElement = statsInfoMap.get(fieldForStats);

  ArrayList fsi = (ArrayList) roomCountElement.getFacets().get(field*B*
);
  for (int i = 0; i  fsi.size(); i++) {

FieldStatsInfo m = (FieldStatsInfo) fsi.get(i);
System.out.println(--  + m.getName() +  +
m.getSum());

  }

Is there a way to set multiple values for stats.facet using the setParm
method?  I noticed that there is a setGetFieldStatistics method which can
be used to set the stats.field, but there don't seem to be any methods that
reach as deep as setting the stats.facet.

Thanks,
-Jay


Anyone using Solr spatial from trunk?

2010-06-07 Thread Jay Hill
I was wondering about the production readiness of the new-in-trunk spatial
functionality. Is anyone using this in a production environment?

-Jay


Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Jay Hill
I've done a lot of recency boosting to documents, and I'm wondering why you
would want to do that at index time. If you are continuously indexing new
documents, what was recent when it was indexed becomes, over time less
recent. Are you unsatisfied with your current performance with the boost
function? Query-time recency boosting is a fairly common thing to do, and,
if done correctly, shouldn't be a performance concern.

-Jay
http://lucidimagination.com


On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman a...@newscred.com wrote:

 Perhaps I should have been more specific in my initial post.  I'm doing
 date-based boosting on the documents in my index, so as to assign a higher
 score to more recent documents.  Currently I'm using a boost function to
 achieve this.  I'm wondering if there would be a performance improvement if
 instead of using the boost function at search time, I indexed the documents
 with a date-based boost.

 On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Index time boosting is different than search time boosting, so
  asking about performance is irrelevant.
 
  Paraphrasing Hossman from years ago on the Lucene list (from
  memory).
 
  ...index time boosting is a way of saying this documents'
  title is more important than other documents' titles. Search
  time boosting is a way of saying I care about documents
  whose titles contain this term more than other documents
  whose titles may match other parts of this query
 
  HTH
  Erick
 
  On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman a...@newscred.com wrote:
 
   Hi,
  
   What are the performance ramifications for using a function-based boost
  at
   search time (through bf in dismax parser) versus an index-time boost?
   Currently I'm using boost functions on a 15GB index of ~14mm documents.
Our
   queries generally match many thousands of documents.  I'm wondering if
 I
   would see a performance improvement by switching over to index-time
   boosting.
  
   Thanks,
  
   Asif
  
   --
   Asif Rahman
   Lead Engineer - NewsCred
   a...@newscred.com
   http://platform.newscred.com
  
 



 --
 Asif Rahman
 Lead Engineer - NewsCred
 a...@newscred.com
 http://platform.newscred.com



Auto-suggest internal terms

2010-06-02 Thread Jay Hill
I've got a situation where I'm looking to build an auto-suggest where any
term entered will lead to suggestions. For example, if I type wine I want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any custom
work?

-Jay


Re: field length normalization

2010-03-11 Thread Jay Hill
The fieldNorm is computed like this: fieldNorm = lengthNorm * documentBoost
* documentFieldBoosts

and the lengthNorm is: lengthNorm  =  1/(numTermsInField)**.5
[note that the value is encoded as a single byte, so there is some precision
loss]

So the values are not pre-set for the lengthNorm, but for some counts the
fieldLength value winds up being the same because of the precision los. Here
is a list of lengthNorm values for 1 to 10 term fields:

# of termslengthNorm
   1  1.0
   2 .625
   3 .5
   4 .5
   5 .4375
   6 .375
   7 .375
   8 .3125
   9 .3125
  10 .3125

That's why, in your example, the lengthNorm for 3 and 4 is the same.

-Jay
http://www.lucidimagination.com





On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote:



 :
 : Did you reindex after setting omitNorms to false? I'm not sure whether or
 : not it is needed, but it makes sense.

 Yes i deleted the old index and reindexed it.
 Just to add another fact, that the titlles length is less than 10. I am not
 sure if solr has pre-set values for length normalizations, because for
 titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the
 debugQuery section).


 --
 View this message in context:
 http://old.nabble.com/field-length-normalization-tp27862618p27867025.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Question about fieldNorms

2010-03-07 Thread Jay Hill
Yes, if omitNorms=true, then no lengthNorm calculation will be done, and the
fieldNorm value will be 1.0, and lengths of the field in question will not
be a factor in the score.

To see an example of this you can do a quick test. Add two text fields,
and on one omitNorms:

   field name=foo type=text indexed=true stored=true/
   field name=bar type=text indexed=true stored=true
omitNorms=true/

Index a doc with the same value for both fields:
  field name=foo1 2 3 4 5/field
  field name=bar1 2 3 4 5/field

Set debugQuery=true and do two queries: q=foo:5   q=bar:5

in the explain section of the debug output note that the fieldNorm value
for the foo query is this:

0.4375 = fieldNorm(field=foo, doc=1)

and the value for the bar query is this:

1.0 = fieldNorm(field=bar, doc=1)

A simplified description of how the fieldNorm value is: fieldNorm =
lengthNorm * documentBoost * documentFieldBoosts

and the lengthNorm is calculated like this: lengthNorm  =
1/(numTermsInField)**.5
[note that the value is encoded as a single byte, so there is some precision
loss]

When omitNorms=true no norm calculation is done, so fieldNorm will always be
one on those fields.

You can also use the Luke utility to view the document in the index, and it
will show that there is a norm value for the foo field, but not the bar
field.

-Jay
http://www.lucidimagination.com


On Sun, Mar 7, 2010 at 5:55 AM, Siddhant Goel siddhantg...@gmail.comwrote:

 Hi everyone,

 Is the fieldNorm calculation altered by the omitNorms factor? I saw on this
 page (http://old.nabble.com/Question-about-fieldNorm-td17782701.html) the
 formula for calculation of fieldNorms (fieldNorm =
 fieldBoost/sqrt(numTermsForField)).

 Does this mean that for a document containing a string like A B C D E in
 its field, its fieldNorm would be boost/sqrt(5), and for another document
 containing the string A B C in the same field, its fieldNorm would be
 boost/sqrt(3). Is that correct?

 If yes, then is *this* what omitNorms affects?

 Thanks,

 --
 - Siddhant



Re: Free Webinar: Mastering Solr 1.4 with Yonik Seeley

2010-02-26 Thread Jay Hill
Yes, it will be recorded and available to view after the presentation.

-Jay


On Thu, Feb 25, 2010 at 2:19 PM, Bernadette Houghton 
bernadette.hough...@deakin.edu.au wrote:

 Yonk, can you please advise whether this event will be recorded and
 available for later download? (It starts 5am our time ;-)  )

 Regards
 Bern

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, 25 February 2010 10:23 AM
 To: solr-user@lucene.apache.org
 Subject: Free Webinar: Mastering Solr 1.4 with Yonik Seeley

 I'd like to invite you to join me for an in-depth review of Solr's
 powerful, versatile new features and functions. The free webinar,
 sponsored by my company, Lucid Imagination, covers an intensive
 how-to for the features you need to make the most of Solr for your
 search application:

* Faceting deep dive, from document fields to performance management
* Best practices for sharding, index partitioning and scaling
* How to construct efficient Range Queries and function queries
* Sneak preview: Solr 1.5 roadmap

 Join us for a free webinar
 Thursday, March 4, 2010
 10:00 AM PST / 1:00 PM EST / 18:00 GMT
 Follow this link to sign up

 http://www.eventsvc.com/lucidimagination/030410?trk=WR-MAR2010-AP

 Thanks,

 -Yonik
 http://www.lucidimagination.com



Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-22 Thread Jay Hill
Looks like multi-threaded support was added to the DIH recently:
http://issues.apache.org/jira/browse/SOLR-1352

-Jay


On Fri, Feb 19, 2010 at 6:27 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Glen may be referring to LuSql indexing with multiple threads?
 Does/can DIH do that, too?


 Otis 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Hadoop ecosystem search :: http://search-hadoop.com/



 - Original Message 
  From: Yonik Seeley yo...@lucidimagination.com
  To: solr-user@lucene.apache.org
  Sent: Fri, February 19, 2010 11:41:07 AM
  Subject: Re: What is largest reasonable setting for ramBufferSizeMB?
 
  On Fri, Feb 19, 2010 at 5:03 AM, Glen Newton wrote:
   You may consider using LuSql[1] to create the indexes, if your source
   content is in a JDBC accessible db. It is quite a bit faster than
   Solr, as it is a tool specifically created and tuned for Lucene
   indexing.
 
  Any idea why it's faster?
  AFAIK, the main purpose of DIH is indexing databases too.  If DIH is
  much slower, we should speed it up!
 
  -Yonik
  http://www.lucidimagination.com




Re: score computation for dismax handler

2010-02-22 Thread Jay Hill
Set the tie parameter to 1.0. This param is set between 0.0 (pure
disjunction maximum) and 1.0 (pure disjunction sum):
http://wiki.apache.org/solr/DisMaxRequestHandler#tie_.28Tie_breaker.29

-Jay

On Thu, Feb 18, 2010 at 4:24 AM, bharath venkatesh 
bharathv6.proj...@gmail.com wrote:

 Hi ,
  When query is made across multiple fields in dismax handler using
 paramater qf  , I have observed that with  debug query enabled the
 resultant
 score is max score of scores of query across each  fields . but I want the
 resultant score to be sum of score across fields (like the standard handler
 ) . can any one tell me how this can be achevied.



Re: optimize is taking too much time

2010-02-22 Thread Jay Hill
With a mergeFactor set to anything  1 you would never have only one segment
- unless you optimized. So Lucene will never naturally merge all the
segments into one. Unless, I suppose, the mergeFactor was set to 1, but I've
never tested that. It's hard to picture how that would work.

If I understand correctly, the same actions occur (deleted documents are
removed, etc.) because an optimize is only a multiway merge down to one
segment, whereas normal merging is triggered by the mergeFactor, but does
not have a target segment count to merge down to.

-Jay

On Sun, Feb 21, 2010 at 11:20 AM, David Smiley @MITRE.org dsmi...@mitre.org
 wrote:


 I've always thought that these two events were effectively equivalent.  --
 the results of an optimize vs the results of Lucene _naturally_ merging all
 segments together into one.  If they don't have the safe effect then what
 is
 the difference?

 ~ David Smiley


 Otis Gospodnetic wrote:
 
  Hello,
 
  Solr will never optimize the whole index without somebody explicitly
  asking for it.
  Lucene will merge index segments on the master as documents are indexed.
  How often it does that depends on mergeFactor.
 
  See:
 
 http://search-lucene.com/?q=mergeFactor+segment+mergefc_project=Lucenefc_project=Solrfc_type=mail+_hash_+user
 
 
  Otis 
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Hadoop ecosystem search :: http://search-hadoop.com/
 
 
 
  - Original Message 
  From: mklprasad mklpra...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, February 19, 2010 1:02:11 AM
  Subject: Re: optimize is taking too much time
 
 
 
 
  Jagdish Vasani-2 wrote:
  
   Hi,
  
   you should not optimize index after each insert of document.insted you
   should optimize it after inserting some good no of documents.
   because in optimize it will merge  all segments to one according to
   setting
   of lucene index.
  
   thanks,
   Jagdish
   On Fri, Feb 12, 2010 at 4:01 PM, mklprasad wrote:
  
  
   hi
   in my solr u have 1,42,45,223 records having some 50GB .
   Now when iam loading a new record and when its trying optimize the
  docs
   its
   taking 2 much memory and time
  
  
   can any body please tell do we have any property in solr to get rid
 of
   this.
  
   Thanks in advance
  
   --
   View this message in context:
  
 
 http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27561570.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
  
  
  
 
  Yes,
  Thanks for reply
  i have removed the optmize() from  code. but i have a doubt ..
  1.Will  mergefactor internally do any optmization (or) we have to
 specify
 
  2. Even if solr initaiates optmize if i have a large data like 52GB will
  that takes huge time?
 
  Thanks,
  Prasad
 
 
 
  --
  View this message in context:
 
 http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27650028.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://old.nabble.com/optimize-is-taking-too-much-time-tp27561570p27676881.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: optimize is taking too much time

2010-02-22 Thread Jay Hill
Thanks for clearing that up guys, I misspoke slightly. It's just that, in a
running system, it's probably very rare that there is only a single segment
for any meaningful length of time. Unless that merge-down-to-one occurs
right when indexing stops there will almost always be a new (small) segment
following immediately after the merge. It would be interesting to observe,
over a long time, how often and for how long everything is merged down to a
single segment.

Probably with a very low mergeFactor (2 or 3?) merges-to-one might occur
often enough to make optimizing unnecessary. But I'm guessing that the
merge-to-one happens so infrequently in most situations that optimizing is
more important.

-Jay


On Mon, Feb 22, 2010 at 12:16 PM, Mark Miller markrmil...@gmail.com wrote:

 Also, a mergefactor of 1 is actually invalid - 2 is the lowest you can go.



 --
 - Mark

 http://www.lucidimagination.com






Solr Analysis Webinar Jan 28, 2010

2010-01-20 Thread Jay Hill
My colleague at Lucid Imagination, Tom Hill, will be presenting a free
webinar focused on analysis in Lucene/Solr. If you're interested, please
sign up and join us.

Here is the official notice:

We'd like to invite you to a free webinar our company is offering next
Thursday, 28 January, at 2PM Eastern/11AM Pacific/1900 GMT

Join Lucid Imagination Senior Staff Engineer Tom Hill for a free, in depth
technical workshop to learn how the Lucene/Solr analyzer can grab and index
text and field data, overcome grammatical and semantic variations, and how a
little careful preparation and tuning lets you unleash the full power of
Lucene/Solr Open Source Search.

* Introduction to analysis, including tokens, tokenizers and token
filters
* Tuning tokenization to improve index flexibility and content retrieval
precision
* Avoid common pitfalls by using special troubleshooting tools and
techniques

Thursday, January 28, 2010
11:00 AM PST / 2:00 PM EST / 1900 GMT
Register here:
http://www.eventsvc.com/lucidimagination/012810?trk=WR-JAN2010-AP


Re: solr blocking on commit

2010-01-19 Thread Jay Hill
A couple of follow up questions:

- What type of garbage collector is in use?
- How often are you optimizing the index?
- In solrconfig.xml what is the setting for mainIndexramBufferSizeMB?
- Right before and after you see this pause, check the output of
http://host:port/solr/admin/system,
specifically the output of jvmmemory and send this to the list.

If possible definitely watch memory usage with something like JConsole, or
start the JVM with some of these params:
–XX:+PrintGCDetails
–XX:+PrintGCTimeStamps

-Jay

On Tue, Jan 19, 2010 at 5:16 PM, Steve Conover scono...@gmail.com wrote:

 I'll play with the GC settings and watch memory usage (I've done a
 little bit of this already), but I have a sense that this isn't the
 problem.

 I should also note that in order to create the really long pauses I
 need to post xml files full of documents that haven't been added in a
 long time / ever.  Once a set of documents is posted to /update, if I
 re-post it solr behaves pretty well - and that's true even if I
 restart solr.

 On Tue, Jan 19, 2010 at 3:05 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover scono...@gmail.com
 wrote:
  I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
  document index that's 10+GB.  We currently give solr 12GB of ram to
  play in and our machine has 32GB total.
 
  We're seeing a problem where solr blocks during commit - it won't
  server /select requests - in some cases for more than 15-30 seconds.
  We'd like to somehow configure things such that there's no
  interruption in /select service.
 
  A commit shouldn't cause searches to block.
  Could this perhaps be a stop-the-word GC pause that coincides with the
 commit?
 
  -Yonik
  http://www.lucidimagination.com
 



Re: Solr 1.4 - stats page slow

2010-01-08 Thread Jay Hill
It's definitely still an issue. I've seen this with at least four different
Solr implementations. It clearly seems to be a problem when there is a large
field cache. It would be bad enough if the stats.jsp was just slow to load
(usually takes 1 to 2 minutes), but when monitoring memory usage with
jconsole there is a clear and serious spike as soon as the url for stats.jsp
is hit, on occasions causing OutOfMemory Exceptions.

-Jay


On Fri, Jan 8, 2010 at 9:46 AM, Yonik Seeley yo...@lucidimagination.comwrote:

 I thought this was fixed...
 http://issues.apache.org/jira/browse/SOLR-1292

 http://www.lucidimagination.com/search/document/57103830f0655776/stats_page_slow_in_latest_nightly


 -Yonik
 http://www.lucidimagination.com



Re: Solr 1.4 - stats page slow

2010-01-08 Thread Jay Hill
Actually my cases were all with customers I work with, not just one case. A
common practice is to monitor cache stats to tune the caches properly. Also,
noting the warmup times for new IndexSearchers, etc. I've worked with people
that have excessive auto-warm count values which is causing extremely long
warmup times for the new Searchers. So the stats.jsp page has always been a
handy, simple tool to monitor this stuff and set caches appropriately. But
at some point (around the release of 1.4) I started to notice this problem.
Since it causes the memory spike it pretty much prevents the use of
stats.jsp in production. I've had to resort to log-parsing and other tricks
which is a bit of a waste since it was so simple to do before this surfaced.

-Jay


On Fri, Jan 8, 2010 at 10:41 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 :  2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
 :  cron task to log some of the data from the stats.jsp page from each
 :  core (about 100 cores, most of which are small indexes).

 1) what stats are you actaully interested in? ... in Jay's case the
 LukeRequestHandler made more sense to get the data he wnated anyway.

 2) what does the output of stats.jsp say when you see these load spikes?
 ... it should be fairly lightweight unless it detects some insanity in
 the way the FieldCaches are being used, in which case it does memory
 estimation to make it clear how significant the problem is.


 -Hoss



Re: Indexing the latests MS Office documents

2010-01-05 Thread Jay Hill
The version of Tika in the 1.4 release definitely parses the most current
Office formats (.docx, .pptx, etc.) and they index as expected.

-Jay


On Mon, Jan 4, 2010 at 6:02 PM, Peter Wolanin peter.wola...@acquia.comwrote:

 You must have been searching old documentation - I think tika 0,3+ has
 support for the new MS formats.  but don't take my word for it - why
 don't you build tika and try it?

 -Peter

 On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes r...@alpha-solutions.dk
 wrote:
  Hi All,
 
  Anyone who knows how to index the latest MS office documents like .docx
 and .xlsx  ?
 
  From searching it seems like Tika only supports the earlier formats .doc
 and .xls
 
 
 
  med venlig hilsen/best regards
 
  Roland Villemoes
  Tel: (+45) 22 69 59 62
  E-Mail: mailto:r...@alpha-solutions.dk
 
 



 --
 Peter M. Wolanin, Ph.D.
 Momentum Specialist,  Acquia. Inc.
 peter.wola...@acquia.com



Re: Solr 1.4 - stats page slow

2009-12-24 Thread Jay Hill
I've noticed this as well, usually when working with a large field cache. I
haven't done in-depth analysis of this yet, but it seems like when the stats
page is trying to pull data from a large field cache it takes quite a long
time.

Are you doing a lot of sorting? If so, what are the field types of the
fields you're sorting on? How large is the index both in document count and
file size?

Another approach to get data from the Solr instance would be to use JMX. And
I've been working on a request handler (started by Erik Hatcher) that will
provide the same information as the stats page, but a little more
efficiently. I may try to put up a patch with this soon.

-Jay


On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote:

 We've been using Solr 1.4 for a few days now and one slight downside we've
 noticed is the stats page comes up very slowly for some reason - sometimes
 more than 10 seconds.  We call this programmatically to retrieve the last
 commit date so that we can keep users from committing too frequently.  This
 means some of our administration pages are now taking a long time to load.
  Is there anything we should be doing to ensure that this page comes up
 quickly?  I see some notes on this back in October but it looks like that
 update should already be applied by now.  Or, better yet, is there now a
 better way to just retrieve the last commit date from Solr without pulling
 all of the statistics?

 Thanks in advance.

 --
 Steve



Re: Solr 1.4 - stats page slow

2009-12-24 Thread Jay Hill
Also, what is your heap size and the amount of RAM on the machine?

I've also noticed that, when watching memory usage through JConsole or
YourKit while loading the stats page, the memory usage spikes dramatically -
are you seeing this as well?

-Jay

On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill jayallenh...@gmail.com wrote:

 I've noticed this as well, usually when working with a large field cache. I
 haven't done in-depth analysis of this yet, but it seems like when the stats
 page is trying to pull data from a large field cache it takes quite a long
 time.

 Are you doing a lot of sorting? If so, what are the field types of the
 fields you're sorting on? How large is the index both in document count and
 file size?

 Another approach to get data from the Solr instance would be to use JMX.
 And I've been working on a request handler (started by Erik Hatcher) that
 will provide the same information as the stats page, but a little more
 efficiently. I may try to put up a patch with this soon.

 -Jay



 On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss swe...@stylesight.comwrote:

 We've been using Solr 1.4 for a few days now and one slight downside we've
 noticed is the stats page comes up very slowly for some reason - sometimes
 more than 10 seconds.  We call this programmatically to retrieve the last
 commit date so that we can keep users from committing too frequently.  This
 means some of our administration pages are now taking a long time to load.
  Is there anything we should be doing to ensure that this page comes up
 quickly?  I see some notes on this back in October but it looks like that
 update should already be applied by now.  Or, better yet, is there now a
 better way to just retrieve the last commit date from Solr without pulling
 all of the statistics?

 Thanks in advance.

 --
 Steve





Sort fields all look Strings in field cache, no matter schema type

2009-12-19 Thread Jay Hill
I'm on a project where I'm trying to determine the size of the field cache.
We're seeing lots of memory problems, and I suspect that the field cache is
extremely large, but I'm trying to get exact counts on what's in the field
cache.

One thing that struck me as odd in the output of the stats.jsp page is that
the field cache always shows a String type for a field, even if it is not a
String. For example, the output below is for a field cscore that is a
double:

entry#0 : 
'org.apache.lucene.index.readonlydirectoryrea...@6239da8a'='cscore',class

org.apache.lucene.search.FieldCache$StringIndex,null=org.apache.lucene.search.FieldCache$StringIndex#297347471


The index has 4,292,426 documents, so I would expect the field cache size
for this field to be:
cscore: double (8 bytes) x 4,292,426 docs = 34,339,408 bytes

But can someone explain why a double is using FieldCache$StringIndex please?
No matter what the type of the field is in the schema the field cache stats
always show FieldCache$StringIndex.

Thanks,
-Jay


Re: Sort fields all look Strings in field cache, no matter schema type

2009-12-19 Thread Jay Hill
This field is of class type solr.SortableDoubleField.

I'm actually migrating a project from Solr 1.1 to 1.4, and am in the process
of trying to update the schema and solrconfig in stages. Updating the field
to TrieDoubleField w/ precisionStep=0 definitely helped.

Thanks Yonik!
-Jay



On Sat, Dec 19, 2009 at 11:37 AM, Yonik Seeley
yo...@lucidimagination.comwrote:

 On Sat, Dec 19, 2009 at 2:25 PM, Jay Hill jayallenh...@gmail.com wrote:
  One thing that struck me as odd in the output of the stats.jsp page is
 that
  the field cache always shows a String type for a field, even if it is not
 a
  String. For example, the output below is for a field cscore that is a
  double:

 What's the class type of the double?  Older style SortableDouble had
 to use the string index.  Newer style trie-double based should use a
 double[].

 It also matters what the FieldCache entry is being used for... certain
 things like faceting on single valued fields still use the
 StringIndex.  I believe the stats component does too.  Sorting and
 function queries should work as expected.

 -Yonik



Re: Sort fields all look Strings in field cache, no matter schema type

2009-12-19 Thread Jay Hill
Oh, forgot to add (just to keep the thread complete), the field is being
used for a sort, so it was able to use TrieDoubleField.

Thanks again,
-Jay


On Sat, Dec 19, 2009 at 12:21 PM, Jay Hill jayallenh...@gmail.com wrote:

 This field is of class type solr.SortableDoubleField.

 I'm actually migrating a project from Solr 1.1 to 1.4, and am in the
 process of trying to update the schema and solrconfig in stages. Updating
 the field to TrieDoubleField w/ precisionStep=0 definitely helped.

 Thanks Yonik!
 -Jay




 On Sat, Dec 19, 2009 at 11:37 AM, Yonik Seeley yo...@lucidimagination.com
  wrote:

 On Sat, Dec 19, 2009 at 2:25 PM, Jay Hill jayallenh...@gmail.com wrote:
  One thing that struck me as odd in the output of the stats.jsp page is
 that
  the field cache always shows a String type for a field, even if it is
 not a
  String. For example, the output below is for a field cscore that is a
  double:

 What's the class type of the double?  Older style SortableDouble had
 to use the string index.  Newer style trie-double based should use a
 double[].

 It also matters what the FieldCache entry is being used for... certain
 things like faceting on single valued fields still use the
 StringIndex.  I believe the stats component does too.  Sorting and
 function queries should work as expected.

 -Yonik





Re: nested queries

2009-11-19 Thread Jay Hill
I don't think your queries are actually nested queries. Nested queries key
off of the magic field name _query_. You're right however that there is
very little in the way of documentation of examples of nested queries. If
you haven't seen this blog about them yet you might find this a helpful
overview of nested queries:
http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

-Jay


On Thu, Nov 19, 2009 at 6:15 AM, Andrea Campi
andrea.ca...@zephirworks.comwrote:

 Grant,

 Grant Ingersoll wrote:

 On Nov 19, 2009, at 7:02 AM, Andrea Campi wrote:



 To make things easier and more maintainable, I'd like to use nested
 queries for that; I'd like to be able to write:

 q={!boost b=$dateboost v=ftext:$terms^1000 OR
 text:$terms}dateboost=product(...etc.)terms=something

 Or even better:

 q={!boost b=$dateboost v=$qq}qq={!query v=ftext:$terms^1000 OR
 text:$terms}dateboost=product(...etc.)terms=something




 Sounds like you might benefit from using the Dismax Parser.  You can
 specify the field boosting thing in your config and also add the bf (boost
 function) capability.


 I tried that but the customer prefers the lucene syntax for the actual
 query.

 However, now that you mention this, I should probably be able to use Dismax
 but specify the lucene syntax for the actual search on the 'text' field,
 right?
 I will try that, thanks.

 Bye,
Andrea



Re: Wildcards at the Beginning of a Search.

2009-11-16 Thread Jay Hill
There is a text_rev field type in the example schema.xml file in the
official release of 1.4. It uses the ReversedWildcardFilterFactory to revers
a field. You can do a copyField from the field you want to use for leading
wildcard searches to a field using the text_rev field, and then do a regular
trailing wildcard search on the reversed field.

-Jay
http://www.lucidimagination.com


On Thu, Nov 12, 2009 at 4:41 AM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 is in solr 1.4 maby a way to search with an wildcard at the beginning?

 in 1.3 i cant activate it.

 KingArtus



Replication admin page auto-reload

2009-11-16 Thread Jay Hill
The replication admin page on slaves used to have an auto-reload set to
reload every few seconds. In the official 1.4 release this doesn't seem to
be working, but it does in a nightly build from early June. Was this changed
on purpose or is this a bug? I looked through CHANGES.txt to see if anything
was mentioned related to this but didn't see anything. If it's a bug I'll
open an issue in JIRA

-Jay


Re: Sending file to Solr via HTTP POST

2009-11-05 Thread Jay Hill
Here is a brief example of how to use SolrJ with the
ExtractingRequestHandler:

  ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest(/update/extract);
  req.addFile(fileToIndex);
  req.setParam(literal.id, getId(fileToIndex));
  req.setParam(literal.hostname, getHostname());
  req.setParam(literal.filename, fileToIndex.getName());

  try {
getSolrServer().request(req);
  } catch (SolrServerException e) {
e.printStackTrace();
  }

You'll need a request handler configured in solrconfig.xml:

  !-- Solr Cell Wiki: http://wiki.apache.org/solr/ExtractingRequestHandler--
  requestHandler name=/update/extract
class=org.apache.solr.handler.extraction.ExtractingRequestHandler
startup=lazy
lst name=defaults
  !-- All the main content goes into this field... if you need to
return
   the extracted text or do highlighting, use a stored field. --
  str name=map.contenttext/str
  str name=lowernamestrue/str
  str name=uprefixignored_/str
/lst
  /requestHandler

Note that the example also shows how to use the literal.* parameter to add
metadata fields of your choice to the document.

Hope that helps get you started.

-Jay
http://www.lucidimagination.com


On Tue, Nov 3, 2009 at 10:38 PM, Caroline Tan caroline@gmail.comwrote:

 Hi,
 From the Solr wiki on ExtractingRequestHandler tutorial, when it comes to
 the part to post file to Solr, it always uses the curl command, e.g.
 curl '
 http://localhost:8983/*solr*/update/extract?literal.id=doc1commit=true'
 -F myfi...@tutorial.html

 I have never used curl and i was thinking is  there any replacement to such
 method?

 Is there any API that i can use to achieve the same thing in a java
 project without relying on CURL? Does SolrJ have such method? Thanks

 ~caroLine



Re: specify multiple files in lst for DataImportHandler

2009-11-05 Thread Jay Hill
You can set up multiple request handlers each with their own configuration
file. For example, in addition to the config you listed you could add
something like this:

requestHandler name=/dataimport-two
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
 str name=configdata-two-config.xml/str
/lst
/requestHandler

and so on with as many handlers as you need.

-Jay
http://www.lucidimagination.com


On Thu, Nov 5, 2009 at 8:57 AM, javaxmlsoapdev vika...@yahoo.com wrote:


 requestHandler name=/dataimport
  class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 /lst
 /requestHandler

 is there a way to list more than one files in the lst above
 configuration?
 I understand I can have multiple entity itself in the config but I need
 to
 keep two data-config files separate and still use same DIH to create one
 index.
 --
 View this message in context:
 http://old.nabble.com/specify-multiple-files-in-%3Clst%3E-for-DataImportHandler-tp26215805p26215805.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: CPU utilization and query time high on Solr slave when snapshot install

2009-11-02 Thread Jay Hill
So assuming you set up a few sample sort queries to run in the firstSearcher
config, and had very low query volume during that ten minutes so that there
were no evictions before a new Searcher was loaded, would those queries run
by the firstSearcher be passed along to the cache for the next Searcher as
part of the autowarm? If so, it seems like you might want to load a few sort
queries for the firstSearcher, but might not need any included in the
newSearcher?

-Jay


On Mon, Nov 2, 2009 at 4:26 PM, Mark Miller markrmil...@gmail.com wrote:

 Hmm...I think you have to setup warming queries yourself and that autowarm
 just copies entries from the old cache to the new cache, rather than issuing
 queries - the value is how many entries it will copy. Though that's still
 going to take CPU and time.

 - Mark

 http://www.lucidimagination.com (mobile)


 On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org
 wrote:

  If you are going to pull a new index every 10 minutes, try turning off
 cache autowarming.

 Your caches are never more than 10 minutes old, so spending a minute
 warming each new cache is a waste of CPU. Autowarm submits queries to the
 new Searcher before putting it in service. This will create a burst of query
 load on the new Searcher, often keeping one CPU pretty busy for several
 seconds.

 In solrconfig.xml, set autowarmCount to 0.

 Also, if you want the slaves to always have an optimized index, create the
 snapshot only in post-optimize. If you create snapshots in both post-commit
 and post-optimize, you are creating a non-optimized index (post-commit),
 then replacing it with an optimized one a few minutes later. A slave might
 get a non-optimized index one time, then an optimized one the next.

 wunder

 On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote:

  Hi Solr Gurus,

 We have solr in 1 master, 2 slave configuration. Snapshot is created post
 commit, post optimization. We have autocommit after 50 documents or 5
 minutes. Snapshot puller runs as a cron every 10 minutes. What we have
 observed is that whenever snapshot is installed on the slave, we see solrj
 client used to query slave solr, gets timedout and there is high CPU
 usage/load avg. on slave server. If we stop snapshot puller, then slaves
 work with no issues. The system has been running since 2 months and this
 issue has started to occur only now  when load on website is increasing.

 Following are some details:

 Solr Details:
 apache-solr Version: 1.3.0
 Lucene - 2.4-dev

 Master/Slave configurations:

 Master:
 - for indexing data HTTPRequests are made on Solr server.
 - autocommit feature is enabled for 50 docs and 5 minutes
 - caching params are disable for this server
 - mergeFactor of 10 is set
 - we were running optimize script after every 2 hours, but now have
 reduced the duration to twice a day but issue still persists

 Slave1/Slave2:
 - standard requestHandler is being used
 - default values of caching are set
 Machine Specifications:

 Master:
 - 4GB RAM
 - 1GB JVM Heap memory is allocated to Solr

 Slave1/Slave2:
 - 4GB RAM
 - 2GB JVM Heap memory is allocated to Solr

 Master and Slave1 (solr1)are on single box and Slave2(solr2) on different
 box. We use HAProxy to load balance query requests between 2 slaves. Master
 is only used for indexing.
 Please let us know if somebody has ever faced similar kind of issue or
 has some insight into it as we guys are literally struck at the moment with
 a very unstable production environment.

 As a workaround, we have started running optimize on master every 7
 minutes. This seems to have reduced the severity of the problem but still
 issue occurs every 2days now. please suggest what could be the root cause of
 this.

 Thanks,
 Bipul








Re: solr web ui

2009-10-30 Thread Jay Hill
Have a look at the VelocityResponseWriter (
http://wiki.apache.org/solr/VelocityResponseWriter). It's in the contrib
area, but the wiki has instructions on how to move it into your core Solr.
Solr uses response writers to return results. The default is XML but
responses can be returned in JSON, Ruby and other formats. The
VelocityResponseWriter enables responses returned using Velocity templates.
It sounds like exactly what you need.

-Jay
http://www.lucidimagination.com


On Thu, Oct 29, 2009 at 6:17 PM, scabbage guans...@gmail.com wrote:


 Hi,

 I'm a new solr user. I would like to know if there are any easy to setup
 web
 UIs for solr. It can be as simple as a search box, term highlighting and
 basic faceting. Basically I'm using solr to store all our automation
 testing
 logs and would like to have a simple searchable UI. I don't wanna spent too
 much time writing my own.

 Thanks.
 --
 View this message in context:
 http://www.nabble.com/solr-web-ui-tp26123604p26123604.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Facets - ORing attribute values

2009-10-29 Thread Jay Hill
1.4 has a good chance of being released next week. There was a hope that it
might make it this week, but another bug in Lucene 2.9.1 was found, pushing
things back just a little bit longer.

-Jay
http://www.lucidimagination.com


On Thu, Oct 29, 2009 at 11:43 AM, beaviebugeater mbr...@jdnholdings.comwrote:


 Do you have any (educated) guess on when 1.4 will be officially released?
 Weeks? Months? Years?




 Yonik Seeley-2 wrote:
 
  Perhaps something like this that's actually running Solr w/
 multi-selecti?
  http://search.lucidimagination.com/
 
 
 http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters
 
  You just need a recent version of Solr 1.4
 
  -Yonik
  http://www.lucidimagination.com
 
 
 
  On Thu, Oct 29, 2009 at 1:51 PM, beaviebugeater mbr...@jdnholdings.com
  wrote:
 
  I have implemented faceting with Solr for an ecommerce project.
 
  However, I'd like to change the default behavior somewhat.  Visualize
  with
  me the left nav that contains:
 
  Attribute A
  value1 (count)
  value2 (count)
  value3 (count)
 
  Attribute B
  value4 (count)
  value5 (count)
 
  The user interface has a checkbox for each attribute value.  As a
  checkbox
  is checked, the list of products is refined to include those with the
  selected attribute(s).
 
  The default behavior is to AND all selected attributes.
 
  What I would like is if I check value1, none of the counts for Attribute
  A
  change, just the product result set.  If I then check value3 the effect
  is
  that I'm saying products with values for Attribute A of value1 OR value3
  (not AND).  Counts for Attribute B do change as usual.
 
  If I then check value4, the effect is to return products with values for
  Attribute A of (value1 OR value3) AND values for Attribute B of value5.
 
  You can see this sort of thing in action here:
 
 
 http://www.beanbags.com/bean-bag-chairs/large/1618+1620+4225.cfm#N=1618+1620+4225+4229+4231Ns=Preferredview=36display=grid_view
 
  Is this doable with Solr out of the box or do I need to build some logic
  around Solr's faceting functionality?
 
  Thanks.
  Matt
 
  --
  View this message in context:
 
 http://www.nabble.com/Facets---ORing-attribute-values-tp26117763p26117763.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Facets---ORing-attribute-values-tp26117763p26118562.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: DIH: Setting rows= on full-import has no effect

2009-10-09 Thread Jay Hill
As always, you guys rock!

Thanks,
-Jay
http://www.lucidimagination.com


On Fri, Oct 9, 2009 at 2:57 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 FYI - This is fixed in trunk.

 2009/10/9 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

  I have raised an issue http://issues.apache.org/jira/browse/SOLR-1501
 
  On Fri, Oct 9, 2009 at 6:10 AM, Jay Hill jayallenh...@gmail.com wrote:
   In the past setting rows=n with the full-import command has stopped the
  DIH
   importing at the number I passed in, but now this doesn't seem to be
   working. Here is the command I'm using:
   curl '
  
 
 http://localhost:8983/solr/indexer/mediawiki?command=full-importrows=100'
  
   But when 100 docs are imported the process keeps running. Here's the
 log
   output:
  
   Oct 8, 2009 5:23:32 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   INFO: Indexing stopped at docCount = 100
   Oct 8, 2009 5:23:33 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   INFO: Indexing stopped at docCount = 200
   Oct 8, 2009 5:23:35 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   INFO: Indexing stopped at docCount = 300
   Oct 8, 2009 5:23:36 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   INFO: Indexing stopped at docCount = 400
   Oct 8, 2009 5:23:38 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   INFO: Indexing stopped at docCount = 500
  
   and so on.
  
   Running on the most recent nightly: 1.4-dev 823366M - jayhill -
  2009-10-08
   17:31:22
  
   I've used that exact url in the past and the indexing stopped at the
 rows
   number as expected, but I haven't run the command for about two months
 on
  a
   build from back in early July.
  
   Here's the dih config:
  
dataConfig
  dataSource
 name=dsFiles
 type=FileDataSource
 encoding=UTF-8/
  document
entity
   name=f
   processor=FileListEntityProcessor
   baseDir=/path/to/files
   fileName=.*xml
   recursive=true
   rootEntity=false
   dataSource=null
  
  entity
 name=wikixml
 processor=XPathEntityProcessor
 forEach=/mediawiki/page
 url=${f.fileAbsolutePath}
 dataSource=dsFiles
 onError=skip
 
field column=id xpath=/mediawiki/page/id/
field column=title xpath=/mediawiki/page/title/
field column=contributor
   xpath=/mediawiki/page/revision/contributor/username/
field column=comment xpath=/mediawiki/page/revision/comment/
field column=text xpath=/mediawiki/page/revision/text/
  
  /entity
/entity
  /document
   /dataConfig
  
  
   -Jay
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: concatenating tokens

2009-10-09 Thread Jay Hill
Use copyField to copy to a field with a field type like this:

fieldType name=special class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z]) replacement=  replace=all/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=([^a-z]) replacement=  replace=all/
  /analyzer
/fieldType

This works for your example, however I can't be sure if it will work for all
of your content, but give it a try and see.

-Jay
http://www.lucidimagination.com

On Fri, Oct 9, 2009 at 1:34 AM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Hi Joe,

 WordDelimiterFilter removes different delimiters, and creates several token
 strings from the input. It can also concatenate and add that as additional
 token to the stream. Though, it concatenates without space. But maybe you
 can tweak it to your needs?
 You could also use two different fields, one creating the concatenated
 version with spaces, and the other producing the catenated tokens. (Both
 with WordDelimiter and/or RegexPattern filters etc.)

 Cheers,
 Chantal

 Joe Calderon schrieb:

  hello *, im using a combination of tokenizers and filters that give me
 the desired tokens, however for a particular field i want to
 concatenate these tokens back to a single string, is there a filter to
 do that, if not what are the steps needed to make my own filter to
 concatenate tokens?

 for example, i start with Sprocket (widget) - Blue the analyzers
 churn out the tokens [sprocket,widget,blue] i want to end up with the
 string sprocket widget blue, this is a simple example and in the
 general case lowercasing and punctuation removal does not work, hence
 why im looking to concatenate tokens

 --joe




Re: Dynamic Data Import from multiple identical tables

2009-10-09 Thread Jay Hill
You could use separate DIH config files for each of your three tables. This
might be overkill, but it would keep them separate. The DIH is not limited
to one request handler setup, so you could create a unique handler for each
case with a unique name:

   requestHandler name=/indexer/table1
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
  str name=configtable1-config.xml/str
/lst
  /requestHandler

   requestHandler name=/indexer/table2
class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configtable2-config.xml/str
/lst
  /requestHandler

   requestHandler name=/indexer/table3
class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configtable3-config.xml/str
/lst
  /requestHandler

When you go to ...solr/admin/dataimport.jsp you should see a list of all
DataImportHandlers that are configured, and can select them individually, if
that works for your needs.

-Jay
http://www.lucidimagination.com

On Fri, Oct 9, 2009 at 10:57 AM, solr.searcher solr.searc...@gmail.comwrote:


 Hi all,

 First of all, please accept my apologies if this has been asked and
 answered
 before. I tried my best to search and couldn't find anything on this.

 The problem I am trying to solve is as follows. I have multiple tables with
 identical schema - table_a, table_b, table_c ... and I am trying to create
 one big index with the data from each of these tables. The idea was to
 programatically create the data-config file (just changing the table name)
 and do a reload-config followed by a full-import with clean set to false.
 In
 other words:

 1. publish the data-config file
 2. do a reload-config
 3. do a full-import with clean = false
 4. commit, optimize
 5. repeat with new table name

 I wanted to then follow the same procedure for delta imports. The problem
 is
 that after I do a reload-config and then do a full-import, the old data in
 the index is getting lost.

 What am I missing here? Please note that I am new to solr.

 INFO: [] webapp=/solr path=/dataimport
 params={command=reload-configclean=false} status=0 QTime=4
 Oct 9, 2009 10:17:30 AM org.apache.solr.core.SolrCore execute
 INFO: [] webapp=/solr path=/dataimport
 params={command=full-importclean=false} status=0 QTime=1
 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.DataImporter
 doFullImport
 INFO: Starting Full Import
 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.SolrWriter
 readIndexerProperties
 INFO: Read dataimport.properties
 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
 call
 INFO: Creating a connection for entity blah blah blah
 Oct 9, 2009 10:17:30 AM org.apache.solr.handler.dataimport.JdbcDataSource$1
 call
 INFO: Time taken for getConnection(): 12
 Oct 9, 2009 10:17:31 AM org.apache.solr.core.SolrDeletionPolicy onInit
 INFO: SolrDeletionPolicy.onInit: commits:num=1


 commit{dir=/blah/blah/index,segFN=segments_1z,version=1255032607825,generation=71,filenames=[segments_1z,
 _cl.cfs]
 Oct 9, 2009 10:17:31 AM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: last commit = 1255032607825

 Any help will be greatly appreciated. Is there any other way to
 automaticaly
 slurp data from multiple, identical tables?

 Thanks a lot.

 --
 View this message in context:
 http://www.nabble.com/Dynamic-Data-Import-from-multiple-identical-tables-tp25825381p25825381.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: java -Dsolr.solr.home=core -jar start.jar not working for me

2009-10-09 Thread Jay Hill
Shouldn't that be:  java -Dsolr.solr.home=multicore -jar start.jar

and then hit url: http://localhost:8983/solr/core0/admin/  or
http://localhost:8983/solr/core1/admin/

-Jay
http://www.lucidimagination.com

On Fri, Oct 9, 2009 at 1:17 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:

 I have a fresh checkout from trunk, cd example, after running java
 -Dsolr.solr.home=core -jar start.jar,
 http://localhost:8983/solr/admin yields a 404 error.



Re: java -Dsolr.solr.home=core -jar start.jar not working for me

2009-10-09 Thread Jay Hill
After checking out the latest revision did you do a build? I've made that
mistake myself a few times: check out the latest revision and then fire up
jetty before running ant example - could that be it?

-Jay
http://www.lucidimagination.com


On Fri, Oct 9, 2009 at 1:38 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:

 Jay,

 I tried that as well, still nothing.

 When I run: java -Dsolr.solr.home=solr -jar start.jar
 I see:
 2009-10-09 13:37:04.887::INFO:  Logging to STDERR via
 org.mortbay.log.StdErrLog
 2009-10-09 13:37:05.051::INFO:  jetty-6.1.3
 2009-10-09 13:37:05.096::INFO:  Started SocketConnector @ 0.0.0.0:8983

 And http://localhost:8983/solr/admin yields a 404 error.

 On Fri, Oct 9, 2009 at 1:27 PM, Jay Hill jayallenh...@gmail.com wrote:
  Shouldn't that be:  java -Dsolr.solr.home=multicore -jar start.jar
 
  and then hit url: http://localhost:8983/solr/core0/admin/  or
  http://localhost:8983/solr/core1/admin/
 
  -Jay
  http://www.lucidimagination.com
 
  On Fri, Oct 9, 2009 at 1:17 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com
  wrote:
 
  I have a fresh checkout from trunk, cd example, after running java
  -Dsolr.solr.home=core -jar start.jar,
  http://localhost:8983/solr/admin yields a 404 error.
 
 



DIH: Setting rows= on full-import has no effect

2009-10-08 Thread Jay Hill
In the past setting rows=n with the full-import command has stopped the DIH
importing at the number I passed in, but now this doesn't seem to be
working. Here is the command I'm using:
curl '
http://localhost:8983/solr/indexer/mediawiki?command=full-importrows=100'

But when 100 docs are imported the process keeps running. Here's the log
output:

Oct 8, 2009 5:23:32 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
INFO: Indexing stopped at docCount = 100
Oct 8, 2009 5:23:33 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
INFO: Indexing stopped at docCount = 200
Oct 8, 2009 5:23:35 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
INFO: Indexing stopped at docCount = 300
Oct 8, 2009 5:23:36 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
INFO: Indexing stopped at docCount = 400
Oct 8, 2009 5:23:38 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
INFO: Indexing stopped at docCount = 500

and so on.

Running on the most recent nightly: 1.4-dev 823366M - jayhill - 2009-10-08
17:31:22

I've used that exact url in the past and the indexing stopped at the rows
number as expected, but I haven't run the command for about two months on a
build from back in early July.

Here's the dih config:

 dataConfig
dataSource
   name=dsFiles
   type=FileDataSource
   encoding=UTF-8/
document
  entity
 name=f
 processor=FileListEntityProcessor
 baseDir=/path/to/files
 fileName=.*xml
 recursive=true
 rootEntity=false
 dataSource=null

entity
   name=wikixml
   processor=XPathEntityProcessor
   forEach=/mediawiki/page
   url=${f.fileAbsolutePath}
   dataSource=dsFiles
   onError=skip
   
  field column=id xpath=/mediawiki/page/id/
  field column=title xpath=/mediawiki/page/title/
  field column=contributor
xpath=/mediawiki/page/revision/contributor/username/
  field column=comment xpath=/mediawiki/page/revision/comment/
  field column=text xpath=/mediawiki/page/revision/text/

/entity
  /entity
/document
/dataConfig


-Jay


Re: TermsComponent or auto-suggest with filter

2009-10-07 Thread Jay Hill
Something like this, building on each character typed:

facet=onfacet.field=tc_queryfacet.prefix=befacet.mincount=1

-Jay
http://www.lucidimagination.com


On Tue, Oct 6, 2009 at 5:43 PM, R. Tan tanrihae...@gmail.com wrote:

 Nice. In comparison, how do you do it with faceting?

  Two other approaches are to use either the TermsComponent (new in Solr
  1.4) or faceting.



 On Wed, Oct 7, 2009 at 1:51 AM, Jay Hill jayallenh...@gmail.com wrote:

  Have a look at a blog I posted on how to use EdgeNGrams to build an
  auto-suggest tool:
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
 
  You could easily add filter queries to this approach. Ffor example, the
  query used in the blog could add filter queries like this:
 
  http://localhost:8983/solr/select/?q=user_query:
  ”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count
  descfq=yourField:yourQueryfq=anotherField:anotherQuery
 
  -Jay
  http://www.lucidimagination.com
 
 
 
 
  On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote:
 
   Hello,
   What's the best way to get auto-suggested terms/keywords that is
 filtered
   by
   one or more fields? TermsComponent should have been the solution but
   filters
   are not supported.
  
   Thanks,
   Rihaed
  
 



Re: ISOLatin1AccentFilter before or after Snowball?

2009-10-07 Thread Jay Hill
Correct me if I'm wrong, but wasn't the ISOLatin1AccentFilterFactory
deprecated in favor of:
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/

in 1.4?

-Jay
http://www.lucidimagination.com


On Wed, Oct 7, 2009 at 1:44 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Tue, Oct 6, 2009 at 4:33 PM, Chantal Ackermann 
 chantal.ackerm...@btelligent.de wrote:

  Hi all,
 
  from reading through previous posts on that subject, it seems like the
  accent filter has to come before the snowball filter.
 
  I'd just like to make sure this is so. If it is the case, I'm wondering
  whether snowball filters for i.e. French process accented language
  correctly, at all, or whether they remove accents anyway... Or whether
  accents should be removed whenever making use of snowball filters.
 
 
 I'd think so but I'm not sure. Perhaps someone else can weigh in.


 
  And also: it really is meant to take UTF-8 as input, even though it is
  named ISOLatin1AccentFilter, isn't it?
 
 
 See http://markmail.org/message/hi25u5iqusfu542b

 --
 Regards,
 Shalin Shekhar Mangar.



Re: TermsComponent or auto-suggest with filter

2009-10-06 Thread Jay Hill
Have a look at a blog I posted on how to use EdgeNGrams to build an
auto-suggest tool:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

You could easily add filter queries to this approach. Ffor example, the
query used in the blog could add filter queries like this:

http://localhost:8983/solr/select/?q=user_query:”i”wt=jsonfl=user_queryindent=onechoParams=nonerows=10sort=count
descfq=yourField:yourQueryfq=anotherField:anotherQuery

-Jay
http://www.lucidimagination.com




On Tue, Oct 6, 2009 at 4:40 AM, R. Tan tanrihae...@gmail.com wrote:

 Hello,
 What's the best way to get auto-suggested terms/keywords that is filtered
 by
 one or more fields? TermsComponent should have been the solution but
 filters
 are not supported.

 Thanks,
 Rihaed



Batching requests using SolrCell with SolrJ

2009-09-19 Thread Jay Hill
When working with SolrJ I have typically batched a Collection of
SolrInputDocument objects before sending them to the Solr server. I'm
working with the latest nightly build and using the ExtractingRequestHandler
to index documents, and everything is working fine. Except I haven't been
able to figure out how to batch documents when also including literals.
Here's what I've got:

//Looping over a List of Files
  ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest(/update/extract);
  req.addFile(fileToIndex);
  req.setParam(literal.id, fileToIndex.getCanonicalPath());

  try {
getSolrServer().request(req);
  } catch (SolrServerException e) {
e.printStackTrace();
  }

Which works great, except that each document processed in the loop is
sending a separate request. Previously I built a collection of SolrInput
docs and had SolrJ send them in batches of 100 or whatever.

It seems like I could batch documents by continuing to add them to the
request (req.addFile(eachFileUpToACount)), but the literals seem to present
a problem. By sending one at a time the contents and the literals all wind
up in the same document. But in a batch there will just be an array of
params for literal.id (in this example) not matched to the contents.

Can anyone provide a code snippet of how to do this? Or is there no other
approach than sending a request for each document.

Thanks,
-Jay
http://www.lucidimagination.com


Any way to encrypt/decrypt stored fields?

2009-09-16 Thread Jay Hill
For security reasons (say I'm indexing very sensitive data, medical records
for example) is there a way to encrypt data that is stored in Solr? Some
businesses I've encountered have such needs and this is a barrier to them
adopting Solr to replace other legacy systems. Would it require a
custom-written filter to encrypt during indexing and decrypt at query time,
or is there something I'm unaware of already available to do this?

-Jay


Re: Is it possible to query for everything ?

2009-09-14 Thread Jay Hill
Use: ?q=*:*

-Jay
http://www.lucidimagination.com


On Mon, Sep 14, 2009 at 4:18 PM, Jonathan Vanasco jvana...@2xlp.com wrote:

 I'm using Solr for seach and faceted browsing

 Is it possible to have solr search for 'everything' , at least as far as q
 is concerned ?

 The request handlers I've found don't like it if I don't pass in a q
 parameter



Re: Is it possible to query for everything ?

2009-09-14 Thread Jay Hill
With dismax you can use q.alt when the q param is missing:
q.alt=*:*
should work.

-Jay


On Mon, Sep 14, 2009 at 5:38 PM, Jonathan Vanasco jvana...@2xlp.com wrote:

 Thanks Jay  Matt

 I tried *:* on my app, and it didn't work

 I tried it on the solr admin, and it did

 I checked the solr config file, and realized that it works on standard, but
 not on dismax, queries

 So i have my app checking *:* on a standard qt, and then filtering what I
 need on other qts!

 I would never have figured this out without you two!



Re: KStem download

2009-09-14 Thread Jay Hill
The two jar files are all you should need, and the configuration is correct.
However I noticed that you are on Solr 1.3. I haven't tested the Lucid
KStemmer on a non-Lucid-certified distribution of 1.3. I have tested it on
recent versions of 1.4 and it works fine (just tested with the most recent
nightly build).

So there are two options, but I don't know if either will work for you:
1. Move up to Solr 1.4, copy over the jars and configure.
2. Get the free Lucid certified distribution of 1.3 which already has the
Lucid KStemmer (and other fixes which are an improvement over the standard
1.3).

-Jay
http://www.lucidimagination.com


On Mon, Sep 14, 2009 at 6:09 PM, darniz rnizamud...@edmunds.com wrote:


 i was able to declare a field type when the i use the lucid distribution of
 solr
 fieldtype name=lucidkstemmer class=solr.TextField
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt /
/analyzer
 /fieldtype

 But if i copy the two jars and put it in lib directory of apache solr
 distribution it still gives me the following error.

 SEVERE: java.lang.NoClassDefFoundError:
 org/apache/solr/util/plugin/ResourceLoaderAware
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at

 org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:375)
at

 org.mortbay.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:337)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at

 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:257)
at

 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:278)
at

 org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:83)
at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
 org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:781)
at
 org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:56)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:413)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:431)
at

 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:440)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:92)
at org.apache.solr.core.SolrCore.init(SolrCore.java:412)
at

 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119)
at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at
 org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at

 org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at

 org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at
 org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at
 org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at

 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at

 org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at

 org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
 org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at
 org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at 

Re: Highlighting in SolrJ?

2009-09-12 Thread Jay Hill
Will do Shalin.

-Jay
http://www.lucidimagination.com


On Fri, Sep 11, 2009 at 9:23 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Jay, it would be great if you can add this example to the Solrj wiki:

 http://wiki.apache.org/solr/Solrj

 On Fri, Sep 11, 2009 at 5:15 AM, Jay Hill jayallenh...@gmail.com wrote:

  Set up the query like this to highlight a field named content:
 
 SolrQuery query = new SolrQuery();
 query.setQuery(foo);
 
 query.setHighlight(true).setHighlightSnippets(1); //set other params
 as
  needed
 query.setParam(hl.fl, content);
 
 QueryResponse queryResponse =getSolrServer().query(query);
 
  Then to get back the highlight results you need something like this:
 
 IteratorSolrDocument iter = queryResponse.getResults();
 
 while (iter.hasNext()) {
   SolrDocument resultDoc = iter.next();
 
   String content = (String) resultDoc.getFieldValue(content));
   String id = (String) resultDoc.getFieldValue(id); //id is the
  uniqueKey field
 
   if (queryResponse.getHighlighting().get(id) != null) {
 ListString highightSnippets =
  queryResponse.getHighlighting().get(id).get(content);
   }
 }
 
  Hope that gets you what you need.
 
  -Jay
  http://www.lucidimagination.com
 
  On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com
 wrote:
 
   Can somebody point me to some sample code for using highlighting in
   SolrJ?  I understand the highlighted versions of the field comes in a
   separate NamedList?  How does that work?
  
   --
   http://www.linkedin.com/in/paultomblin
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Highlighting in SolrJ?

2009-09-11 Thread Jay Hill
It's really just a matter of what you're intentions are. There are an awful
lot of highlighting params and so highlighting is very flexible and
customizable. Regarding snippets, as an example Google presents two snippets
in results, which is fairly common. I'd recommend doing a lot of
experimenting by changing the params on the query string to get what you
want, and then setting them up in SolrJ. The example I sent was intended to
be a generic starting point and mostly just to show how to set highlighting
params and how to get back a List of highlighting results.

-Jay
http://www.lucidimagination.com


On Thu, Sep 10, 2009 at 5:40 PM, Paul Tomblin ptomb...@xcski.com wrote:

 If I set snippets to 9 and mergeContinuous to true, will I get
 the entire contents of the field with all the search terms replaced?
 I don't see what good it would be just getting one line out of the
 whole field as a snippet.

 On Thu, Sep 10, 2009 at 7:45 PM, Jay Hill jayallenh...@gmail.com wrote:
  Set up the query like this to highlight a field named content:
 
 SolrQuery query = new SolrQuery();
 query.setQuery(foo);
 
 query.setHighlight(true).setHighlightSnippets(1); //set other params
 as
  needed
 query.setParam(hl.fl, content);
 
 QueryResponse queryResponse =getSolrServer().query(query);
 
  Then to get back the highlight results you need something like this:
 
 IteratorSolrDocument iter = queryResponse.getResults();
 
 while (iter.hasNext()) {
   SolrDocument resultDoc = iter.next();
 
   String content = (String) resultDoc.getFieldValue(content));
   String id = (String) resultDoc.getFieldValue(id); //id is the
  uniqueKey field
 
   if (queryResponse.getHighlighting().get(id) != null) {
 ListString highightSnippets =
  queryResponse.getHighlighting().get(id).get(content);
   }
 }
 
  Hope that gets you what you need.
 
  -Jay
  http://www.lucidimagination.com
 
  On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com
 wrote:
 
  Can somebody point me to some sample code for using highlighting in
  SolrJ?  I understand the highlighted versions of the field comes in a
  separate NamedList?  How does that work?
 
  --
  http://www.linkedin.com/in/paultomblin
 
 



 --
 http://www.linkedin.com/in/paultomblin



Re: Highlighting in SolrJ?

2009-09-11 Thread Jay Hill
Try adding this param: hl.fragsize=3
(obviously set the fragsize to whatever very high number you need for your
largest doc.)

-Jay


On Fri, Sep 11, 2009 at 7:54 AM, Paul Tomblin ptomb...@xcski.com wrote:

 What I want is the whole text of that field with every instance of the
 search term high lighted, even if the search term only occurs in the
 first line of a 300 page field.  I'm not sure if mergeContinuous will
 do that, or if it will miss everything after the last line that
 contains the search term.

 On Fri, Sep 11, 2009 at 10:42 AM, Jay Hill jayallenh...@gmail.com wrote:
  It's really just a matter of what you're intentions are. There are an
 awful
  lot of highlighting params and so highlighting is very flexible and
  customizable. Regarding snippets, as an example Google presents two
 snippets
  in results, which is fairly common. I'd recommend doing a lot of
  experimenting by changing the params on the query string to get what you
  want, and then setting them up in SolrJ. The example I sent was intended
 to
  be a generic starting point and mostly just to show how to set
 highlighting
  params and how to get back a List of highlighting results.
 
  -Jay
  http://www.lucidimagination.com
 
 
  On Thu, Sep 10, 2009 at 5:40 PM, Paul Tomblin ptomb...@xcski.com
 wrote:
 
  If I set snippets to 9 and mergeContinuous to true, will I get
  the entire contents of the field with all the search terms replaced?
  I don't see what good it would be just getting one line out of the
  whole field as a snippet.
 
  On Thu, Sep 10, 2009 at 7:45 PM, Jay Hill jayallenh...@gmail.com
 wrote:
   Set up the query like this to highlight a field named content:
  
  SolrQuery query = new SolrQuery();
  query.setQuery(foo);
  
  query.setHighlight(true).setHighlightSnippets(1); //set other
 params
  as
   needed
  query.setParam(hl.fl, content);
  
  QueryResponse queryResponse =getSolrServer().query(query);
  
   Then to get back the highlight results you need something like this:
  
  IteratorSolrDocument iter = queryResponse.getResults();
  
  while (iter.hasNext()) {
SolrDocument resultDoc = iter.next();
  
String content = (String) resultDoc.getFieldValue(content));
String id = (String) resultDoc.getFieldValue(id); //id is the
   uniqueKey field
  
if (queryResponse.getHighlighting().get(id) != null) {
  ListString highightSnippets =
   queryResponse.getHighlighting().get(id).get(content);
}
  }
  
   Hope that gets you what you need.
  
   -Jay
   http://www.lucidimagination.com
  
   On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com
  wrote:
  
   Can somebody point me to some sample code for using highlighting in
   SolrJ?  I understand the highlighted versions of the field comes in a
   separate NamedList?  How does that work?
  
   --
   http://www.linkedin.com/in/paultomblin
  
  
 
 
 
  --
  http://www.linkedin.com/in/paultomblin
 
 



 --
 http://www.linkedin.com/in/paultomblin



Re: standard requestHandler components

2009-09-11 Thread Jay Hill
RequestHandlers are configured in solrconfig.xml. If no components are
explicitly declared in the request handler config the the defaults are used.
They are:
- QueryComponent
- FacetComponent
- MoreLikeThisComponent
- HighlightComponent
- StatsComponent
- DebugComponent

If you wanted to have a custom list of components (either omitting defaults
or adding custom) you can specify the components for a handler directly:
arr name=components
  strquery/str
  strfacet/str
  strmlt/str
  strhighlight/str
  strdebug/str
  strsomeothercomponent/str
/arr

You can add components before or after the main ones like this:
arr name=first-components
  strmycomponent/str
/arr

arr name=last-components
  strmyothercomponent/str
/arr

and that's how the spell check component can be added:
arr name=last-components
  strspellcheck/str
/arr

Note that the a component (except the defaults) must be configured in
solrconfig.xml with the name used in the str element as well.

Have a look at the solrconfig.xml in the example directory
(.../example/solr/conf/) for examples on how to set up the spellcheck
component, and on how the request handlers are configured.

-Jay
http://www.lucidimagination.com


On Fri, Sep 11, 2009 at 3:04 PM, michael8 mich...@saracatech.com wrote:


 Hi,

 I have a newbie question about the 'standard' requestHandler in
 solrconfig.xml.  What I like to know is where is the config information for
 this requestHandler kept?  When I go to http://localhost:8983/solr/admin,
 I
 see the following info, but am curious where are the supposedly 'chained'
 components (e.g. QueryComponent, FacetComponent, MoreLikeThisComponent)
 configured for this requestHandler.  I see timing and process debug output
 from these components with debugQuery=true, so somewhere these components
 must have been configured for this 'standard' requestHandler.

 name:standard
 class:  org.apache.solr.handler.component.SearchHandler
 version:$Revision: 686274 $
 description:Search using components:

 org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.DebugComponent,
 stats:  handlerStart : 1252703405335
 requests : 3
 errors : 0
 timeouts : 0
 totalTime : 201
 avgTimePerRequest : 67.0
 avgRequestsPerSecond : 0.015179728


 What I like to do from understanding this is to properly integrate
 spellcheck component into the standard requestHandler as suggested in a
 solr
 spellcheck example.

 Thanks for any info in advance.
 Michael
 --
 View this message in context:
 http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25409075.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Pagination with solr json data

2009-09-10 Thread Jay Hill
All you have to do is use the start and rows parameters to get the
results you want. For example, the query for the first page of results might
look like this,
?q=solrstart=0rows=10 (other params omitted). So you'll start at the
beginning (0) and get 10 results. They next page would be
?q=solrstart=10rows=10 - start at the 10th result and display the next 10
rows. Then ?q=solrstart=20rows=10, and so on.

-Jay
http://www.lucidimagination.com


On Wed, Sep 9, 2009 at 12:24 PM, Elaine Li elaine.bing...@gmail.com wrote:

 Hi,

 What is the best way to do pagination?

 I searched around and only found some YUI utilities can do this. But
 their examples don't have very close match to the pattern I have in
 mind. I would like to have pretty plain display, something like the
 search results from google.

 Thanks.

 Elaine



Re: TermsComponent

2009-09-10 Thread Jay Hill
If you need an alternative to using the TermsComponent for auto-suggest,
have a look at this blog on using EdgeNGrams instead of the TermsComponent.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

-Jay
http://www.lucidimagination.com


On Wed, Sep 9, 2009 at 3:35 PM, Todd Benge todd.be...@gmail.com wrote:

 We're using the StandardAnalyzer but I'm fairly certain that's not the
 issue.

 In fact, I there doesn't appear to be any issue with Lucene or Solr.  There
 are many instances of data in which users have removed the whitespace so
 they have a high frequency which means they bubble to the top of the sort.
 The result is that a search for a name shows a first and last name without
 the whitespace.

 One thing I've noticed is that since TermsComponent is working on a single
 Term, there doesn't seem to be a way to query against a phrase.  The same
 example as above applies, so if you're querying for name it'd be prefered
 to
 get multi-term responses back if a first name matches.

 Any suggestions?

 Thanks for all the help.  It's much appreciated.

 Todd


 On Wed, Sep 9, 2009 at 12:11 PM, Grant Ingersoll gsing...@apache.org
 wrote:

  And what Analyzer are you using?  I'm guessing that your words are being
  split up during analysis, which is why you aren't seeing whitespace.  If
 you
  want to keep the whitespace, you will need to use the String field type
 or
  possibly the Keyword Analyzer.
 
  -Grant
 
 
  On Sep 9, 2009, at 11:06 AM, Todd Benge wrote:
 
   It's set as Field.Store.YES, Field.Index.ANALYZED.
 
 
 
  On Wed, Sep 9, 2009 at 8:15 AM, Grant Ingersoll gsing...@apache.org
  wrote:
 
   How are you tokenizing/analyzing the field you are accessing?
 
 
  On Sep 9, 2009, at 8:49 AM, Todd Benge wrote:
 
  Hi Rekha,
 
 
  Here's teh link to the TermsComponent info:
 
  http://wiki.apache.org/solr/TermsComponent
 
  and another link Matt Weber did on autocompletion:
 
 
 
 
 http://www.mattweber.org/2009/05/02/solr-autosuggest-with-termscomponent-and-jquery/
 
  We had to upgrade to the latest nightly to get the TermsComponent to
  work.
 
  Good Luck!
 
  Todd
 
  On Wed, Sep 9, 2009 at 5:17 AM, dharhsana rekha.dharsh...@gmail.com
  wrote:
 
 
   Hi,
 
  I have a requirement on Autocompletion search , iam using solr 1.4.
 
  Could you please tell me how you worked on that Terms component using
  solr
  1.4,
  i could'nt find terms component in solr 1.4 which i have
 downloaded,is
  there
  anyother configuration should be done.
 
  Do you have code for autocompletion, please share wih me..
 
  Regards
  Rekha
 
 
 
  tbenge wrote:
 
 
  Hi,
 
  I was looking at TermsComponent in Solr 1.4 as a way of building a
  autocomplete function.  I have a prototype working but noticed that
  terms
  that have whitespace in them when indexed are absent the whitespace
  when
  returned from the TermsComponent.
 
  Any ideas on why that may be happening?  Am I just missing a
 
   configuration
 
   option?
 
  Thanks,
 
  Todd
 
 
 
   --
  View this message in context:
  http://www.nabble.com/TermsComponent-tp25302503p25362829.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
   --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
 using
  Solr/Lucene:
  http://www.lucidimagination.com/search
 
 
 
  --
  Grant Ingersoll
  http://www.lucidimagination.com/
 
  Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
  Solr/Lucene:
  http://www.lucidimagination.com/search
 
 



Re: Highlighting in SolrJ?

2009-09-10 Thread Jay Hill
Set up the query like this to highlight a field named content:

SolrQuery query = new SolrQuery();
query.setQuery(foo);

query.setHighlight(true).setHighlightSnippets(1); //set other params as
needed
query.setParam(hl.fl, content);

QueryResponse queryResponse =getSolrServer().query(query);

Then to get back the highlight results you need something like this:

IteratorSolrDocument iter = queryResponse.getResults();

while (iter.hasNext()) {
  SolrDocument resultDoc = iter.next();

  String content = (String) resultDoc.getFieldValue(content));
  String id = (String) resultDoc.getFieldValue(id); //id is the
uniqueKey field

  if (queryResponse.getHighlighting().get(id) != null) {
ListString highightSnippets =
queryResponse.getHighlighting().get(id).get(content);
  }
}

Hope that gets you what you need.

-Jay
http://www.lucidimagination.com

On Thu, Sep 10, 2009 at 3:19 PM, Paul Tomblin ptomb...@xcski.com wrote:

 Can somebody point me to some sample code for using highlighting in
 SolrJ?  I understand the highlighted versions of the field comes in a
 separate NamedList?  How does that work?

 --
 http://www.linkedin.com/in/paultomblin



Re: Sort a Multivalue field

2009-09-09 Thread Jay Hill
Unfortunately you can't sort on a multi-valued field. In order to sort on a
field it must be indexed but not multi-valued.

Have a look at the FieldOptions wiki page for a good description of what
values to set for different use cases:
http://wiki.apache.org/solr/FieldOptionsByUseCase

-Jay
www.lucidimagination.com


On Wed, Sep 9, 2009 at 2:37 AM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 Hallo Friends,

 I have a Problem...

 my Search engins Server runs since a lot of weeks...
 Now i gett new XML, and one of the fields ar Multivalue,,

 Ok, i change the Schema.xml, set it to Multivalue and it works :-) no Error
 by the Indexing.. Now i go to the Gui, and will sort this Field, and BAM, i
 cant sort.

 it is impossible to sort a Tokenized field

 Than i think, ok, i doo it in a CopyField and sort the CopyField..
 and voila, i dont get an error, but hie dosent sort realy, i get an output,
 but no change by desc ore asc

 What can i do to sort this Field.. i thinkt, when i soert this field (only
 Numbers) the file comes multible in the output, like this...


 xml:
 field aaa1122/aaa
 field aaa2211/aaa
 field aaa3322/aaa

 sort field aaa

 *1122*
 1134
 1145
 *2211*
 2233
 3355
 3311
 3312
 *3322*
 ...
 ...
 ...

 i hope you have a idea, i am at the end with my ideas

 KingArtus



Re: Field names with whitespaces

2009-08-31 Thread Jay Hill
This seems to work:

?q=field\ name:something

Probably not a good idea to have field names with whitespace though.

-Jay

2009/8/28 Marcin Kuptel marcinkup...@gmail.com

 Hi,

 Is there a way to query solr about fields which names contain whitespaces?
 Indexing such data does not cause any problems but I have been unable to
 retrieve it.

 Regards,
 Marcin Kuptel



Re: MoreLikeThis: How to get quality terms from html from content stream?

2009-08-09 Thread Jay Hill
Solr Cell definitely sounds like it has a place here. But wouldn't it be
needed for as an extracting component earlier in the process for the
MoreLikeThisHandler? The MLT Handler works great when it's directed to a
content stream of plain text. If we could just use Solr Cell to identify the
file type and do the content extraction earlier in the stream that would do
the trick I think. Then whether the URL pointed to HTML, a PDF, or whatever,
MLT would be receiving a stream of extracted content.

-Jay


On Sun, Aug 9, 2009 at 7:17 AM, Grant Ingersoll gsing...@apache.org wrote:

 It's starting to sound like Solr Cell needs a SearchComponent as well, that
 can come before the QueryComponent and can be used to map into the other
 components.  Essentially, take the functionality of the extractOnly option
 and have it feed other SearchComponent.



 On Aug 8, 2009, at 10:42 AM, Ken Krugler wrote:


 On Aug 7, 2009, at 5:23pm, Jay Hill wrote:

  I'm using the MoreLikeThisHandler with a content stream to get documents
 from my index that match content from an html page like this:

 http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi
 ?f=/c/a/2009/08/06/SP5R194Q13.DTLmlt.fl=bodyrows=4debugQuery=true

 But, not surprisingly, the query generated is meaningless because a lot
 of
 the markup is picked out as terms:
 str name=parsedquery_toString
 body:li body:href  body:div body:class body:a body:script body:type
 body:js
 body:ul body:text body:javascript body:style body:css body:h body:img
 body:var body:articl body:ad body:http body:span body:prop
 /str

 Does anyone know a way to transform the html so that the content can be
 parsed out of the content stream and processed w/o the markup? Or do I
 need
 to write my own HTMLParsingMoreLikeThisHandler?


 You'd want to parse the HTML to extract only text first, and use that for
 your index data.

 Both the Nutch and Tika OSS projects have examples of using HTML parsers
 (based on TagSoup or CyberNeko) to generate content suitable for indexing.

 -- Ken

  If I parse the content out to a plain text file and point the stream.url
 param to file:///parsedfile.txt it works great.

 -Jay


 --
 Ken Krugler
 TransPac Software, Inc.
 http://www.transpac.com
 +1 530-210-6378


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search




MoreLikeThis: How to get quality terms from html from content stream?

2009-08-07 Thread Jay Hill
I'm using the MoreLikeThisHandler with a content stream to get documents
from my index that match content from an html page like this:
http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/08/06/SP5R194Q13.DTLmlt.fl=bodyrows=4debugQuery=true

But, not surprisingly, the query generated is meaningless because a lot of
the markup is picked out as terms:
str name=parsedquery_toString
body:li body:href  body:div body:class body:a body:script body:type body:js
body:ul body:text body:javascript body:style body:css body:h body:img
body:var body:articl body:ad body:http body:span body:prop
/str

Does anyone know a way to transform the html so that the content can be
parsed out of the content stream and processed w/o the markup? Or do I need
to write my own HTMLParsingMoreLikeThisHandler?

If I parse the content out to a plain text file and point the stream.url
param to file:///parsedfile.txt it works great.

-Jay


Re: DIH: Any way to make update on db table?

2009-08-04 Thread Jay Hill
Excellent, thanks Avlesh and Noble.

-Jay

On Mon, Aug 3, 2009 at 9:28 PM, Avlesh Singh avl...@gmail.com wrote:

 
  datasource.getData(update mytable ); //though the name is getData()
  it can execute update commands also
 
 Even when the dataSource is readOnly, Noble?

 Cheers
 Avlesh

 2009/8/4 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

  If your are writing a Transformer (or any other component) you can get
  hold of a dataSource instance .
 
   datasource =Context#getDataSource(name).
  //then you can invoke
  datasource.getData(update mytable );
  //though the name is getData() it can execute update commands also
 
  ensure that you do a
  datasource.close();
  after you are done
 
  On Tue, Aug 4, 2009 at 9:40 AM, Avlesh Singhavl...@gmail.com wrote:
   Couple of things -
  
 1. Your dataSource is probably in readOnly mode. It is possible to
 fire
 updates, by specifying readOnly=false in your dataSource.
 2. What you are trying achieve, is typically done using a select for
 update. For MySql, here's the documentation -
 http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
 3. You don't need to create a separate entity for firing updates.
 Writing a database procedure might be a good idea. In that case your
  query
 will simply be  entity name=mainEntity query=call MyProcedure();
  .../.
 All the heavy lifting can be done by this query.
  
   Moreover, update queries, only return the number of rows affected and
 not
  a
   resultSet. DIH expects one and hence the exception.
  
   Cheers
   Avlesh
  
   On Tue, Aug 4, 2009 at 1:49 AM, Jay Hill jayallenh...@gmail.com
 wrote:
  
   Is it possible for the DataImportHandler to update records in the
 table
  it
   is querying? For example, say I have a query like this in my entity:
  
   query=select field1, field2, from someTable where
 hasBeenIndexed=false
  
   Is there a way I can mark each record processed by updating the
   hasBeenIndexed field? Here's a config I tried:
  
   ?xml version=1.0?
   dataConfig
  dataSource
 type=JdbcDataSource
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost:3306/solrhacks
 user=user
 password=pass/
  
document name=testingDIHupdate
  entity name=mainEntity
  pk=id
  query=select id, name from tableToIndex where
   hasBeenIndexed=0
field column=id template=dihTestUpdate-${main.id}/
field column=name name=name/
  
entity name=updateEntity
pk=id
query=update tableToIndex set hasBeenIndexed=1 where
   id=${mainEntity.id}
/entity
  /entity
/document
   /dataConfig
  
   It does update the first record, but then an Exception is thrown:
   Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
   buildDocument
   SEVERE: Exception while processing: mainEntity document :
   SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
   org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
 to
   execute query: update tableToIndex set hasBeenIndexed=1 where id=1
   Processing Document # 1
  at
  
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:250)
  at
  
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
  at
  
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
  at
  
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
  at
  
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
  at
  
  
 
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
  at
  
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
  at
  
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
  at
  
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
  at
  
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
   Caused by: java.lang.NullPointerException
  at
  
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:248)
  ... 12 more
  
  
   -Jay
  
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 



DIH: Any way to make update on db table?

2009-08-03 Thread Jay Hill
Is it possible for the DataImportHandler to update records in the table it
is querying? For example, say I have a query like this in my entity:

query=select field1, field2, from someTable where hasBeenIndexed=false

Is there a way I can mark each record processed by updating the
hasBeenIndexed field? Here's a config I tried:

?xml version=1.0?
dataConfig
dataSource
   type=JdbcDataSource
   driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost:3306/solrhacks
   user=user
   password=pass/

  document name=testingDIHupdate
entity name=mainEntity
pk=id
query=select id, name from tableToIndex where
hasBeenIndexed=0
  field column=id template=dihTestUpdate-${main.id}/
  field column=name name=name/

  entity name=updateEntity
  pk=id
  query=update tableToIndex set hasBeenIndexed=1 where
id=${mainEntity.id}
  /entity
/entity
  /document
/dataConfig

It does update the first record, but then an Exception is thrown:
Aug 3, 2009 1:15:24 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: mainEntity document :
SolrInputDocument[{id=id(1.0)={1}, name=name(1.0)={John Jones}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
execute query: update tableToIndex set hasBeenIndexed=1 where id=1
Processing Document # 1
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:250)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:207)
at
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:40)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:370)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
Caused by: java.lang.NullPointerException
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:248)
... 12 more


-Jay


Re: How can i get lucene index format version information?

2009-07-30 Thread Jay Hill
Check the system request handler: http://localhost:8983/solr/admin/system

Should look something like this:
lst name=lucene
str name=solr-spec-version1.3.0.2009.07.28.10.39.42/str
str name=solr-impl-version1.4-dev 797693M - jayhill - 2009-07-28
10:39:42/str
str name=lucene-spec-version2.9-dev/str
str name=lucene-impl-version2.9-dev 794238 - 2009-07-15 18:05:08/str
/lst

-Jay


On Thu, Jul 30, 2009 at 10:32 AM, Walter Underwood wun...@wunderwood.orgwrote:

 I think the properties page in the admin UI lists the Lucene version, but I
 don't have a live server to check that on at this instant.

 wunder


 On Jul 30, 2009, at 10:26 AM, Chris Hostetter wrote:


 :  i want to get the lucene index format version from solr web app (as

 : the Luke request handler writes it out:
 :
 :indexInfo.add(version, reader.getVersion());

 that's the index version (as in i have added docs to the index, so the
 version number has changed) the question is about the format version (as
 in: i have upgraded Lucene from 2.1 to 2.3, so the index format has
 changed)

 I'm not sure how Luke get's that ... it's not exposed via a public API on
 an IndexReader.

 Hmm...  SegmentInfos.readCurrentVersion(Directory) seems like it would do
 the trick; but i'm not sure how that would interact with customized
 INdexReader implementations.  i suppose we could always make it non-fatal
 if it throws an exception (just print the exception mesg in place of hte
 number)

 anybody want to submit a patch to add this to the LukeRequestHandler?


 -Hoss





FieldCollapsing: Two response elements returned?

2009-07-27 Thread Jay Hill
I'm doing some testing with field collapsing, and early results look good.
One thing seems odd to me however. I would expect to get back one block of
results, but I get two - the first one contains the collapsed results, the
second one contains the full non-collapsed results:

result name=response numFound=11 start=0 ... /result
result name=response numFound=62 start=0 ... /result

This seems somewhat confusing. Is this intended or is this a bug?

Thanks,
-Jay


DIH: On import (full or delta) commit=false seems to not take effect

2009-07-15 Thread Jay Hill
I am trying to run full and delta imports with the commit=false option, but
it doesn't seem to take effect - after the import a commit always happens no
matter what params I send. I've looked at the source and unless I'm missing
something it doesn't seem to process the commit param.

Here's the url I'm using:
curl '
http://localhost:8080/solr/indexer/books?command=full-importcommit=false'

But as soon as the import finishes a commit occurs. I want to set things up
to let autoCommit control all commits as I have a series of DIH-configs
importing data at different times.

I will file an issue in JIRA, but I wanted to check the list first to see if
this has come up for others.

-Jay


Re: spellcheck with misspelled words in index

2009-07-15 Thread Jay Hill
We had the same thing to deal with recently, and a great solution was posted
to the list. Create a stopwords filter on the field your using for your
spell checking, and then populate a custom stopwords file with known
misspelled words:

fieldType name=textSpell class=solr.TextField
positionIncrementGap=100 
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=misspelled_words.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

Your spell field would look like this:
   field name=spell type=textSpell indexed=true stored=true
multiValued=true/

Then add words like cusine to messpelled_words.txt

-Jay


On Tue, Jul 14, 2009 at 11:40 PM, Chris Williams cswilli...@gmail.comwrote:

 Hi,
 I'm having some trouble getting the correct results from the
 spellcheck component.  I'd like to use it to suggest correct product
 titles on our site, however some of our products have misspellings in
 them outside of our control.  For example, there's 2 products with the
 misspelled word cusine (and 25k with the correct spelling
 cuisine).  So if someone searches for the word cusine on our site,
 I would like to show the 2 misspelled products, and a suggestion with
 Did you mean cuisine?.

 However, I can't seem to ever get any spelling suggestions when I
 search by the word cusine, and correctlySpelled is always true.
 Misspelled words that don't appear in the index work fine.

 I noticed that setting onlyMorePopular to true will return suggestions
 for the misspelled word, but I've found that it doesn't work great for
 other words and produces suggestions too often for correctly spelled
 words.

 I incorrectly had thought that by setting thresholdTokenFrequency
 higher on my spelling dictionary that these misspellings would not
 appear in my spelling index and thus I would get suggestions for them,
 but as I see now, the spellcheck doesn't quite work like that.

 Is there any way to somehow get spelling suggestions to work for these
 misspellings in my index if they have a low frequency?

 Thanks in advance,
 Chris



Re: DIH: On import (full or delta) commit=false seems to not take effect

2009-07-15 Thread Jay Hill
My bad, I had a configuration setting overriding this value. Sorry for the
mistake.

-Jay

On Wed, Jul 15, 2009 at 12:07 PM, Jay Hill jayallenh...@gmail.com wrote:

 I am trying to run full and delta imports with the commit=false option, but
 it doesn't seem to take effect - after the import a commit always happens no
 matter what params I send. I've looked at the source and unless I'm missing
 something it doesn't seem to process the commit param.

 Here's the url I'm using:
 curl '
 http://localhost:8080/solr/indexer/books?command=full-importcommit=false'

 But as soon as the import finishes a commit occurs. I want to set things up
 to let autoCommit control all commits as I have a series of DIH-configs
 importing data at different times.

 I will file an issue in JIRA, but I wanted to check the list first to see
 if this has come up for others.

 -Jay



Re: DIH: On import (full or delta) commit=false seems to not take effect

2009-07-15 Thread Jay Hill
Actually, my good after all. The parameter does not take effect. If
commit=false is passed in a commit still happens.

Will open and JIRA and supply a patch shortly.

-Jay


On Wed, Jul 15, 2009 at 5:50 PM, Jay Hill jayallenh...@gmail.com wrote:

 My bad, I had a configuration setting overriding this value. Sorry for the
 mistake.

 -Jay


 On Wed, Jul 15, 2009 at 12:07 PM, Jay Hill jayallenh...@gmail.com wrote:

 I am trying to run full and delta imports with the commit=false option,
 but it doesn't seem to take effect - after the import a commit always
 happens no matter what params I send. I've looked at the source and unless
 I'm missing something it doesn't seem to process the commit param.

 Here's the url I'm using:
 curl '
 http://localhost:8080/solr/indexer/books?command=full-importcommit=false
 '

 But as soon as the import finishes a commit occurs. I want to set things
 up to let autoCommit control all commits as I have a series of DIH-configs
 importing data at different times.

 I will file an issue in JIRA, but I wanted to check the list first to see
 if this has come up for others.

 -Jay





Spell checking: Is there a way to exclude words known to be wrong?

2009-07-13 Thread Jay Hill
We're building a spell index from a field in our main index with the
following configuration:
  searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str
lst name=spellchecker
  str name=namedefault/str
  str name=fieldspell/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=buildOnCommittrue/str
/lst
  /searchComponent

This works great and re-builds the spelling index on commits as expected.
However, we know there are misspellings in the spell field of our main
index. We could remove these from the spelling index using Luke, however
they will be added again on commits. What we need is something similar to
how the protwords.txt file is used. So that when we notice misspelled words
such as beginnning being pulled from our main index we could add them to
an exclusion file so they are not added to the spelling index again.

Any tricks to make this possible?

-Jay


Re: Creating DataSource for DIH to Oracle Database

2009-07-09 Thread Jay Hill
Francis, your question is a little vague. Are you looking for the
configuration for connecting the DIH to a JNDI datasource set up in
Weblogic?

dataSource
   name=dsDb
   jndiName=java:comp/env/jdbc/myWeblogicDatasource
   type=JdbcDataSource
   user=/

-Jay


On Mon, Jul 6, 2009 at 2:41 PM, Francis Yakin fya...@liquid.com wrote:


 Have any one had experience creating a datasource for DIH to an Oracle
 Database?

 Also, from the Solr side we are running weblogic and deploy the application
 using weblogic.
 I know in weblogic we can create a datasource that can connect to Oracle
 database, has any one had experience with this?


 Thanks

 Francis





Re: about defaultSearchField

2009-07-08 Thread Jay Hill
Just to be sure: You mentioned that you adjusted schema.xml - did you
re-index after making your changes?

-Jay


On Wed, Jul 8, 2009 at 7:07 AM, Yang Lin beckl...@gmail.com wrote:

 Thanks for your reply. But it works not.

 Yang

 2009/7/8 Yao Ge yao...@gmail.com

 
  Try with fl=* or fl=*,score added to your request string.
  -Yao
 
  Yang Lin-2 wrote:
  
   Hi,
   I have some problems.
   For my solr progame, I want to type only the Query String and get all
   field
   result that includ the Query String. But now I can't get any result
   without
   specified field. For example, query with tina get nothing, but
   Sentence:tina could.
  
   I hava adjusted the *schema.xml* like this:
  
   fields
  field name=CategoryNamePolarity type=text indexed=true
   stored=true multiValued=true/
  field name=CategoryNameStrenth type=text indexed=true
   stored=true multiValued=true/
  field name=CategoryNameSubjectivity type=text indexed=true
   stored=true multiValued=true/
  field name=Sentence type=text indexed=true stored=true
   multiValued=true/
  
  field name=allText type=text indexed=true stored=true
   multiValued=true/
   /fields
  
   uniqueKey required=falseSentence/uniqueKey
  
!-- field for the QueryParser to use when an explicit fieldname is
   absent
   --
defaultSearchFieldallText/defaultSearchField
  
!-- SolrQueryParser configuration: defaultOperator=AND|OR --
solrQueryParser defaultOperator=OR/
  
   copyfield source=CategoryNamePolarity dest=allText/
   copyfield source=CategoryNameStrenth dest=allText/
   copyfield source=CategoryNameSubjectivity dest=allText/
   copyfield source=Sentence dest=allText/
  
  
   I think the problem is in defaultSearchField, but I don't know how to
   fix
   it. Could anyone help me?
  
   Thanks
   Yang
  
  
 
  --
  View this message in context:
  http://www.nabble.com/about-defaultSearchField-tp24382105p24384615.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Indexing rich documents from websites using ExtractingRequestHandler

2009-07-08 Thread Jay Hill
I haven't tried this myself, but it sounds like what you're looking for is
enabling remote streaming:
http://wiki.apache.org/solr/ContentStream#head-7179a128a2fdd5dde6b1af553ed41735402aadbf

As the link above shows you should be able to enable remote streaming like
this: requestParsers enableRemoteStreaming=true
multipartUploadLimitInKB=2048 /  and then something like this might work:
stream.url=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdfhttp://www.sub.mydomain.com/files/pdfdocs/testfile.pdf

So you use stream.url instead of stream.file.

Hope this helps.

-Jay


On Wed, Jul 8, 2009 at 7:40 AM, ahammad ahmed.ham...@gmail.com wrote:


 Hello,

 I can index rich documents like pdf for instance that are on the
 filesystem.
 Can we use ExtractingRequestHandler to index files that are accessible on a
 website?

 For example, there is a file that can be reached like so:
 http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf

 How would I go about indexing that file? I tried using the following
 combinations. I will put the errors in brackets:

 stream.file=http://www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The
 filename, directory name, or volume label syntax is incorrect)
 stream.file=www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The system
 cannot find the path specified)
 stream.file=//www.sub.myDomain.com/files/pdfdocs/testfile.pdf (The format
 of
 the specified network name is invalid)
 stream.file=sub.myDomain.com/files/pdfdocs/testfile.pdf (The system cannot
 find the path specified)
 stream.file=//sub.myDomain.com/files/pdfdocs/testfile.pdf (The network
 path
 was not found)

 I sort of understand why I get those errors. What are the alternative
 methods of doing this? I am guessing that the stream.file attribute doesn't
 support web addresses. Is there another attribute that does?
 --
 View this message in context:
 http://www.nabble.com/Indexing--rich-documents-from-websites-using-ExtractingRequestHandler-tp24392809p24392809.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Indexing XML

2009-07-07 Thread Jay Hill
Mathieu, have a look at Solr's DataImportHandler. It provides a
configuration-based approach to index different types of datasources
including relational databases and XML files. In particular have a look at
the XpathEntityProcessor (
http://wiki.apache.org/solr/DataImportHandler#head-f1502b1ed71d98ef0120671db5762e137e63f9d2)
which allows you to use xpath syntax to map xml data to index fields.

-Jay


On Tue, Jul 7, 2009 at 7:25 AM, Saeli Mathieu saeli.math...@gmail.comwrote:

 Hello.

 I'm a new user of Solr, I already used Lucene to index files and search.
 But my programme was too slow, it's why I was looking for another solution,
 and I thought I found it.

 I said I thought because I don't know if it's possible to use solar with
 this kind of XML files.

  lom xsi:schemaLocation=http://ltsc.ieee.org/xsd/lomv1.0
 http://ltsc.ieee.org/xsd/lomv1.0/lom.xsd;
 general
 identifier
 catalogSTRING HERE/catalog
 entry
 STRING HERE
 /entry
 /identifier
 title
 string language=fr
 STRING HERE
 /string
 /title
 languagefr/language
 description
 string language=fr
 STRING HERE
 /string
 /description
 /general
 lifeCycle
 status
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /status
 contribute
 role
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /role
 entitySTRING HERE
 /entity
 /contribute
 /lifeCycle
 metaMetadata
 identifier
 catalogSTRING HERE/catalog
 entrySTRING HERE/entry
 /identifier
 contribute
 role
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /role
 entitySTRING HERE
 /entity
 date
 dateTimeSTRING HERE/dateTime
 /date
 /contribute
 contribute
 role
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /role
 entitySTRING HERE
 /entity
 entitySTRING HERE/entity
 entitySTRING HERE
 /entity
 date
 dateTimeSTRING HERE/dateTime
 /date
 /contribute
 metadataSchemaSTRING HERE/metadataSchema
 languageSTRING HERE/language
 /metaMetadata
 technical
 locationSTRING HERE
 /location
 /technical
 educational
 intendedEndUserRole
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /intendedEndUserRole
 context
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /context
 typicalAgeRange
 string language=frSTRING HERE/string
 /typicalAgeRange
 description
 string language=frSTRING HERE/string
 /description
 description
 string language=frSTRING HERE/string
 /description
 languageSTRING HERE/language
 /educational
 annotation
 entitySTRING HERE
 /entity
 date
 dateTimeSTRING HERE/dateTime
 /date
 /annotation
 classification
 purpose
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /purpose
 /classification
 classification
 purpose
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /purpose
 taxonPath
 source
 string language=frSTRING HERE/string
 /source
 taxon
 idSTRING HERE/id
 entry
 string language=frSTRING HERE/string
 /entry
 /taxon
 /taxonPath
 /classification
 classification
 purpose
 sourceSTRING HERE/source
 valueSTRING HERE/value
 /purpose
 taxonPath
 source
 string language=frSTRING HERE /string
 /source
 taxon
 idSTRING HERE/id
 entry
 string language=frSTRING HERE/string
 /entry
 /taxon
 /taxonPath
 taxonPath
 source
 string language=frSTRING HERE/string
 /source
 taxon
 idSTRING HERE/id
 entry
 string language=frSTRING HERE/string
 /entry
 /taxon
 /taxonPath
 /classification
 /lom

 I don't know how I can use this kind of file with Solr because the XML
 example are this one.

  add
  doc
  field name=idSOLR1000/field
  field name=nameSolr, the Enterprise Search Server/field
  field name=manuApache Software Foundation/field
  field name=catsoftware/field
  field name=catsearch/field
  field name=featuresAdvanced Full-Text Search Capabilities using
 Lucene/field
  field name=featuresOptimized for High Volume Web Traffic/field
  field name=featuresStandards Based Open Interfaces - XML and
 HTTP/field
  field name=featuresComprehensive HTML Administration
 Interfaces/field
  field name=featuresScalability - Efficient Replication to other Solr
 Search Servers/field
  field name=featuresFlexible and Adaptable with XML configuration and
 Schema/field
  field name=featuresGood unicode support: h#xE9;llo (hello with an
 accent over the e)/field
  field name=price0/field
 field name=popularity10/field
 field name=inStocktrue/field
 field name=incubationdate_dt2006-01-17T00:00:00.000Z/field
 /doc
 /add

 I understood Solr need this kind of architecture, by Architecture I mean
 field + name=keywordValue/field
 or as you can see I can't use this kind of architecture because I'm not
 allow to change my XML files.

 I'm looking forward to read you.

 Mathieu Saeli
 --
 Saeli Mathieu.



Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Jay Hill
Thanks Noble, I gave those examples a try.

If I use field column=body xpath=/book/body/chapter/p /  I only get
the text from the last p element, not from all elements.

If I use field column=body xpath=/book/body/chapter flatten=true/
or field column=body xpath=/book/body/chapter/ flatten=true/ I don't
get back anything for the body column.

So the first example is close, but it only gets the text for the last p
element. If I could get all p elements at the same level that would be
what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
supported.

Thanks,
-Jay


2009/7/1 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 complete xpath is not supported

 /book/body/chapter/p

 should work.

 if you wish all the text under chapter irrespective of nesting , tag
 names use this
 field column=body xpath=/book/body/chapter flatten=true/






 On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote:
  I'm using the XPathEntityProcessor to parse an xml structure that looks
 like
  this:
 
  book
 authorJoe Smith/author
 titleWorld Atlas/title
 body
 chapter
 pContent I want is here/p
 pMore content I want is here./p
 pStill more content here./p
 /chapter
 /body
  /book
 
  The author and title parse out fine:   field column=title
  xpath=/book/title/  field column=author xpath=/book/author/
 
  But I can't get at the data inside the p tags. I want to get all
  non-markup text inside the body tag with something like this:
 
  field column=body xpath=/book/body/chapter//p/
 
  but that is not supported.
 
  Does anyone know of a way that I can get the content within the p tags
  without the markup?
 
  Thanks,
  -Jay
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Jay Hill
It is not multivalued. The intention is to get all text under they body
element into one body field in the index that is not multivalued.
Essentially everything within the body element minus the markup.

Thanks,
-Jay


On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie fer...@twig.me.uk wrote:

 Thanks Noble, I gave those examples a try.
 
 If I use field column=body xpath=/book/body/chapter/p /  I only get
 the text from the last p element, not from all elements.

 Hm, I am sure I have done this. In your schema.xml is the
 field body multiValued or not?


 
 If I use field column=body xpath=/book/body/chapter flatten=true/
 or field column=body xpath=/book/body/chapter/ flatten=true/ I
 don't
 get back anything for the body column.
 
 So the first example is close, but it only gets the text for the last p
 element. If I could get all p elements at the same level that would be
 what I need. The double-slash (/book/body/chapter//p) doesn't seem to be
 supported.
 
 Thanks,
 -Jay
 
 
 2009/7/1 Noble Paul ??  Â Ë³Ë noble.p...@corp.aol.com
 
  complete xpath is not supported
 
  /book/body/chapter/p
 
  should work.
 
  if you wish all the text under chapter irrespective of nesting , tag
  names use this
  field column=body xpath=/book/body/chapter flatten=true/
 
 
 
 
 
 
  On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote:
   I'm using the XPathEntityProcessor to parse an xml structure that
 looks
  like
   this:
  
   book
  authorJoe Smith/author
  titleWorld Atlas/title
  body
  chapter
  pContent I want is here/p
  pMore content I want is here./p
  pStill more content here./p
  /chapter
  /body
   /book
  
   The author and title parse out fine:   field column=title
   xpath=/book/title/  field column=author xpath=/book/author/
  
   But I can't get at the data inside the p tags. I want to get all
   non-markup text inside the body tag with something like this:
  
   field column=body xpath=/book/body/chapter//p/
  
   but that is not supported.
  
   Does anyone know of a way that I can get the content within the p
 tags
   without the markup?
  
   Thanks,
   -Jay
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com
 

 --

 ===
 Fergus McMenemie   
 Email:fer...@twig.me.ukemail%3afer...@twig.me.uk
 Techmore Ltd   Phone:(UK) 07721 376021

 Unix/Mac/Intranets Analyst Programmer
 ===



Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Jay Hill
I'm on the trunk, built on July 2: 1.4-dev 789506

Thanks,
-Jay

On Thu, Jul 2, 2009 at 11:33 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com
 wrote:

  Shalin Shekhar Mangar wrote:
 
 
  It selects all matching nodes. But if the field is not multi-valued, it
  will
  store only the last value. I guess this is what is happening here.
 
 
 
  So do you think it should match them all and add the concatenated text as
  one field?
 
  That would be more Xpath like I think, and less arbitrary than just
  choosing the last one.
 

 I won't call it arbitrary because it creates a SolrInputDocument with
 values
 from all the matching nodes just like you'd create any multi-valued field.
 The problem is that his field is not declared to be multi-valued. The same
 would happen if you posted an XML document to /update with multiple values
 for a single-valued field.

 XPathEntityProcessor provides the flatten=true option if you want to add
 it as concatenated test. Jay mentioned that flatten did not work for him
 which is something we should investigate.

 Jay, which version of Solr are you running? The flatten option is a 1.4
 feature (added with SOLR-1003).
 --
 Regards,
 Shalin Shekhar Mangar.



Re: DIH: Limited xpath syntax unable to parse all xml elements

2009-07-02 Thread Jay Hill
Thanks Fergus, setting the field to multivalued did work:
  field column=body xpath=/book/body/chapter/p flatten=true/
gets all the p elements as multivalue fields in the body field.

The only thing is, the body field is used by some other content sources, so
I have to look at the implications setting it to multi-valued will have on
the other data sources. Still, this might do the trick.

Thanks to all that helped on this!

-Jay



On Thu, Jul 2, 2009 at 11:40 AM, Fergus McMenemie fer...@twig.me.uk wrote:

 Shalin Shekhar Mangar wrote:
  On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  It looks like DIH implements its own subset of the Xpath spec.
 
 
 
  Right, DIH has a streaming implementation supporting a subset of XPath
 only.
  The supported things are in the wiki examples.
 
 
 
  I don't see any tests with multiple matching sub nodes, so perhaps DIH
  Xpath does not properly support that and just selects the last matching
  node?
 
 
 
  It selects all matching nodes. But if the field is not multi-valued, it
 will
  store only the last value. I guess this is what is happening here.
 
 
 So do you think it should match them all and add the concatenated text
 as one field?
 
 That would be more Xpath like I think, and less arbitrary than just
 choosing the last one.

 Only when the field in schema.xml in not multiValued. If the field is
 multiValued is should still behave as at present?

 Also... what went wrong with the suggested:-
 field column=body xpath=/book/body/chapter flatten=true/

 Regards Fergus.



DIH: Distributing docs to more than one Solr instance

2009-07-01 Thread Jay Hill
I'm using the DIH to index records from a relational database. No problems,
everything works great. But now, due to the size of index (70GB w/ 25M+
docs) I need to shard and want the DIH to distribute documents evenly
between two shards. Current approach is to modify the sql query in the
config file to get only even numbered ids on one host and odd numbered ids
on the other host. Is there are more elegant way to distribute the
documents? Has anyone else come up with a better way to approach this?

Thanks,
-Jay


DIH: Limited xpath syntax unable to parse all xml elements

2009-07-01 Thread Jay Hill
I'm using the XPathEntityProcessor to parse an xml structure that looks like
this:

book
authorJoe Smith/author
titleWorld Atlas/title
body
chapter
pContent I want is here/p
pMore content I want is here./p
pStill more content here./p
/chapter
/body
/book

The author and title parse out fine:   field column=title
xpath=/book/title/  field column=author xpath=/book/author/

But I can't get at the data inside the p tags. I want to get all
non-markup text inside the body tag with something like this:

field column=body xpath=/book/body/chapter//p/

but that is not supported.

Does anyone know of a way that I can get the content within the p tags
without the markup?

Thanks,
-Jay


PlainTextEntitiyProcessor not putting any text into a field in index

2009-06-18 Thread Jay Hill
I'm having some trouble getting the PlainTextEntityProcessor to populate a
field in an index. I'm using the TemplateTransformer to fill 2 fields, and
have a timestamp field in schema.xml, and these fields make it into the
index. Only the plaintText data is missing. Here is my configuration:

dataConfig
dataSource type=FileDataSource encoding=UTF-8 /
document
entity
   name=f
   processor=FileListEntityProcessor
   baseDir=/Users/jayhill/test/dir
   fileName=.*txt
   recursive=true
   rootEntity=true
   

entity
   name=pt
   processor=PlainTextEntityProcessor
   url=${f.fileAbsolutePath}
   transformer=RegexTransformer,TemplateTransformer
   
  field column=plainText name=text/
  field column=datasource template=textfiles /
/entity

/entity
/document
/dataConfig

I've tried adding plainText as a field in schema.xml, but that didn't work
either.

When I look at what the PlainTextEntityProcessor class is doing I see that
it has correctly parsed the file and has the text in a StringWriter:
row.put(PLAIN_TEXT, sw.toString());
I just don't know how to get that text into a field in the index

Any pointers appreciated.

-Jay


Re: query issue /special character and case

2009-06-08 Thread Jay Hill
Regarding being able to search SCHOLKOPF (o with no umlaut) and match
SCHÖLKOPF (with umlaut) try using the ISOLatin1AccentFilterFactory in your
analysis chain:

filter class=solr.ISOLatin1AccentFilterFactory /

This filter removes accented chars and replaces them with non-accented
versions. As always, make sure to add it to the for both type index and
type query.

-Jay

On Fri, Jun 5, 2009 at 11:10 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Sat, May 30, 2009 at 9:48 AM, revas revas...@gmail.com wrote:

  Hi ,
 
  When i give a query like the following ,why does it become a phrase query
  as shown below?
  The field type is the default text field in the schema.
 
  str name=querystringvolker-blanz/str
  str name=parsedqueryPhraseQuery(content:volker blanz)/str
 

 What is the query that was sent to Solr?


  Also when i have special characters in the query as SCHÖLKOPF , i am not
  able to convert the o with spl character  to lower case on my unix
 os/it
  works fine on windows xp OS .Also if i have a spl character in my  query
 ,i
  would like to search for it wihtout the special character as  SCHOLKOPF
  ,this works fine in windows with strtr (string translate php fucntion)
 ,but
  again not in windows OS.
 

 Hmm, not sure. If you are using Tomcat, have you enabled UTF-8?


 http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

 You can try using the analysis.jsp on the text field with this token and
 see
 how it is being analyzed. See if that gives some hints.

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Query faceting

2009-06-08 Thread Jay Hill
In order to get the the values you want for the service field you will need
to change the fieldType definition in schema.xml for service to use
something that doesn't alter your original values. Try the string
fieldType to start and look at the fieldType definition for string. I'm
guessing you have it set to text or something else with a chain of filters
during analysis.

If you don't want back facets with a count of 0 set this param:
facet.mincount=1  Have a look at all the values you can set on facets:
http://wiki.apache.org/solr/SimpleFacetParameters

-Jay

On Mon, Jun 8, 2009 at 2:09 PM, siping liu siping...@hotmail.com wrote:


 Hi,

 I have a field called service with following values:

 - Shuttle Services
 - Senior Discounts
 - Laundry Rooms

 - ...



 When I conduct query with facet=truefacet.field=servicefacet.limit=-1,
 I get something like this back:

 - shuttle 2

 - service 3

 - senior 0

 - laundry 0

 - room 3

 - ...



 Questions:

 - How not to break up fields values in words, so I can get something like
 Shuttle Services 2 back?

 - How to tell Solr not to return facet with 0 value? The query takes long
 time to finish, seemingly because of the long list of items with 0 count.



 thanks for any advice.

 _
 Insert movie times and more without leaving Hotmail®.

 http://windowslive.com/Tutorial/Hotmail/QuickAdd?ocid=TXT_TAGLM_WL_HM_Tutorial_QuickAdd_062009



Re: Highlighting and Field options

2009-06-01 Thread Jay Hill
Use the fl param to ask for only the fields you need, but also keep hl=true.
Something like this:

http://localhost:8080/solr/select/?q=bearversion=2.2start=0rows=10indent=onhl=truefl=id

Note that fl=id means the only field returned in the XML will be the id
field.

Highlights are still returned in the highlight element, but you won't get
back the unneeded content field.

-Jay


On Mon, Jun 1, 2009 at 9:41 AM, ashokc ash...@qualcomm.com wrote:


 Hi,

 The 'content' field that I am indexing is usually large (e.g. a pdf doc of
 a
 few Mb in size). I need highlighting to be on. This 'seems' to require that
 I have to set the 'content' field to be STORED. This returns the whole
 content field in the search result XML. for each matching document. The
 highlighted text also is returned in a separate block. But I do NOT need
 the
 entire content field to display the search results. I only use the
 highlighted segments to display a brief description of each hit. The fact
 that SOLR returns entire content field, makes the returned XML
 unnecessarily
 huge, and makes for larger response times. How can I have SOLR return ONLY
 the highlighted text for each hit and NOT the entire 'content' filed?
 Thanks
 - ashok
 --
 View this message in context:
 http://www.nabble.com/Highlighting-and-Field-options-tp23818019p23818019.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Question about field types and querying

2009-05-28 Thread Jay Hill
Try using the admin analysis tool
(http://host:port/solr/admin/analysis.jsp)
too see what the analysis chain is doing to your query. Enter the field name
(question in your case) and the Field value (Index) customize (since
that's what's in the document). For Field value (Query) enter customer.
Check Verbose Output and click Analyze. This will show you each filter
in the chain and the actions they are taking on your query. Note that the
highlighted fields show where a match would occur.

Then adjust your fieldTypes and fields to get the results you want. Create a
new fieldType if needed and add/remove filters as needed.

-Jay


On Thu, May 28, 2009 at 12:07 PM, ahammad ahmed.ham...@gmail.com wrote:


 Hello,

 I have a field type of text in my collection called question.

 When I query for the word customer for example in the question field
 (ie
 q=question:customer), the first document with the highest score shows up,
 but does not contain the word customer at all.

 Instead, it contains the word customize.

 What would be a way around this? I tried changing the type to string
 instead
 of text, but that I wouldn't get any results if I don't have the exact
 statement in there...
 --
 View this message in context:
 http://www.nabble.com/Question-about-field-types-and-querying-tp23768061p23768061.html
 Sent from the Solr - User mailing list archive at Nabble.com.




  1   2   >