Re: is indexing single-threaded?

2010-09-23 Thread Ryan McKinley
Multiple threads work well.

If you are using solrj, check the StreamingSolrServer for an
implementation that will keep X number of threads busy.

Your mileage will very, but in general I find a reasonable thread
count is ~ (number of cores)+1


On Wed, Sep 22, 2010 at 5:52 AM, Andy angelf...@yahoo.com wrote:
 Does Solr index data in a single thread or can data be indexed concurrently 
 in multiple threads?

 Thanks
 Andy






Re: How can I delete the entire contents of the index?

2010-09-23 Thread Ryan McKinley
deletequery*:*/query/delete

will leave you a fresh index


On Thu, Sep 23, 2010 at 12:50 AM, xu cheng xcheng@gmail.com wrote:
 deletequerythe query that fetch the data you wanna
 delete/query/delete
 I did like this to delete my data
 best regards

 2010/9/23 Igor Chudov ichu...@gmail.com

 Let's say that I added a number of elements to Solr (I use
 Webservice::Solr as the interface to do so).

 Then I change my mind and want to delete them all.

 How can I delete all contents of the database, but leave the database
 itself, just empty?

 Thanks

 i




Re: Concurrent DB updates and delta import misses few records

2010-09-23 Thread Shashikant Kore
Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range. Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:

  On 9/22/2010 1:39 AM, Shashikant Kore wrote:

 Hi,

 I'm using DIH to index records from a database. After every update on
 (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
 observed that if db updates and DIH import is happening concurrently,
 import
 misses few records.

 Here is how it happens.

 The table has a column 'lastUpdated' which has default value of current
 timestamp. Many records are added to database in a single transaction that
 takes several seconds. For example, if 10,000 rows are being inserted, the
 rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
 18:21:26'. These rows become visible only after transaction is committed.
 That happens at, say, '2010-09-20 18:21:30'.

 If Solr is import gets triggered at '18:20:29', it will use a timestamp of
 last import for delta query. This import will not see the records added in
 the aforementioned transaction as transaction was not committed at that
 instant. After this import, the dataimport.properties will have last index
 time as '18:20:29'.  The next import will not able to get all the rows of
 previously referred trasaction as some of the rows have timestamp earlier
 than '18:20:29'.

 While I am testing extreme conditions, there is a possibility of missing
 out
 on some data.

 I could not find any solution in Solr framework to handle this. The table
 has an auto increment key, all updates are deletes followed by inserts.
 So,
 having last_indexed_id would have helped, where last_indexed_id is the max
 value of id fetched in that import. The query would then become Select id
 where idlast_indexed_id.' I suppose, Solr does not have any provision
 like
 this.

 Two options I could think of are:
 (a) Ensure at application level that there are no concurrent DB updates
 and
 DIH import requests going concurrently.
 (b) Use exclusive locking during DB update

 What is the best way to address this problem?


 Shashi,

 I was not solving the same problem, but perhaps you can adapt my solution
 to yours.  My main problem was that I don't have a modified date in my
 database, and due to the size of the table, it is impractical to add one.
  Instead, I chose to track the database primary key (a simple autoincrement)
 outside of Solr and pass min/max values into DIH for it to use in the SELECT
 statement.  You can see a simplified version of my entity here, with a URL
 showing how to send the parameters in via the dataimport GET:

 http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

 The update script that runs every two minutes gets MAX(did) from the
 database, retrieves the minDid from a file on an NFS share, and runs a
 delta-import with those two values.  When the import is reported successful,
 it writes the maxDid value to the minDid file on the network share for the
 next run.  If the import fails, it sends an alarm and doesn't update the
 minDid.

 Shawn




Re: Solr Reporting

2010-09-23 Thread Myron Chelyada
Hi Adeel,

I would use the first approach since it is more flexible and easier to use.
Please consider XsltResponseWriter which allows to transform result set from
Solr's default xml structure into custom using provided xslt template.

Myron

2010/9/23 Adeel Qureshi adeelmahm...@gmail.com

 This probably isnt directly a solr user type question but its close enough
 so I am gonna post it here. I have been using solr for a few months now and
 it works just out of this world so I definitely love the software (and
 obviously lucene too) .. but I feel that solr output xml is in kind of
 weird
 format .. I mean its in a format that simply makes it difficult to plug
 solr
 output xml in any xml reading tool or api .. this whole concept of using
 str name=id123/str
 instead of
 id123/id
 doesnt makse sense to me ..

 what I am trying to do now is setup a reporting system off of solr .. and
 the concept is simply .. let the user do all the searches, facet etc and
 once they have finalized on some results .. simply allow them to export
 those results in an excel or pdf file .. what I have setup right now is I
 simply let the export feature use the same solr query that user used to
 search their results .. send that query to solr again and get all results
 back and simply iterate over xml and dump all data in an excel file

 this has worked fine in most situations but I want to improve this process
 and specifically use jasper reports for reporting .. and I want to use
 ireport to design my report templates ..
 thats where solr output xml format is causing problems .. as I cant figure
 out how to make it work with ireport because of solr xml not having any
 named nodes .. it all looks like the same nodes and ireport cant
 distinguish
 one column from another .. so I am thinking a couple of solutions here and
 wanted to get some suggestions from you guys on how to do it best

 1. receive solr output xml .. convert it to a more readable xml form .. use
 named nodes instead of nodes by data type
 str name=id123/str
 str name=titlexyz/str

 =
 id123/id
 titlexyz/title

 and then feed that to jasper report template

 2. use solrJ to recieve solr output in the NamedList resultset as it
 returns
  ..I havent tried this method so I am not sure how useful or easy to work,
 this NamedList structure is .. in this I would be feeding Collection of
 NamedList items to jasper .. havent played around with this so not sure how
 well its gonna work out .. if you have tried something like this please let
 me know how it worked out for u

 I would appreciate absolutely any kind of comments on this

 Thanks
 Adeel



Re: Autocomplete: match words anywhere in the token

2010-09-23 Thread Chantal Ackermann
On Wed, 2010-09-22 at 20:14 +0200, Arunkumar Ayyavu wrote:
 Thanks for the responses. Now, I included the EdgeNGramFilter. But, I get
 the following results when I search for canon pixma.
 Canon PIXMA MP500 All-In-One Photo Printer
 Canon PowerShot SD500
 
 As you can guess, I'm not expecting the 2nd result entry. Though I
 understand why I'm getting the 2nd entry, I don't know how to ask Solr to
 exlcude it (I could fitler it in my application though). :-( Looks like I
 should study more of Solr's capabilites to get the solution.
 

This has not so much to do with autosuggest, anymore?
You put those quotes in to denote the search input, not to say that the
search input was a phrase, I suppose. Searching for the phrase (quoted),
only the first line should have been found.

If you want to have returned hits that include most of the searched
terms, and in case of only two input terms both: you can configure such
sophisticated rules with the 
http://wiki.apache.org/solr/DisMaxQParserPlugin
Have a look at the mm parameter (Minimum Should Match)

Chantal



Custom Sorting with function queries

2010-09-23 Thread dl
I need to 'rank' the documents in a solr index based on some field values and 
the query. Is this possible using function queries?

Two example to illustrate what I am trying to achieve:

The index contains two fields min_rooms and max_rooms, both integers, both 
optional. If I query the index for a value (rooms) I would like the documents 
that place this value between min and max to be ranked higher than those that 
don't. The smaller the difference between min and max is, the more exact a 
match the document is and the higher the document will be ranked. If either min 
or max or both are not specified then the document gets a 'negative rank'.
The index contains a float field. If, and only if, the query contains a search 
for this field (field:1 or field:on), then the value of the field affects the 
ranking of the document. (1, on, yes, etc can be solved with synonyms)

Lastly, once this 'custom ranking works), how do I switch off solr's built in 
ranking calculations?

bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Andy
Hi,

I was going thru this LucidImagnaton presentation on analysis:

http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right

1) on p.31-33, it talks about forming bi-grams for the 32 most common terms 
during indexing. Is there an analyzer that does that?

2) on p. 34, it mentions that the default Solr configuraton would turn L'art 
into the phrase query L art but it is much more efficient to turn it into a 
single token 'L art'. Which analyzer would do that?

Thanks.
Andy


  


Re: Solr Reporting

2010-09-23 Thread kenf_nc

keep in mind that the str name=id paradigm isn't completely useless, the
str is a data type (string), it can be int, float, double, date, and others.
So to not lose any information you may want to do something like:

id type=int123/id 
title type=strxyz/title

Which I agree makes more sense to me. The name of the field is more
important than it's datatype, but I don't want to lose track of the data
type.

Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Steven A Rowe
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory


 -Original Message-
 From: Andy [mailto:angelf...@yahoo.com]
 Sent: Thursday, September 23, 2010 6:05 AM
 To: solr-user@lucene.apache.org
 Subject: bi-grams for common terms - any analyzers do that?
 
 Hi,
 
 I was going thru this LucidImagnaton presentation on analysis:
 
 http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-
 on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right
 
 1) on p.31-33, it talks about forming bi-grams for the 32 most common
 terms during indexing. Is there an analyzer that does that?
 
 2) on p. 34, it mentions that the default Solr configuraton would turn
 L'art into the phrase query L art but it is much more efficient to
 turn it into a single token 'L art'. Which analyzer would do that?
 
 Thanks.
 Andy
 
 
 


Re: How can I delete the entire contents of the index?

2010-09-23 Thread kenf_nc

Quick tangent... I went to the link you provided, and the delete part makes
sense. But the next tip, how to re-index after a schema change. What is the
point of step

5. Send an optimize/ command.

? Why do you need to optimize an empty index? Or is my understanding of
Optimize incorrect?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Searches with a period (.) in the query

2010-09-23 Thread kenf_nc

Do you have any other Analyzers or Formatters involved? I use delimiters in
certain string fields all the time. Usually a colon : or slash / but
should be the same for a period. I've never seen this behavior. But if you
have any kind of tokenizer or formatter involved beyond 
fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true / 
then you may be introducing something extra to the party.

What does your fieldType definition look like?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: How can I delete the entire contents of the index?

2010-09-23 Thread Jonathan Rochkind
Because even after you've deleted every document from the index, there are 
still actually index _files_ on disk taking up space.  Lucene organizes it's 
files for quick access, and a consequence of this is that deleting a document 
does not neccesarily reclaim the disk space.   Optimize will reclaim that disk 
space. 

For deleting ALL documents in your index there's actually a shortcut though. 
Delete the entire solr 'data' directory and restart Solr, Solr will recreate 
the data directory with starter index files.  (Note you have to delete the 
directory itself, if you just delete all the files inside it, Solr will get 
unhappy).   I am somewhat suspicious of doing this and would never do it on a 
production index, but for just development playing around where it's not that 
disastrous if something goes wrong, it's a lot lot quicker than an actual 
delete command followed by an optimize. 

From: kenf_nc [ken.fos...@realestate.com]
Sent: Thursday, September 23, 2010 8:22 AM
To: solr-user@lucene.apache.org
Subject: Re: How can I delete the entire contents of the index?

Quick tangent... I went to the link you provided, and the delete part makes
sense. But the next tip, how to re-index after a schema change. What is the
point of step

5. Send an optimize/ command.

? Why do you need to optimize an empty index? Or is my understanding of
Optimize incorrect?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Jonathan Rochkind
I've been thinking about the CommonGramsFilter for a while, and am confused 
about how it works. Can anyone provide examples?  Are you meant to include the 
analyzer at both index and query time?  The description on the wiki says among 
other things: The CommonGramsQueryFilter converts the phrase query the cat 
into the single term query the_cat. -- does that mean it _only_ works on 
phrase queries?If you've indexed with commongrams, what will happen at 
query time to a non-phrase query the cat ?   Very confused. 

From: Steven A Rowe [sar...@syr.edu]
Sent: Thursday, September 23, 2010 8:21 AM
To: solr-user@lucene.apache.org
Subject: RE: bi-grams for common terms - any analyzers do that?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory


 -Original Message-
 From: Andy [mailto:angelf...@yahoo.com]
 Sent: Thursday, September 23, 2010 6:05 AM
 To: solr-user@lucene.apache.org
 Subject: bi-grams for common terms - any analyzers do that?

 Hi,

 I was going thru this LucidImagnaton presentation on analysis:

 http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-
 on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right

 1) on p.31-33, it talks about forming bi-grams for the 32 most common
 terms during indexing. Is there an analyzer that does that?

 2) on p. 34, it mentions that the default Solr configuraton would turn
 L'art into the phrase query L art but it is much more efficient to
 turn it into a single token 'L art'. Which analyzer would do that?

 Thanks.
 Andy





Re: Xpath extract element name

2010-09-23 Thread yklxmas

Great. XSL worked like a charm! Thx lots.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Xpath-extract-element-name-tp1534390p1567809.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can I delete the entire contents of the index?

2010-09-23 Thread Michael McCandless
Lucene has an API for very fast deletion of the index (ie, it removes
the files): IndexWriter.deleteAll().  It's part of the transaction,
ie, you still must call .commit() to make the change visible to
external readers.

But I don't know whether this is exposed in Solr...

Mike

On Thu, Sep 23, 2010 at 8:50 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Because even after you've deleted every document from the index, there are 
 still actually index _files_ on disk taking up space.  Lucene organizes it's 
 files for quick access, and a consequence of this is that deleting a document 
 does not neccesarily reclaim the disk space.   Optimize will reclaim that 
 disk space.

 For deleting ALL documents in your index there's actually a shortcut though. 
 Delete the entire solr 'data' directory and restart Solr, Solr will recreate 
 the data directory with starter index files.  (Note you have to delete the 
 directory itself, if you just delete all the files inside it, Solr will get 
 unhappy).   I am somewhat suspicious of doing this and would never do it on a 
 production index, but for just development playing around where it's not that 
 disastrous if something goes wrong, it's a lot lot quicker than an actual 
 delete command followed by an optimize.
 
 From: kenf_nc [ken.fos...@realestate.com]
 Sent: Thursday, September 23, 2010 8:22 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How can I delete the entire contents of the index?

 Quick tangent... I went to the link you provided, and the delete part makes
 sense. But the next tip, how to re-index after a schema change. What is the
 point of step

    5. Send an optimize/ command.

 ? Why do you need to optimize an empty index? Or is my understanding of
 Optimize incorrect?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-can-I-delete-the-entire-contents-of-the-index-tp1565548p1567640.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Reporting

2010-09-23 Thread Adeel Qureshi
Thank you for your suggestions .. makes sense and I didnt knew about the
XsltResponseWriter .. that opens up door to all kind of possibilities ..so
its great to know about that

but before I go that route .. what about performance .. In Solr Wiki it
mentions that XSLT transformation isnt so bad in terms of memory usage but I
guess its all relative to the amount of data and obviously system resources
..

my data set will be around 15000 - 30'000 records at the most ..I do have
about 30 some fields but all fields are either small strings (less than 500
chars) or dates, int, booleans etc .. so should I be worried about
performances problems while doing the XSLT translations .. secondly for
reports Ill have to request solr to send all 15000 some records at the same
time to be entered in report output files .. is there a way to kind of
stream that process .. well I think Solr native xml is already streamed to
you but sounds like for the translation it will have to load the whole thing
in RAM ..

and again what about SolrJ .. isnt that supposed to provide better
performance since its in java .. well I guess it shouldnt be much different
since it also uses the HTTP calls to communicate to Solr ..

Thanks for your help
Adeel

On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote:


 keep in mind that the str name=id paradigm isn't completely useless,
 the
 str is a data type (string), it can be int, float, double, date, and
 others.
 So to not lose any information you may want to do something like:

 id type=int123/id
 title type=strxyz/title

 Which I agree makes more sense to me. The name of the field is more
 important than it's datatype, but I don't want to lose track of the data
 type.

 Ken
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How can I delete the entire contents of the index?

2010-09-23 Thread Chris Hostetter
: Lucene has an API for very fast deletion of the index (ie, it removes
: the files): IndexWriter.deleteAll().  It's part of the transaction,
...
: But I don't know whether this is exposed in Solr...

Solr definitely has optimized the delete *:* case (but i don't know if 
it's using the specific method)

I believe the poster is getting confused because immediately following 
this FAQ...

http://wiki.apache.org/solr/FAQ#How_can_I_delete_all_documents_from_my_index.3F

which says to use deletequery*:*/query/delete which specificly 
notes: This has been optimized to be more efficient then deleting by some 
arbitrary query which matches all docs because of the nature of the data.

...was this FAQ...

http://wiki.apache.org/solr/FAQ#How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema.3F

...which until a moment ago gave outdated advice.

-Hoss

--

http://lucenerevolution.org/  ...  October 7-8, Boston
http://bit.ly/stump-hoss  ...  Stump The Chump!



Re: Searches with a period (.) in the query

2010-09-23 Thread Siddharth Powar
Hey Ken,

The filedType definition that I am using is:
fieldType name=string class=solr.StrField
sortMissingLast=true omitNorms=true /

Thanks,
Sid

On Thu, Sep 23, 2010 at 5:29 AM, kenf_nc ken.fos...@realestate.com wrote:


 Do you have any other Analyzers or Formatters involved? I use delimiters in
 certain string fields all the time. Usually a colon : or slash / but
 should be the same for a period. I've never seen this behavior. But if you
 have any kind of tokenizer or formatter involved beyond
fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true /
 then you may be introducing something extra to the party.

 What does your fieldType definition look like?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Burton-West, Tom
Hi all,

The CommonGrams filter is designed to only work on phrase queries.  It is 
designed to solve the problem of slow phrase queries with phrases containing 
common words, when you don't want to use stop words.  It would not make sense 
for Boolean queries. Boolean queries just get passed through unchanged. 

For background on the CommonGramsFilter please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

There are two filters,  CommonGramsFilter and CommonGramsQueryFilter you use 
CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing.  
CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries 
(i.e. non-phrase queries)  will work.  For example the rain would produce 3 
tokens:
the  position 1
rain position 2
the-rain position 1
When you have a phrase query, you want Solr to search for the token the-rain 
so you don't want the unigrams.
When you have a Boolean query, the CommonGramsQueryFilter only gets one token 
as input and simply outputs it.

Appended below is a sample config from our schema.xml.

For background on the problem with l'art please see: 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We 
used a custom filter to change all punctuation to spaces.   You could probably 
use one of the other filters to do this. (See the comments from David Smiley at 
the end of the blog post regarding possible approaches.)At the time, I just 
couldn't get WordDelimiterFilter to behave as documented with various 
combinations of parameters and was not aware of the other filters David 
mentions.

The problem with l'art is actually due to a bug or feature in the 
QueryParser.  Currently the QueryParser interacts with the token chain and 
decides whether the tokens coming back from a tokenfilter should be treated as 
a phrase query based on whether or not more than one non-synonym token comes 
back from the tokestream for a single 'queryparser token'.
It also splits on whitespace which causes all CJK queries to be treated as 
phrase queries regardless of the CJK tokenizer you use. This is a contentious 
issue.  See https://issues.apache.org/jira/browse/LUCENE-2458.  There is a 
semi-workaround using PositionFilter, but it has many undesirable side effects. 
 I believe Robert Muir, who is an expert on the various problems involved and  
opened Lucene-2458 is working on a better fix.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



fieldType name=CommonGramTest class=solr.TextField 
positionIncrementGap=100
−
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=ISOLatin1AccentFilterFactory/
filter class=solr.PunctuationFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CommonGramsFilterFactory words=new400common.txt/
/analyzer
−
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=ISOLatin1AccentFilterFactory/
filter class=solr.PunctuationFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CommonGramsQueryFilterFactory words=new400common.txt/
/analyzer
/fieldType


Re: is indexing single-threaded?

2010-09-23 Thread Dennis Gearon
I was kind of wondering what magic had been done to achieve multiple writing to 
the index file :-)

BTW, wouldn't it be possible to have seperate segments per thread? Set up the 
index with a minimum (desired?) segment count, and write each individually?

Is there any organization in the segments? Or can adjacent data be found in 
different segments?

I seem to remember that the new stuff gets committed to its own segment until 
some sort of 'consolidate' command takes place.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/23/10, Jan Høydahl / Cominvent jan@cominvent.com wrote:

 From: Jan Høydahl / Cominvent jan@cominvent.com
 Subject: Re: is indexing single-threaded?
 To: solr-user@lucene.apache.org
 Date: Thursday, September 23, 2010, 1:42 AM
 SolrJ threads speeds up feeding
 throughput. The building the index is still single threaded
 (per core), isn't it? Don't know about analysis. But you
 cannot have two threads write to the same file...
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 
 On 23. sep. 2010, at 08.01, Ryan McKinley wrote:
 
  Multiple threads work well.
  
  If you are using solrj, check the StreamingSolrServer
 for an
  implementation that will keep X number of threads
 busy.
  
  Your mileage will very, but in general I find a
 reasonable thread
  count is ~ (number of cores)+1
  
  
  On Wed, Sep 22, 2010 at 5:52 AM, Andy angelf...@yahoo.com
 wrote:
  Does Solr index data in a single thread or can
 data be indexed concurrently in multiple threads?
  
  Thanks
  Andy
  
  
  
  
 



Re: matches in result grouping

2010-09-23 Thread Koji Sekiguchi
 (10/09/23 18:14), Koji Sekiguchi wrote:
  I'm using recent committed field collapsing / result grouping
 feature in trunk.

 I'm confusing matches parameter in the result at the second
 sample output of Wiki:

 http://wiki.apache.org/solr/FieldCollapsing#Quick_Start

 I cannot understand why there are two matches:5 entries
 in the result. Can anyone explain it?
Probably multiple GroupCollectors are generated for each group.field,
group.func and group.query and match can be counted per collector.

Koji

-- 
http://www.rondhuit.com/en/



Re: matches in result grouping

2010-09-23 Thread Yonik Seeley
2010/9/23 Koji Sekiguchi k...@r.email.ne.jp:
  (10/09/23 18:14), Koji Sekiguchi wrote:
  I'm using recent committed field collapsing / result grouping
 feature in trunk.

 I'm confusing matches parameter in the result at the second
 sample output of Wiki:

 http://wiki.apache.org/solr/FieldCollapsing#Quick_Start

 I cannot understand why there are two matches:5 entries
 in the result. Can anyone explain it?
 Probably multiple GroupCollectors are generated for each group.field,
 group.func and group.query and match can be counted per collector.

Correct.  The matches is the doc count before any grouping (and for
field.query that means before the restriction given by field.query is
applied).  It won't always be the same though - for example we might
implement filter excludes like we do with faceting, etc.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Solr Reporting

2010-09-23 Thread Peter Sturge
Hi,

Are you going to generate a report with 3 records in it? That will
be a very large report - will anyone really want to read through that?
If you want/need 'summary' reports - i.e. stats on on the 30k records,
it is much more efficient to setup faceting and/or server-side
analysis to do this, rather than download
3 records to a client, then do statistical analysis on the result.
It will take a while to stream 3 records over an http connection,
and, if you're building, say, a PDF table for 30k records, that will
take some time as well.
Server-side analysis then just send the results will work better, if
that fits your remit for reporting.

Peter



On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com wrote:
 Thank you for your suggestions .. makes sense and I didnt knew about the
 XsltResponseWriter .. that opens up door to all kind of possibilities ..so
 its great to know about that

 but before I go that route .. what about performance .. In Solr Wiki it
 mentions that XSLT transformation isnt so bad in terms of memory usage but I
 guess its all relative to the amount of data and obviously system resources
 ..

 my data set will be around 15000 - 30'000 records at the most ..I do have
 about 30 some fields but all fields are either small strings (less than 500
 chars) or dates, int, booleans etc .. so should I be worried about
 performances problems while doing the XSLT translations .. secondly for
 reports Ill have to request solr to send all 15000 some records at the same
 time to be entered in report output files .. is there a way to kind of
 stream that process .. well I think Solr native xml is already streamed to
 you but sounds like for the translation it will have to load the whole thing
 in RAM ..

 and again what about SolrJ .. isnt that supposed to provide better
 performance since its in java .. well I guess it shouldnt be much different
 since it also uses the HTTP calls to communicate to Solr ..

 Thanks for your help
 Adeel

 On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com wrote:


 keep in mind that the str name=id paradigm isn't completely useless,
 the
 str is a data type (string), it can be int, float, double, date, and
 others.
 So to not lose any information you may want to do something like:

 id type=int123/id
 title type=strxyz/title

 Which I agree makes more sense to me. The name of the field is more
 important than it's datatype, but I don't want to lose track of the data
 type.

 Ken
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Robert Muir
On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom tburt...@umich.eduwrote:

 The problem with l'art is actually due to a bug or feature in the
 QueryParser.  Currently the QueryParser interacts with the token chain and
 decides whether the tokens coming back from a tokenfilter should be treated
 as a phrase query based on whether or not more than one non-synonym token
 comes back from the tokestream for a single 'queryparser token'.


Just a note: in solr's trunk or 3x branch you have a lot more flexibility
already with this stuff:

1. for the specific problem of l'art: you can use the ElisionFilterFactory,
its actually designed to address this. But before it was a bit unwieldy to
use (you had to supply your own list of french contractions: l', m', etc):
with trunk or 3x you can just add it to your analyzer, if you don't specify
a list it uses the default list from Lucene's FrenchAnalyzer.

2. if you are using WordDelimiterFilter, you can customize how it splits on
a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 ,
a user gave a nice example there of how you can treat '#' and '@' special
for twitter messages.

3. in all cases, if you don't want phrase queries automatically formed
unless the user put them in quotes, you can turn it off in your fieldtype:
fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=false

(somewhat related)
Tom, thanks for posting your schema. given your problems with huge amounts
of terms, i looked at your previous messages and ran some quick math and
guestimated your average term length must be quite large.

Yet i notice from your website (
http://www.hathitrust.org/visualizations_languages) it says you have 18,329
thai books (and you have no ThaiWordFilter in your schema).

Are you sure that your terms are not filled with tons of very long
untokenized thai sentences? (thai uses no spaces between words) just an idea
:)

-- 
Robert Muir
rcm...@gmail.com


Re: Issue with Solr Boosting

2010-09-23 Thread Jak Akdemir
I think if you don't need to add more categories, just increasing
boost factor of Electronics would work.
As you said  because of DocFreq of Mobile Phones, scoring algorithm
is working as expected way.

On Thu, Sep 23, 2010 at 3:42 PM, Jayant Patil
jayan...@peopleinteractive.in wrote:
 Hi,

 We are using Solr for our searches. We are facing issues while applying
 boost on particular fields.
 E.g.
 We have a field Category, which contains values like Electronics,
 Computers, Home Appliances, Mobile Phones etc.
 We want to boost the category Electronics and Mobile Phones, we are using
 the following query
 (category:Electronics^2 OR category:Mobile Phones^1 OR category:[* TO *]^0)

 The results are unexpected as Category Mobile Phones gets more boost than
 Electronics even if we are specifying the boost factor 2 for electronics
 and 1 for mobile phones respectively.
 On debugging we found that DocFreq is manipulating the scores and hence
 affecting the overall boost. The no. of docs for mobile phones is much
 lower than that for electronics and solr is giving higher score to mobile
 phones for this reason.

 Please suggest a solution.

 Regards,
 Jayant


 
 People Interactive DISCLAIMER and CONFIDENTIALITY CAUTION
 
 This email and any files transmitted with it are confidential and intended 
 solely for the use of the individual or entity to whom they are addressed. 
 Unauthorized reading, dissemination, distribution or copying of this 
 communication is prohibited. If you are not the intended recipient, any 
 disclosure, copying, distribution or any action taken or omitted to be taken 
 in reliance on it, is prohibited and may be unlawful. If you have received 
 this communication in error, please notify us immediately and promptly 
 destroy the original communication. Thank you for your cooperation.

 Please note that any views or opinions presented in this email are solely 
 those of the author and may not necessarily represent those of the company. 
 Communicating through email is not secure and capable of interception, 
 corruption and delays. Anyone communicating with People Interactive (I) 
 Private Limited by email accepts the risks involved and their consequences. 
 The recipient should check this email and any attachments for the presence of 
 viruses. People Interactive (I) Private Limited accepts no liability for any 
 damage caused by any virus transmitted by this email.




Re: Searches with a period (.) in the query

2010-09-23 Thread Jak Akdemir
Siddharth, did you check tokenizer and filter behaviour from
../admin/analysis.jsp page. That would be quite informative to you.


On Thu, Sep 23, 2010 at 6:42 PM, Siddharth Powar
powar.siddha...@gmail.com wrote:
 Hey Ken,

 The filedType definition that I am using is:
 fieldType name=string class=solr.StrField
 sortMissingLast=true omitNorms=true /

 Thanks,
 Sid

 On Thu, Sep 23, 2010 at 5:29 AM, kenf_nc ken.fos...@realestate.com wrote:


 Do you have any other Analyzers or Formatters involved? I use delimiters in
 certain string fields all the time. Usually a colon : or slash / but
 should be the same for a period. I've never seen this behavior. But if you
 have any kind of tokenizer or formatter involved beyond
    fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true /
 then you may be introducing something extra to the party.

 What does your fieldType definition look like?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1567666.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr Reporting

2010-09-23 Thread Adeel Qureshi
Hi Peter

I understand what you are saying but I think you are thinking more of report
as graph and analysis and summary kind of data .. for my reports I do need
to include all records that qualify certain criteria .. e.g. a listing of
all orders placed in last 6 months .. now that could be 1 orders and yes
I will need probably a report that summarizes all that data but at the same
time .. I need all those 1 records to be exported in an excel file ..
those are the reports that I am talking about ..

and 3 probably is a stretch .. it might be 10-15000 at the most but I
guess its still the same idea .. and yes I realize that its alot of data to
be transferred over http .. but thats exactly why i am asking for suggestion
on how to do .. I find it hard to believe that this is an unusual
requirement .. I think most companies do reports that dump all records from
databases in excel files ..

so again to clarify I definitely need reports that present statistics and
averages and yes I will be using facets and all kind of stuff there and I am
not so concerned about those reports because like you pointed out, for those
reports there will be very little data transfer but its the full data dump
reports that I am trying to figure out the best way to handle.

Thanks for your help
Adeel



On Thu, Sep 23, 2010 at 11:43 AM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Are you going to generate a report with 3 records in it? That will
 be a very large report - will anyone really want to read through that?
 If you want/need 'summary' reports - i.e. stats on on the 30k records,
 it is much more efficient to setup faceting and/or server-side
 analysis to do this, rather than download
 3 records to a client, then do statistical analysis on the result.
 It will take a while to stream 3 records over an http connection,
 and, if you're building, say, a PDF table for 30k records, that will
 take some time as well.
 Server-side analysis then just send the results will work better, if
 that fits your remit for reporting.

 Peter



 On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com
 wrote:
  Thank you for your suggestions .. makes sense and I didnt knew about the
  XsltResponseWriter .. that opens up door to all kind of possibilities
 ..so
  its great to know about that
 
  but before I go that route .. what about performance .. In Solr Wiki it
  mentions that XSLT transformation isnt so bad in terms of memory usage
 but I
  guess its all relative to the amount of data and obviously system
 resources
  ..
 
  my data set will be around 15000 - 30'000 records at the most ..I do have
  about 30 some fields but all fields are either small strings (less than
 500
  chars) or dates, int, booleans etc .. so should I be worried about
  performances problems while doing the XSLT translations .. secondly for
  reports Ill have to request solr to send all 15000 some records at the
 same
  time to be entered in report output files .. is there a way to kind of
  stream that process .. well I think Solr native xml is already streamed
 to
  you but sounds like for the translation it will have to load the whole
 thing
  in RAM ..
 
  and again what about SolrJ .. isnt that supposed to provide better
  performance since its in java .. well I guess it shouldnt be much
 different
  since it also uses the HTTP calls to communicate to Solr ..
 
  Thanks for your help
  Adeel
 
  On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com
 wrote:
 
 
  keep in mind that the str name=id paradigm isn't completely useless,
  the
  str is a data type (string), it can be int, float, double, date, and
  others.
  So to not lose any information you may want to do something like:
 
  id type=int123/id
  title type=strxyz/title
 
  Which I agree makes more sense to me. The name of the field is more
  important than it's datatype, but I don't want to lose track of the data
  type.
 
  Ken
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Calgary Solr Consultant?

2010-09-23 Thread Ryan Courtnage
Hi,

I'm looking for a Solr expert local to Calgary, Alberta to help us jumpstart
a search project.

Ryan Courtnage

PS: apologies if this is the wrong list for this type of request.


Grouping in solr ?

2010-09-23 Thread Papp Richard
Hi all,

  is it possible somehow to group documents?
  I have services as documents, and I would like to show the filtered
services grouped by company. 
  So I filter services by given criteria, but I show the results grouped by
companay.
  If I got 1000 services, maybe I need to show just 100 companies (this will
affect pagination as well), and how could I get the company info? Should I
store the company info in each service (I don't need the compnany info to be
indexed) ?

regards,
  Rich
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5419 (20100902) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



RE: Grouping in solr ?

2010-09-23 Thread Markus Jelsma
http://wiki.apache.org/solr/FieldCollapsing

https://issues.apache.org/jira/browse/SOLR-236

 
-Original message-
From: Papp Richard ccode...@gmail.com
Sent: Thu 23-09-2010 21:29
To: solr-user@lucene.apache.org; 
Subject: Grouping in solr ?

Hi all,

 is it possible somehow to group documents?
 I have services as documents, and I would like to show the filtered
services grouped by company. 
 So I filter services by given criteria, but I show the results grouped by
companay.
 If I got 1000 services, maybe I need to show just 100 companies (this will
affect pagination as well), and how could I get the company info? Should I
store the company info in each service (I don't need the compnany info to be
indexed) ?

regards,
 Rich


__ Information from ESET NOD32 Antivirus, version of virus signature
database 5419 (20100902) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com




Re: Solr Reporting

2010-09-23 Thread Peter Sturge
Yes, that makes sense. So, more of a bulk data export requirement.
If the excel data doesn't have to go out on the web, you could export
to a local file (using a local solj streamer), then publish it,
which might save some external http bandwidth if that's a concern.
We do this all the time using a local solrj client, so if you've got a
big data stream (e.g. an entire core), you don't
have to send it through your outward-facing web servers. Using a
replica to retrieve/export the data might be worth considering as
well.


On Thu, Sep 23, 2010 at 7:21 PM, Adeel Qureshi adeelmahm...@gmail.com wrote:
 Hi Peter

 I understand what you are saying but I think you are thinking more of report
 as graph and analysis and summary kind of data .. for my reports I do need
 to include all records that qualify certain criteria .. e.g. a listing of
 all orders placed in last 6 months .. now that could be 1 orders and yes
 I will need probably a report that summarizes all that data but at the same
 time .. I need all those 1 records to be exported in an excel file ..
 those are the reports that I am talking about ..

 and 3 probably is a stretch .. it might be 10-15000 at the most but I
 guess its still the same idea .. and yes I realize that its alot of data to
 be transferred over http .. but thats exactly why i am asking for suggestion
 on how to do .. I find it hard to believe that this is an unusual
 requirement .. I think most companies do reports that dump all records from
 databases in excel files ..

 so again to clarify I definitely need reports that present statistics and
 averages and yes I will be using facets and all kind of stuff there and I am
 not so concerned about those reports because like you pointed out, for those
 reports there will be very little data transfer but its the full data dump
 reports that I am trying to figure out the best way to handle.

 Thanks for your help
 Adeel



 On Thu, Sep 23, 2010 at 11:43 AM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 Are you going to generate a report with 3 records in it? That will
 be a very large report - will anyone really want to read through that?
 If you want/need 'summary' reports - i.e. stats on on the 30k records,
 it is much more efficient to setup faceting and/or server-side
 analysis to do this, rather than download
 3 records to a client, then do statistical analysis on the result.
 It will take a while to stream 3 records over an http connection,
 and, if you're building, say, a PDF table for 30k records, that will
 take some time as well.
 Server-side analysis then just send the results will work better, if
 that fits your remit for reporting.

 Peter



 On Thu, Sep 23, 2010 at 4:14 PM, Adeel Qureshi adeelmahm...@gmail.com
 wrote:
  Thank you for your suggestions .. makes sense and I didnt knew about the
  XsltResponseWriter .. that opens up door to all kind of possibilities
 ..so
  its great to know about that
 
  but before I go that route .. what about performance .. In Solr Wiki it
  mentions that XSLT transformation isnt so bad in terms of memory usage
 but I
  guess its all relative to the amount of data and obviously system
 resources
  ..
 
  my data set will be around 15000 - 30'000 records at the most ..I do have
  about 30 some fields but all fields are either small strings (less than
 500
  chars) or dates, int, booleans etc .. so should I be worried about
  performances problems while doing the XSLT translations .. secondly for
  reports Ill have to request solr to send all 15000 some records at the
 same
  time to be entered in report output files .. is there a way to kind of
  stream that process .. well I think Solr native xml is already streamed
 to
  you but sounds like for the translation it will have to load the whole
 thing
  in RAM ..
 
  and again what about SolrJ .. isnt that supposed to provide better
  performance since its in java .. well I guess it shouldnt be much
 different
  since it also uses the HTTP calls to communicate to Solr ..
 
  Thanks for your help
  Adeel
 
  On Thu, Sep 23, 2010 at 7:16 AM, kenf_nc ken.fos...@realestate.com
 wrote:
 
 
  keep in mind that the str name=id paradigm isn't completely useless,
  the
  str is a data type (string), it can be int, float, double, date, and
  others.
  So to not lose any information you may want to do something like:
 
  id type=int123/id
  title type=strxyz/title
 
  Which I agree makes more sense to me. The name of the field is more
  important than it's datatype, but I don't want to lose track of the data
  type.
 
  Ken
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-Reporting-tp1565271p1567604.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 




Re: Autocomplete: match words anywhere in the token

2010-09-23 Thread Jonathan Rochkind
This works with _one_ entry per document, right?   If you've actually 
found a clever trick to use this technique when you have more than one 
entry for auto-suggest per document, do let me know.  Cause I haven't 
been able to come with one.


Jonathan

Chantal Ackermann wrote:

What works very good for me:

1.) Keep the tokenized field (KeywordTokenizerFilter,
WordDelimiterFilter) (like you described you had)
2.) create an additional field that stores uses the String type with the
same content (use copy field to fill either)
3.) use facet.prefix instead of terms.prefix for searching the
suggestions
4.) to your query add also the String field as a facet, and return the
results from that field as suggestion list. They will include the
complete String canon pixma mp500 for example. The other field can
only return facets based on tokens. You probably never want that as
facets.

So your query was alright and the canon (2) facet count probably is
the two occurrences that you listed, but as the field was tokenized,
only tokens would be returned as facets. You need to have an additional
field of pure String type to get the complete value as a facet back.

In general, it worked out fine for me to create String fields as return
values for facets while using the tokenized fields for searching and the
actual facet queries.

Cheers,
Chantal


On Wed, 2010-09-22 at 16:39 +0200, Jason Rutherglen wrote:
  

This may be what you're looking for.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

On Wed, Sep 22, 2010 at 4:41 AM, Arunkumar Ayyavu
arunkumar.ayy...@gmail.com wrote:


It's been over a week since I started learning Solr. Now, I'm using the
electronics store example to explore the autocomplete feature in Solr.

When I send the query terms.fl=nameterms.prefix=canon to terms request
handler, I get the following response
lst name=terms
 lst name=name
  int name=canon2/int
 /lst
/lst

But I expect the following results in the response.
canon pixma mp500 all-in-one photo printer
canon powershot sd500

So, I changed the schema for textgen fieldType to use
KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
gives me the expected result.

Now, I also want the Solr to return canon pixma mp500 all-in-one photo
printer  when I send the query terms.fl=nameterms.prefix=pixma. Could you
gurus help me get the expected result?

BTW, I couldn't quite understand the behavior of terms.lower and terms.upper
(I tried these with the electronics store example). Could you also help me
understand these 2 query fields?
Thanks.

--
Arun

  



  


Range query not working

2010-09-23 Thread PeterKerk

I have this in my query:
 q=*:*facet.query=location_rating_total:[3 TO 100]

And this document:
result name=response numFound=6 start=0 maxScore=1.0
−
doc
float name=score1.0/float
str name=id1/str
int name=location_rating_total2/int
/doc

But still my total results equals 6 (total population) and not 0 as I would
expect

Why?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Search a URL

2010-09-23 Thread Max Lynch
Is there a tokenizer that will allow me to search for parts of a URL?  For
example, the search google would match on the data 
http://mail.google.com/dlkjadf;

This tokenizer factory doesn't seem to be sufficient:

fieldType name=text_standard class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
/analyzer
analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/

 filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
 /analyzer
/fieldType

Thanks.


Re: Range query not working

2010-09-23 Thread Yonik Seeley
On Thu, Sep 23, 2010 at 4:30 PM, PeterKerk vettepa...@hotmail.com wrote:
 I have this in my query:
  q=*:*facet.query=location_rating_total:[3 TO 100]

 And this document:
 result name=response numFound=6 start=0 maxScore=1.0
 -
 doc
 float name=score1.0/float
 str name=id1/str
 int name=location_rating_total2/int
 /doc

 But still my total results equals 6 (total population) and not 0 as I would
 expect

 Why?

facet.query will give you the number of docs matching
location_rating_total:[3 TO 100], it does not restrict the results
list.  If you want that, you want a filter.

Try
q=*:*fq=location_rating_total:[3 TO 100]

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Range query not working

2010-09-23 Thread PeterKerk

Forgot to mention..I tried that too already.

So when I have:
location_rating_total:[0 TO 100]

It shows only the location for which the location_rating_total is EXACTLY
0...locations that have location_rating_total value of 2 are NOT included.

Any other suggestions?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570502.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search a URL

2010-09-23 Thread dl
LetterTokenizerFactory will use each contiguous sequence of letters and discard 
the rest. http, https, com,  etc. would need to be a stopword.

Alternatively you can try PatternTokenizerFactory with a regular expression if 
you are looking for a specific part of the URL.

On Sep 23, 2010, at 10:59 PM, Max Lynch wrote:

 Is there a tokenizer that will allow me to search for parts of a URL?  For
 example, the search google would match on the data 
 http://mail.google.com/dlkjadf;
 
 This tokenizer factory doesn't seem to be sufficient:
 
fieldType name=text_standard class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
/analyzer
analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
 /analyzer
/fieldType
 
 Thanks.



Re: Range query not working

2010-09-23 Thread PeterKerk

This is the field in my schema.xml:

field name=location_rating_total type=integer indexed=true
stored=true/

Also in the response it clearly shows:
int name=location_rating_total0/int

What else can I do?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Range-query-not-working-tp1570324p1570580.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Generating a sitemap

2010-09-23 Thread Doki

Hi all,

 Hate to bring forward a zombified thread (Mar 2010 though, not too
bad), but I also am tasked to generate a sitemap for items indexed in a Solr
index.  Been at this job for only a few weeks, so Solr and Lucene are all
new to me, but I think my path forward on this is to create a requesthandler
that creates a flat datafile upon request, then program a script (Php) that
calls this request, reformats the data into the appropriate xml format, then
posts it for Google to find and crawl.  Attach this script to a crontab item
(daily, weekly, whatever schedule the Google Webmaster Tools has set for the
site), and Boom!  Problem solved.
 Anyone else try this method?  Any successes, failures, advice, etc?

Dave
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Generating-a-sitemap-tp478346p1570641.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Grouping in solr ?

2010-09-23 Thread Papp Richard
thank you!
this is really helpful. just tried it and it's amazing.
do you know, how trustable is a nightly built version (solr4) ?

Rich

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl] 
Sent: Thursday, September 23, 2010 22:38
To: solr-user@lucene.apache.org
Subject: RE: Grouping in solr ?

http://wiki.apache.org/solr/FieldCollapsing

https://issues.apache.org/jira/browse/SOLR-236

 
-Original message-
From: Papp Richard ccode...@gmail.com
Sent: Thu 23-09-2010 21:29
To: solr-user@lucene.apache.org; 
Subject: Grouping in solr ?

Hi all,

 is it possible somehow to group documents?
 I have services as documents, and I would like to show the filtered
services grouped by company. 
 So I filter services by given criteria, but I show the results grouped by
companay.
 If I got 1000 services, maybe I need to show just 100 companies (this will
affect pagination as well), and how could I get the company info? Should I
store the company info in each service (I don't need the compnany info to be
indexed) ?

regards,
 Rich


__ Information from ESET NOD32 Antivirus, version of virus signature
database 5419 (20100902) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com


 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5419 (20100902) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 
 

__ Information from ESET NOD32 Antivirus, version of virus signature
database 5419 (20100902) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com
 



Re: Range query not working

2010-09-23 Thread Yonik Seeley
On Thu, Sep 23, 2010 at 5:44 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 The field type in a standard schema.xml that's defined as integer is NOT
 sortable.

Right - before 1.4.  There is no integer field type in 1.4 and
beyond in the example schema.

 You can not sort on this and get what you want. (What's the point
 of it even existing then, if it pretty much does the same thing as a string
 field?

You can sort on it... you just can't do range queries on it because
the term order isn't correct for numerics.
It's there only for support of legacy lucene indexes that indexed
numerics as plain strings.
They are now named pint for plain integer in 1.4 and above.

Perhaps we should retain support for that, but remove them from the
example schema and only document them somewhere (under supporting
lucene indexed built by other software or something?)

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


RE: Search a URL

2010-09-23 Thread Dennis Gearon
WDF is not WTF(what I think when I see WDF), right ;-)

What is WDF?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/23/10, Markus Jelsma markus.jel...@buyways.nl wrote:

 From: Markus Jelsma markus.jel...@buyways.nl
 Subject: RE: Search a URL
 To: solr-user@lucene.apache.org
 Date: Thursday, September 23, 2010, 2:11 PM
 Try setting generateWordParts=1 in
 your WDF. Also, having a WhitespaceTokenizer makes little
 sense for URL's, there should be no whitespace in a URL, the
 StandardTokenizer can tokenize a URL. Anyway, the problem is
 your WDF.
  
 -Original message-
 From: Max Lynch ihas...@gmail.com
 Sent: Thu 23-09-2010 23:00
 To: solr-user@lucene.apache.org;
 
 Subject: Search a URL
 
 Is there a tokenizer that will allow me to search for parts
 of a URL?  For
 example, the search google would match on the data 
 http://mail.google.com/dlkjadf;
 
 This tokenizer factory doesn't seem to be sufficient:
 
        fieldType name=text_standard
 class=solr.TextField
 positionIncrementGap=100
            analyzer type=index
                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
                filter
 class=solr.LowerCaseFilterFactory/
                filter
 class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
            /analyzer
            analyzer type=query
                 tokenizer
 class=solr.WhitespaceTokenizerFactory/
 
                 filter
 class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1/
                 filter
 class=solr.LowerCaseFilterFactory/
                 filter
 class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
             /analyzer
    /fieldType
 
 Thanks.



Re: Can Solr do approximate matching?

2010-09-23 Thread Igor Chudov
Eric, it appears that the /solr/mlt handler is missing, at least
based on the URL that I typed.

How can I verify existence of MoreLikeThis handler and install it?

Thanks a lot!

Igor

On Wed, Sep 22, 2010 at 11:18 AM, Erik Hatcher erik.hatc...@gmail.com wrote:
 http://www.lucidimagination.com/search/?q=%22find+similar%22 (then narrow 
 to wiki to find things in documentation)

 which will get  you to http://wiki.apache.org/solr/MoreLikeThisHandler

        Erik


 On Sep 22, 2010, at 12:12 PM, Li Li wrote:

 It seems there is a SimilarLikeThis in lucene . I don't know whether a
 counterpart in solr. It just use the found document as a query to find
 similar documents. Or you just use boolean or query and similar
 questions with getting higher score. Of course, you can analyse the
 question using some NLP techs such as identifying entities and ingore
 less usefull words such as which is ... but I guess tf*idf score
 function will also work well

 2010/9/22 Igor Chudov ichu...@gmail.com:
 Hi guys. I am new here. So if I am unwittingly violating any rules,
 let me know.

 I am working with Solr because I own algebra.com, where I have a
 database of 250,000 or so answered math questions. I want to use Solr
 to provide approximate matching functionality called similar items.
 So that users looking at a problem could see how similar ones were
 answered.

 And my question is, does Solr support some find similar
 functionality. For example, in my mind, sentence I like tasty
 strawberries is 'similar' to a sentence such as I like yummy
 strawberries, just because both have a few of the same words.

 So, to end my long winded query, how would I implement a find top ten
 similar items to this one functionality?

 Thanks!





TokenFilter that removes payload ?

2010-09-23 Thread Teruhiko Kurosaka
Is there an existing TokenFilter that simply removes
payloads from the token stream?

Teruhiko Kuro Kurosaka
RLP + Lucene  Solr = powerful search for global contents