Re: No group by? looking for an alternative.

2010-08-05 Thread Mickael Magniez

Thanks for your response.

Unfortunately, I don't think it'll be enough. In fact, I have many other
products than shoes in my index, with many other facets fields.

I simplified my schema : in reality facets are dynamic fields.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1025256.html
Sent from the Solr - User mailing list archive at Nabble.com.


Multiple Facet Dates

2010-08-05 Thread Raphaël Droz

Hi,
I saw this post : 
http://lucene.472066.n3.nabble.com/Multiple-Facet-Dates-td495480.html
I didn't see work in progress or plans about this feature on the list 
and bugtracker.


Does someone already created a patch, pof, ... I wouldn't have been able 
to find ?
From my naïve point of view the ratio usefulness / added code 
complexity appears as high.


My use-case is to provide, in one request :
- the results count for each one of several years (tag-based exclusion)
- the results count for each month of a given year
- the results count for each day of a given month and year)

I pretty sure someone here already encountered the above, isn't ?


RE: Indexing fieldvalues with dashes and spaces

2010-08-05 Thread PeterKerk

@Michael, @Erick,

You both mention interesting things that triggered me.

@Erick:
Your referenced page is very useful. It seems the whitespace tokenizer under
the text_ws is causing issues.

You do mention another interesting thing:
And do be aware that fields you get back from a request (i.e. a search) are
the stored fields, NOT what's indexed.

On the page you provided I see this under the Analyzers section: Analyzers
are components that pre-process input text at index time and/or at search
time.

So I dont completely understand how that sentence is in line with your
comment.


@Michael:
You say: use the tokenized field to return results, but have a duplicate
field of fieldtype=string to show the untokenized results. E.g. facet on
that field.
I think your comment applies on my requirement: a city field is something
that I want users to search on via text input, so lets say New Yo would
give the results for New York.
But also a facet Cities is available in which New York is just one of
the cities that is clickable.
The other facet is theme, which in my example holds values like
Gemeentehuis and Strand  Zee, that would not be a thing on which can be
searched via manual input but IS clickable. 

Could you please indicate (just for the above fields) what needs to be
changed in my schema.xml and if so how that affects the way my request is
build up?


Thanks so much ahead in getting me started!


This is my schema.xml


?xml version=1.0 encoding=UTF-8 ?

schema name=db version=1.1

  types
fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true/
fieldType name=boolean class=solr.BoolField sortMissingLast=true
omitNorms=true/
fieldType name=integer class=solr.IntField omitNorms=true/
fieldType name=long class=solr.LongField omitNorms=true/
fieldType name=float class=solr.FloatField omitNorms=true/
fieldType name=double class=solr.DoubleField omitNorms=true/
fieldType name=sint class=solr.SortableIntField
sortMissingLast=true omitNorms=true/
fieldType name=slong class=solr.SortableLongField
sortMissingLast=true omitNorms=true/
fieldType name=sfloat class=solr.SortableFloatField
sortMissingLast=true omitNorms=true/
fieldType name=sdouble class=solr.SortableDoubleField
sortMissingLast=true omitNorms=true/
fieldType name=date class=solr.DateField sortMissingLast=true
omitNorms=true/
fieldType name=random class=solr.RandomSortField indexed=true /
fieldType name=text_ws class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType
fieldType name=text class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

fieldType name=textTight class=solr.TextField
positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 generateNumberParts=0 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

fieldType name=alphaOnlySort class=solr.TextField
sortMissingLast=true omitNorms=true
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.PatternReplaceFilterFactory pattern=([^a-z])
replacement= replace=all /
  /analyzer
/fieldType
fieldtype name=ignored stored=false 

Re: how to take a value from the query result

2010-08-05 Thread Geert-Jan Brits
you should parse the xml and extract the value. Lot's of libraries
undoubtably exist for PHP to help you with that (I don't know PHP)

Moreover, if all you want from the result is AUC_CAT you should consider
using the fl=param like:
http://172.16.17.126:8983/search/select/?q=AUC_ID:607136fl=AUC_CAT

to return a document of the form:

doc
int name=AUC_CAT576/int
/doc

which if more efficient.
Still you have to parse the doc with xml though.



http://172.16.17.126:8983/search/select/?q=AUC_ID:607136

2010/8/5 twojah e...@tokobagus.com


 this is my query in browser navigation toolbar
 http://172.16.17.126:8983/search/select/?q=AUC_ID:607136

 and this is the result in browser page:
 ...
 doc
 int name=AP_AUC_PHOTO_AVAIL1/int
 double name=AUC_AD_PRICE1.0/double
 int name=AUC_CAT576/int
 int name=AUC_CLIENT_ID27017/int
 str name=AUC_DESCR_SHORTBracket Ceiling untuk semua merk projector,
 panjang 60-90 cm  Bahan Besi Cat Hitam = 325rb Bahan Sta/str
 str

 name=AUC_HTML_DIR_NL/aksesoris-batere-dan-tripod/update-bracket-projector-dan-lcd-plasma-tv-607136.html/str
 int name=AUC_ID607136/int
 str name=AUC_ISNEGONego/str
 int name=AUC_LOCATION7/int
 str name=AUC_PHOTO270/27017/bracket_lcd_plasma_3a-1274291780.JPG/str
 str name=AUC_START2010-05-19 17:56:45/str
 str name=AUC_TITLE[UPDATE] BRACKET Projector dan LCD/PLASMA TV/str
 int name=AUC_TYPE21/int
 int name=PRO_BACKGROUND0/int
 int name=PRO_BOLD0/int
 int name=PRO_COLOR0/int
 int name=PRO_GALLERY0/int
 int name=PRO_LINK0/int
 int name=PRO_SPONSOR0/int
 int name=cat_id_sub0/int
 int name=sectioncode28/int
 /doc

 I want to get the AUC_CAT value (576) and using it in my PHP, how can I get
 that value?
 please help
 thanks before
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-take-a-value-from-the-query-result-tp1025119p1025119.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH and Cassandra

2010-08-05 Thread Shalin Shekhar Mangar
On Thu, Aug 5, 2010 at 3:07 AM, Dennis Gearon gear...@sbcglobal.net wrote:

 If data is stored in the index, isn't the index of Solr pretty much already
 a 'Big/Cassandra Table', except with tokenized columns to make seaching
 easier?

 How are Cassandra/Big/Couch DBs doing text/weighted searching?

 Seems a real duplication to use Cassandra AND Solr. OTOH, I don't know how
 many 'Tables'/indexes one can make using Solr, I'm still a newbie.


I don't think Mark wants to duplicate Solr's functionality through
Cassandra. He is just asking if he can use DIH to import data from his data
sources into Cassandra.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Load cores without restarting/reloading Solr

2010-08-05 Thread Karthik K
Can some one please answer this.

 Is there a way of creating/adding a core and starting it without having to
reload Solr ?


RE: Re: Load cores without restarting/reloading Solr

2010-08-05 Thread Markus Jelsma
http://wiki.apache.org/solr/CoreAdmin
 
-Original message-
From: Karthik K karthikkato...@gmail.com
Sent: Thu 05-08-2010 12:00
To: solr-user@lucene.apache.org; 
Subject: Re: Load cores without restarting/reloading Solr

Can some one please answer this.

Is there a way of creating/adding a core and starting it without having to
reload Solr ?


Re: Auto suggest with spell check

2010-08-05 Thread Grijesh.singh

Given below are the steps for auto-suggest and spellcheck in single query:
Make the change in TermComponent part in solrconfig.xml

searchComponent name=termsComponent
class=org.apache.solr.handler.component.TermsComponent/
requestHandler name=/terms
class=org.apache.solr.handler.component.SearchHandler
 lst name=defaults
  bool name=termstrue/bool
/lst
arr name=components
  strtermsComponent/str   
  strspellcheck/str!--Added for using spellcheck with termcomponent
--
/arr
  /requestHandler
Use given below query format for getting autosuggest and spellcheck
suggestion.
http://localhost:8983/solr/terms?terms.fl=textterms.prefix=computrspellcheck.q=computrspellcheck=true
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-suggest-with-spell-check-tp1015114p1025688.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH and Cassandra

2010-08-05 Thread Jon Baer
That is not 100% true.  I would think RDBMS and XML would be the most common 
importers but the real flexibility is with the TikaEntityProcessor [1] that 
comes w/ DIH ...

http://wiki.apache.org/solr/TikaEntityProcessor

Im pretty sure it would be able to handle any type of serde (in the case of 
Cassandra I believe it is Thrift) on it's own w/ the dep libraries.

I find the TEP to be underutilized sometimes, I think it's because the docs on 
the DIH lack more info on what it can do.

[1] - http://tika.apache.org

- Jon

On Aug 4, 2010, at 3:00 PM, Andrei Savu wrote:

 DIH only works with relational databases and XML files [1], you need
 to write custom code in order to index data from Cassandra.
 
 It should be pretty easy to map documents from Cassandra to Solr.
 There are a lot of client libraries available [2] for Cassandra.
 
 [1] http://wiki.apache.org/solr/DataImportHandler
 [2] http://wiki.apache.org/cassandra/ClientOptions
 
 On Wed, Aug 4, 2010 at 6:41 PM, Mark static.void@gmail.com wrote:
 Is it possible to use DIH with Cassandra either out of the box or with
 something more custom? Thanks
 
 
 
 
 -- 
 Indekspot -- http://www.indekspot.com -- Managed Hosting for Apache Solr



Using solr response json

2010-08-05 Thread Rakhi Khatwani
Hi,
i want to query solr and convert my response object to a json string
using solrj

when i query from my browser(with wt=json) i get the following result:
{
responseHeader:{
status:0,
QTime:0},
response:{numFound:0,start:0,docs:[]
}}


 At the moment i am using google-gson (a third party api) to directly
convert an object into a json string
but somehow when i try converting a QueryResponse object into a json string
i get:

{_header:{nvPairs:[status,0,QTime,1]},_results:[],elapsedTime:121,response:{nvPairs:[responseHeader,{nvPairs:[status,0,QTime,1]},response,[]]}}

 Any pointers?

Regards
Raakhi.


Process entire result set

2010-08-05 Thread Eloi Rocha
Hi everybody,

I would like to know if does make sense to use Solr in the following
scenario:
  - search for large amount of data (like 1000, 1, 10 registers)
  - each register contains four or five fields (strings and integers)
  - every time will request for entire result set (I can paginate the
results). It would be much better to get all results at once
  - we need to process the entire set in order to decide which ones will be
returned
  - this kind of request will happen frequently in several machines (several
transactions per second)
  - solr machines and request machines will be in the same cluster
  - we would like to get the entire result set in less than 500ms.

Thanks in advance,

Eloi


Re: Load cores without restarting/reloading Solr

2010-08-05 Thread Mark Miller
On 8/5/10 5:59 AM, Karthik K wrote:
 Can some one please answer this.
 
  Is there a way of creating/adding a core and starting it without having to
 reload Solr ?
 

Yes, see http://wiki.apache.org/solr/CoreAdmin

- Mark
lucidimagination.com


word delimiter

2010-08-05 Thread j
I have UPPER12-lower and would like to be able to find it with queries
UPPER or lower. What should break this up for the index? A
tokenizer or a filter such as WordDelimiterFilterFactory?

I have tried various combinations of parameters to
WordDelimiterFilterFactory and cant get it to split properly. Here are
the results from using standard tokenizer followed directly by the
WordDelimiterFilterFactory markup below (from analysis.jsp):

1 | 2
UPPER12-lower | lower
---
UPPER  |
---
12   |


filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=0 catenateWords=0 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1 preserveOriginal=1/


Re: No group by? looking for an alternative.

2010-08-05 Thread Mickael Magniez

I've got only one document per shoes, whatever its size or color.

My first try was to create one document per model/size/color, but when i
searche for 'converse' for example, the same shoe is retrieved several
times, and i want to show only one record for each model. But I don't
succeed in grouping results by shoe model.

If you look at  
http://www.amazon.com/s/ref=nb_sb_noss?url=node%3D679255011field-keywords=Converse+All+Star+Leather+Hi+Chuck+Taylor+x=0y=0ih=1_0_0_0_0_0_0_0_0_0.4136_1fsc=-1
amazon for Converse All Star Leather Hi Chuck Taylor  .
They show the shoe only one time, but if you go on the product details, its
exists in several colors and sizes. Now if you filter or color, there is
less sizes available.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1026618.html
Sent from the Solr - User mailing list archive at Nabble.com.


get-colt

2010-08-05 Thread Sai . Thumuluri
Hi - I am trying to compile Solr source and during ant dist step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




RE: get-colt

2010-08-05 Thread Sai . Thumuluri
This is the message I am getting 

Error getting
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 1:15 PM
To: solr-user@lucene.apache.org
Subject: get-colt

Hi - I am trying to compile Solr source and during ant dist step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




Re: question about relevance

2010-08-05 Thread Bharat Jain
Thank you for all the help. Greatly appreciated. I have seen the related
issues and I see lot of patches in the JIRA mentioned. I am really confused
which patch to use (pls excuse my ignorance). Also are the patches
production ready? I will greatly appreciate if you can point me to the
correct patch or is it that i have to apply all the patches and make it
work. Can I apply the patch in solr 1.3?

Thanks
Bharat Jain


On Sat, Jul 31, 2010 at 2:16 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 May I suggest looking at some of the related issues, say SOLR-1682


 This issue is related to:
  SOLR-1682 Implement CollapseComponent
  SOLR-1311 pseudo-field-collapsing
  LUCENE-1421 Ability to group search results by field
  SOLR-1773 Field Collapsing (lightweight version)
  SOLR-237  Field collapsing



 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Bharat Jain bharat.j...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, July 30, 2010 10:40:19 AM
  Subject: Re: question about relevance
 
  Hi,
 Thanks a lot for the info and your time. I think field collapse  will
 work
  for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236but
  which file I should  use for patch. We use solr-1.3.
 
  Thanks
  Bharat Jain
 
 
  On Fri,  Jul 30, 2010 at 12:53 AM, Chris Hostetter
  hossman_luc...@fucit.orgwrote:
 
  
: 1. There are user records of type A, B, C etc. (userId field in
 index  is
   : common to all records)
   : 2. A user can have any number of  A, B, C etc (e.g. think of A being
 a
   : language then user can know many  languages like french, english,
 german
   etc)
   : 3. Records are  currently stored as a document in index.
   : 4. A given query can match  multiple records for the user
   : 5. If for a user more records are  matched (e.g. if he knows both
 french
   and
   : german) then he is  more relevant and should come top in UI. This is
 the
   : reason I wanted  to add lucene scores assuming the greater score
 means
   more
   :  relevance.
  
   if your goal is to get back users from each search,  then you should
   probably change your indexing strategry so that each  user has a
 single
   document -- fields like langauge can be  multivalued, etc...
  
   then a search for language:en langauge:fr  will return users who
 speak
   english or french, and hte ones that speak  both will score higher.
  
   if you really cant change the index  structure, then essentially waht
 you
   are looking for is a field  collapsing solution on the userId field,
   where you want each collapsed  group to get a cumulative score.  i
 don't
   know if the existing  field collapsing patches support this -- if you
 are
   already  willing/capable to do it in the lcient then that may be the
   simplest  thing to support moving foward.
  
   Adding the scores is certainly  one metric you could use -- it's
 generally
   suspicious to try and imply  too much meaning to scores in lucene/solr
 but
   that's becuase people  typically try to imply broader absolute meaning.
  in
   the case of a  single query the scores are relative eachother, and
 adding
   up all the  scores for a given userId is approximaly what would happen
 in
   my example  above -- except that there is also a coord factor that
 would
penalalize documents that only match one clause ... it's complicated,
  but
   as an approximation adding the scores might give you what you are
  looking
   for -- only you can know for sure based on your specific  data.
  
  
  
   -Hoss
  
  
 



anti-words - exact match

2010-08-05 Thread Satish Kumar
Hi,

We have a requirement to NOT display search results if user query contains
terms that are in our anti-words field. For example, if user query is I
have swollen foot and if some records in our index have swollen foot in
anti-words field, we don't want to display those records. How do I go about
implementing this?

NOTE 1: anti-words field can contain multiple values. Each value can be a
one or multiple words (e.g. swollen foot, headache, etc. )

NOTE 2: the match must be exact. If anti-words field contains swollen foot
and if user query is I have swollen foot, record must be excluded. If user
query is My foot is swollen, the record should not be excluded.

Any pointers is greatly appreciated!


Thanks,
Satish


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-05 Thread Ravi Kiran
Hello Mr. Horsetter,
I again tried the code from trunk '
https://svn.apache.org/repos/asf/lucene/dev/trunk' on solr 1.4 index and it
gave me the following IndexFormatTooOldExceptio which in the first place
prompted me to think the indexes are incompatible. Any ideas ?

java.lang.RuntimeException:
org.apache.lucene.index.IndexFormatTooOldException: Format version is not
supported in file '_1d60.fdx': 1 (needs to be between 2 and 2). This version
of Lucene only supports indexes created with release 3.0 and later. at
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1067) at
org.apache.solr.core.SolrCore.init(SolrCore.java:582) at
org.apache.solr.core.CoreContainer.create(CoreContainer.java:453) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:308) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:198) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:123)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:273)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:385)
at
org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:119)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4529)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:5348)
at com.sun.enterprise.web.WebModule.start(WebModule.java:353) at
com.sun.enterprise.web.LifecycleStarter.doRun(LifecycleStarter.java:58) at
com.sun.appserv.management.util.misc.RunnableBase.runSync(RunnableBase.java:304)
at
com.sun.appserv.management.util.misc.RunnableBase.run(RunnableBase.java:341)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619) Caused by:
org.apache.lucene.index.IndexFormatTooOldException: Format version is not
supported in file '_1d60.fdx': 1 (needs to be between 2 and 2). This version
of Lucene only supports indexes created with release 3.0 and later. at
org.apache.lucene.index.FieldsReader.init(FieldsReader.java:109) at
org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:242)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:523) at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:494) at
org.apache.lucene.index.DirectoryReader.init(DirectoryReader.java:133) at
org.apache.lucene.index.ReadOnlyDirectoryReader.init(ReadOnlyDirectoryReader.java:28)
at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:98)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:630)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:92) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:415) at
org.apache.lucene.index.IndexReader.open(IndexReader.java:294) at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1056) ... 21 more




Ravi Kiran Bhaskar

On Tue, Aug 3, 2010 at 11:15 AM, Ravi Kiran ravi.bhas...@gmail.com wrote:

 Hello Mr.Hostetter,
 Thank you very much for the clarification. I do
 remember that when I first deployed the solr code from trunk on a test
 server I couldnt open the index (created via 1.4) even via the solr admin
 page, It kept giving me corrupted index EOF kind of exception, so I was
 curious. Let me try it out again and report to you with the exact error.


 On Mon, Aug 2, 2010 at 4:28 PM, Chris Hostetter 
 hossman_luc...@fucit.orgwrote:

 : I am trying to use the solr code from '
 : https://svn.apache.org/repos/asf/lucene/dev/trunk' as my design
 warrants use
 : of PolyType fields. My understanding is that the indexes are
 incompatible,
 : am I right ?. I have about a million docs in my index (indexed via solr
 : 1.4). Is re-indexing my only option or is there a tool of some sort to
 : convert the 1.4 index to 3.1 format ?

 a) the trunk is what will ultimately be Solr 4.x, not 3.x ... for the
 3.x line there is a 3x branch...

 http://wiki.apache.org/solr/Solr3.1
 http://wiki.apache.org/solr/Solr4.0

 b) The 3x branch can read indexes created by Solr 1.4 -- the first time
 you add a doc and commit the new segments wil automaticly be converted to
 the new format.  I am fairly certian that as of this moment, the 4x trunk
 can also read indexes created by Solr 1.4, with the same automatic
 converstion taking place.

 c)  If/When the trunk can no longer read Solr 1.4 indexes, there will be
 a tool provided 

RE: get-colt

2010-08-05 Thread Sai . Thumuluri
Got it working - had to manually copy the jar files under the contrib
directories

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 2:00 PM
To: solr-user@lucene.apache.org
Subject: RE: get-colt

This is the message I am getting 

Error getting
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar

-Original Message-
From: sai.thumul...@verizonwireless.com
[mailto:sai.thumul...@verizonwireless.com] 
Sent: Thursday, August 05, 2010 1:15 PM
To: solr-user@lucene.apache.org
Subject: get-colt

Hi - I am trying to compile Solr source and during ant dist step, the
build times out on 

get-colt:
  [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
  [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri




Re: get-colt

2010-08-05 Thread Koji Sekiguchi

(10/08/06 2:14), sai.thumul...@verizonwireless.com wrote:

Hi - I am trying to compile Solr source and during ant dist step, the
build times out on

get-colt:
   [get] Getting:
http://repo1.maven.org/maven2/colt/colt/1.2.0/colt-1.2.0.jar
   [get] To:
/opt/solr/apache-solr-1.4.0/contrib/clustering/lib/downloads/colt-1.2.0.
jar

After a while - the steps fails giving the following message

BUILD FAILED
/opt/solr/apache-solr-1.4.0/common-build.xml:356: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/common-build.xml:219: The following error
occurred while executing this line:
/opt/solr/apache-solr-1.4.0/contrib/clustering/build.xml:79:
java.net.ConnectException: Connection timed out

Any help is greatly appreciated?

Sai Thumuluri



   

Sai,

If there is a proxy in your environment, specify the proxy host
and port (and optionally user and password):

$ ant dist -Dproxy.home=HOST -Dproxy.port=PORT -Dproxy.user=USER 
-Dproxy.password=PASSWORD


Koji

--
http://www.rondhuit.com/en/



Re: No group by? looking for an alternative.

2010-08-05 Thread Geert-Jan Brits
If I understand correctly:
1. products have different product variants ( in case of shoes a combination
of color and size + some other fields).
2. Each product is shown once in the result set. (so no multiple product
variants of the same product are shown)

This would solve that IMO:

1, create 1 document per product (so not a document per product-variant)
2.create a multivalued field on which to facet containing: all combinations
of: size-color-any other field-yett another field
3. make sure to include combinations in which the user is indifferent of a
particular filter. i.e: don't care about size (dc) + red -- dc-red
4. filtering on that combination would give you all the products that
satisfy the product-variant constraints (size, color, etc.) + the extra
product constraints ('converse)
5. on the detail page show all available product-variants not filtered by
the constraints specified. This would likely be something outside of solr (a
simple sql-select on a single product)

hope that helps,
Geert-Jan

2010/8/5 Mickael Magniez mickaelmagn...@gmail.com


 I've got only one document per shoes, whatever its size or color.

 My first try was to create one document per model/size/color, but when i
 searche for 'converse' for example, the same shoe is retrieved several
 times, and i want to show only one record for each model. But I don't
 succeed in grouping results by shoe model.

 If you look at

 http://www.amazon.com/s/ref=nb_sb_noss?url=node%3D679255011field-keywords=Converse+All+Star+Leather+Hi+Chuck+Taylor+x=0y=0ih=1_0_0_0_0_0_0_0_0_0.4136_1fsc=-1
 amazon for Converse All Star Leather Hi Chuck Taylor  .
 They show the shoe only one time, but if you go on the product details, its
 exists in several colors and sizes. Now if you filter or color, there is
 less sizes available.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1026618.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: No group by? looking for an alternative.

2010-08-05 Thread Jonathan Rochkind

Mickael Magniez wrote:

Thanks for your response.

Unfortunately, I don't think it'll be enough. In fact, I have many other
products than shoes in my index, with many other facets fields.

I simplified my schema : in reality facets are dynamic fields.
  


You could change the way you do indexing, so every product-color-size 
combo is it's own document.


Document1:
   product: running shoe
   size: 12
   color: red

Document2:
   product: running shoe
  size: 13
   color: red

That would let you do the kind of facetting drill-down you want to do. 
It would of course make other things more complicated. But it's the only 
way I can think of to let you do the kind of facet drill-down you want, 
if I understand what you want correctly, which I may not.


Jonathan





Re: anti-words - exact match

2010-08-05 Thread Jonathan Rochkind
This is tricky. You could try doing something with the ShingleFilter 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory) 
at _query time_ to turn the users query:


i have a swollen foot into:
i, i have, i have a, i have a swollen,  have, have a, 
have a swollen... etc.


I _think_ you can get the ShingleFilter factory to do that.

But now you only want to exclude if one of those shingles matches the 
ENTIRE anti-word. So maybe index as non-tokenized, so each of those 
shingles will somehow only match on the complete thing.  You'd want to 
normalize spacing and punctuation.


But then you need to turn that into a _negated_ element of your query. 
Perhaps by using an fq with a NOT/- in it? And a query which 'matches' 
(causing 'not' behavior) if _any_ of the shingles match.


I have no idea if it's actually possible to put these things together in 
that way. A non-tokenized field? Which still has it's queries 
shingle-ized at query time? And then works as a negated query, matching 
for negation if any of the shingles match?  Not really sure how to put 
that together in your solrconfig.xml and/or application logic if needed. 
You could try.


Another option would be doing the query-time 'shingling' in your app, 
and then it's a somewhat more normal Solr query. fq= -shingle one 
-shingle two -shingle three etc.  Or put em in seperate fq's 
depending on how you want to use your filter cache. Still searching on a 
non-tokenized field, and still normalizing on white-space and 
punctuation at both index time and (using same normalization logic but 
in your application logic this time) query time.  I think that might work.


So I'm not really sure, but maybe that gives you some ideas.

Jonathan



Satish Kumar wrote:

Hi,

We have a requirement to NOT display search results if user query contains
terms that are in our anti-words field. For example, if user query is I
have swollen foot and if some records in our index have swollen foot in
anti-words field, we don't want to display those records. How do I go about
implementing this?

NOTE 1: anti-words field can contain multiple values. Each value can be a
one or multiple words (e.g. swollen foot, headache, etc. )

NOTE 2: the match must be exact. If anti-words field contains swollen foot
and if user query is I have swollen foot, record must be excluded. If user
query is My foot is swollen, the record should not be excluded.

Any pointers is greatly appreciated!


Thanks,
Satish

  


Re: Solr searching performance issues, using large documents

2010-08-05 Thread Peter Spam
I've read through the DataImportHandler page a few times, and still can't 
figure out how to separate a large document into smaller documents.  Any hints? 
:-)  Thanks!

-Peter

On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:

 Spanning won't work- you would have to make overlapping mini-documents
 if you want to support this.
 
 I don't know how big the chunks should be- you'll have to experiment.
 
 Lance
 
 On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote:
 What would happen if the search query phrase spanned separate document 
 chunks?
 
 Also, what would the optimal size of chunks be?
 
 Thanks!
 
 
 -Peter
 
 On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
 
 Not that I know of.
 
 The DataImportHandler has the ability to create multiple documents
 from one input stream. It is possible to create a DIH file that reads
 large log files and splits each one into N documents, with the file
 name as a common field. The DIH wiki page tells you in general how to
 make a DIH file.
 
 http://wiki.apache.org/solr/DataImportHandler
 
 From this, you should be able to make a DIH file that puts log files
 in as separate documents. As to splitting files up into
 mini-documents, you might have to write a bit of Javascript to achieve
 this. There is no data structure or software that implements
 structured documents.
 
 On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote:
 Thanks for the pointer, Lance!  Is there an example of this somewhere?
 
 
 -Peter
 
 On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
 
 Ah! You're not just highlighting, you're snippetizing. This makes it 
 easier.
 
 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.
 
 Good luck!
 
 Lance
 
 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the 
 highlighting will sometimes be blank :-(
 Thanks!
 
 - Peter
 
 ps. sorry for the many responses - I'm rushing around trying to get this 
 working.
 
 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
 
 Correction - it went from 17 seconds to 10 seconds - I was changing the 
 hl.regex.maxAnalyzedChars the first time.
 Thanks!
 
 -Peter
 
 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
 
 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
 
 did you already try other values for hl.maxAnalyzedChars=2147483647
 
 Yes, I tried dropping it down to 21, but it didn't have much of an 
 impact (one search I just tried went from 17 seconds to 15.8 seconds, 
 and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
 
 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for fuzzy 
 or
 other wildcard search.
 
 fuzzy could be set to * but isn't right now.
 
 Thanks for the tips, Peter - this has been very frustrating!
 
 
 - Peter
 
 Regards,
 Peter.
 
 Data set: About 4,000 log files (will eventually grow to millions).  
 Average log file is 850k.  Largest log file (so far) is about 70MB.
 
 Problem: When I search for common terms, the query time goes from 
 under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled. 
  When I disable highlighting, performance improves a lot, but is 
 still slow for some queries (7 seconds).  Thanks in advance for any 
 ideas!
 
 
 -Peter
 
 
 -
 
 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar
 
 -
 
 schema.xml changes:
 
  fieldType name=text_pl class=solr.TextField
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
/analyzer
  /fieldType
 
 ...
 
 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 multiValued=false/
 field name=filename type=string indexed=true stored=true 
 multiValued=false/
 field 

Re: Process entire result set

2010-08-05 Thread Jonathan Rochkind

Eloi Rocha wrote:

Hi everybody,

I would like to know if does make sense to use Solr in the following
scenario:
  - search for large amount of data (like 1000, 1, 10 registers)
  - each register contains four or five fields (strings and integers)
  - every time will request for entire result set (I can paginate the
results). It would be much better to get all results at once [...]
  


Depends on what kinds of searching you're doing. Are you doing searching 
that needs an indexer like Solr?  Then Solr is a good tool for your job. 
 Are you not, and you can do what you want just as easily in an rdbms 
or non-sql store like MongoDB? Then I wouldn't use Solr.


Assuming you really do need Solr, I think this should work, but I would 
not store the actual stored fields in Solr, I'd store those fields in an 
external store (key-value store, rdbms, whatever).   You store only what 
you need to index in Solr, you do your search, you get ID's back.  You 
ask for the entire result set back, why not.  If you give Solr enough 
RAM, and set your cache settings appropriately (really big document and 
related caches), then I _think_ it should perform okay. One way to find 
out.


What you'd get back is just ID's, then you'd look up that ID in your 
external store to get your actual fields you want to operate on. _May_ 
not be neccesary, maybe you could do it with solr stored fields, but 
making Solr do only exactly what you really need from it (an index) will 
maximize it's ability to do what you need in available RAM.


If you don't need Solr/Lucene indexing/faceting behavior, and you can do 
just fine with an rdbms or non-sql store, use that.


Jonathan


Re: Sharing index files between multiple JVMs and replication

2010-08-05 Thread Lance Norskog
Oh yes, replication will not work for shared files. It is about making
your own copy from another machine.

There is no read-only option but there should be. The files and
directory can be read-only, I've done it. You could use the OS
permission system to enforce read-only. Then you can just do a
commit against the read-only instances, and this will reload the
index without changing it.

Lance

On Wed, Aug 4, 2010 at 10:42 AM, Kelly Taylor wired...@yahoo.com wrote:
 Is anybody else encountering these same issues; IF having a similar setup?  
 And
 is there a way to configure certain Solr web-apps as read-only (basically 
 dummy
 instances) so that index changes are not allowed?



 - Original Message 
 From: Kelly Taylor wired...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Tue, August 3, 2010 5:48:11 PM
 Subject: Re: Sharing index files between multiple JVMs and replication

 Yes, they are on a common file server, and I've been sharing the same index
 directory between the Solr JVMs. But I seem to be hitting a wall when 
 attempting

 to use just one instance for changing the index.

 With Solr replication disabled, I stream updates to the one instance, and this
 process hangs whenever there are additional Solr JVMs started up with the same
 configuration in solrconfig.xml  -  So I then tried, to no avail, using a
 different configuration, solrconfig-readonly.xml where the updateHandler was
 commmented out, all /update* requestHandlers removed, mainIndex locktype of
 none, etc.

 And with Solr replication enabled, the Slave seems to hang, or at least report
 unusually long time estimates for the current running replication process to
 complete.


 -Kelly



 - Original Message 
 From: Lance Norskog goks...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, August 3, 2010 4:56:58 PM
 Subject: Re: Sharing index files between multiple JVMs and replication

 Are these files on a common file server? If you want to share them
 that way, it actually does work just to give them all the same index
 directory, as long as only one of them changes it.

 On Tue, Aug 3, 2010 at 4:38 PM, Kelly Taylor wired...@yahoo.com wrote:
 Is there a way to share index files amongst my multiple Solr web-apps, by
 configuring only one of the JVMs as an indexer, and the remaining, as
 read-only
 searchers?

 I'd like to configure in such a way that on startup of the read-only
 searchers,
 missing cores/indexes are not created, and updates are not handled.

 If I can get around the files being locked by the read-only instances, I
 should
 be able to scale wider in a given environment, as well as have less 
 replicated
 copies of my master index (Solr 1.4 Java Replication).

 Then once the commit is issued to the slave, I can fire off a RELOAD script
 for
 each of my read-only cores.

 -Kelly








 --
 Lance Norskog
 goks...@gmail.com








-- 
Lance Norskog
goks...@gmail.com


Re: Support loading queries from external files in QuerySenderListener

2010-08-05 Thread Lance Norskog
You can use an XInclude in solrconfig.xml. Your external query file
has to be in the XML format.

Lance

On Wed, Aug 4, 2010 at 7:57 AM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Wed, Aug 4, 2010 at 3:27 PM, Stanislaw 
 solrgeschic...@googlemail.comwrote:

 Hi all!
 I cant load my custom queries from the external file, as written here:
 https://issues.apache.org/jira/browse/SOLR-784

 This option is seems to be not implemented in current version 1.4.1 of
 Solr.
 It was deleted or it comes first with new version?


 That patch was never committed so it is not available in any release.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Lance Norskog
goks...@gmail.com


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-05 Thread Chris Hostetter

: Hello Mr. Horsetter,

Please, call me Hoss.  Mr. Horsetter is ... well frankly i have no idea 
who that is.

: I again tried the code from trunk '
: https://svn.apache.org/repos/asf/lucene/dev/trunk' on solr 1.4 index and it

Please note my previous comments...

:  a) the trunk is what will ultimately be Solr 4.x, not 3.x ... for the
:  3.x line there is a 3x branch...
: 
:  http://wiki.apache.org/solr/Solr3.1
:  http://wiki.apache.org/solr/Solr4.0
: 
:  b) The 3x branch can read indexes created by Solr 1.4 -- the first time
:  you add a doc and commit the new segments wil automaticly be converted to
:  the new format.  I am fairly certian that as of this moment, the 4x trunk
:  can also read indexes created by Solr 1.4, with the same automatic
:  converstion taking place.

...aparently i was mistaken about trunk that has already had the code 
for reading Lucene 2.9 indexes (what's used in Solr 1.4) removed (hence 
the IndexFormatTooOldException.

But that doens't change hte fact that 3.1 will be able to read Solr 1.4 
indexes.  And 4.0 will be able to read 3.1 indexes.

You should, infact, be able to use the 3x branch code today to open your 
SOlr 1.4 index, add one document to have it convert to a 3x index.  then 
use the trunk code to open that index, add one doucment, andh ave it 
convert to a trunk index

Of course: there is no garuntee that index format in the official 4.0 
index format will be the same as what's on trunk right now -- it hasn't 
been officially released.

:  c)  If/When the trunk can no longer read Solr 1.4 indexes, there will be
:  a tool provided for upgrading index versions.

That should still be true in the the official 4.0 release (i really should 
have said When 4.0 can no longer read SOlr 1.4 indexes), ...
i havne't been following the detials closely, but i suspect that tool 
hasn't been writen yet because there isn't much point until the full 
details of the trunk index format are nailed down.


-Hoss



Re: Indexing fieldvalues with dashes and spaces

2010-08-05 Thread Erick Erickson
This confuses lots of people. When you index a field, it's Analyzed 10
ways from Sunday. Consider The World is an unknown Entity. When
you INDEX it, many thing happen, depending upon the analyser.
Stopwords may be removed. each token may be lower cased. Each token
may be stemmed. It all depends on what's in your analyzer chain. Assume
a simple chain consisting of breaking up tokens on whitespace, lowercasing,
and removing stopwords. The actual tokens INDEXED would be world,
unknown, and entity. That is what is searched against.

However, the string, unchanged, would be STORED if you specified it so.
So when you asked for the field to be returned in a search result, you
would
get The World is an unknown Entity if you asked for the field to be
returned as part of a search result that matched on, say, world.

HTH
Erick

On Thu, Aug 5, 2010 at 4:31 AM, PeterKerk vettepa...@hotmail.com wrote:


 @Michael, @Erick,

 You both mention interesting things that triggered me.

 @Erick:
 Your referenced page is very useful. It seems the whitespace tokenizer
 under
 the text_ws is causing issues.

 You do mention another interesting thing:
 And do be aware that fields you get back from a request (i.e. a search)
 are
 the stored fields, NOT what's indexed.

 On the page you provided I see this under the Analyzers section: Analyzers
 are components that pre-process input text at index time and/or at search
 time.

 So I dont completely understand how that sentence is in line with your
 comment.


 @Michael:
 You say: use the tokenized field to return results, but have a duplicate
 field of fieldtype=string to show the untokenized results. E.g. facet on
 that field.
 I think your comment applies on my requirement: a city field is something
 that I want users to search on via text input, so lets say New Yo would
 give the results for New York.
 But also a facet Cities is available in which New York is just one of
 the cities that is clickable.
 The other facet is theme, which in my example holds values like
 Gemeentehuis and Strand  Zee, that would not be a thing on which can
 be
 searched via manual input but IS clickable. 

 Could you please indicate (just for the above fields) what needs to be
 changed in my schema.xml and if so how that affects the way my request is
 build up?


 Thanks so much ahead in getting me started!


 This is my schema.xml


 ?xml version=1.0 encoding=UTF-8 ?

 schema name=db version=1.1

  types
fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true/
fieldType name=boolean class=solr.BoolField sortMissingLast=true
 omitNorms=true/
fieldType name=integer class=solr.IntField omitNorms=true/
fieldType name=long class=solr.LongField omitNorms=true/
fieldType name=float class=solr.FloatField omitNorms=true/
fieldType name=double class=solr.DoubleField omitNorms=true/
fieldType name=sint class=solr.SortableIntField
 sortMissingLast=true omitNorms=true/
fieldType name=slong class=solr.SortableLongField
 sortMissingLast=true omitNorms=true/
fieldType name=sfloat class=solr.SortableFloatField
 sortMissingLast=true omitNorms=true/
fieldType name=sdouble class=solr.SortableDoubleField
 sortMissingLast=true omitNorms=true/
fieldType name=date class=solr.DateField sortMissingLast=true
 omitNorms=true/
fieldType name=random class=solr.RandomSortField indexed=true /
fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType
fieldType name=text class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

fieldType name=textTight class=solr.TextField
 positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter 

Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-05 Thread Robert Muir
On Thu, Aug 5, 2010 at 9:07 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 That should still be true in the the official 4.0 release (i really should
 have said When 4.0 can no longer read SOlr 1.4 indexes), ...
 i havne't been following the detials closely, but i suspect that tool
 hasn't been writen yet because there isn't much point until the full
 details of the trunk index format are nailed down.


This is news to me?

File formats are back-compatible between major versions. Version X.N should
be able to read indexes generated by any version after and including version
X-1.0, but may-or-may-not be able to read indexes generated by version
X-2.N.

(And personally I think there is stuff in 2.x like modified-utf8 that i
would object to adding support for with terms now as byte[])

-- 
Robert Muir
rcm...@gmail.com


Re: XML Format

2010-08-05 Thread twojah

can somebody help me please
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/XML-Format-tp1024608p1028456.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr searching performance issues, using large documents

2010-08-05 Thread Lance Norskog
You may have to write your own javascript to read in the giant field
and split it up.

On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam ps...@mac.com wrote:
 I've read through the DataImportHandler page a few times, and still can't 
 figure out how to separate a large document into smaller documents.  Any 
 hints? :-)  Thanks!

 -Peter

 On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:

 Spanning won't work- you would have to make overlapping mini-documents
 if you want to support this.

 I don't know how big the chunks should be- you'll have to experiment.

 Lance

 On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam ps...@mac.com wrote:
 What would happen if the search query phrase spanned separate document 
 chunks?

 Also, what would the optimal size of chunks be?

 Thanks!


 -Peter

 On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:

 Not that I know of.

 The DataImportHandler has the ability to create multiple documents
 from one input stream. It is possible to create a DIH file that reads
 large log files and splits each one into N documents, with the file
 name as a common field. The DIH wiki page tells you in general how to
 make a DIH file.

 http://wiki.apache.org/solr/DataImportHandler

 From this, you should be able to make a DIH file that puts log files
 in as separate documents. As to splitting files up into
 mini-documents, you might have to write a bit of Javascript to achieve
 this. There is no data structure or software that implements
 structured documents.

 On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam ps...@mac.com wrote:
 Thanks for the pointer, Lance!  Is there an example of this somewhere?


 -Peter

 On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:

 Ah! You're not just highlighting, you're snippetizing. This makes it 
 easier.

 Highlighting does not stream- it pulls the entire stored contents into
 one string and then pulls out the snippet.  If you want this to be
 fast, you have to split up the text into small pieces and only
 snippetize from the most relevant text. So, separate documents with a
 common group id for the document it came from. You might have to do 2
 queries to achieve what you want, but the second query for the same
 query will be blindingly fast. Often 1ms.

 Good luck!

 Lance

 On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam ps...@mac.com wrote:
 However, I do need to search the entire document, or else the 
 highlighting will sometimes be blank :-(
 Thanks!

 - Peter

 ps. sorry for the many responses - I'm rushing around trying to get 
 this working.

 On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:

 Correction - it went from 17 seconds to 10 seconds - I was changing 
 the hl.regex.maxAnalyzedChars the first time.
 Thanks!

 -Peter

 On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:

 On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:

 did you already try other values for hl.maxAnalyzedChars=2147483647

 Yes, I tried dropping it down to 21, but it didn't have much of an 
 impact (one search I just tried went from 17 seconds to 15.8 seconds, 
 and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).

 ? Also regular expression highlighting is more expensive, I think.
 What does the 'fuzzy' variable mean? If you use this to query via
 ~someTerm instead someTerm
 then you should try the trunk of solr which is a lot faster for 
 fuzzy or
 other wildcard search.

 fuzzy could be set to * but isn't right now.

 Thanks for the tips, Peter - this has been very frustrating!


 - Peter

 Regards,
 Peter.

 Data set: About 4,000 log files (will eventually grow to millions). 
  Average log file is 850k.  Largest log file (so far) is about 70MB.

 Problem: When I search for common terms, the query time goes from 
 under 2-3 seconds to about 60 seconds.  TermVectors etc are 
 enabled.  When I disable highlighting, performance improves a lot, 
 but is still slow for some queries (7 seconds).  Thanks in advance 
 for any ideas!


 -Peter


 -

 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar

 -

 schema.xml changes:

  fieldType name=text_pl class=solr.TextField
    analyzer
      tokenizer class=solr.WhitespaceTokenizerFactory/
    filter class=solr.LowerCaseFilterFactory/
    filter class=solr.WordDelimiterFilterFactory 
 generateWordParts=0 generateNumberParts=0 catenateWords=0 
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
    /analyzer
  /fieldType

 ...

 field name=body type=text_pl indexed=true stored=true 
 multiValued=false termVectors=true termPositions=true 
 termOffsets=true /
  field name=timestamp type=date indexed=true stored=true 
 default=NOW multiValued=false/
 field name=version type=string indexed=true stored=true 
 multiValued=false/
 field name=device type=string indexed=true stored=true 
 

Re: No group by? looking for an alternative.

2010-08-05 Thread Lance Norskog
I can see how one document per model blows up when you have many
options. But how many models of the shoe do they actually make? They
can't possibly make 5000, one for every metadat combination.

If you go with one document per model, you have to do a second search
on that product ID to get all of the models.

Field Collapsing is exactly for the 'many shoes for one product'
problem, but it is not released, so the second search is what you
want.

On Thu, Aug 5, 2010 at 4:54 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Mickael Magniez wrote:

 Thanks for your response.

 Unfortunately, I don't think it'll be enough. In fact, I have many other
 products than shoes in my index, with many other facets fields.

 I simplified my schema : in reality facets are dynamic fields.


 You could change the way you do indexing, so every product-color-size combo
 is it's own document.

 Document1:
   product: running shoe
   size: 12
   color: red

 Document2:
   product: running shoe
  size: 13
   color: red

 That would let you do the kind of facetting drill-down you want to do. It
 would of course make other things more complicated. But it's the only way I
 can think of to let you do the kind of facet drill-down you want, if I
 understand what you want correctly, which I may not.

 Jonathan







-- 
Lance Norskog
goks...@gmail.com


Query Result is not updated based on the new XML files

2010-08-05 Thread twojah

hi everyone,
I run the query from the browser:
http://172.16.17.126:8983/search/select/?q=AUC_CAT:978

the query is based on cat_978.xml which was produced by my PHP script
and I got the correct result like this:
response
  lst name=responseHeader
int name=status0/int
int name=QTime4/int
lst name=params
  str name=q.opAND/str
  str name=flAUC_ID,AUC_CAT,AUC_DESCR_SHORT/str
  str name=start0/str
  str name=qAUC_CAT:978/str
  str name=rows1000/str
/lst
  /lst
  result name=response numFound=1575 start=0
doc
  int name=AUC_CAT978/int
  str name=AUC_DESCR_SHORTHP Compaq Presario V3700Core 2 duo webcam
wifi lan HD 160Gb DDR2 1Gb Tas original windows 7 ultimate/str
  int name=AUC_ID618436123/int
/doc
doc
  int name=AUC_CAT978/int
  str name=AUC_DESCR_SHORTHP Compaq Presario V3700Core 2 duo webcam
wifi lan HD 160Gb DDR2 1Gb Tas original windows 7 ultimate/str
  int name=AUC_ID618436/int
/doc
  /result
/response

now, I edit the AUC_ID field in cat_978.xml, I change 618436123 to 618436
(look the bold letters above)
and I refresh the browser but it doesn't updated or reflect the changes I
was made
how to make the query result updated exactly based on cat_978.xml changes?

really need your help
thanks before
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-Result-is-not-updated-based-on-the-new-XML-files-tp1028575p1028575.html
Sent from the Solr - User mailing list archive at Nabble.com.