Re: Boosting of words

2009-10-17 Thread bhaskar chandrasekar
Hi,
 
I am using Solr 1.3.
I access Solr through carrot and use Java.
 
 
Regards
Bhaskar

--- On Thu, 10/15/09, AHMET ARSLAN iori...@yahoo.com wrote:


From: AHMET ARSLAN iori...@yahoo.com
Subject: Re: Boosting of words
To: solr-user@lucene.apache.org
Date: Thursday, October 15, 2009, 8:58 AM


 Hi,
  
 I am able to see the results when i pass the values in the
 query browser.
  
 When i pass the below query i am able to see the difference
 in output.
  
 http://localhost:8983/solr/select/?q=java^100%20technology^1
  
 Each time user cannot pass the values in the query browser
 to see the output.
  
 But where exactly 
  
 java^100 technology^1
  
 this value should be set. In which file and which location
 to be precise?.
  
 Please help me.

Althought I do not understand you, you need to URL encode your parameter values 
before you invoke a HTTP GET.   paramater=urlencode(value,UTF-8) 

Try this url :
/select/?q=java%5E100+OR+technology%5E1version=2.2

Note that space is encoded into +.
Also ^ is encoded into %5E. 

What kind of solr client are you using? How are you accessing to solr? From 
java, php, rubby?







  

Re: Using DIH's special commands....Help needed

2009-10-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
It is strange that LogTransformer did not log the data. .

On Fri, Oct 16, 2009 at 5:54 PM, William Pierce evalsi...@hotmail.com wrote:
 Folks:

 Continuing my saga with DIH and use of its special commands.  I have
 verified that the script functionality is indeed working.    I also verified
 that '$skipRow' is working.    But I don't think that '$deleteDocById' is
 working.

 My script now looks as follows:

 script
        ![CDATA[
                function DeleteRow(row) {
                                   var jid = row.get('Id');
                    var jis = row.get('IndexingStatus');
                    if ( jis == 4 ) {
                                       row.put('$deleteDocById', jid);
                                       row.remove('Col1');
                                       row.put('Col1', jid);
                                  }
               return row;
           }
     ]]
  /script

 The theory is that rows whose 'IndexingStatus' value is 4 should be deleted
 from solr index.  Just to be sure that javascript syntax was correct and
 checked out,  I intentionally overwrite a field called 'Col1' in my schema
 with primary key of the document to be deleted.

 On a clean and empty index, I import 47 rows from my dummy db.   Everything
 checks out correctly since IndexingStatus for each row is 1.  There are no
 rows to delete.    I then go into the db and set one row with the
 IndexingStatus = 4.   When I execute the dataimport,  I find that all 47
 documents are imported correctly.   However,  for the row for which
 'IndexingStatus' was set to 4,  the Col1 value is set correctly by the
 script transformer to be the primary key value for that row/document.
 However,  I should not be seeing that document  since the '$deleteDocById
 should have deleted this from solr.

 Could this be a bug in solr?  Or, am I misunderstanding how $deleteDocById
 works?

 By the way, Noble, I tried to set the LogTransformer, and add logging per
 your suggestion.  That did not work either.  I set logLevel=debug, and
 also turned on solr logging in the admin console to be the max value
 (finest) and still no output.

 Thanks,

 - Bill



 --
 From: Noble Paul ???  ?? noble.p...@corp.aol.com
 Sent: Thursday, October 15, 2009 10:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Using DIH's special commandsHelp needed

 use  LogTransformer to see if the value is indeed set

 entity name=post transformer=script:DeleteRow,
 RegexTransformer,LogTransformer
         logTemplate=${post}
         query= select  Id, a, b, c, IndexingStatus from  prod_table
 where (IndexingStatus = 1 or IndexingStatus = 4) 

 this should print out the entire row after the transformations



 On Fri, Oct 16, 2009 at 3:04 AM, William Pierce evalsi...@hotmail.com
 wrote:

 Thanks for your reply!  I tried your suggestion.  No luck.  I have
 verified
 that I have version  1.6.0_05-b13 of java installed.  I am running with
 the
 nightly bits of October 7.  I am pretty much out of ideas at the present
 timeI'd appreciate any tips/pointers.

 Thanks,

 - Bill

 --
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 Sent: Thursday, October 15, 2009 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Using DIH's special commandsHelp needed

 On Fri, Oct 16, 2009 at 12:46 AM, William Pierce
 evalsi...@hotmail.comwrote:

 Thanks for your help.  Here is my DIH config fileI'd appreciate any
 help/pointers you may give me.  No matter what I do the documents are
 not
 getting deleted from the index.  My db has rows whose 'IndexingStatus'
 field
 has values of either 1 (which means add it to solr), or 4 (which means
 delete the document with the primary key from SOLR index).  I have two
 transformers running.  Not sure what I am doing wrong.

 dataConfig
  script![CDATA[
             function DeleteRow(row)    {
                 var jis = row.get('IndexingStatus');
                 var jid = row.get('Id');
                 if ( jis == 4 ) {
                      row.put('$deleteDocById', jid);
                  }
                 return row;
             }
     ]]/script

  dataSource type=JdbcDataSource
           driver=com.mysql.jdbc.Driver
           url=jdbc:mysql://localhost/db
           user=**
           password=***/
  document
  entity name=post transformer=script:DeleteRow, RegexTransformer
         query= select  Id, a, b, c, IndexingStatus from  prod_table
 where (IndexingStatus = 1 or IndexingStatus = 4) 
      field column=ptype splitBy=, sourceColName=a /
      field column=wauth splitBy=,  sourceColName=b /
      field column=miles splitBy=,  sourceColName=c /
  /entity
  /document
 /dataConfig


 One thing I'd try is to use '4' for comparison rather than the number 4
 (the
 type would depend on the sql type). Also, for javascript transformers to
 work, you must use JDK 6 which has javascript support. 

Re: stats page slow in latest nightly

2009-10-17 Thread Yonik Seeley
On Tue, Oct 6, 2009 at 5:51 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : When I was working on it, I was actually going to default to not show
 : the size, and make you click a link that added a param to get the sizes
 : in the display too. But I foolishly didn't bring it up when Hoss made my
 : life easier with his simpler patch.

 we can always turn the size estimator off ... or turn it only only when
 doing the insanity checks (so normal stats are fast, buf if anything is
 duplicated you'll get info on the size of the discrepancy)

Is this something we want to do before release?  I'm not at all
familiar with the new size estimator stuff, so I'm not sure how long
it can actually take for a big index.

-Yonik
http://www.lucidimagination.com


urgent need of some basic help

2009-10-17 Thread Naga raja
I m in need of some basic help regarding solr?

1)In which format the posted data will store in SOLR?
how the data are stored in solr?

2) what is the concept of replication in SOLR?

3) suppose in my schema.xml i had the format like id,no,name and
i had posted nearly 50 documents.
Now iu need to post a data of format id,name,address ,,
how can i do?
i need to index all the posted files again or is there is some other
option available?


Store tika extracted result as xhtml

2009-10-17 Thread Andy Lam Yin Cong
Dear All,

I have a field defined in schema.xml as below,
fieldtype name=string  class=solr.StrField sortMissingLast=true 
indexed=true stored=true multiValued=false omitNorms=true/
field name=original type=string indexed=false  /

and in the solrconfig.xml
str name=fmap.contentoriginal/str

basically, when I upload the document via the command below
curl 
'http://localhost:8983/solr/info/update/extract?map.content=text_shingleliteral.url=testcommit=true'
 -F fi...@mccm.pdf

and try to display field via a query, it shows 

Take A Chance On Me  
Take A Chance On Me
Monte Carlo Condensed Matter
A very brief guide to Monte Carlo simulation.
An explanation of what I do.
A chance for far too many ABBA puns
...
The above is Not an xhtml(!)

However, if I run the command below with extractOnly=true
 curl 
 'http://localhost:8983/solr/info/update/extract?map.content=text_shingleliteral.url=testextractOnly=true'
  -F fi...@mccm.pdf

I get the result
lt;?xml version=1.0 encoding=UTF-8?gt;
lt;html xmlns=http://www.w3.org/1999/xhtmlgt;
lt;headgt;
lt;titlegt;Take A Chance On Melt;/titlegt;
lt;/headgt;
lt;bodygt;
lt;divgt;
.
which is an xhtml output.

My objective is to be able to stored it as xhtml in the field and be able to 
retrieve it as cached output. 
Since tika is already giving xhtml output, I wonder why when Solr save it as a 
plain text. (Maybe I missed out something in the configuration??)

Also, I will be using SolrJ as the application layer, hence as a workaround if 
there are any ways that I can get the xhtml result, maybe I can stored it 
somewhere else outside of Solr.
Any advice on this will be highly appreciated.

 Many Thanks  Kind Regards
Andy


  


Re: stats page slow in latest nightly

2009-10-17 Thread Chris Hostetter

:  we can always turn the size estimator off ... or turn it only only when
:  doing the insanity checks (so normal stats are fast, buf if anything is
:  duplicated you'll get info on the size of the discrepancy)
: 
: Is this something we want to do before release?  I'm not at all
: familiar with the new size estimator stuff, so I'm not sure how long
: it can actually take for a big index.

crap ... this slipped my mind.  Yeah, we probably ought to do it before 
the release.  I suspect if you've got things tuned for lots of little 
segments it could be so slow to be worthless.

I won't have access to the code until monday, but i'm pretty sure this 
should be a fairly trivial change (just un-set the estimator on the 
CacheEntry objects)


-Hoss



Re: urgent need of some basic help

2009-10-17 Thread Bess Sadler

Hi, Naga.

On 17-Oct-09, at 10:18 AM, Naga raja wrote:


I m in need of some basic help regarding solr?

1)In which format the posted data will store in SOLR?
   how the data are stored in solr?


Once solr has ingested the data, it is stored in binary files in a  
lucene index. You can see the files in the data/index directory of  
your solr instance, and you can open that lucene index with something  
like Luke: http://www.getopt.org/luke/
I find that looking at your index with Luke is a very helpful way of  
understanding exactly what is being stored.



2) what is the concept of replication in SOLR?


Sometimes you want two solr indexes that contain the same data.  For  
example, one common situation is when you want to have one index on a  
machine where documents are processed and written to an index, (which  
can be slow) and a separate index that's only used for searching  
(which you want to be as fast as possible). You could do this by  
replicating index one to index two. Is that what you're asking about?


There are several ways of replicating solr indexes. Make sure you  
check out these pages on the wiki:


http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/SolrReplication



3) suppose in my schema.xml i had the format like id,no,name and
i had posted nearly 50 documents.
Now iu need to post a data of format id,name,address ,,
how can i do?
i need to index all the posted files again or is there is some other
option available?


One concept that people sometimes have trouble with when they start  
using an index like solr instead of a relational database is that your  
fields do not all have to be populated for every field. It's totally  
fine to have some documents that have id, no, name and others that  
have id, name, and address, but all four potential fields (id, no,  
name, and address) will have to be accounted for in your schema.xml.


Remember to read the wiki, which has pretty much everything you ever  
need to know about solr: http://wiki.apache.org/solr/

There is also a good solr book available now: 
http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=http%3A%2F%2Flucene.apache.org%2Fsolr%2Futm_medium=sponsutm_content=podutm_campaign=mdb_000275

I hope this helps!

Bess


Elizabeth (Bess) Sadler
Chief Architect for the Online Library Environment
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

smime.p7s
Description: S/MIME cryptographic signature


Re: Boosting of words

2009-10-17 Thread AHMET ARSLAN
 I am using Solr 1.3.
 I access Solr through carrot and use Java.

What is the meaning of accessing solr through carrot?
Are you using solr as an input to carrot? Using 
org.carrot2.source.solr.SolrDocumentSource just to cluster search results?
Can we say that you are interested in clustered search results rather than 
search results them selfs? If yes solr 1.4 will have Grant Ingersoll's 
ClusteringComponent [1] which uses carrot2 to cluster search results.

[1] http://wiki.apache.org/solr/ClusteringComponent 


  


Re: Using DIH's special commands....Help needed

2009-10-17 Thread Lance Norskog
I had this problem also, but I was using the Jetty exampl. I fail at
logging configurations about 90% of the time, so I assumed it was my
fault.

2009/10/17 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 It is strange that LogTransformer did not log the data. .

 On Fri, Oct 16, 2009 at 5:54 PM, William Pierce evalsi...@hotmail.com wrote:
 Folks:

 Continuing my saga with DIH and use of its special commands.  I have
 verified that the script functionality is indeed working.    I also verified
 that '$skipRow' is working.    But I don't think that '$deleteDocById' is
 working.

 My script now looks as follows:

 script
        ![CDATA[
                function DeleteRow(row) {
                                   var jid = row.get('Id');
                    var jis = row.get('IndexingStatus');
                    if ( jis == 4 ) {
                                       row.put('$deleteDocById', jid);
                                       row.remove('Col1');
                                       row.put('Col1', jid);
                                  }
               return row;
           }
     ]]
  /script

 The theory is that rows whose 'IndexingStatus' value is 4 should be deleted
 from solr index.  Just to be sure that javascript syntax was correct and
 checked out,  I intentionally overwrite a field called 'Col1' in my schema
 with primary key of the document to be deleted.

 On a clean and empty index, I import 47 rows from my dummy db.   Everything
 checks out correctly since IndexingStatus for each row is 1.  There are no
 rows to delete.    I then go into the db and set one row with the
 IndexingStatus = 4.   When I execute the dataimport,  I find that all 47
 documents are imported correctly.   However,  for the row for which
 'IndexingStatus' was set to 4,  the Col1 value is set correctly by the
 script transformer to be the primary key value for that row/document.
 However,  I should not be seeing that document  since the '$deleteDocById
 should have deleted this from solr.

 Could this be a bug in solr?  Or, am I misunderstanding how $deleteDocById
 works?

 By the way, Noble, I tried to set the LogTransformer, and add logging per
 your suggestion.  That did not work either.  I set logLevel=debug, and
 also turned on solr logging in the admin console to be the max value
 (finest) and still no output.

 Thanks,

 - Bill



 --
 From: Noble Paul ???  ?? noble.p...@corp.aol.com
 Sent: Thursday, October 15, 2009 10:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Using DIH's special commandsHelp needed

 use  LogTransformer to see if the value is indeed set

 entity name=post transformer=script:DeleteRow,
 RegexTransformer,LogTransformer
         logTemplate=${post}
         query= select  Id, a, b, c, IndexingStatus from  prod_table
 where (IndexingStatus = 1 or IndexingStatus = 4) 

 this should print out the entire row after the transformations



 On Fri, Oct 16, 2009 at 3:04 AM, William Pierce evalsi...@hotmail.com
 wrote:

 Thanks for your reply!  I tried your suggestion.  No luck.  I have
 verified
 that I have version  1.6.0_05-b13 of java installed.  I am running with
 the
 nightly bits of October 7.  I am pretty much out of ideas at the present
 timeI'd appreciate any tips/pointers.

 Thanks,

 - Bill

 --
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 Sent: Thursday, October 15, 2009 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Using DIH's special commandsHelp needed

 On Fri, Oct 16, 2009 at 12:46 AM, William Pierce
 evalsi...@hotmail.comwrote:

 Thanks for your help.  Here is my DIH config fileI'd appreciate any
 help/pointers you may give me.  No matter what I do the documents are
 not
 getting deleted from the index.  My db has rows whose 'IndexingStatus'
 field
 has values of either 1 (which means add it to solr), or 4 (which means
 delete the document with the primary key from SOLR index).  I have two
 transformers running.  Not sure what I am doing wrong.

 dataConfig
  script![CDATA[
             function DeleteRow(row)    {
                 var jis = row.get('IndexingStatus');
                 var jid = row.get('Id');
                 if ( jis == 4 ) {
                      row.put('$deleteDocById', jid);
                  }
                 return row;
             }
     ]]/script

  dataSource type=JdbcDataSource
           driver=com.mysql.jdbc.Driver
           url=jdbc:mysql://localhost/db
           user=**
           password=***/
  document
  entity name=post transformer=script:DeleteRow, RegexTransformer
         query= select  Id, a, b, c, IndexingStatus from  prod_table
 where (IndexingStatus = 1 or IndexingStatus = 4) 
      field column=ptype splitBy=, sourceColName=a /
      field column=wauth splitBy=,  sourceColName=b /
      field column=miles splitBy=,  sourceColName=c /
  /entity
  /document
 /dataConfig



Problem with Query Parser

2009-10-17 Thread Germán Biozzoli
Hi everybody

I have a simple but (for me) annoying problem. I'm happy user of Solr
1.4 with a small collection of documents. Today one of the users has
reported that a query returns documents that are non-pertinent to the
expression. I have spanish, portuguese and english text inside the
collection. Using the Solr administration interface I've found that
she was right, if I search for the spanish term represion, I found
just only the word root, I mean it returns every document with the
term repres. Using the admin-debug search I found this:


lst name=debug
str name=rawquerystringdescription:represion/str
str name=querystringdescription:represion/str
str name=parsedquerydescription:repres/str
str name=parsedquery_toStringdescription:repres/str

the ion part of the term was deleted by the query parser. The first
question is: I don´t know now where should I see to correct this, at
the schema.xml or at the solrconfig.xml.

At schema, description is

field name=description type=text indexed=true
multiValued=true stored=true/

and text is:

fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

/fieldtype

The only thing that is suspicious to me is the EnglishPorter. I've
deleted from the configuration but nothing changes. Should I reindex
the collection to see the changes? Should I delete also from the index
section? What I will loose deleting English porter?

Thanks a lot for the help
German


Re: Using DIH's special commands....Help needed

2009-10-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
postImportDeletQuery is fine in your case.

On Sat, Oct 17, 2009 at 3:16 AM, William Pierce evalsi...@hotmail.com wrote:
 Shalin,

 Many thanks for your tipBut it did not seem to help!

 Do you think I can use postDeleteImportQuery for this task?

 Should I file a bug report?

 Cheers,

 Bill

 --
 From: Shalin Shekhar Mangar shalinman...@gmail.com
 Sent: Friday, October 16, 2009 1:16 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Using DIH's special commandsHelp needed

 On Fri, Oct 16, 2009 at 5:54 PM, William Pierce
 evalsi...@hotmail.comwrote:

 Folks:

 Continuing my saga with DIH and use of its special commands.  I have
 verified that the script functionality is indeed working.    I also
 verified
 that '$skipRow' is working.    But I don't think that '$deleteDocById' is
 working.

 My script now looks as follows:

 script
       ![CDATA[
               function DeleteRow(row) {
                                  var jid = row.get('Id');
                   var jis = row.get('IndexingStatus');
                   if ( jis == 4 ) {
                                      row.put('$deleteDocById', jid);
                                      row.remove('Col1');
                                      row.put('Col1', jid);
                                 }
              return row;
          }
    ]]
  /script

 The theory is that rows whose 'IndexingStatus' value is 4 should be
 deleted
 from solr index.  Just to be sure that javascript syntax was correct and
 checked out,  I intentionally overwrite a field called 'Col1' in my
 schema
 with primary key of the document to be deleted.

 On a clean and empty index, I import 47 rows from my dummy db. Everything
 checks out correctly since IndexingStatus for each row is 1.  There are
 no
 rows to delete.    I then go into the db and set one row with the
 IndexingStatus = 4.   When I execute the dataimport,  I find that all 47
 documents are imported correctly.   However,  for the row for which
 'IndexingStatus' was set to 4,  the Col1 value is set correctly by the
 script transformer to be the primary key value for that row/document.
 However,  I should not be seeing that document  since the '$deleteDocById
 should have deleted this from solr.

 Could this be a bug in solr?  Or, am I misunderstanding how
 $deleteDocById
 works?


 Would the row which has IndexingStatus=4 also create a document with the
 same uniqueKey which you would delete using the transformer? If yes, that
 can explain what is happening and you can avoid that by adding a $skipDoc
 flag in addition to the $deleteDocById flag.

 I know this is a basic question but you are using Solr 1.4, aren't you?

 --
 Regards,
 Shalin Shekhar Mangar.





-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com