Re: Dataimport: Could not load driver: com.mysql.jdbc.Driver

2010-12-06 Thread stockii

maybe encoding !? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Dataimport-Could-not-load-driver-com-mysql-jdbc-Driver-tp2021616p2027138.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr -File Based Spell Check

2010-12-06 Thread ramzesua

Hi. As I know, for file based spellcheck you need:
 - configure you spellcheck seach component in solrconfig.xml, for example:

searchComponent name=spellcheck class=solr.SpellCheckComponent
lst name=spellchecker
  str name=classnamesolr.FileBasedSpellChecker/str
  str name=namefile/str
  str name=sourceLocationspellings.txt/str
  str name=characterEncodingUTF-8/str  
  str name=spellcheckIndexDir./spellcheckerFile/str  
/lst
  /searchComponent

 - then you must get or form spellings.txt, for example:
abaft 
abalone 
abalones 
abandon 
abandoned 
abandonedly 
... 
(each correct word in new line)

 - after that you must build you file to index:
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build
for build try do to this:
http://solr:8983/solr/select?q=*:*spellcheck=truespellcheck.build=true

After that you can use spellcheck in your search, for example:
http://solr:8983/solr/select?q=bingospellcheck=true

Try this, if there is some errors, post here..
P.S. please, read http://wiki.apache.org/solr/SpellCheckComponent for more
information


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-File-Based-Spell-Check-tp2025671p2027258.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr -File Based Spell Check

2010-12-06 Thread Erick Erickson
Are you sure you want spellcheck/autosuggest?

Because what you're talking about almost sounds like
synonyms.

Best
Erick

On Mon, Dec 6, 2010 at 1:37 AM, rajini maski rajinima...@gmail.com wrote:

 How does the solr file based spell check work?

 How do we need to enter data in the spelling.txt...I am not clear about its
 functionality..If anyone know..Please reply.

 I want to index a word = Wear
 But while searching I search as =Dress
 I want to get results for Wear.. How do i apply this functionality..

 Awaiting Reply



Re: Solr -File Based Spell Check

2010-12-06 Thread rajini maski
Yeah..  I wanna use this Spell-check only.. I want to create myself the
dictionary.. And give it as input to solr.. Because my indexes also have
mis-spelled content and so I want solr to refer this file and not
autogenrated. How do i get this done?

I will try the spell check as suggested by  michael...

One more main thing I wanted to know is,  how to extract the dictionary
generated by default.?  How do i read this  .cfs files generated in index
folder..

Please reply if you know anything related to this..


Awaiting reply




On Mon, Dec 6, 2010 at 7:33 PM, Erick Erickson erickerick...@gmail.comwrote:

 Are you sure you want spellcheck/autosuggest?

 Because what you're talking about almost sounds like
 synonyms.

 Best
 Erick

 On Mon, Dec 6, 2010 at 1:37 AM, rajini maski rajinima...@gmail.com
 wrote:

  How does the solr file based spell check work?
 
  How do we need to enter data in the spelling.txt...I am not clear about
 its
  functionality..If anyone know..Please reply.
 
  I want to index a word = Wear
  But while searching I search as =Dress
  I want to get results for Wear.. How do i apply this functionality..
 
  Awaiting Reply
 



Re: FastVectorHighlighter ignoring fragmenter parameter . . .

2010-12-06 Thread CRB

Koji,

Thank you for the reply.

Being something of a novice with Solr, I would be grateful if you could 
clarify my next steps.


I infer from your reply that there is no current implementation yet 
contributed for the FVH similar to the regex fragmenter.


Thus I need to write my own custom extensions of *FragmentsBuilder 
http://lucene.apache.org/java/3_0_1/api/contrib-fast-vector-highlighter/org/apache/lucene/search/vectorhighlight/FragmentsBuilder.html 
 **FragListBuilder 
http://lucene.apache.org/java/3_0_1/api/contrib-fast-vector-highlighter/org/apache/lucene/search/vectorhighlight/FragListBuilder.html 
*interfaces to take in and apply the regex.


I would be happy to contribute back what I create.

Appreciate whatever guidance you can offer,

Christopher

On 2:59 PM, Koji Sekiguchi wrote:

(10/12/05 5:53), CRB wrote:
Got the FVH to work in Solr 3.1 (or at least I presume I have given I 
can see multi-color

highlighting in the output.)

But I am not able to get it to recognize the regex fragmenter. I 
get no change in output if I
specify the fragmenter. In fact, I can even enter bogus names for the 
fragmenter and get no change

in the output.

Grateful for any suggestions.

Settings and output below.

Christopher


*Query*

http://localhost:8983/solr/10k-Fragments/select?
q=content%3Aliquidity
rows=100
fl=id%2Ccontent
qt=standard
hl.fl=content
hl.useFastVectorHighlighter=true
hl=true
hl.fragmentsBuilder=colored
hl.fragmenter=regex


Christopher,

Because algorithm of FVH is totally different from (traditional) 
highlighter,
FVH doesn't see hl.fragmenter and hl.formatter, but see 
hl.fragListBuilder
and hl.fragmentsBuilder instead. I think your settings and 
request/response

looks good except hl.fragmenter=regex. FVH simply ignores the parameter.

Koji




Re: Query performance very slow even after autowarming

2010-12-06 Thread Alexey Serba
* Do you use EdgeNGramFilter in index analyzer only? Or you also use
it on query side as well?

* What if you create additional field first_letter (string) and put
first character/characters (multivalued?) there in your external
processing code. And then during search you can filter all documents
that start with letter a using fq=a filter query. Would that solve
your performance problems?

* It makes sense to specify what are you trying to achieve and
probably more people can help you with that.

On Fri, Dec 3, 2010 at 10:47 AM, johnnyisrael johnnyi.john...@gmail.com wrote:

 Hi,

 I am using edgeNgramFilterfactory on SOLR 1.4.1 [filter
 class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 /]
 for my indexing.

 Each document will have about 5 fields in it and only one field is indexed
 with EdgeNGramFilterFactory.

 I have about 1.4 million documents in my index now and my index size is
 approx 296MB.

 I made the field that is indexed with EdgeNGramFilterFactory as default
 search field. All my query responses are very slow, some of them taking more
 than 10seconds to respond.

 All my query responses are very slow, Queries with single letters are still
 very slow.

 /select/?q=m

 So I tried query warming as follows.

 listener event=newSearcher class=solr.QuerySenderListener
      arr name=queries
        lststr name=qa/str/lst
        lststr name=qb/str/lst
        lststr name=qc/str/lst
        lststr name=qd/str/lst
        lststr name=qe/str/lst
        lststr name=qf/str/lst
        lststr name=qg/str/lst
        lststr name=qh/str/lst
        lststr name=qi/str/lst
        lststr name=qj/str/lst
        lststr name=qk/str/lst
        lststr name=ql/str/lst
        lststr name=qm/str/lst
        lststr name=qn/str/lst
        lststr name=qo/str/lst
        lststr name=qp/str/lst
        lststr name=qq/str/lst
        lststr name=qr/str/lst
        lststr name=qs/str/lst
        lststr name=qt/str/lst
        lststr name=qu/str/lst
        lststr name=qv/str/lst
        lststr name=qw/str/lst
        lststr name=qx/str/lst
        lststr name=qy/str/lst
        lststr name=qz/str/lst
      /arr
 /listener

 The same above is done for firstSearcher as well.

 My cache settings are as follows.

 filterCache
      class=solr.LRUCache
      size=16384
      initialSize=4096
 autowarmCount=4096/

 queryResultCache
      class=solr.LRUCache
      size=16384
      initialSize=4096
 autowarmCount=1024/

 documentCache
      class=solr.LRUCache
      size=16384
      initialSize=16384
 /

 Still after query warming, few single character search is taking up to 3
 seconds to respond.

 Am i doing anything wrong in my cache setting or autowarm setting or am i
 missing anything here?

 Thanks,

 Johnny
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: dataimports response returns before done?

2010-12-06 Thread Alexey Serba
 After issueing a dataimport, I've noticed solr returns a response prior to 
 finishing the import. Is this correct?   Is there anyway i can make solr not 
 return until it finishes?
Yes, you can add synchronous=true to your request. But be aware that
it could take a long time and you can see http timeout exception.

 If not, how do I ping for the status whether it finished or not?
See command=status


On Fri, Dec 3, 2010 at 8:55 PM, Tri Nguyen tringuye...@yahoo.com wrote:
 Hi,

 After issueing a dataimport, I've noticed solr returns a response prior to 
 finishing the import. Is this correct?   Is there anyway i can make solr not 
 return until it finishes?

 If not, how do I ping for the status whether it finished or not?

 thanks,

 tri


How to get all the search results?

2010-12-06 Thread Solr User
Hi,

First off thanks to the group for guiding me to move from default search
handler to dismax.

I have a question related to getting all the search results. In the past
with the default search handler I was getting all the search results (8000)
if I pass q=* as search string but with dismax I was getting only 16 results
instead of 8000 results.

How to get all the search results using dismax? Do I need to configure
anything to make * (asterisk) work?

Thanks,
Solr User


Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba
Hey Juan,

It seems that DataImportHandler is not a right tool for your scenario
and you'd better use Solr XML update protocol.
* http://wiki.apache.org/solr/UpdateXmlMessages

You still can work around your outdated GUI view problem with calling
DIH synchronously, by adding synchronous=true to your request. But it
won't solve the problem with two parallel requests from two users to
single DIH request handler, because DIH doesn't support that, and if
previous request is still running it bounces the second request.

HTH,
Alex



On Fri, Dec 3, 2010 at 10:33 PM, Juan Manuel Alvarez naici...@gmail.com wrote:
 Hello everyone! I would like to ask you a question about DIH.

 I am using a database and DIH to sync against Solr, and a GUI to
 display and operate on the items retrieved from Solr.
 When I change the state of an item through the GUI, the following happens:
 a. The item is updated in the DB.
 b. A delta-import command is fired to sync the DB with Solr.
 c. The GUI is refreshed by making a query to Solr.

 My problem comes between (b) and (c). The delta-import operation is
 executed in a new thread, so my call returns immediately, refreshing
 the GUI before the Solr index is updated causing the item state in the
 GUI to be outdated.

 I had two ideas so far:
 1. Querying the status of the DIH after the delta-import operation and
 do not return until it is idle. The problem I see with this is that
 if other users execute delta-imports, the status will be busy until
 all operations are finished.
 2. Use Zoie. The first problem is that configuring it is not as
 straightforward as it seems, so I don't want to spend more time trying
 it until I am sure that this will solve my issue. On the other hand, I
 think that I may suffer the same problem since the delta-import is
 still firing in another thread, so I can't be sure it will be called
 fast enough.

 Am I pointing on the right direction or is there another way to
 achieve my goal?

 Thanks in advance!
 Juan M.



Re: How to get all the search results?

2010-12-06 Thread Savvas-Andreas Moysidis
Hello,

shouldn't that query syntax be *:* ?

Regards,
-- Savvas.

On 6 December 2010 16:10, Solr User solr...@gmail.com wrote:

 Hi,

 First off thanks to the group for guiding me to move from default search
 handler to dismax.

 I have a question related to getting all the search results. In the past
 with the default search handler I was getting all the search results (8000)
 if I pass q=* as search string but with dismax I was getting only 16
 results
 instead of 8000 results.

 How to get all the search results using dismax? Do I need to configure
 anything to make * (asterisk) work?

 Thanks,
 Solr User



Re: How to get all the search results?

2010-12-06 Thread Shawn Heisey
With dismax, I didn't get any results with *:*.  I did the query with 
these options (q is empty) and got the full rowcount:


q=rows=0qt=dismax

I have q.alt defined in my dismax handler as *:*, don't know if that is 
required or not.


Shawn


On 12/6/2010 9:17 AM, Savvas-Andreas Moysidis wrote:

Hello,

shouldn't that query syntax be *:* ?

Regards,
-- Savvas.

On 6 December 2010 16:10, Solr Usersolr...@gmail.com  wrote:


Hi,

First off thanks to the group for guiding me to move from default search
handler to dismax.

I have a question related to getting all the search results. In the past
with the default search handler I was getting all the search results (8000)
if I pass q=* as search string but with dismax I was getting only 16
results
instead of 8000 results.

How to get all the search results using dismax? Do I need to configure
anything to make * (asterisk) work?

Thanks,
Solr User





Re: How to get all the search results?

2010-12-06 Thread Peter Karich

 for dismax just pass an empty query all q= or none at all


Hello,

shouldn't that query syntax be *:* ?

Regards,
-- Savvas.

On 6 December 2010 16:10, Solr Usersolr...@gmail.com  wrote:


Hi,

First off thanks to the group for guiding me to move from default search
handler to dismax.

I have a question related to getting all the search results. In the past
with the default search handler I was getting all the search results (8000)
if I pass q=* as search string but with dismax I was getting only 16
results
instead of 8000 results.

How to get all the search results using dismax? Do I need to configure
anything to make * (asterisk) work?

Thanks,
Solr User




--
http://jetwick.com twitter search prototype



Index version on slave nodes

2010-12-06 Thread Markus Jelsma
Hi,

The indexversion command in the replicationHandler on slave nodes returns 0 
for indexversion and generation while the details command does return the 
correct information. I haven't found an existing ticket on this one although 
https://issues.apache.org/jira/browse/SOLR-1573 has similarities.

Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Stored field value modification

2010-12-06 Thread Emmanuel Bégué
Hello,

Is it possible to manipulate the value of a field before it is stored?

I'm indexing a database where some field contain raw HTML, including
named character entities.

Using solr.HTMLStripCharFilterFactory on the index analyzer, results
in this HTML being correctly stripped, and named character entities
replaced by the corresponding characters, in the index (as verified
when searching, and with Luke).

But, the stored values of the documents are stored unmodified, so the
result sets, including highlights, contain HTML tags (that are
escaped) and entities (where the leading '' is also escaped) which
make handling the results quite difficult.

So, is it possible to apply some filters to the data before it is
stored in the non-indexed fields?

I couldn't find a part of the documentation that said whether it was
possible or not; I did find this message in the archives of this list:

 From: Noble Paul
 Sent: Tuesday, March 31, 2009 5:41 PM
 Subject: Re: indexed fields vs stored fields

 indexed = can be searched (mean you can use this to query). This
undergoes tokenization filter etc
 stored = can be retrieved. No modification to the data. This is
stored verbatim

which seems to say that it is not possible; but maybe things have
changed since then?

Any other idea? given that:
- I have zero control over what is stored in the database
- using the Solr XML update protocol i could probably transform the
data before sending it
- ... but I'd much rather continue using DataImportHandler to access
the database

Thanks,
Regards,
EB


Re: Stored field value modification

2010-12-06 Thread Markus Jelsma
Hi,

You can create a custom update request processor [1] to strip unwanted input 
as it is about to enter the index.

[1]: http://wiki.apache.org/solr/UpdateRequestProcessor

Cheers,

On Monday 06 December 2010 17:36:09 Emmanuel Bégué wrote:
 Hello,
 
 Is it possible to manipulate the value of a field before it is stored?
 
 I'm indexing a database where some field contain raw HTML, including
 named character entities.
 
 Using solr.HTMLStripCharFilterFactory on the index analyzer, results
 in this HTML being correctly stripped, and named character entities
 replaced by the corresponding characters, in the index (as verified
 when searching, and with Luke).
 
 But, the stored values of the documents are stored unmodified, so the
 result sets, including highlights, contain HTML tags (that are
 escaped) and entities (where the leading '' is also escaped) which
 make handling the results quite difficult.
 
 So, is it possible to apply some filters to the data before it is
 stored in the non-indexed fields?
 
 I couldn't find a part of the documentation that said whether it was
 
 possible or not; I did find this message in the archives of this list:
  From: Noble Paul
  Sent: Tuesday, March 31, 2009 5:41 PM
  Subject: Re: indexed fields vs stored fields
  
  indexed = can be searched (mean you can use this to query). This
 
 undergoes tokenization filter etc
 
  stored = can be retrieved. No modification to the data. This is
 
 stored verbatim
 
 which seems to say that it is not possible; but maybe things have
 changed since then?
 
 Any other idea? given that:
 - I have zero control over what is stored in the database
 - using the Solr XML update protocol i could probably transform the
 data before sending it
 - ... but I'd much rather continue using DataImportHandler to access
 the database
 
 Thanks,
 Regards,
 EB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Taxonomy and Faceting

2010-12-06 Thread webdev1977

I have been digging through the user lists for Solr and Nutch, as well as
reading lots of blogs, etc.  I have yet to find a clear answer (maybe there
is none )

I am trying to find the best way ahead for choosing a technology that will
allow the ability to use a large taxonomy for classifying structured and
unstructured data and then displaying those categorizations as facets to the
user during search.  

There seems to be several approaches, some of which make use of index time
for encoding the terms found in the text, but I have seen no mention of HOW
to get those terms from the text.  Some sort of text classification software
I am assuming.  If this is true, are there any good open source engines that
can process text against a taxonomy?

The other approach seems to be two patches being developed for Solr 3.0, 792
and 64.  Again, I think you would have to have some sort of an engine to
give you this information that could then be added at index time. 

I have also seen some interesting literature on using Drupal and the Solr
module.  

My current architecture uses Nutch (1.2) for crawling, solrindex for inexing
(Solr 1.4.1), and Ajax Solr for my UI.  

I have also seen some talk in webinars/etc from Lucid Imagination about
upcoming development on Native Taxonomy Facets, any idea where that
development stands?

I have to use the most stable version of Solr/Nutch/Lucene possible for my
implementation, because, unfortunately, once I choose, going back will be
next to impossible for years to come!

Thanks!




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2028442.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get all the search results?

2010-12-06 Thread Savvas-Andreas Moysidis
ahhh, right..in dismax, you pre-define the fields that will be searched upon
is that right? is it also true that the query is parsed and all special
characters escaped?

On 6 December 2010 16:25, Peter Karich peat...@yahoo.de wrote:

  for dismax just pass an empty query all q= or none at all


  Hello,

 shouldn't that query syntax be *:* ?

 Regards,
 -- Savvas.

 On 6 December 2010 16:10, Solr Usersolr...@gmail.com  wrote:

  Hi,

 First off thanks to the group for guiding me to move from default search
 handler to dismax.

 I have a question related to getting all the search results. In the past
 with the default search handler I was getting all the search results
 (8000)
 if I pass q=* as search string but with dismax I was getting only 16
 results
 instead of 8000 results.

 How to get all the search results using dismax? Do I need to configure
 anything to make * (asterisk) work?

 Thanks,
 Solr User



 --
 http://jetwick.com twitter search prototype




Re: Stored field value modification

2010-12-06 Thread Ahmet Arslan
 - I have zero control over what is stored in the database
 - using the Solr XML update protocol i could probably
 transform the
 data before sending it
 - ... but I'd much rather continue using DataImportHandler
 to access
 the database


If you are already using DIH, 
http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer can do what 
you want.


  


Re: Index version on slave nodes

2010-12-06 Thread Xin Li
I think this is expected behavior. You have to issue the details
command to get the real indexversion for slave machines.

Thanks,
Xin

On Mon, Dec 6, 2010 at 11:26 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 The indexversion command in the replicationHandler on slave nodes returns 0
 for indexversion and generation while the details command does return the
 correct information. I haven't found an existing ticket on this one although
 https://issues.apache.org/jira/browse/SOLR-1573 has similarities.

 Cheers,

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350



Dynamically filtering results based on score

2010-12-06 Thread Bryan Barkley
I've seen references to score filtering in the list archives with frange
being the suggested solution, but I have a slightly different problem that I
don't think frange will solve. I basically want to drop a portion of the
results based on their score in relation to the other scores in the result
set. I've found that some queries produce poor results because they are
matching solely based on a field with a very low boost (a product
description in my case). Looking at the scores it's very obvious when the
result set transitions from good matches to just those pulled in by the
description.

I've come up with a solution on the client side of things, but need to move
this to running within solr because it doesn't play well with facets (facet
data is still returned for products that I'm stripping out). The basic
approach is to keep a running average of the highest scores, and when a
document's score is off by an order of magnitude drop it and everything else
(assuming everything is sorted by score desc). This approach seems to work
well because in some cases when users just enter 'long tail' terms I want
results to still be returned, which a static lower bound in frange won't
accommodate.

Does anyone have any suggestions for an approach to this? It doesn't look
like a filter has access to the scores. It doesn't look like I can subclass
SolrIndexSearcher as a number of its methods are private and can't be
overridden. It doesn't look like I can modify the ResponseBuilder's results
docset after the query but before faceting is applied because I don't have
access to the scorer (at least in a SearchComponent). I'm out of ideas for
now.

Thanks for any assistance,
  Bryan


DIH - rdbms to index confusion

2010-12-06 Thread kmf

I'm new to solr (and indexing in general) and am having a hard time making
the transition from rdbms to indexing in terms of the DIH/data-config.xml
file.  I've successfully created a working index (so far) for the simple
queries in my db, but I'm struggling to add a more complex query.  When I
say simple I mean one or two tables and when I say complex I'm referring to
3 plus.

I have a table that contains the data values I'm wanting to return when
someone makes a search.  This table has, in addition to the data values, 3
id's (FKs) pointing to the data/info that I'm wanting the users to be able
to search on (while also returning the data values).

The general rdbms query would be something like:
select f.value, g.gar_name, c.cat_name from foo f, gar g, cat c, dub d
where g.id=f.gar_id
and c.id=f.cat_id
and d.id=f.dub_id

I tried following the item_category entity used in the DIH example here:
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example
and am struggling to get it to work.
 
My current attempt looks like (entity translated to the above rdbms query):
dataConfig
   dataSource  /
   document
 entity ...simple query-working for main entity, cat
field ... /
 
 entity name=foo query=SELECT gar_id FROM foo
 WHERE cat_id='${cat.id}' 
entity name=gar query=SELECT name FROM gar 
WHERE id='${f.gar_id}'
 
  field column=name name=g_name /
/entity   

entity name=dub query=SELECT name FROM dub  
   
 WHERE id='${f.dub_id}'
  field column=name name=dub_name /
/entity   
field column=value name=f_value /
/entity   

other working entities
/entity
/document
/dataConfig

I'm getting some of the data/info back, but it's not what I am expecting. 
I'm hoping for/expecting a document/record to look like:
cat_name 1 : g_name 1 : dub_name 1 : f_value 1
cat_name 1 : g_name 1 : dub_name 2 : f_value 2 
cat_name 1 : g_name 2 : dub_name 1 : f_value 1
cat_name 1 : g_name 2 : dub_name 2 : f_value 2 
cat_name 2 : g_name 1 : dub_name 1 : f_value 1
cat_name 2 : g_name 1 : dub_name 2 : f_value 2 
cat_name 2 : g_name 2 : dub_name 1 : f_value 1
cat_name 2 : g_name 2 : dub_name 2 : f_value 2 

(All but the values are showing up in the index in some form)

Any suggestions on where my logic is failing?

Thanks

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-rdbms-to-index-confusion-tp2028543p2028543.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stored field value modification

2010-12-06 Thread Emmanuel Bégué
2010/12/6 Ahmet Arslan iori...@yahoo.com:

 If you are already using DIH, 
 http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer can do 
 what you want.

Indeed it can. Many thanks.


Re: Question about Solr Fieldtypes, Chaining of Tokenizers

2010-12-06 Thread Matthew Hall

Yes, that's my conclusion as well Grant.

As for the example output:

The symposium of TgThe(RX3fg+and) gene studies

Should end up tokenizing to:

symposium tg the rx3fg and gene studi

Assuming I guessed right on the stemming.

Anyhow, thanks for the confirmation guys.

Matt

On 12/4/2010 8:18 PM, Grant Ingersoll wrote:

Could you expand on your example and show the output you want?  FWIW, you could 
simply write a token filter that does the same thing as the WhitespaceTokenizer.

-Grant

On Dec 3, 2010, at 1:14 PM, Matthew Hall wrote:


Hey folks, I'm working with a fairly specific set of requirements for our 
corpus that needs a somewhat tricky text type for both indexing and searching.

The chain currently looks like this:

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
   pattern=(.*?)(\p{Punct}*)$
   replacement=$1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
filter class=solr.PatternReplaceFilterFactory
   pattern=\p{Punct}
   replacement= /
tokenizer class=solr.WhitespaceTokenizerFactory/

Now you will notice that I'm trying to add in a second tokenizer to this chain 
at the very end, this is due to the final replacement of punctuation to 
whitespace.  At that point I'd like to further break up these tokens to smaller 
tokens.

The reason for this is that we have a mixed normal english word and scientific corpus.  For 
example you could expect string like The symposium of TgThe(RX3fg+and) gene 
studies being added to the index, and parts of those phrases being searched on.

We want to be able to remove the stopwords in the mostly english parts of these 
types of statements, which the whitespace tokenizer, followed by removing 
trailing punctuation,  followed by the stopfilter takes care of.  We do not 
want to remove references to genetic information contained in allele symbols 
and the like.

Sadly as far as I can tell, you cannot chain tokenizers in the schema.xml, so 
does anyone have some suggestions on how this could be accomplished?

Oh, and let me add that the WordDelimiterFilter comes really close to what I want, but since we are 
unwilling to promote our solr version to the trunk (we are on the 1.4x) version atm, the inability 
to turn off the automatic phrase queries makes it a no go.  We need to be able to make searches on 
left/right match right/left.

My searches through the old material on this subject isn't really showing me 
much except some advice on using the copyField attribute.  But my understanding 
is that this will simply take your original input to the field, and then 
analyze it in two different ways depending on the field definitions.  It would 
be very nice if it were copying the already analyzed version of the text... but 
that's not what its doing, right?

Thanks for any advice on this matter.

Matt



--
Grant Ingersoll
http://www.lucidimagination.com





Using Saxon 9 as a response writer with Solr 3.1 . . ?

2010-12-06 Thread CRB

Has anyone been able to get Saxon 9 working with Solr3.1?

I was following the wiki page 
(http://wiki.apache.org/solr/XsltResponseWriter), placing all the 
saxon-*.jars are in Jetty's lib/ext folder and start with


java 
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl 
-jar start.jar

But get an ugly dump of errors from Jetty:

   2010-12-06 13:29:16.515::WARN:  failed SolrRequestFilter
   java.lang.NoSuchMethodError:
   net.sf.saxon.dom.DOMEnvelope.getInstance()Lnet/sf/saxon/dom/DOMEnvelope;
at
   net.sf.saxon.java.JavaPlatform.initialize(JavaPlatform.java:43)
at net.sf.saxon.Configuration.init(Configuration.java:392)
at net.sf.saxon.Configuration.init(Configuration.java:311)
at
   
net.sf.saxon.xpath.XPathFactoryImpl.makeConfiguration(XPathFactoryImpl.java:41)
at
   net.sf.saxon.xpath.XPathFactoryImpl.init(XPathFactoryImpl.java:26)
at
   sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
   Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at
   javax.xml.xpath.XPathFactoryFinder.loadFromService(Unknown Source)
at javax.xml.xpath.XPathFactoryFinder._newFactory(Unknown
   Source)
at javax.xml.xpath.XPathFactoryFinder.newFactory(Unknown
   Source)
at javax.xml.xpath.XPathFactory.newInstance(Unknown Source)
at javax.xml.xpath.XPathFactory.newInstance(Unknown Source)
at org.apache.solr.core.Config.clinit(Config.java:50)
at
   org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:68)
at
   sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
   Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at
   org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:153)
at
   org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:94)
at
   org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
   org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at
   org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at
   org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at
   org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at
   org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at
   org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
   
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at
   
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at
   org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
   
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at
   org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
   org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at
   org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at
   org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
   Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)




Re: FastVectorHighlighter ignoring fragmenter parameter . . .

2010-12-06 Thread CRB

Koji,

Thank you for the reply.

Being something of a novice with Solr, I would be grateful if you could 
clarify my next steps.


I infer from your reply that there is no current implementation yet 
contributed for the FVH similar to the regex fragmenter.


Thus I need to write my own custom extensions of FragmentsBuilder  
FragListBuilder interfaces to take in and apply the regex.


I would be happy to contribute back what I create.

Appreciate whatever guidance you can offer,

Christopher


Re: DIH - rdbms to index confusion

2010-12-06 Thread Alexey Serba
 I have a table that contains the data values I'm wanting to return when
 someone makes a search.  This table has, in addition to the data values, 3
 id's (FKs) pointing to the data/info that I'm wanting the users to be able
 to search on (while also returning the data values).

 The general rdbms query would be something like:
 select f.value, g.gar_name, c.cat_name from foo f, gar g, cat c, dub d
 where g.id=f.gar_id
 and c.id=f.cat_id
 and d.id=f.dub_id

You can put this general rdbms query as is into single DIH entity - no
need to split it.

You would probably want to split it if your main table has one to many
relation with other tables, so you can't retrieve all the data and
have single result set row per Solr document.


Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Juan Manuel Alvarez
Alex:

Thanks for the quick reply.

When you say two parallel requests from two users to single DIH
request handler, what do you mean by request handler? Are you
refering to the HTTP request? Would that mean that if I make the
request from different HTTP sessions it would work?

Cheers!
Juan M.

On Mon, Dec 6, 2010 at 1:12 PM, Alexey Serba ase...@gmail.com wrote:
 Hey Juan,

 It seems that DataImportHandler is not a right tool for your scenario
 and you'd better use Solr XML update protocol.
 * http://wiki.apache.org/solr/UpdateXmlMessages

 You still can work around your outdated GUI view problem with calling
 DIH synchronously, by adding synchronous=true to your request. But it
 won't solve the problem with two parallel requests from two users to
 single DIH request handler, because DIH doesn't support that, and if
 previous request is still running it bounces the second request.

 HTH,
 Alex



 On Fri, Dec 3, 2010 at 10:33 PM, Juan Manuel Alvarez naici...@gmail.com 
 wrote:
 Hello everyone! I would like to ask you a question about DIH.

 I am using a database and DIH to sync against Solr, and a GUI to
 display and operate on the items retrieved from Solr.
 When I change the state of an item through the GUI, the following happens:
 a. The item is updated in the DB.
 b. A delta-import command is fired to sync the DB with Solr.
 c. The GUI is refreshed by making a query to Solr.

 My problem comes between (b) and (c). The delta-import operation is
 executed in a new thread, so my call returns immediately, refreshing
 the GUI before the Solr index is updated causing the item state in the
 GUI to be outdated.

 I had two ideas so far:
 1. Querying the status of the DIH after the delta-import operation and
 do not return until it is idle. The problem I see with this is that
 if other users execute delta-imports, the status will be busy until
 all operations are finished.
 2. Use Zoie. The first problem is that configuring it is not as
 straightforward as it seems, so I don't want to spend more time trying
 it until I am sure that this will solve my issue. On the other hand, I
 think that I may suffer the same problem since the delta-import is
 still firing in another thread, so I can't be sure it will be called
 fast enough.

 Am I pointing on the right direction or is there another way to
 achieve my goal?

 Thanks in advance!
 Juan M.




Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Alexey Serba
 When you say two parallel requests from two users to single DIH
 request handler, what do you mean by request handler?
I mean DIH.

 Are you
 refering to the HTTP request? Would that mean that if I make the
 request from different HTTP sessions it would work?
No.

It means that when you have two users that simultaneously changed two
objects in the UI then you have two HTTP requests to DIH to pull
changes from the db into Solr index. If the second request comes when
the first is not fully processed then the second request will be
rejected. As a result your index would be outdated (w/o the latest
update) until the next update.


Re: Taxonomy and Faceting

2010-12-06 Thread Peter Karich
 I'm unsure but maybe you mean something like clustering? Then carrot^2 
can do this (at index time I think):

http://search.carrot2.org/stable/search?query=jetwickview=visu
(There is a plugin for solr)

Or do you already know the categories of your docs. E.g. you already 
have a category tree and associated documents?


Regards,
Peter.


I have been digging through the user lists for Solr and Nutch, as well as
reading lots of blogs, etc.  I have yet to find a clear answer (maybe there
is none )

I am trying to find the best way ahead for choosing a technology that will
allow the ability to use a large taxonomy for classifying structured and
unstructured data and then displaying those categorizations as facets to the
user during search.

There seems to be several approaches, some of which make use of index time
for encoding the terms found in the text, but I have seen no mention of HOW
to get those terms from the text.  Some sort of text classification software
I am assuming.  If this is true, are there any good open source engines that
can process text against a taxonomy?

The other approach seems to be two patches being developed for Solr 3.0, 792
and 64.  Again, I think you would have to have some sort of an engine to
give you this information that could then be added at index time.

I have also seen some interesting literature on using Drupal and the Solr
module.

My current architecture uses Nutch (1.2) for crawling, solrindex for inexing
(Solr 1.4.1), and Ajax Solr for my UI.

I have also seen some talk in webinars/etc from Lucid Imagination about
upcoming development on Native Taxonomy Facets, any idea where that
development stands?

I have to use the most stable version of Solr/Nutch/Lucene possible for my
implementation, because, unfortunately, once I choose, going back will be
next to impossible for years to come!

Thanks!




Re: Taxonomy and Faceting

2010-12-06 Thread webdev1977

Thanks for the quick response!  

I was thinking more about the idea of having both structured and unstructred
data coming into a system to be indexed/searched.  I would like these
documents to be processed by some sort of entity/keyword/semantic
processing.  I have a well defined taxonomy for my organization (it is quite
large) and at the moment we use RetrievalWare to give keyword/classification
suggestions.  This does NOT work well though, and RetrievalWare is pretty
much useless to us.  

I want a way to do this process either at index time or search time.  All
documents should be processed against this taxonomy.  I do not want the user
to be able to nominate keywords, it must happen automatically.   I am
assuming it is only natural for these keywords/taxonomy entities to show up
as hierarchical facets?

From what I can tell, there is no way to tell Solr.. here is my taxonomy..
classify my documents and give me back facets and facet counts.. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2029636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Syncing 'delta-import' with 'select' query

2010-12-06 Thread Juan Manuel Alvarez
Thanks for all the help! It is really appreciated.

For now, I can afford the parallel requests problem, but when I put
synchronous=true in the delta import, the call still returns with
outdated items.
Examining the log, it seems that the commit operation is being
executed after the operation returns, even when I am using
commit=true.
Is it possible to also execute the commit synchronously?

Cheers!
Juan M.

On Mon, Dec 6, 2010 at 4:29 PM, Alexey Serba ase...@gmail.com wrote:
 When you say two parallel requests from two users to single DIH
 request handler, what do you mean by request handler?
 I mean DIH.

 Are you
 refering to the HTTP request? Would that mean that if I make the
 request from different HTTP sessions it would work?
 No.

 It means that when you have two users that simultaneously changed two
 objects in the UI then you have two HTTP requests to DIH to pull
 changes from the db into Solr index. If the second request comes when
 the first is not fully processed then the second request will be
 rejected. As a result your index would be outdated (w/o the latest
 update) until the next update.



high CPU usage and SelectCannelConnector threads used a lot

2010-12-06 Thread John Russell
Hi,
I'm using solr and have been load testing it for around 4 days.  We use the
solrj client to communicate with a separate jetty based solr process on the
same box.

After a few days solr's CPU% is now consistently at or above 100% (multiple
processors available) and the application using it is mostly not responding
because it times out talking to solr. I connected visual VM to the solr JVM
and found that out of the many btpool-# threads there are 4 that are pretty
much stuck in the running state 100% of the time. Their names are

btpool0-1-Acceptor1 SelectChannelConnector @0.0.0.0:9983
btpool0-2-Acceptor2 SelectChannelConnector @0.0.0.0:9983
btpool0-3-Acceptor3 SelectChannelConnector @0.0.0.0:9983
btpool0-9-Acceptor0 SelectChannelConnector @0.0.0.0:9983



The stacks are all the same

btpool0-2 - Acceptor2 SelectChannelConnector @ 0.0.0.0:9983 - Thread
t...@27
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked 106a644 (a sun.nio.ch.Util$1)
- locked 18dd381 (a java.util.Collections$UnmodifiableSet)
- locked 38d07d (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at
org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:419)
at
org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:169)
at
org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

   Locked ownable synchronizers:
- None

All of the other idle thread pool threads are just waiting for new tasks.
The active threads never seem to change, its always these 4.  The selector
channel appears to be in the jetty code, receiving requests from our other
process through the solrj client.

Does anyone know what this might mean or how to address it? Are these
running all the time because they are blocked on IO so not actually
consuming CPU? If so, what else might be? Is there a better way to figure
out what is pinning the CPU?

Some more info that might be useful.

32 bit machine ( I know, I know)
2.7GB of RAM for solr process ~2.5 is used
According to visual VM around 25% of CPU time is spent in GC with the rest
in application.

Thanks for the help.

John


Re: DIH - rdbms to index confusion

2010-12-06 Thread kmf

I'm not understanding this response.  My main table does have a one to many
relationship with the other tables.  What should I be anticipating/wanting
for each document if I want to return to the user the values while allowing
them to search on the other terms?  

Thanks.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-rdbms-to-index-confusion-tp2028543p2030456.html
Sent from the Solr - User mailing list archive at Nabble.com.


only index synonyms

2010-12-06 Thread lee carroll
Hi Can the following usecase be achieved.

value to be analysed at index time this is a pretty line of text

synonym list is pretty = scenic , text = words

valued placed in the index is scenic words

That is to say only the matching synonyms. Basically i want to produce a
normalised set of phrases for faceting.

Cheers Lee C


Re: high CPU usage and SelectCannelConnector threads used a lot

2010-12-06 Thread Kent Fitch
Hi John,

sounds like this bug in NIO:

http://jira.codehaus.org/browse/JETTY-937

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933

I think recent versions of jetty work around this bug, or maybe try
the non-NIO socket connector

Kent

On Tue, Dec 7, 2010 at 9:10 AM, John Russell jjruss...@gmail.com wrote:
 Hi,
 I'm using solr and have been load testing it for around 4 days.  We use the
 solrj client to communicate with a separate jetty based solr process on the
 same box.

 After a few days solr's CPU% is now consistently at or above 100% (multiple
 processors available) and the application using it is mostly not responding
 because it times out talking to solr. I connected visual VM to the solr JVM
 and found that out of the many btpool-# threads there are 4 that are pretty
 much stuck in the running state 100% of the time. Their names are

 btpool0-1-Acceptor1 SelectChannelConnector @0.0.0.0:9983
 btpool0-2-Acceptor2 SelectChannelConnector @0.0.0.0:9983
 btpool0-3-Acceptor3 SelectChannelConnector @0.0.0.0:9983
 btpool0-9-Acceptor0 SelectChannelConnector @0.0.0.0:9983



 The stacks are all the same

    btpool0-2 - Acceptor2 SelectChannelConnector @ 0.0.0.0:9983 - Thread
 t...@27
    java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked 106a644 (a sun.nio.ch.Util$1)
        - locked 18dd381 (a java.util.Collections$UnmodifiableSet)
        - locked 38d07d (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at
 org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:419)
        at
 org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:169)
        at
 org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124)
        at
 org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:516)
        at
 org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

       Locked ownable synchronizers:
        - None

 All of the other idle thread pool threads are just waiting for new tasks.
 The active threads never seem to change, its always these 4.  The selector
 channel appears to be in the jetty code, receiving requests from our other
 process through the solrj client.

 Does anyone know what this might mean or how to address it? Are these
 running all the time because they are blocked on IO so not actually
 consuming CPU? If so, what else might be? Is there a better way to figure
 out what is pinning the CPU?

 Some more info that might be useful.

 32 bit machine ( I know, I know)
 2.7GB of RAM for solr process ~2.5 is used
 According to visual VM around 25% of CPU time is spent in GC with the rest
 in application.

 Thanks for the help.

 John



Re: only index synonyms

2010-12-06 Thread Erick Erickson
See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

with the = syntax, I think that's what you're looking for

Best
Erick

On Mon, Dec 6, 2010 at 6:34 PM, lee carroll lee.a.carr...@googlemail.comwrote:

 Hi Can the following usecase be achieved.

 value to be analysed at index time this is a pretty line of text

 synonym list is pretty = scenic , text = words

 valued placed in the index is scenic words

 That is to say only the matching synonyms. Basically i want to produce a
 normalised set of phrases for faceting.

 Cheers Lee C



Re: FastVectorHighlighter ignoring fragmenter parameter . . .

2010-12-06 Thread Koji Sekiguchi

(10/12/06 23:52), CRB wrote:

Koji,

Thank you for the reply.

Being something of a novice with Solr, I would be grateful if you could clarify 
my next steps.

I infer from your reply that there is no current implementation yet contributed 
for the FVH similar
to the regex fragmenter.

Thus I need to write my own custom extensions of *FragmentsBuilder
http://lucene.apache.org/java/3_0_1/api/contrib-fast-vector-highlighter/org/apache/lucene/search/vectorhighlight/FragmentsBuilder.html
 **FragListBuilder
http://lucene.apache.org/java/3_0_1/api/contrib-fast-vector-highlighter/org/apache/lucene/search/vectorhighlight/FragListBuilder.html
*interfaces to take in and apply the regex.

I would be happy to contribute back what I create.

Appreciate whatever guidance you can offer,

Christopher


Christopher,

Thank you for being interested in FVH!

As I'm not sure a regex-fragmenter-like-function can be implemented for FVH,
I cannot give any advise to you. Sorry about that.
Basically, contribution back is always welcome!

Thank you,

Koji
--
http://www.rondhuit.com/en/


Re: Taxonomy and Faceting

2010-12-06 Thread Lance Norskog
That is correct. Solr is a search engine, not a text analysis engine.
There are a few open source text analysis systems: Weka, OpenNLP,
UIMA.

Someone is working on integrating UIMA with Solr:
https://issues.apache.org/jira/browse/SOLR-2129

But you should generally assume you will have a batch processing pass
over the data before indexing it.

On Mon, Dec 6, 2010 at 12:04 PM, webdev1977 webdev1...@gmail.com wrote:

 Thanks for the quick response!

 I was thinking more about the idea of having both structured and unstructred
 data coming into a system to be indexed/searched.  I would like these
 documents to be processed by some sort of entity/keyword/semantic
 processing.  I have a well defined taxonomy for my organization (it is quite
 large) and at the moment we use RetrievalWare to give keyword/classification
 suggestions.  This does NOT work well though, and RetrievalWare is pretty
 much useless to us.

 I want a way to do this process either at index time or search time.  All
 documents should be processed against this taxonomy.  I do not want the user
 to be able to nominate keywords, it must happen automatically.   I am
 assuming it is only natural for these keywords/taxonomy entities to show up
 as hierarchical facets?

 From what I can tell, there is no way to tell Solr.. here is my taxonomy..
 classify my documents and give me back facets and facet counts..
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2029636.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Solr Newbie - need a point in the right direction

2010-12-06 Thread Mark
Hi,

First time poster here - I'm not entirely sure where I need to look for this
information.

What I'm trying to do is extract some (presumably) structured information
from non-uniform data (eg, prices from a nutch crawl) that needs to show in
search queries, and I've come up against a wall.

I've been unable to figure out where is the best place to begin.

I had a look through the solr wiki and did a search via Lucid's search tool
and I'm guessing this is handled at index time through my schema? But I've
also seen dismax being thrown around as a possible solution and this has
confused me.

Basically, if you guys could point me in the right direction for resources
(even as much as saying, you need X, it's over there) that would be a huge
help.

Cheers

Mark


Out of memory error

2010-12-06 Thread sivaprasad

Hi,

When i am trying to import the data using DIH, iam getting Out of memory
error.The below are the configurations which i have.

Database:Mysql
Os:windows
No Of documents:15525532
In Db-config.xml i made batch size as -1

The solr server is running on Linux machine with tomcat.
i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M

Can anybody has idea, where the things are going wrong?

Regards,
JS


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Out of memory error

2010-12-06 Thread Fuad Efendi
Batch size -1??? Strange but could be a problem. 

Note also you can't provide parameters to default startup.sh command; you 
should modify setenv.sh instead

--Original Message--
From: sivaprasad
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Out of memory error
Sent: Dec 7, 2010 12:03 AM


Hi,

When i am trying to import the data using DIH, iam getting Out of memory
error.The below are the configurations which i have.

Database:Mysql
Os:windows
No Of documents:15525532
In Db-config.xml i made batch size as -1

The solr server is running on Linux machine with tomcat.
i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M

Can anybody has idea, where the things are going wrong?

Regards,
JS


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sent on the TELUS Mobility network with BlackBerry

Re: Problem with DIH delta-import delete.

2010-12-06 Thread Matti Oinas
Thanks Koji.

Problem seems to be that template transformer is not used when delete
is performed.

...
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed ModifiedRowKey for Entity: entry rows obtained : 0
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed DeletedRowKey for Entity: entry rows obtained : 1223
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed parentDeltaQuery for Entity: entry
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder deleteAll
INFO: Deleting stale documents
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
INFO: Deleting document: 787
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
INFO: Deleting document: 786
...

There are entries with id 787 and 786 in database and those are marked
as deleted. Query returns right number of deleted documents and right
rows from database but delete fails because solr is using plain
numeric id when deleting document. The same happens with blogs also.

Matti


2010/12/4 Koji Sekiguchi k...@r.email.ne.jp:
 (10/11/17 20:18), Matti Oinas wrote:

 Solr does not delete documents from index although delta-import says
 it has deleted n documents from index. I'm using version 1.4.1.

 The schema looks like

  fields
     field name=uuid type=string indexed=true stored=true
 required=true /
     field name=type type=int indexed=true stored=true
 required=true /
     field name=blog_id type=int indexed=true stored=true /
     field name=entry_id type=int indexed=false stored=true /
     field name=content type=textgen indexed=true stored=true /
  /fields
  uniqueKeyuuid/uniqueKey


 Relevant fields from database tables:

 TABLE: blogs and entries both have

   Field: id
    Type: int(11)
    Null: NO
     Key: PRI
 Default: NULL
   Extra: auto_increment
 
   Field: modified
    Type: datetime
    Null: YES
     Key:
 Default: NULL
   Extra:
 
   Field: status
    Type: tinyint(1) unsigned
    Null: YES
     Key:
 Default: NULL
   Extra:


 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
        dataSource type=JdbcDataSource
 driver=com.mysql.jdbc.Driver.../
        document
                entity name=blog
                                pk=id
                                query=SELECT id,description,1 as type FROM
 blogs WHERE status=2
                                deltaImportQuery=SELECT id,description,1
 as type FROM blogs WHERE
 status=2 AND id='${dataimporter.delta.id}'
                                deltaQuery=SELECT id FROM blogs WHERE
 '${dataimporter.last_index_time}'lt; modified AND status=2
                                deletedPkQuery=SELECT id FROM blogs WHERE
 '${dataimporter.last_index_time}'lt;= modified AND status=3
                                transformer=TemplateTransformer
                        field column=uuid name=uuid
 template=blog-${blog.id} /
                        field column=id name=blog_id /
                        field column=description name=content /
                        field column=type name=type /
                /entity
                entity name=entry
                                pk=id
                                query=SELECT f.id as
 id,f.content,f.blog_id,2 as type FROM
 entries f,blogs b WHERE f.blog_id=b.id AND b.status=2
                                deltaImportQuery=SELECT f.id as
 id,f.content,f.blog_id,2 as type
 FROM entries f,blogs b WHERE f.blog_id=b.id AND
 f.id='${dataimporter.delta.id}'
                                deltaQuery=SELECT f.id as id FROM entries
 f JOIN blogs b ON
 b.id=f.blog_id WHERE '${dataimporter.last_index_time}'lt; b.modified
 AND b.status=2
                                deletedPkQuery=SELECT f.id as id FROM
 entries f JOIN blogs b ON
 b.id=f.blog_id WHERE b.status!=2 AND '${dataimporter.last_index_time}'
 lt; b.modified

  transformer=HTMLStripTransformer,TemplateTransformer
                        field column=uuid name=uuid
 template=entry-${entry.id} /
                        field column=id name=entry_id /
                        field column=blog_id name=blog_id /
                        field column=content name=content
 stripHTML=true /
                        field column=type name=type /
                /entity
        /document
 /dataConfig

 Full import and delta import works without problems when it comes to
 adding new documents to the index but when blog is deleted (status is
 set to 3 in database), solr report after delta import is something
 like Indexing completed. Added/Updated: 0 documents. Deleted 81
 documents.. The problem is that documents are still found from solr
 index.

 1. UPDATE blogs SET modified=NOW(),status=3 WHERE id=26;

 2. delta-import =

 str name=
 Indexing completed. Added/Updated: 0 documents. Deleted 81 documents.
 /str
 str 

Re: Solr -File Based Spell Check and Read .cfs file generated

2010-12-06 Thread rajini maski
Anyone know abt it?
 how to extract the dictionary generated by default.?  How do i read this
 .cfs files generated in index folder..


Awaiting reply


On Mon, Dec 6, 2010 at 7:54 PM, rajini maski rajinima...@gmail.com wrote:

 Yeah..  I wanna use this Spell-check only.. I want to create myself the
 dictionary.. And give it as input to solr.. Because my indexes also have
 mis-spelled content and so I want solr to refer this file and not
 autogenrated. How do i get this done?

 I will try the spell check as suggested by  michael...

 One more main thing I wanted to know is,  how to extract the dictionary
 generated by default.?  How do i read this  .cfs files generated in index
 folder..

 Please reply if you know anything related to this..


 Awaiting reply




 On Mon, Dec 6, 2010 at 7:33 PM, Erick Erickson erickerick...@gmail.comwrote:

 Are you sure you want spellcheck/autosuggest?

 Because what you're talking about almost sounds like
 synonyms.

 Best
 Erick

 On Mon, Dec 6, 2010 at 1:37 AM, rajini maski rajinima...@gmail.com
 wrote:

  How does the solr file based spell check work?
 
  How do we need to enter data in the spelling.txt...I am not clear about
 its
  functionality..If anyone know..Please reply.
 
  I want to index a word = Wear
  But while searching I search as =Dress
  I want to get results for Wear.. How do i apply this functionality..
 
  Awaiting Reply
 





how to config DataImport Scheduling

2010-12-06 Thread Hamid Vahedi
Hi 

I want to config DataImport Scheduling, but not know, how to do it.
i just create and compile Scheduling classes with netbeans. and now have 
Scheduling.Jar. 

Q: how to setup it on tomcat or solr?  (i using tomcat 6 on windows 2008)

thanks in advanced



  

Re: only index synonyms

2010-12-06 Thread lee carroll
Hi Erik thanks for the reply. I only want the synonyms to be in the index
how can I achieve that ? Sorry probably missing something obvious in the
docs
On 7 Dec 2010 01:28, Erick Erickson erickerick...@gmail.com wrote:
 See:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

 with the = syntax, I think that's what you're looking for

 Best
 Erick

 On Mon, Dec 6, 2010 at 6:34 PM, lee carroll lee.a.carr...@googlemail.com
wrote:

 Hi Can the following usecase be achieved.

 value to be analysed at index time this is a pretty line of text

 synonym list is pretty = scenic , text = words

 valued placed in the index is scenic words

 That is to say only the matching synonyms. Basically i want to produce a
 normalised set of phrases for faceting.

 Cheers Lee C



Re: only index synonyms

2010-12-06 Thread Tom Hill
Hi Lee,


On Mon, Dec 6, 2010 at 10:56 PM, lee carroll
lee.a.carr...@googlemail.com wrote:
 Hi Erik

Nope, Erik is the other one. :-)

 thanks for the reply. I only want the synonyms to be in the index
 how can I achieve that ? Sorry probably missing something obvious in the
 docs

Exactly what he said, use the = syntax. You've already got it. Add the lines

pretty = scenic
text = words

to synonyms.txt, and it will do what you want.

Tom

 On 7 Dec 2010 01:28, Erick Erickson erickerick...@gmail.com wrote:
 See:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

 with the = syntax, I think that's what you're looking for

 Best
 Erick

 On Mon, Dec 6, 2010 at 6:34 PM, lee carroll lee.a.carr...@googlemail.com
wrote:

 Hi Can the following usecase be achieved.

 value to be analysed at index time this is a pretty line of text

 synonym list is pretty = scenic , text = words

 valued placed in the index is scenic words

 That is to say only the matching synonyms. Basically i want to produce a
 normalised set of phrases for faceting.

 Cheers Lee C




Solr JVM performance issue after 2 days

2010-12-06 Thread Hamid Vahedi
Hi,

I am using multi-core tomcat on 2 servers. 3 language per server.

I am adding documents to solr up to 200 doc/sec. when updating process is 
started, every thing is fine (update performance is max 200 ms/doc. with about 
800 MB memory used with minimal cpu usage). 

After 15-17 hours it's became so slow  (more that 900 sec for update), used 
heap 
memory is about 15GB, GC time is became more than one hour. 


I don't know what's wrong with it? Can anyone describe me what's the problem? 
Is that came from Solr or JVM? 

Note: when i stop updating, CPU busy within 15-20 min. and when start updating 
again i have same issue. but when stop tomcat service and start it again, all 
thing is OK.

I am using tomcat 6 with 18 GB memory on windows 2008 server x64. Solr 1.4.1 

thanks in advanced
Hamid