date:20110307

Re: Drop documents when indexing with DHI

2011-03-07 Thread Stefan Matheis

Rosa,

try http://wiki.apache.org/solr/DataImportHandler#Special_Commands

HTH
Stefan

On Fri, Mar 4, 2011 at 9:44 PM, Rosa (Anuncios)
rosaemailanunc...@gmail.com wrote:
 Hi,

 Is it possible to skip document when indexing with DHI based on a regex to
 filter certain badwords for example?

 Thanks for your help,

 rosa

Re: New PHP API for Solr (Logic Solr API)

2011-03-07 Thread Stefan Matheis

Burak,

what's wrong with the existing PHP-Extension
(http://php.net/manual/en/book.solr.php)?

Regards
Stefan

On Sun, Mar 6, 2011 at 11:31 PM, Burak burak...@gmail.com wrote:
 Hello,

 I have recently finished writing a PHP API for Solr and have released it
 under the Apache License. The project is called Logic Solr API and is
 located at https://github.com/buraks78/Logic-Solr-API/wiki. It has good unit
 test coverage (over 90%) but is still in alpha. So I am primarily interested
 in some feedback and help for testing if anybody is interested as my test
 setup is pretty limited in regards to the Solr version (1.4.1), PHP version
 (5.3.5), and Solr setup (data required for testing certain features fully is
 missing). The documentation is located at
 https://github.com/buraks78/Logic-Solr-API/wiki. Although it is pretty weak
 at this point, I believe it can get you started. I also have phpdocs under
 docs/api folder in the package if needed.

 Burak

Re: New PHP API for Solr (Logic Solr API)

2011-03-07 Thread Lukas Kahwe Smith


On 07.03.2011, at 09:43, Stefan Matheis wrote:

 Burak,
 
 what's wrong with the existing PHP-Extension
 (http://php.net/manual/en/book.solr.php)?


the main issue i see with it is that the API isn't designed much. aka it just 
exposes lots of features with dedicated methods, but doesnt focus on keeping 
the API easy to overview (aka keep simple things simple and make complex stuff 
possible). at the same time fundamental stuff like quoting are not covered.

that being said, i do not think we really need a proliferation of solr API's 
for PHP, even if this one is based on PHP 5.3 (namespaces etc). btw there is 
already another PHP 5.3 based API, though it tries to also unify other Lucene 
based API's as much as possible:
https://github.com/dstendardi/Ariadne

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: Solr Autosuggest help

2011-03-07 Thread Ahmet Arslan

 I have added the following line in both the  section
 and in   section in
 schema.xml.
 
 filter class=solr.ShingleFilterFactory
 maxShingleSize=2
 outputUnigrams=true outputUnigramIfNoNgram=true
 
 And reindex my content. However, if I query solr for the
 multi work search
 terms suggestion , it only send the single word
 suggestions.
 
 http://localhost:8080/solr/mydata/select?qt=/termsterms=trueterms.fl=contentterms.lower=javaterms.prefix=javaterms.lower.incl=falseindent=true
 
 It wont return the words like 'java final', it only returns
 words like
 javadoc, javascript..
 
 Could any one update me how to correct this.. or what I am
 missing..

What happens when you add terms.limit=-1 to your search URL?

Or when you use java plus one blank character in terms.prefix?
terms.prefix=java indent=true

Can you see multi-word terms in admin/schema.jsp page?

StreamingUpdateSolrServer

2011-03-07 Thread Isan Fulia

Hi all,
I am using StreamingUpdateSolrServer with queuesize = 5 and threadcount=4
The no. of connections created are same as threadcount.
Is it that it creates a new connection for every thread.


-- 
Thanks  Regards,
Isan Fulia.

Re: Solr Autosuggest help

2011-03-07 Thread rahul

hi..

thanks for your replies..

It seems I mistakenly put ShingleFilterFactory in another field. When I put
the factory in correct field it works fine now. 

Thanks.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Autosuggest-help-tp2580944p2645780.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

2011-03-07 Thread Rachita Choudhary

Hi Yonik,

Thanks for the information, but we are still facing issues related to
slowness and high memory usage.

As per my understanding, the default 'FC' method suits are use case, as we
have total about 1.1 million documents and no. of unique values for facet
fields is quite high.
We facet on 5 fields and the no. of unique values are:
Field 1 : 19,000
Field 2 : 19,000
Field 3 : 55,000
Field 4: 474
Field 5 : 27 (The alphabetical faceting)

All the facet fields are of type string and multivalued.

As enum method , will create a bitset for all the unique values, it would be
consuming more memory compared to fc method.
Also even with a field value cache size of '100', the heap memory(max 6GB)
is getting consumed pretty fast.

With about 60 parallel requests contributing about 4 million queries, about
25% of our queries have QTime above 1 sec.
The max QTime shoots upto 55 sec.

Debugging deeper into the solr and lucene code, the particular method which
slows us down is IndexSearcher.numDocs which internally gets the terms by
loading it from the index.
I have not been able to determine the root cause of this.

Any other pointers/suggestions in this regard will be helpful.

Thanks,
Rachita

On Tue, Feb 22, 2011 at 10:42 PM, Yonik Seeley
yo...@lucidimagination.comwrote:

On Tue, Feb 22, 2011 at 9:13 AM, Rachita Choudhary
rachita.choudh...@burrp.com wrote:
Hi Solr Users,

We are upgrading from Solr 1.3 to Solr 1.4.1.
While using Solr 1.3 , we were seeing multiple blocking active threads on
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() .

To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other
type of multiple blocking threads on
org.apache.solr.request.UnInvertedField.getUnInvertedField()

SegmentReader$CoreReaders.getTermsReader.
Due to this, the QTimes shoots up from few hundreds to thousand of
msec.. even going upto 30-40 secs for a single query.

- The multiple blocking threads show up after few thousands of queries.
- We do not have faceting and sorting on the same fields.
- Our facet fields are multivalued text fields, but no large text values
are
present.
- Index size - around 10 GB
- We have not specified any method for faceting in our schema.xml.
- Our field value cache settings are:
fieldValueCache
class=solr.FastLRUCache
size=175
autowarmCount=0
showItems=10
/

Can someone please tell us the why we are seeing these blocked threads ?
Also if they are related to our field value cache , then a cache of size
175
will be filled up with very few initial queries and right after that we
should see multiple blocking threads ?
What difference it will make if we have facet.method = enum ?

fc method on a multivalued field instantiates an UnInvertedField (like
a multi-valued field cache) which can take some time.
Just like sorting, you may want to use some warming faceting queries
to make sure that real queries don't pay the cost of the initial entry
construction.

From your fieldValueCache statistics, it looks like the number of
terms is low enough that the enum method may be fine here.

-Yonik
http://lucidimagination.com

Is this all related to fieldValueCache or is there some other
configuration
which we need to set to avoid these blocking threads?

Thanks,
Rachita

*Cache values example:
*facetField1_27443 :

{field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1}

facetField1_70 :

{field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1}

facetField2 :
{field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031}

Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

2011-03-07 Thread Yonik Seeley

On Mon, Mar 7, 2011 at 9:44 AM, Rachita Choudhary
rachita.choudh...@burrp.com wrote:
 As enum method , will create a bitset for all the unique values

It's more complex than that.
 - small sets will use a sorted int set... not a bitset
 - you can control what gets cached via facet.enum.cache.minDf parameter

-Yonik
http://lucidimagination.com

Re: dismax, and too much qf?

2011-03-07 Thread Jonathan Rochkind

I use about that many qf's in Solr 1.4.1.   It works. I'm not entirely 
sure if it has performance implications -- I do have searching that is 
somewhat slower then I'd like, but I'm not sure if the lengthy qf is a 
contributing factor, or other things I'm doing (like a dozen different 
facet.fields too!).   I haven't profiled everything.  But it doesn't 
grind my Solr to a halt or anything, it works.


Seperately, I've also been thinking of other ways to get similar 
highlighting behavior as you describe, give the 'field' that the match 
was in in the highlight response, but haven't come up with anything 
great, if your approach works, that's cool.  I've been trying to think 
of a way to store a single stored field in a structured format (CSV? 
XML?), and somehow have the highlighter return the complete 'field' that 
matches, not just the surrounding X words. But haven't gotten anywhere 
on that, just an idle thought.


Jonathan

On 3/4/2011 10:09 AM, Jeff Schmidt wrote:

Hello:

I'm working on implementing a requirement where when a document is returned, we want to 
pithily tell the end user why. That is, say, with five documents returned, they may be so 
for similar or different reasons. These reasons are the field(s) in which 
matches occurred.  Some are more important than others, and I'll have to return just the 
most relevant one or two reasons to not overwhelm the user.

This is a separate goal than Solr's scoring of the returned documents. That is, 
index/query time boosting can indicate which fields are more significant in computing the 
overall document score, but then I need to know what fields where, matched with what 
terms. I do have an application that stands between Solr and the end user (RESTful API), 
so I figured I can rank the reasons and return more domain specific names 
rather than the Solr fields names.

So, I've turned to highlighting, and in the results I can see for each document ID 
the fields matched, and the text in the field etc. Great. But,  to get that to work, 
I have to specifically query individual fields. That is, the approach 
ofcopyField'ing a bunch of fields to a common text field for efficiency 
purposes is no longer an option. And, using the dismax request handler, I'm querying 
a lot of fields:

  str name=qf
 n_nameExact^4.0
 n_macromolecule_nameExact^3.0
 n_macromolecule_name^2.0
 n_macromolecule_id^1.8
 n_pathway_nameExact^1.5
 n_top_regulates
 n_top_regulated_by
 n_top_binds
 n_top_role_in_cell
 n_top_disease
 n_molecular_function
 n_protein_family
 n_subcell_location
 n_pathway_name
 n_cell_component
 n_bio_process
 n_synonym^0.5
 n_macromolecule_summary^0.6
 p_nameExact^4.0
 p_name^2.0
 p_description^0.6
  /str

Is that crazy?  Is telling Solr to look at so many individual fields going to 
be a performance problem?  I'm only prototyping at this stage and it works 
great. :)  I've not run anything yet at scale handling lots of requests.

There are two document types in that shared index, demarcated using a field 
named type.  So, when configuring the SolrJ SolrQuery, I do setup 
addFilterQuery() to select one or the other type.

Anyway, using dismax with all of those query fields along with highlighting, I 
get the information I need to render meaningful results for the end user.  But, 
it has a sort of smell to it. :)   Shall I look for another way, or am I 
worrying about nothing?

I am current using Solr 3.1 trunk.

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com

Re: New PHP API for Solr (Logic Solr API)

2011-03-07 Thread dan whelan


When are you going to complete the Texis Search API?



On 3/6/11 2:31 PM, Burak wrote:

Hello,

I have recently finished writing a PHP API for Solr and have released 
it under the Apache License. The project is called Logic Solr API 
and is located at https://github.com/buraks78/Logic-Solr-API/wiki. It 
has good unit test coverage (over 90%) but is still in alpha. So I am 
primarily interested in some feedback and help for testing if anybody 
is interested as my test setup is pretty limited in regards to the 
Solr version (1.4.1), PHP version (5.3.5), and Solr setup (data 
required for testing certain features fully is missing). The 
documentation is located at 
https://github.com/buraks78/Logic-Solr-API/wiki. Although it is pretty 
weak at this point, I believe it can get you started. I also have 
phpdocs under docs/api folder in the package if needed.


Burak

Re: dismax, and too much qf?

2011-03-07 Thread Jeff Schmidt

Hi Jonathan:

On Mar 7, 2011, at 8:33 AM, Jonathan Rochkind wrote:

 I use about that many qf's in Solr 1.4.1.   It works. I'm not entirely sure 
 if it has performance implications -- I do have searching that is somewhat 
 slower then I'd like, but I'm not sure if the lengthy qf is a contributing 
 factor, or other things I'm doing (like a dozen different facet.fields too!). 
   I haven't profiled everything.  But it doesn't grind my Solr to a halt or 
 anything, it works.

Thanks for the feedback on that. I'll learn more on how this performs in the 
coming months, but if the approach is doomed from the start, that would be good 
to know sooner rather than later, so I could consider doing something else (not 
sure what that would be). It is a pretty big customer requirement though, so 
perhaps it can be carried out regardless by using more EC2 instances? :)

 Seperately, I've also been thinking of other ways to get similar highlighting 
 behavior as you describe, give the 'field' that the match was in in the 
 highlight response, but haven't come up with anything great, if your approach 
 works, that's cool.  I've been trying to think of a way to store a single 
 stored field in a structured format (CSV? XML?), and somehow have the 
 highlighter return the complete 'field' that matches, not just the 
 surrounding X words. But haven't gotten anywhere on that, just an idle 
 thought.

That's an interesting idea. There are a number of other highlighting related 
parameters I've not yet played with yet, relating to fragment size, snippets, 
max analyzed chars etc.  Could those get your what you need w/o having to 
create a separate structured field?

In my case, most of the fields I'm searching are small in size, and I  just 
need to know in what field(s) a match occurred. Often, the actual matched 
characters are less important than the fact that the provided terms matched in 
that field.  

Take it easy,

Jeff

 
 Jonathan
 
 On 3/4/2011 10:09 AM, Jeff Schmidt wrote:
 Hello:
 
 I'm working on implementing a requirement where when a document is returned, 
 we want to pithily tell the end user why. That is, say, with five documents 
 returned, they may be so for similar or different reasons. These reasons 
 are the field(s) in which matches occurred.  Some are more important than 
 others, and I'll have to return just the most relevant one or two reasons to 
 not overwhelm the user.
 
 This is a separate goal than Solr's scoring of the returned documents. That 
 is, index/query time boosting can indicate which fields are more significant 
 in computing the overall document score, but then I need to know what fields 
 where, matched with what terms. I do have an application that stands between 
 Solr and the end user (RESTful API), so I figured I can rank the reasons 
 and return more domain specific names rather than the Solr fields names.
 
 So, I've turned to highlighting, and in the results I can see for each 
 document ID the fields matched, and the text in the field etc. Great. But,  
 to get that to work, I have to specifically query individual fields. That 
 is, the approach ofcopyField'ing a bunch of fields to a common text field 
 for efficiency purposes is no longer an option. And, using the dismax 
 request handler, I'm querying a lot of fields:
 
  str name=qf
 n_nameExact^4.0
 n_macromolecule_nameExact^3.0
 n_macromolecule_name^2.0
 n_macromolecule_id^1.8
 n_pathway_nameExact^1.5
 n_top_regulates
 n_top_regulated_by
 n_top_binds
 n_top_role_in_cell
 n_top_disease
 n_molecular_function
 n_protein_family
 n_subcell_location
 n_pathway_name
 n_cell_component
 n_bio_process
 n_synonym^0.5
 n_macromolecule_summary^0.6
 p_nameExact^4.0
 p_name^2.0
 p_description^0.6
  /str
 
 Is that crazy?  Is telling Solr to look at so many individual fields going 
 to be a performance problem?  I'm only prototyping at this stage and it 
 works great. :)  I've not run anything yet at scale handling lots of 
 requests.
 
 There are two document types in that shared index, demarcated using a field 
 named type.  So, when configuring the SolrJ SolrQuery, I do setup 
 addFilterQuery() to select one or the other type.
 
 Anyway, using dismax with all of those query fields along with highlighting, 
 I get the information I need to render meaningful results for the end user.  
 But, it has a sort of smell to it. :)   Shall I look for another way, or am 
 I worrying about nothing?
 
 I am current using Solr 3.1 trunk.
 
 Thanks!
 
 Jeff
 --
 Jeff Schmidt
 535 Consulting
 j...@535consulting.com
 http://www.535consulting.com
 
 

--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com

Re: Trying to use FieldReaderDataSource in DIH

2011-03-07 Thread Jeff Schmidt

I can see that XPathEntityProcessor.init() is using the no-arg version of 
Context.getDataSource(). Since fields are hierarchical, should that not be a 
request for the the current innermost data source (i.e. fieldSource which is 
a FieldReaderDataSource)?   Or should init() be looking at the dataSource 
attribute value of the field in order to effectively invoke 
Context.getDataSource(fieldSource)?

It seems I'm obsessing over this bug when it's probably some bigger picture 
thing I'm missing.  Given the other examples of using this technique, it's hard 
to believe I'm the first to encounter this issue. :)

Thanks,

Jeff

On Mar 4, 2011, at 10:00 AM, Jeff Schmidt wrote:

 Hello:
 
 I'm trying to make use of FieldReaderDataSource so that I can read a (Oracle) 
 database CLOB, and then use XPathEntityProcessor to derive Solr field values 
 via xpath notation.
 
 For an extra bit of fun, the CLOB itself is base 64 encoded and gzip'd.  I 
 created a transformer of my own to take care of the encoding and compression 
 and that seems to work.  I patterned the new transformer after the existing 
 ones (Solr 3.1 trunk).  Anyway, I can see in catalina.out, my own debug 
 output:
 
 - Processing field: {toWrite=false, clob=true, 
 column=SUMMARY_XML, boost=1.0, gzip64=true}
 - Updated field: SUMMARY_XML to type: java.lang.String value: 
 'node id=ING:2ylbg name=LOC677213 type=genesynonym-listsynonym 
 name=LOC677213//synonym-listmacromolecule-listmacromolecule 
 id=677213 source=EG species=MM name=similar to U2AF homology motif 
 (UHM) kinase 1 
 summary=//macromolecule-listmember-of/member-ofmolecular-function/molecular-functionbiological-process/biological-processcellular-component/cellular-componentpathway-list/pathway-listprotein-familyterm
  
 name=unknown//protein-familysubcellular-location/subcellular-locationtop-findings/top-findingsadditional-findings/additional-findingsreference-list
  finding-count=0/reference-listcopyright#169;2000-2010  Ingenuity 
 Systems, Inc. All rights reserved./copyright/node'
 
 So, the transformer replaces the original CLOB extracted by ClobTransformer 
 with a String representing the decoded result. I then want to feed this XML 
 string to XPathEntityProcessor.  So, in my DIH data config file:
 
 dataConfig
dataSource
name=ipsDb
type=JdbcDataSource 
driver=oracle.jdbc.driver.OracleDriver

 url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac1-vip)(PORT=1537))(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac2-vip)(PORT=1537))(sdu=8760)(LOAD_BALANCE=yes)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=EIPS1R)))
user=user
password=password
/
 
datasource
   name=fieldSource
   type=FieldReaderDataSource
/
 
document
entity
   rootEntity=false
   name=ipsNode
dataSource=ipsDb
   query=select SUMMARY_XML from IPS_NODE where ROWNUM lt; 10

 transformer=ClobTransformer,com.ingenuity.isec.util.SolrDihGzip64Transformer
 
field column=SUMMARY_XML clob=true gzip64=true/
 
   entity
   name=node
   dataSource=fieldSource
   dataField=ipsNode.SUMMARY_XML  
   processor=XPathEntityProcessor
   forEach=/node
   
   field column=n_id xpath=/node/@id/
   field column=n_name xpath=/node/@name/
   ...
   /entity
/entity
/document
 /dataConfig
 
 Basically, I'm trying to specify the (former CLOB, now String) SUMMARY_XML 
 field as the data field for the FieldReaderDataSource. I can see it has the 
 ability to simply return a StringReader() for String fields, rather than have 
 to deal with a Clob itself. So, I figured FieldReaderDataSource would be 
 happy with that and it would supply XPathEntityProcessor with XML contained 
 in the field's value.
 
 But, when I do a full import, I see this:
 
 Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.DataImporter 
 doFullImport
 INFO: Starting Full Import
 Mar 4, 2011 9:10:26 AM org.apache.solr.core.SolrCore execute
 INFO: [ing-nodes] webapp=/solr path=/select 
 params={clean=falsecommit=truecommand=full-importqt=/dataimport-ips} 
 status=0 QTime=31 
 Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.SolrWriter 
 readIndexerProperties
 WARNING: Unable to read: dataimport-ips.properties
 Mar 4, 2011 9:10:26 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
 call
 INFO: Creating a connection for entity ipsNode with URL: 
 jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac1-vip)(PORT=1537))(ADDRESS=(PROTOCOL=TCP)(HOST=ueipa1rac2-vip)(PORT=1537))(sdu=8760)(LOAD_BALANCE=yes)(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=EIPS1R)))
 Mar 4, 2011 9:10:28 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 
 call
 INFO: Time

Solr Cell DataImport Tika handler broken - fails to index Zip file contents

2011-03-07 Thread Jayendra Patil

Working with the latest Solr Trunk code and seems the Tika handlers
for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler
(TikaEntityProcessor.java) fails to index the zip file contents again.
It just indexes the file names again.
This issue was addressed some time back, late last year, but seems to
have reappeared with the latest code.

I had raised a jira for the Data Import handler part with the patch
and the testcase - https://issues.apache.org/jira/browse/SOLR-2332.
The same fix is needed for the Solr Cell as well.

I can raise a jira and provide the patch for the same, if the above
patch seems good enough.

Regards,
Jayendra

Looking for a Lucene/Solr Contractor

2011-03-07 Thread Drew Kutcharian

Hi Everyone,

We are looking for someone to help us build a similarity engine. Here are some 
preliminary specs for the project.

1) We want to be able to show similar posts when a user posts a new block of 
text. A good example of this is StackOverflow. When a user tries to ask a new 
question, the system displays similar questions.

2) This is for a messaging system, so indexing/analysis should happen 
preferably at the time of posting, not later.

3) The posts are going to be less than 1000 characters.

4) We anticipate to have a millions of posts so the solution should consider 
sharding techniques to shard the indexes on many machines.

5) The solution can be delivered as a stand alone Java SE solution which can be 
run from the command line, no web development necessary.

6) We expect clean APIs.

Thanks,

Drew

Re: Looking for a Lucene/Solr Contractor

2011-03-07 Thread Jan Høydahl

Please check http://wiki.apache.org/solr/Support and 
http://wiki.apache.org/lucene-java/Support for a list of companies you may 
contact.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 7. mars 2011, at 19.40, Drew Kutcharian wrote:

 Hi Everyone,
 
 We are looking for someone to help us build a similarity engine. Here are 
 some preliminary specs for the project.
 
 1) We want to be able to show similar posts when a user posts a new block of 
 text. A good example of this is StackOverflow. When a user tries to ask a new 
 question, the system displays similar questions.
 
 2) This is for a messaging system, so indexing/analysis should happen 
 preferably at the time of posting, not later.
 
 3) The posts are going to be less than 1000 characters.
 
 4) We anticipate to have a millions of posts so the solution should consider 
 sharding techniques to shard the indexes on many machines.
 
 5) The solution can be delivered as a stand alone Java SE solution which can 
 be run from the command line, no web development necessary.
 
 6) We expect clean APIs.
 
 Thanks,
 
 Drew

How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Andy

I have documents that contain both simplified and traditional Chinese 
characters. Is there any way to search across them? For example, if someone 
searches for 类 (simplified Chinese), I'd like to be able to recognize that the 
equivalent character is 類 in traditional Chinese and search for 类 or 類 in the 
documents. 

Is that something that Solr, or any related software, can do? Is there a 
standard approach in dealing with this problem?

Thanks.

Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread François Schiettecatte

I did a little research into this for a client a while. The character mapping 
is not one to one which complicates things (TC and SC have evolved 
independently) and if you want to do a perfect job you will need a dictionary. 
However there are tables out there (I can dig one up for you) that allow 
conversion from one to the other. So you would pick either TC or SC as your 
canonical Chinese, and just convert all the documents and searches to it.

I will stress that this is very much a brute force approach, the mapping is not 
perfect and the two character sets have evolved (much like UK and US English, I 
was brought up in the UK and live in the US).

Hope this helps.

Cheers

François

On Mar 7, 2011, at 5:02 PM, Andy wrote:

 I have documents that contain both simplified and traditional Chinese 
 characters. Is there any way to search across them? For example, if someone 
 searches for 类 (simplified Chinese), I'd like to be able to recognize that 
 the equivalent character is 類 in traditional Chinese and search for 类 or 類 in 
 the documents. 
 
 Is that something that Solr, or any related software, can do? Is there a 
 standard approach in dealing with this problem?
 
 Thanks.

Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Andy

Thanks. Please tell me more about the tables/software that does the conversion. 
Really appreciate your help.

--- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote:

 From: François Schiettecatte fschietteca...@gmail.com
 Subject: Re: How to handle searches across traditional and simplifies Chinese?
 To: solr-user@lucene.apache.org
 Date: Monday, March 7, 2011, 5:24 PM
 I did a little research into this for
 a client a while. The character mapping is not one to one
 which complicates things (TC and SC have evolved
 independently) and if you want to do a perfect job you will
 need a dictionary. However there are tables out there (I can
 dig one up for you) that allow conversion from one to the
 other. So you would pick either TC or SC as your canonical
 Chinese, and just convert all the documents and searches to
 it.

 I will stress that this is very much a brute force
 approach, the mapping is not perfect and the two character
 sets have evolved (much like UK and US English, I was
 brought up in the UK and live in the US).

 Hope this helps.

 Cheers

 François

 On Mar 7, 2011, at 5:02 PM, Andy wrote:

  I have documents that contain both simplified and
 traditional Chinese characters. Is there any way to search
 across them? For example, if someone searches for 类
 (simplified Chinese), I'd like to be able to recognize that
 the equivalent character is 類 in traditional Chinese and
 search for 类 or 類 in the documents. 

  Is that something that Solr, or any related software,
 can do? Is there a standard approach in dealing with this
 problem?

  Thanks.

Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread François Schiettecatte

Here are a bunch of resources which will help:


This does TC = SC conversions:


http://search.cpan.org/~audreyt/Encode-HanConvert-0.35/lib/Encode/HanConvert.pm


This has a TC = SC converter in there somewhere:

http://www.mediawiki.org/wiki/MediaWiki


This explains some of the issues behind TC = SC conversions:

http://people.w3.org/rishida/scripts/chinese/


Misc tools:

http://mandarintools.com/


François


On Mar 7, 2011, at 7:01 PM, Andy wrote:

 Thanks. Please tell me more about the tables/software that does the 
 conversion. Really appreciate your help.
 
 
 --- On Mon, 3/7/11, François Schiettecatte fschietteca...@gmail.com wrote:
 
 From: François Schiettecatte fschietteca...@gmail.com
 Subject: Re: How to handle searches across traditional and simplifies 
 Chinese?
 To: solr-user@lucene.apache.org
 Date: Monday, March 7, 2011, 5:24 PM
 I did a little research into this for
 a client a while. The character mapping is not one to one
 which complicates things (TC and SC have evolved
 independently) and if you want to do a perfect job you will
 need a dictionary. However there are tables out there (I can
 dig one up for you) that allow conversion from one to the
 other. So you would pick either TC or SC as your canonical
 Chinese, and just convert all the documents and searches to
 it.
 
 I will stress that this is very much a brute force
 approach, the mapping is not perfect and the two character
 sets have evolved (much like UK and US English, I was
 brought up in the UK and live in the US).
 
 Hope this helps.
 
 Cheers
 
 François
 
 On Mar 7, 2011, at 5:02 PM, Andy wrote:
 
 I have documents that contain both simplified and
 traditional Chinese characters. Is there any way to search
 across them? For example, if someone searches for 类
 (simplified Chinese), I'd like to be able to recognize that
 the equivalent character is 類 in traditional Chinese and
 search for 类 or 類 in the documents. 
 
 Is that something that Solr, or any related software,
 can do? Is there a standard approach in dealing with this
 problem?
 
 Thanks.

Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Robert Muir

On Mon, Mar 7, 2011 at 7:01 PM, Andy angelf...@yahoo.com wrote:
 Thanks. Please tell me more about the tables/software that does the 
 conversion. Really appreciate your help.


also you might be interested in this example:

filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTransformFilterFactory

logical relation among filter queries

2011-03-07 Thread cyang2010

I wonder what is the logical relation among filter queries.  I can't find
much documentation on filter query.

for example,  i want to find all titles that is either PG-13 or R through
filter query.   The following query won't give me any result back.  So I
suppose by default it is intersection among each filter query result?

fq=rating:PG-13fq=rating:Rq=*:*


How do i change it to union to include value for each filter query result?

Thanks.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/logical-relation-among-filter-queries-tp2649142p2649142.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: logical relation among filter queries

2011-03-07 Thread Jayendra Patil

you can use the boolean operators in the filter query.

e.g. fq=rating:(PG-13 OR R)

Regards,
Jayendra

On Mon, Mar 7, 2011 at 9:25 PM, cyang2010 ysxsu...@hotmail.com wrote:
 I wonder what is the logical relation among filter queries.  I can't find
 much documentation on filter query.

 for example,  i want to find all titles that is either PG-13 or R through
 filter query.   The following query won't give me any result back.  So I
 suppose by default it is intersection among each filter query result?

 fq=rating:PG-13fq=rating:Rq=*:*


 How do i change it to union to include value for each filter query result?

 Thanks.






 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/logical-relation-among-filter-queries-tp2649142p2649142.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: New PHP API for Solr (Logic Solr API)

2011-03-07 Thread Burak


On 03/07/2011 12:43 AM, Stefan Matheis wrote:

Burak,

what's wrong with the existing PHP-Extension
(http://php.net/manual/en/book.solr.php)?
I think wrong is not the appropriate word here. But if I had to 
summarize why I wrote this API:


* Not everybody is enthusiastic about adding another item to an already 
long list of server dependencies. I just wanted a pure PHP option.
* I am not a C programmer either so the ability to understand the source 
code and modify it according to my needs is another advantage.
* Yes, a PECL package would be faster. However, in 99% of the cases, 
after everything is said, coded, and byte-code cached, my biggest 
bottlenecks end up being the database and network.

* Last of all, choice is what open source means to me.

Burak

Use of multiple tomcat instance and shards.

2011-03-07 Thread rajini maski

  In order to increase the Java heap memory, I have only 2gb ram… so
my default memory configuration is --JvmMs 128 --JvmMx 512  . I have the
single solr data index upto 6gb. Now if I am trying to fire a search very
often on this data index, after sometime I find an error as java heap space
out of memory error and search does not return results. What are the
possibilities to fix this error? (I cannot increase heap memory) How about
having another tomcat instance running (how this works? )or is it by
configuring shards? What is that might help me fix this search fail?


Rajani

Re: Drop documents when indexing with DHI

Re: New PHP API for Solr (Logic Solr API)

Re: New PHP API for Solr (Logic Solr API)

Re: Solr Autosuggest help

StreamingUpdateSolrServer

Re: Solr Autosuggest help

Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

Re: dismax, and too much qf?

Re: New PHP API for Solr (Logic Solr API)

Re: dismax, and too much qf?

Re: Trying to use FieldReaderDataSource in DIH

Solr Cell DataImport Tika handler broken - fails to index Zip file contents

Looking for a Lucene/Solr Contractor

Re: Looking for a Lucene/Solr Contractor

How to handle searches across traditional and simplifies Chinese?

Re: How to handle searches across traditional and simplifies Chinese?

Re: How to handle searches across traditional and simplifies Chinese?

Re: How to handle searches across traditional and simplifies Chinese?

Re: How to handle searches across traditional and simplifies Chinese?

logical relation among filter queries

Re: logical relation among filter queries

Re: New PHP API for Solr (Logic Solr API)

Use of multiple tomcat instance and shards.

24 matches

Site Navigation

Mail list logo

Footer information