Re: Store input text after analyzers and token filters

2010-03-15 Thread JCodina
For solr 1.4 Is basically the same but IndexSchema (org.apache.solr.schema.IndexSchema) needs to be updated to include the function getFieldTypeByName(String fieldTypeName) which is already in sorl1.5 /** * Given the name of a {...@link org.apache.solr.schema.FieldType} (not to be

Re: Store input text after analyzers and token filters

2010-03-09 Thread JCodina
Otis, I've been thinking on it, and trying to figure out the different solutions - Try to solve it doing a bridge between solr and clustering. - Try to solve it before/during indexing The second option, of course is better for performance, but how to do it?? I think a good option may be to

Store input text after analyzers and token filters

2010-03-05 Thread JCodina
In an stored field, the content stored is the raw input text. But when the analyzers perform some cleaning or interesting transformation of the text, then it could be interesting to store the text after the tokenizer/Filter chain there is a way to do this? To be able to get back the text of the

Re: Store input text after analyzers and token filters

2010-03-05 Thread JCodina
Thanks, It can be useful as a workarrond, but I get a vector not a result that I may use wherever I could used the stored text. I'm thinking in clustering. Ahmet Arslan wrote: In an stored field, the content stored is the raw input text. But when the analyzers perform some cleaning or

Clustering from anlayzed text instead of raw input

2010-03-03 Thread JCodina
I'm trying to use carrot2 (now I started with the workbench) and I can cluster any field, but, the text used for clustering is the original raw text, the one that was indexed, without any of the processing performed by the tokenizer or filters. So I get stop words. I also did shingles (after

error in sum function

2010-03-03 Thread JCodina
the sum function or the map one are not parsed correctly, doing this sort, works as a charm... sort=score+desc,sum(Num,map(Num,0,2000,42000))+asc but sort=score+desc,sum(map(Num,0,2000,42000),Num)+asc gives the following exception SEVERE: org.apache.solr.common.SolrException: Must declare sort

Re: error in sum function

2010-03-03 Thread JCodina
Ok, solved!!! Joan Koji Sekiguchi-2 wrote: Can you try it latest trunk? I have just fixed it in a couple of days Koji Sekiguchi from mobile On 2010/03/03, at 18:18, JCodina joan.cod...@barcelonamedia.org wrote: the sum function or the map one are not parsed correctly, doing

Re: Clustering from anlayzed text instead of raw input

2010-03-03 Thread JCodina
Thanks Staszek I'll give a try to stopwords treatbment, but the problem is that we perform POS tagging and then use payloads to keep only Nouns and Adjectives, and we thought that could be interesting to perform clustering only with these elements, to avoid senseless words. Of course is a

Re: Solr and UIMA

2010-03-02 Thread JCodina
You can test our UIMA to Solr cas consumer is based on JulieLab Lucas and uses their CAS. but transformed to generate XML which can be saved to a file or posted direcly to solr In the map file you can define which information is generated for each token, and how its concatenaded, allowing the

Re: Solr and UIMA

2010-02-11 Thread JCodina
Things are done :-) now we already have done the UIMA CAS consumer for Solr, we are making it public, more news soon. We have also been developing some filters based on payloads One of the filters is to remove words with the payloads in the list the other one maintains only these tokens

Re: Solr and UIMA

2009-07-24 Thread JCodina
On Jul 21, 2009, at 11:57 AM, JCodina wrote: Let me sintetize: We (well, I think Grant?) do changes in the DPTFF ( DelimitedPayloadTokenFilterFactory ) so that is able to index at the same position different tokes that may have payloads. 1. token delimiter (#) 2. payload delimiter (|) We

Re: Lemmatisation support in Solr

2009-07-21 Thread JCodina
I think that to get the best results you need some kind of natural language processing I'm trying to do so using UIMA but i need to integrate it with SOLR as I explain in this post http://www.nabble.com/Solr-and-UIMA-tc24567504.html prerna07 wrote: Hi, I am implementing Lemmatisation in

Re: Solr and UIMA

2009-07-21 Thread JCodina
for the right semantic info. But gives them the same increment. Of course the full processing chain must be aware of this. But I must think on multiwords tokens Grant Ingersoll-6 wrote: On Jul 20, 2009, at 6:43 AM, JCodina wrote: D: Break things down. The CAS would only produce XML that solr can

Solr and UIMA

2009-07-20 Thread JCodina
We are starting to use UIMA as a platform to analyze the text. The result of analyzing a document is a UIMA CAS. A Cas is a generic data structure that can contain different data. UIMA processes single documents, They get the documents from a CAS producer, process them using a PIPE that the

Re: facets and stopwords

2009-07-08 Thread JCodina
hossman wrote: but are you sure that example would actually cause a problem? i suspect if you index thta exact sentence as is you wouldn't see the facet count for si or que increase at all. If you do a query for {!raw field=content}que you bypass the query parsers (which is

Re: facets and stopwords

2009-07-01 Thread JCodina
Sorry , I was too cryptic. I you follow this link http://projecte01.development.barcelonamedia.org/fonetic/ you will see a Top Words list (in Spanish and stemmed) in the list there is the word si which is in 20649 documents. If you click at this word, the system will perform the query

Top tf_idf in TermVectorComponent

2009-06-25 Thread JCodina
In order to perform any further study of the resultset, like clustering, the TermVectorComponent gives the list of words with the correspoing tf, idf, but this list can be huge for each document, and most of the terms may have a low tf or a too high df, maybe, it is usefull to compare the

version of lucene

2009-06-15 Thread JCodina
I have the solr-nightly build of last week, and in the lib foloder i can find the lucene-core-2.9-dev.jar I need to do some changes to the shingle filter in order to remove stopwords from bigrams, but to do so I need to compile lucene, the problem is, lucene is in version 2.4 not 2.9 If I take,

facets and stopwords

2009-06-09 Thread JCodina
I have a text field from where I remove stop words, as a first approximation I use facets to see the most common words in the text, but.. stopwords are there, and if I search documents having the stopwords, then , there are no documents in the answer. You can test it in this address (using

Re: Build Solr to run SolrJS

2008-11-22 Thread JCodina
ironing out this stuff. Erik On Nov 20, 2008, at 5:44 PM, JCodina wrote: I could not manage, yet to use it. :confused: My doubts are: - must I download solr from svn - trunk? - then, must I apply the patches of solrjs and velocity and unzip the files? or is this already in trunk

DataImportHanler JDBC case problems

2008-11-21 Thread JCodina
I tried to perform a DataImportHandler where the column name user and the field name User are the same but the case of the first letter, when performing a full import, I was getting different sorts of errors, on that field depending on the cases of the names, I tried the four possible

Re: Build Solr to run SolrJS

2008-11-20 Thread JCodina
I could not manage, yet to use it. :confused: My doubts are: - must I download solr from svn - trunk? - then, must I apply the patches of solrjs and velocity and unzip the files? or is this already in trunk? because trunk contains velocity and javascript in contrib. but does not find the

Re: Build Solr to run SolrJS

2008-11-17 Thread JCodina
To give you more information. The error I get is this one: java.lang.NoClassDefFoundError: org/apache/solr/request/VelocityResponseWriter (wrong name: contrib/velocity/src/main/java/org/apache/solr/request/VelocityResponseWriter) at java.lang.ClassLoader.defineClass1(Native Method) at

Build Solr to run SolrJS

2008-11-16 Thread JCodina
I downloaded solr/trunk and build it, everything seems to work except that the VelocityResponseWriter is not in the war file and tomcat , gives an error of configuration when using the conf.xml of the solrjs. Any suggestion on how to build the solr to work with solrjs?? Thanks Joan Codina --