Re: Copy field a source of copy field

2017-07-25 Thread tstusr
Je, I also think that!. We have some serious gaps on what you explain to me. First, you point me that there's no real need to use ShingleFilter, I tried with all Tokenizer and the result is the same, the species are not caught. On the simplest scenario I've got this: PUT YOUR

Re: Copy field a source of copy field

2017-07-20 Thread tstusr
Well, correct me if I'm wrong. Your suggestion is to use species field as a source of genus field. We try with this Where species work as described and genus just use a KWF, like this: But now, the problem now is different. When we try

Re: Copy field a source of copy field

2017-07-19 Thread tstusr
Well, our documents consist on pdf files (between 20 to 200 pages). So, we catch words of all the file, for that, we use the extract handler, that's why we have this fields: We catch species in all the pdf content (On attr_content field) Species captured are used for ranking purposes. So,

Re: Copy field a source of copy field

2017-07-18 Thread tstusr
Well, for me it's kind of strange because it's working only with words that have blank spaces. It seems that maybe I'm not explaining well. My field is defined as follows: We have 2 KWF files, "species" and

Re: Copy field a source of copy field

2017-07-18 Thread tstusr
Well, I have no idea why that images display as did. The correct order is: Field chain analyzer. KWF-genus file Test output.

Re: Copy field a source of copy field

2017-07-18 Thread tstusr
It seems that is just taking the last file of keep words. Now for control purposes, I have in genus file: And just is

Re: Copy field a source of copy field

2017-07-18 Thread tstusr
Ok, I know shingling will join with "_". But that is the behaviour we want, imagine we have this fields (contained in species file): abarema idiopoda abutilon bakerianum Those become in: abarema idiopoda abutilon bakerianum abarema_idiopoda abutilon_bakerianum But now in my genus file maybe

Copy field a source of copy field

2017-07-17 Thread tstusr
Hi We want to use a copy field as a source for another copy field or some kind of post processing of a field. The problem is here. We have a field from a text that is captured by a field, like this: which has (at the end of the processing) just the words in a field.

Best way to split text

2017-07-05 Thread tstusr
We are working on a search application for large pdfs (~ 10 - 100 Mb), there are been correctly indexed. However we want to make some training in the pipeline, so we are implementing some spark mllib algorithms. But now, some requirements are to split documents into either paragraphs or pages.

Re: Query fieldNorm through http

2017-06-08 Thread tstusr
Hi, thanks for reply. After adding true on distrib, with query localhost:8983/solr/uda/tvrh?q=usage:stuff={!func}norm(usage)=on=true I've got something similar, I append the complete solr log. 2017-06-08 20:22:02.065 INFO (qtp1205044462-18) [c:uda s:shard2 r:core_node2 x:uda_shard2_replica1]

Query fieldNorm through http

2017-06-08 Thread tstusr
I wanted to ask the properly way to query or get the length of a field in solr. I'm trying to ask and append fieldNorm in a result field by querying localhost:8983/solr/uda/tvrh?q=usage:stuff={!func}norm(usage)=on=on Nevertheless, the response to this query is: true 500

Re: Solr installdir deleted after set up solr cloud

2017-06-06 Thread tstusr
Hi, there We've got a silly but terrible mistake. We replace on solrconfig.xml ${solr.data.dir:} for ${solr.install.dir}. So, when solr cloud config works it replaces on install dir, erasing all solr instance. Thanks for your help. -- View this message in context:

Solr installdir deleted after set up solr cloud

2017-05-31 Thread tstusr
Hi, there. There is a strange behavior I'm not capable of trace when set up solr in cloud mode. I'm able to start solr in cloud mode following this tutorial. https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud Just following instructions. We are trying to set up

Re: Modify solr score

2017-04-24 Thread tstusr
We came with a simple solution. We use termfreq and write a simple processor that counts words for making a boost function that only calculates the ratio between words that hit terms and the whole field length. Some tests are being made,

Re: Modify solr score

2017-04-21 Thread tstusr
Well, I know they can change. I think, the main problem here it that (in this point) documents completely unrelated to a topic are being ranked as high as documents related. So, in order to penalize them we are trying to use the ratio or term frequency/word length. Nevertheless we aren't able to

Re: Modify solr score

2017-04-21 Thread tstusr
Well, maybe I explain it wrong. We have entry points, each of them are related to a topic. It mens that when we select the first topic all information has to be related in some way to this vocabulary. So, it can work since we select documents not related to each vocabulary of every entry point.

Re: Modify solr score

2017-04-21 Thread tstusr
Since we report the score, we think there will be some relation between them. As far as we know scoring (and then ranking) are calculated based on tf-idf. What we want to do is to make a qualitative ranking, it means, according to one topic we will tag documents as "very related", "fairly

Modify solr score

2017-04-21 Thread tstusr
Hi. We are making an application that searches for certain specific topics, as many captured words on a document the higher the score. We have 2 scenarios of testing. The first one with documents that users tag as relevant and other ones that contains documents out of our domain. In first

Re: Solr performance issue on indexing

2017-03-31 Thread tstusr
Hi, thanks for the feedback. Yes, it is about OOM, indeed even solr instance makes unavailable. As I was saying I can't find more relevant information on logs. We're are able to increment JVM amout, so, the first thing we'll do will be that. As far as I know, all documents are bounded to that

Solr performance issue on indexing

2017-03-31 Thread tstusr
Hi there. We are currently indexing some PDF files, the main handler to index is /extract where we perform simple processing (extract relevant fields and store on some fields). The PDF files are about 10M~100M size and we have to have available the text extracted. So, everything works correct