Re: indexing pdf files using post tool

2016-03-19 Thread Francisco Andrés Fernández
Vidya, I don't know if I'm understanding it very well but, I think that the
best way is to parse your text using a routine outside Solr. You might need
to map the different parts of your document using your domain knowledge and
use such routine to produce an XML document for example, with corresponding
tags for any part you need to differentiate. After that you could index it
in Solr.
Francisco

El mié., 16 de mar. de 2016 a la(s) 04:18, vidya 
escribió:

> Sorry for conveying it in wrong way. I want my data of 1 pdf file to be
> indexed with different fields in a document of solr according to data in it
> like name;id;title;content etc
>
> Thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4264052.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: words with spaces within

2016-02-23 Thread Francisco Andrés Fernández
Binoy and Walter, many thanks for your answer.
I think I'll go by Walter sugestion.
Best regards,

Francisco

El lun., 22 de feb. de 2016 a la(s) 23:43, Walter Underwood <
wun...@wunderwood.org> escribió:

> This happens for fonts where Tika does not have font metrics. Open the
> document in Adobe Reader, then use document info to find the list of fonts.
>
> Then post this question to the Tika list.
>
> Fix it in Tika, don’t patch it in Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Feb 22, 2016, at 6:40 PM, Binoy Dalal <binoydala...@gmail.com> wrote:
> >
> > Is there some set pattern to how these words occur or do they occur
> > randomly in the text, i.e., somewhere it'll be "subtitle" and somewhere
> "s
> > u b t i t l e"?
> >
> > On Tue, 23 Feb 2016, 05:01 Francisco Andrés Fernández <fra...@gmail.com>
> > wrote:
> >
> >> Hi all,
> >> I'm extracting some text from pdf. As result, some important words end
> with
> >> spaces between characters. I know they are words but, don't know how to
> >> make Solr detect and index them.
> >> For example, I could have the word "Subtitle" that I want to detect,
> >> written like "S u b t i t l e". If I would parse the text with a
> standard
> >> tokenizer, the word will be lost.
> >> How could I make Solr detect this type of word occurrence?
> >> Many thanks,
> >>
> >> Francisco
> >>
> > --
> > Regards,
> > Binoy Dalal
>
>


words with spaces within

2016-02-22 Thread Francisco Andrés Fernández
Hi all,
I'm extracting some text from pdf. As result, some important words end with
spaces between characters. I know they are words but, don't know how to
make Solr detect and index them.
For example, I could have the word "Subtitle" that I want to detect,
written like "S u b t i t l e". If I would parse the text with a standard
tokenizer, the word will be lost.
How could I make Solr detect this type of word occurrence?
Many thanks,

Francisco


Re: Detect term occurrences

2015-09-13 Thread Francisco Andrés Fernández
Thanks again.
For the moment I think it won't be a problem. I have ~500 documents.
Regards,

Francisco

El vie., 11 de sept. de 2015 a la(s) 6:08 p. m., simon <mtnes...@gmail.com>
escribió:

> +1 on Sujit's recommendation: we have a similar use case (detecting drug
> names / disease entities /MeSH terms ) and have been using the
> SolrTextTagger with great success.
>
> We run a separate Solr instance as a tagging  service and add the detected
> tags as metadata fields to a document before it is ingested into our main
> Solr collection.
>
> How many documents/product leaflets do you have ? The tagger is very fast
> at the Solr level but I'm seeing quite a bit of HTTP overhead.
>
> best
>
> -Simon
>
> On Fri, Sep 11, 2015 at 1:39 PM, Sujit Pal <sujit@comcast.net> wrote:
>
> > Hi Francisco,
> >
> > >> I have many drug products leaflets, each corresponding to 1 product.
> In
> > the
> > other hand we have a medical dictionary with about 10^5 terms.
> > I want to detect all the occurrences of those terms for any leaflet
> > document.
> > Take a look at SolrTextTagger for this use case.
> > https://github.com/OpenSextant/SolrTextTagger
> >
> > 10^5 entries are not that large, I am using it for much larger
> dictionaries
> > at the moment with very good results.
> >
> > Its a project built (at least originally) by David Smiley, who is also
> > quite active in this group.
> >
> > -sujit
> >
> >
> > On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitch <
> arafa...@gmail.com
> > >
> > wrote:
> >
> > > Assuming the medical dictionary is constant, I would do a copyField of
> > > text into a separate field and have that separate field use:
> > >
> > >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html
> > > with words coming from the dictionary (normalized).
> > >
> > > That way that new field will ONLY have your dictionary terms from the
> > > text. Then you can do facet against that field or anything else. Or
> > > even search and just be a lot more efficient.
> > >
> > > The main issue would be a gigantic filter, which may mean speed and/or
> > > memory issues. Solr has some ways to deal with such large set matches
> > > by compiling them into a state machine (used for auto-complete), but I
> > > don't know if that's exposed for your purpose.
> > >
> > > But could make a fun custom filter to build.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 10 September 2015 at 22:21, Francisco Andrés Fernández
> > > <fra...@gmail.com> wrote:
> > > > Yes.
> > > > I have many drug products leaflets, each corresponding to 1 product.
> In
> > > the
> > > > other hand we have a medical dictionary with about 10^5 terms.
> > > > I want to detect all the occurrences of those terms for any leaflet
> > > > document.
> > > > Could you give me a clue about how is the best way to perform it?
> > > > Perhaps, the best way is (as Walter suggests) to do all the queries
> > every
> > > > time, as needed.
> > > > Regards,
> > > >
> > > > Francisco
> > > >
> > > > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre
> > Rafalovitch <
> > > > arafa...@gmail.com> escribió:
> > > >
> > > >> Can you tell us a bit more about the business case? Not the current
> > > >> technical one. Because it is entirely possible Solr can solve the
> > > >> higher level problem out of the box without you doing manual term
> > > >> comparisons.In which case, your problem scope is not quite right.
> > > >>
> > > >> Regards,
> > > >>Alex.
> > > >> 
> > > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > >> http://www.solr-start.com/
> > > >>
> > > >>
> > > >> On 10 September 2015 at 09:58, Francisco Andrés Fernández
> > > >> <fra...@gmail.com> wrote:
> > > >> > Hi all, I'm new to Solr.
> > > >> > I want to detect all ocurrences of terms existing in a thesaurus
> > into
> > > 1
> > > >> or
> > > >> > more documents.
> > > >> > What´s the best strategy to make it?
> > > >> > Doing a query for each term doesn't seem to be the best way.
> > > >> > Many thanks,
> > > >> >
> > > >> > Francisco
> > > >>
> > >
> >
>


Re: Detect term occurrences

2015-09-11 Thread Francisco Andrés Fernández
Many thanks pals.
I will walk some of those ways (and return with new questions)
;)
Best regards,

Francisco

El vie., 11 de sept. de 2015 a la(s) 5:41 a. m., Upayavira <u...@odoko.co.uk>
escribió:

> It sounds to me like you are wanting to *filter* your document to only
> include terms within that medical dictionary. Or to have a keyword field
> based upon those of your 100k terms that appear in that doc.
>
> Synonyms are your saviour, if that's the case. Create a synonyms list
> for your terms, they can be a one-to-one mapping, so:
>
> diabetes => diabetes
>
> is quite okay. Then, in your index time analysis chain, have a
> SynonymFilterFactory followed by a TypeTokenFilterFactory configured to
> only allow SYNONYM tokens through.
>
> Then, in your index, you will have a field that contains all the terms
> from your 100k that are included in that particular document.
>
> Does that get it?
>
> Upayavira
>
> On Fri, Sep 11, 2015, at 03:21 AM, Francisco Andrés Fernández wrote:
> > Yes.
> > I have many drug products leaflets, each corresponding to 1 product. In
> > the
> > other hand we have a medical dictionary with about 10^5 terms.
> > I want to detect all the occurrences of those terms for any leaflet
> > document.
> > Could you give me a clue about how is the best way to perform it?
> > Perhaps, the best way is (as Walter suggests) to do all the queries every
> > time, as needed.
> > Regards,
> >
> > Francisco
> >
> > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch <
> > arafa...@gmail.com> escribió:
> >
> > > Can you tell us a bit more about the business case? Not the current
> > > technical one. Because it is entirely possible Solr can solve the
> > > higher level problem out of the box without you doing manual term
> > > comparisons.In which case, your problem scope is not quite right.
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 10 September 2015 at 09:58, Francisco Andrés Fernández
> > > <fra...@gmail.com> wrote:
> > > > Hi all, I'm new to Solr.
> > > > I want to detect all ocurrences of terms existing in a thesaurus
> into 1
> > > or
> > > > more documents.
> > > > What´s the best strategy to make it?
> > > > Doing a query for each term doesn't seem to be the best way.
> > > > Many thanks,
> > > >
> > > > Francisco
> > >
>


Re: Detect term occurrences

2015-09-11 Thread Francisco Andrés Fernández
Thanks!

El vie, sep 11, 2015 14:39, Sujit Pal <sujit@comcast.net> escribió:

> Hi Francisco,
>
> >> I have many drug products leaflets, each corresponding to 1 product. In
> the
> other hand we have a medical dictionary with about 10^5 terms.
> I want to detect all the occurrences of those terms for any leaflet
> document.
> Take a look at SolrTextTagger for this use case.
> https://github.com/OpenSextant/SolrTextTagger
>
> 10^5 entries are not that large, I am using it for much larger dictionaries
> at the moment with very good results.
>
> Its a project built (at least originally) by David Smiley, who is also
> quite active in this group.
>
> -sujit
>
>
> On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitch <arafa...@gmail.com
> >
> wrote:
>
> > Assuming the medical dictionary is constant, I would do a copyField of
> > text into a separate field and have that separate field use:
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html
> > with words coming from the dictionary (normalized).
> >
> > That way that new field will ONLY have your dictionary terms from the
> > text. Then you can do facet against that field or anything else. Or
> > even search and just be a lot more efficient.
> >
> > The main issue would be a gigantic filter, which may mean speed and/or
> > memory issues. Solr has some ways to deal with such large set matches
> > by compiling them into a state machine (used for auto-complete), but I
> > don't know if that's exposed for your purpose.
> >
> > But could make a fun custom filter to build.
> >
> > Regards,
> >Alex.
> > 
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 10 September 2015 at 22:21, Francisco Andrés Fernández
> > <fra...@gmail.com> wrote:
> > > Yes.
> > > I have many drug products leaflets, each corresponding to 1 product. In
> > the
> > > other hand we have a medical dictionary with about 10^5 terms.
> > > I want to detect all the occurrences of those terms for any leaflet
> > > document.
> > > Could you give me a clue about how is the best way to perform it?
> > > Perhaps, the best way is (as Walter suggests) to do all the queries
> every
> > > time, as needed.
> > > Regards,
> > >
> > > Francisco
> > >
> > > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre
> Rafalovitch <
> > > arafa...@gmail.com> escribió:
> > >
> > >> Can you tell us a bit more about the business case? Not the current
> > >> technical one. Because it is entirely possible Solr can solve the
> > >> higher level problem out of the box without you doing manual term
> > >> comparisons.In which case, your problem scope is not quite right.
> > >>
> > >> Regards,
> > >>Alex.
> > >> 
> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > >> http://www.solr-start.com/
> > >>
> > >>
> > >> On 10 September 2015 at 09:58, Francisco Andrés Fernández
> > >> <fra...@gmail.com> wrote:
> > >> > Hi all, I'm new to Solr.
> > >> > I want to detect all ocurrences of terms existing in a thesaurus
> into
> > 1
> > >> or
> > >> > more documents.
> > >> > What´s the best strategy to make it?
> > >> > Doing a query for each term doesn't seem to be the best way.
> > >> > Many thanks,
> > >> >
> > >> > Francisco
> > >>
> >
>


Detect term occurrences

2015-09-10 Thread Francisco Andrés Fernández
Hi all, I'm new to Solr.
I want to detect all ocurrences of terms existing in a thesaurus into 1 or
more documents.
What´s the best strategy to make it?
Doing a query for each term doesn't seem to be the best way.
Many thanks,

Francisco


Re: Detect term occurrences

2015-09-10 Thread Francisco Andrés Fernández
Yes.
I have many drug products leaflets, each corresponding to 1 product. In the
other hand we have a medical dictionary with about 10^5 terms.
I want to detect all the occurrences of those terms for any leaflet
document.
Could you give me a clue about how is the best way to perform it?
Perhaps, the best way is (as Walter suggests) to do all the queries every
time, as needed.
Regards,

Francisco

El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch <
arafa...@gmail.com> escribió:

> Can you tell us a bit more about the business case? Not the current
> technical one. Because it is entirely possible Solr can solve the
> higher level problem out of the box without you doing manual term
> comparisons.In which case, your problem scope is not quite right.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 September 2015 at 09:58, Francisco Andrés Fernández
> <fra...@gmail.com> wrote:
> > Hi all, I'm new to Solr.
> > I want to detect all ocurrences of terms existing in a thesaurus into 1
> or
> > more documents.
> > What´s the best strategy to make it?
> > Doing a query for each term doesn't seem to be the best way.
> > Many thanks,
> >
> > Francisco
>