Steve,

You gave as an example:

Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�
vollständig�sind

This sentence is probably from the PDF form label content, rather than form
values.   Sometimes in PDF, the form's value fields are kept in a separate
file.   I'm 99% sure Tika won't be able to handle that, because it handles
one file at a time.   If the form's value fields are in the PDF, Tika
should be able to handle it, but may be making some small errors that could
be addressed.

When you look at the form in Acrobat Reader, can you see whether the
indexed words contain any words from the form fields's values?

If you have a form where the data is not sensitive, I can investigate.   If
you are interested in this contact me offline - to dansm...@gmail.com or
d...@danizen.net.

Thanks,

Dan

On Thu, Apr 23, 2015 at 11:59 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was
> sent to Solr, _not_
> the actual tokens in the index. What do you see when you go to the admin
> schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you
> see in the browser
> when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see
> your analysis chain,
> i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed
> just fine, but
> I've certainly been wrong before, more times than I want to remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <steve.sch...@t-systems.com> wrote:
> > Hey Erick,
> >
> > thanks for your answer. They are not indexed correctly. Also throught
> the solr admin interface I see these typical questionmarks within a rhombus
> where a blank space should be.
> > I now figured out the following (not sure if it is relevant at all):
> > - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
> indexed correctly, no issues
> > - PDF documents (with editable form fields) created with "Adobe InDesign
> CS5 (7.0.1)"  are indexed with the blank space issue
> >
> > Best
> > Steve
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Erick Erickson [mailto:erickerick...@gmail.com]
> > Gesendet: Mittwoch, 22. April 2015 17:11
> > An: solr-user@lucene.apache.org
> > Betreff: Re: Odp.: solr issue with pdf forms
> >
> > Are they not _indexed_ correctly or not being displayed correctly?
> > Take a look at admin UI>>schema browser>> your field and press the "load
> terms" button. That'll show you what is _in_ the index as opposed to what
> the raw data looked like.
> >
> > When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
> >
> > Best,
> > Erick
> >
> > On Wed, Apr 22, 2015 at 7:08 AM,  <steve.sch...@t-systems.com> wrote:
> >> Thanks for your answer. Maybe my English is not good enough, what are
> you trying to say? Sorry I didn't get the point.
> >> :-(
> >>
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: LAFK [mailto:tomasz.bo...@gmail.com]
> >> Gesendet: Mittwoch, 22. April 2015 14:01
> >> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
> >> Betreff: Odp.: solr issue with pdf forms
> >>
> >> Out of my head I'd follow how are writable PDFs created and encoded.
> >>
> >> @LAFK_PL
> >>   Oryginalna wiadomość
> >> Od: steve.sch...@t-systems.com
> >> Wysłano: środa, 22 kwietnia 2015 12:41
> >> Do: solr-user@lucene.apache.org
> >> Odpowiedz: solr-user@lucene.apache.org
> >> Temat: solr issue with pdf forms
> >>
> >> Hi guys,
> >>
> >> hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> >> - usual pdf files are indexed just fine
> >> - pdf files with writable form-fields look like this:
> >> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
> >> ollständig sind
> >>
> >> Somehow the blank space character is not indexed correctly.
> >>
> >> Is this a know issue? Does anybody have an idea?
> >>
> >> Thanks a lot
> >> Best
> >> Steve
>

Reply via email to