Re: Adding pdf/word file using JSON/XML

Roland Everaert Tue, 11 Jun 2013 07:14:09 -0700

Jan,

Thanks for the answer.


Concerning the usage of /extract, If I understand correctly how works the
interface, it seems that the Document is recreated every time the url is
called. That would means that all metadata must be provided along the file
every time we want to update the related document, to avoid deletion of
extra fields.


Roland.



On Tue, Jun 11, 2013 at 3:31 PM, Jan Høydahl <jan....@cominvent.com> wrote:

> Hi,
>
> You can let your web application where people upload the files take care
> of extracting the text, e.g. using Apache Tika.
> Once you have the text of the PDF, you can add that to your Solr document
> along with all the rest of the metadata, and
> post it to Solr as JSON, XML or whatever you like. You do not need to use
> extracting request handler then, since you do
> the extraction on the client side.
>
> PS: Evem if you use /extract, note that you can pass the literal.* params
> as POST if you choose, using 100% standards-based HTTP multipart post.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 11. juni 2013 kl. 14:48 skrev Roland Everaert <reveatw...@gmail.com>:
>
> > We are working on an application that allows some users to add files
> (pdf,
> > ms word, odt, etc), located on their local hard disk, to our internal
> > system and allows other users to search for them. So we are considering
> > Solr for the indexing and search functionalities of the system. Along
> with
> > the file content, we want to index some metadata related to the file.
> >
> > It seems obvious that Solr couldn't import the file from the local disk
> of
> > the user, so the system will have to import the file into a directory
> that
> > Solr can reach and instruct Solr to index the file with the metadata, but
> > is it possible to index the file + metadata with a JSON/XML request?
> >
> > It seems that the only way to index a file with some metadata is to
> build a
> > request that would look like the following exemple that uses curl. The
> > developer would like to avoid using parameters in the url to pass
> arguments.
> >
> > curl "
> >
> http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text
> "
> > --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
> >
> >
> > Additionally, it seems that if a subsequent request is sent to the
> indexer
> > to update the file, if the metadata are not passed to Solr with the
> > request, they are deleted.
> >
> > Thanks for your help,
> >
> >
> >
> > Roland.
> >
> >
> > On Mon, Jun 10, 2013 at 4:14 PM, Jack Krupansky <j...@basetechnology.com
> >wrote:
> >
> >> Sorry, but you are STILL not being clear!
> >>
> >> Are you asking if you can pass Solr parameters as XML fields? No.
> >>
> >> Are you asking if the file name and path can be indexed as metadata? To
> >> some degree:
> >>
> >> curl "http://localhost:8983/solr/**update/extract?literal.id=doc-**1\<
> http://localhost:8983/solr/update/extract?literal.id=doc-1%5C>
> >> &commit=true&uprefix=attr_" -F "HelloWorld.docx=@HelloWorld.**docx"
> >>
> >> Then the stream has a name that is indexed as metadata:
> >>
> >> <arr name="attr_meta">
> >> <str>stream_source_info</str>
> >> <str>HelloWorld.docx</str>
> >> <str>stream_content_type</str>
> >> <str>application/octet-stream<**/str>
> >> <str>stream_size</str>
> >> <str>10096</str>
> >> <str>stream_name</str>
> >> <str>HelloWorld.docx</str>
> >> <str>Content-Type</str>
> >> <str>application/vnd.**openxmlformats-officedocument.**
> >> wordprocessingml.document</**str>
> >> </arr>
> >>
> >> and
> >>
> >> <arr name="attr_stream_source_info"**>
> >> <str>HelloWorld.docx</str>
> >> </arr>
> >>
> >> <arr name="attr_stream_name">
> >> <str>HelloWorld.docx</str>
> >> </arr>
> >>
> >> Or, what is it that you are really string to do?
> >>
> >> Simply tell us in plain language what problem you are trying to solve.
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Roland Everaert
> >> Sent: Monday, June 10, 2013 9:23 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Adding pdf/word file using JSON/XML
> >>
> >>
> >> Sorry if it was not clear.
> >>
> >> What I would like is to know how to construct an XML/JSON request that
> >> provide any necessary information (supposedly the full path on disk) to
> >> solr to retrieve and index a pdf/ms word document.
> >>
> >> So, an XML request could look like this:
> >>
> >> <add>
> >> <doc>
> >> <field name="id">doc10</field>
> >> <field name="name">BLAH</field>
> >> <field name="path">/path/to/file.pdf<**/field>
> >> </doc>
> >> </add>
> >>
> >>
> >> Regards,
> >>
> >>
> >> Roland.
> >>
> >>
> >> On Mon, Jun 10, 2013 at 3:12 PM, Gora Mohanty <g...@mimirtech.com>
> wrote:
> >>
> >> On 10 June 2013 17:47, Roland Everaert <reveatw...@gmail.com> wrote:
> >>>> Hi,
> >>>>
> >>>> Based on the wiki, below is an example of how I am currently adding a
> >
> >>> pdf
> >>>> file with an extra field called name:
> >>>> curl "
> >>>>
> >>> http://localhost:8080/solr/**update/extract?literal.id=**
> >>> doc10&literal.name=BLAH&**defaultField=text<
> http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text
> >
> >>> "
> >>>> --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
> >>>>
> >>>> Is it possible to add a file + any extra fields using a JSON or XML
> >>> request.
> >>>
> >>> It is not entirely clear what you are asking. Do you mean
> >>> can one do the same as your example above for a PDF
> >>> file, but with a XML or JSON file? If so, yes. Please see
> >>> the examples in example/exampledocs/ of a Solr source
> >>> tree, and http://wiki.apache.org/solr/**ExtractingRequestHandler<
> http://wiki.apache.org/solr/ExtractingRequestHandler>
> >>>
> >>> Regards,
> >>> Gora
> >>>
> >>>
> >>
>
>

Re: Adding pdf/word file using JSON/XML

Reply via email to