Awesome. One thought would be to take the below and update our wiki with the information on how you are integrating TikaJAXRS and cURL. That seems very useful.
If you wouldn't mind updating the wiki, that would be a great help to the community! http://wiki.apache.org/tika/ Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Mr Havercamp <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, October 9, 2013 9:06 PM To: "[email protected]" <[email protected]> Subject: Re: Using TikaJAXRS with remote files >Thanks Chris, good to know I'm on the right track. > >I guess the caveat to below is that it does fetch the entire file so >only grabbing the file's metadata on large files (say a video) can take >a while. > >I did attempt passing on the file's headers to the tika server: > >curl -I "http://url/to/my.file" | curl -X PUT -T - >http://myserver/tika/meta > >and it does make an attempt to fetch the metadata but it results in very >little real metadata info: > >"Content-Encoding","windows-1252" >"Content-Type","text/plain; charset=windows-1252" > >(understandable as Tika Server is expecting the entire file to do its >magic). > >In the meantime I'm using CURL to obtain the file metadata: > >curl -I http://url/to/my.video > >HTTP/1.1 200 OK >Date: Thu, 10 Oct 2013 04:01:15 GMT >Last-Modified: Thu, 10 Oct 2013 04:01:15 GMT >ETag: 1381377675619 >Expires: Thu, 10 Oct 2013 04:11:15 GMT >Cache-Control: public >Cache-Control: max-age=600 >Cache-Control: s-maxage=600 >x-entity-prefix: bitstreams >x-entity-reference: /to/my.video >x-entity-url: /to/myfile.html >x-entity-format: html >x-sdata-handler: org.dspace.rest.providers.BitstreamProvider >x-sdata-url: /bitstreams/2416/download >Content-Disposition: attachment; filename=my.video >Content-Type: video/x-ms-wmv;charset=UTF-8 >Content-Length: 243062358 > >then, if the Content-Type matches my preconfigured list of types I want >to extract, I make another run through using my tika server: > >curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta > > >On 10/10/13 10:35, Chris Mattmann wrote: >> Looks good to me! Excellent work and not sure I have >> a better way atm.. >> >> ------------------------ >> Chris Mattmann >> [email protected] >> >> >> >> >> -----Original Message----- >> From: Mr Havercamp <[email protected]> >> Reply-To: <[email protected]> >> Date: Wednesday, October 9, 2013 7:27 PM >> To: <[email protected]> >> Subject: Re: Using TikaJAXRS with remote files >> >>> Success! >>> >>> For anybody else interested: >>> >>> curl "http://url/to/my.file" | curl -X PUT -T - >>>http://myserver/tika/meta >>> >>> However would be interested if anybody else has a different/more >>> efficient way of doing such an operation. >>> >>> On 10/10/13 10:11, Mr Havercamp wrote: >>>> Further to my previous post: >>>> >>>> I can send remote files using a combination of the tika app running in >>>> server mode, curl and nc: >>>> >>>> java -jar tika-app-1.3.jar --server 1234 >>>> >>>> curl "http://url/to/my.file" | nc localhost 1234 >>>> >>>> So I guess now the only missing piece is being able to send remote >>>> files to JAXRS for extraction. >>>> >>>> On 10/10/13 07:50, Mr Havercamp wrote: >>>>> Hi >>>>> >>>>> Been working with tika jaxrs and it is working great. >>>>> >>>>> One thing I'm wondering; the standalone Tika app can extract remote >>>>> files by providing a url (both in GUI and CMD mode); I'm wondering if >>>>> the same is at all possible with TIKAJAXRS or TIka app launched in >>>>> server mode? >>>>> >>>>> The reason being I may run an indexing client on a separate server so >>>>> it wouldn't necessarily have direct access to the file system where >>>>> the files to be indexed reside. >>>>> >>>>> Cheers >>>>> >>>>> >>>>> Hayden >> >
