Re: Parser removes file content and treats it as Metadata

2024-01-25 Thread Tim Allison
I'm sorry for not looking into this and responding sooner. That's the way the RFC822 parser works. It attempts to read the headers and put those into the metadata fields appropriately, and it tries to put the content in the body. The reason that SUBJECT: XYZ is slipping through into t

Re: Parser removes file content and treats it as Metadata

2024-01-25 Thread Gerardo Hernandez
From: Ken Krugler Sent: Wednesday, January 24, 2024 02:40 PM To: user@tika.apache.org Cc: Tim Allison Subject: Re: Parser removes file content and treats it as Metadata You don't often get email from kkrugler_li...@transpac.com. Learn why this is important&

Re: Parser removes file content and treats it as Metadata

2024-01-24 Thread Ken Krugler
Hi Gerardo,What happens if you set the filename in the metadata, before calling parse()?E.g.   metadata.set(Metadata.RESOURCE_NAME_KEY, filename);I don’t recall whether the Resource Name detector will be called first, before the Mime Magic detector (Tim?). If it is, then having a xxx.txt filename

Re: Parser removes file content and treats it as Metadata

2024-01-23 Thread Tilman Hausherr
On 23.01.2024 20:27, Gerardo Hernandez wrote: Btw we are currently working on 2.7.0 version Please retry with the current version (2.9.1) and tell if that is better. Tilman

Re: Parser removes file content and treats it as Metadata

2024-01-23 Thread Gerardo Hernandez
Btw we are currently working on 2.7.0 version From: Gerardo Hernandez Sent: Tuesday, January 23, 2024 01:26 PM To: user@tika.apache.org Subject: Re: Parser removes file content and treats it as Metadata You don't often get email from g.hernan...@aparav

Re: Parser removes file content and treats it as Metadata

2024-01-23 Thread Gerardo Hernandez
, January 20, 2024 11:54 AM To: user@tika.apache.org Cc: Mikhail Gushinets Subject: Re: Parser removes file content and treats it as Metadata You don't often get email from kkrugler_li...@transpac.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> I ass

Re: Parser removes file content and treats it as Metadata

2024-01-20 Thread Ken Krugler
I assume you are getting the initial lines as metadata because Tika is identifying the file as email. If you include details on your code (how you are calling the parser) and version, I’m confident someone can suggest reasonable work-arounds. Regards, — Ken > On Jan 18, 2024, at 8:44

Re: Parser removes file content and treats it as Metadata

2024-01-18 Thread Gerardo Hernandez
This is the input file; I think it was not uploaded correctly. Best regards, Gerardo From: Gerardo Hernandez Sent: Thursday, January 18, 2024 10:39 PM To: user@tika.apache.org Cc: Mikhail Gushinets Subject: Parser removes file content and treats it as Metadata

Parser removes file content and treats it as Metadata

2024-01-18 Thread Gerardo Hernandez
adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam... (Till the end of the file). and the initial text of the file (FROM, TO, DATE, LOCATION) is not included but registered as metadata: [cid:48357a6d-fda2-4d43-bdea-e7ad165b2cd9] I wo

Re: How to process metadata returned by the tika server?

2023-11-07 Thread Mark Kerzner SHMsoft, Inc.
Thank you, it helped. Mark Mark Kerzner, SHMsoft

Re: How to process metadata returned by the tika server?

2023-11-03 Thread Tim Allison
A heavier-weight option is to use the tika-serialization module (which uses Jackson databind) and do something like this: https://github.com/apache/tika/blob/main/tika-server/tika-server-standard/src/test/java/org/apache/tika/server/standard/RecursiveMetadataResourceTest.java#L89 On Fri, Nov 3, 2

Re: How to process metadata returned by the tika server?

2023-11-03 Thread Cihad Guzel
Hi Mark, You can use a library like `org.json` to process incoming JSON data. With this library, you can convert the incoming JSON data into a `JSONObject` object and read its contents." Regards, Cihad Guzel 3 Kas 2023 Cum, saat 08:27 tarihinde Mark Kerzner SHMsoft, Inc. < mark.kerz...@shmsoft.c

How to process metadata returned by the tika server?

2023-11-02 Thread Mark Kerzner SHMsoft, Inc.
Hi, https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-RecursiveMetadataandContent says that it returns JSONified object. How do I process it in Java? I simple-mindedly expected a hashtable. Thank you, Mark Mark Kerzner, SHMsoft

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
eithrbennett/rika, JRuby >>> wrapper for Tika) to work with current Tika versions and to add a command >>> line executable. >>> >>> I noticed that Rika opens the document's input stream twice; once to >>> call Tika#detect to get its media type, and aga

Re: Content-Type Metadata

2023-08-14 Thread Keith Bennett
he document's input stream twice; once to call >> Tika#detect to get its media type, and again to do the parsing. Is this >> detect call unnecessary? I noticed a Content-Type in the parsed metadata, >> which has the same value as the value returned by Tika#detect. Is >>

Re: Content-Type Metadata

2023-08-14 Thread Tim Allison
ommand > line executable. > > I noticed that Rika opens the document's input stream twice; once to call > Tika#detect to get its media type, and again to do the parsing. Is this > detect call unnecessary? I noticed a Content-Type in the parsed metadata, > which has the same

Content-Type Metadata

2023-08-09 Thread Keith Bennett
o the parsing. Is this detect call unnecessary? I noticed a Content-Type in the parsed metadata, which has the same value as the value returned by Tika#detect. Is Content-Type at least as reliable as Tika#detect? Thanks for any help on this. Also, if you have any interest in rika, feel free to let me kno

Re: Best practice for extracting content and metadata repeatedly

2023-03-06 Thread Nick Burch
On Mon, 6 Mar 2023, Chris Bamford via user wrote: From both performance and thread safety points of view what is the best approach for the use / reuse of the following objects: Tika ParseContext Parser Metadata The Tika object and/or TikaConfig object should only be created once and then re

is there a way to print out just the basic metadata about a file type using tike without too much specifics about the particular file?

2023-03-02 Thread Albretch Mueller
if you use: file --brief --mime "$file" you would get too little practically unusable info. file --brief "$file" would give you what you need when it comes to pdf files (yes, the structure of pdf files depend on their version), but with images it gives you data relating to the specific file, s

Re: metadata keys

2022-10-07 Thread Tim Allison
I shouldn't be, but I'm disheartened by how many metadata keys are not name-spaced. I don't think we can do anything with these in 2.x, but for 3.x, we should be thinking about namespacing all the keys that don't have natural dc: or other standards. I'm also, frankly, be

Re: metadata keys

2022-10-07 Thread Tim Allison
Is there anything that leaps out that needs attention? On Fri, Oct 7, 2022 at 7:12 AM Markus Jelsma wrote: > > Ah, there are some differences this time, except for MboxParser, of course :) > > Very nice to see this happening, it wasn't present/noticed in the other set > tiff:ImageWidth,727519 > t

Re: metadata keys

2022-10-07 Thread Markus Jelsma
Ah, there are some differences this time, except for MboxParser, of course :) Very nice to see this happening, it wasn't present/noticed in the other set tiff:ImageWidth,727519 tiff:ImageLength,727512 There are this time also quite a few with whitespaces in the keys: Dimension HorizontalPixelSize

Re: metadata keys

2022-10-06 Thread Tim Allison
I reprocessed a million files and wrote proper UTF-8 csv files. This did away with any risk of me botching something via copy/paste from stdout. https://corpora.tika.apache.org/base/share/metadata-keys-1m-20221006.tgz On Mon, Oct 3, 2022 at 4:03 PM Markus Jelsma wrote: > > Hi Tim, >

Re: metadata keys

2022-10-03 Thread Markus Jelsma
pinterest.com/sarcasthttps > 3 > > > > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the > six most frequently used Arabic symbols are not present. I wonder why. But > there is an RTL-script present, Hebrew. It is always strange to meet > terms/

Re: metadata keys

2022-10-03 Thread Tim Allison
> RTL-scripts in an otherwise general LTR-world. > > I was a bit disappointed not to find any obscene terms. The set seemed to be > large enough for at least some general curse words. > > MboxParser is the real winner with 1763 unique keys, this is really absurd! > > Thanks, thi

Re: metadata keys

2022-10-03 Thread Markus Jelsma
fun! Markus Op ma 3 okt. 2022 om 15:26 schreef Tim Allison : > All, > > I recently extracted metadata keys from 1 million files in our > regression corpus and did a group by. This allows insight into common > metadata keys. > > I've included two views, one looks at o

metadata keys

2022-10-03 Thread Tim Allison
All, I recently extracted metadata keys from 1 million files in our regression corpus and did a group by. This allows insight into common metadata keys. I've included two views, one looks at overall counts, and the other breaks down metadata keys by mime type. Please let us know i

Re: bug: adding to tika 2.4.2 config.xml truncates metadata return

2022-07-26 Thread PGNet Dev
I'd stared repeatedly at in the docs. seemed reasonable that since TesseractOCRParser *is* the default parser, exlcuding it made no sense. guess not! with config, + + + false

Re: bug: adding to tika 2.4.2 config.xml truncates metadata return

2022-07-26 Thread Tim Allison
Try something like this: 180 On Tue, Jul 26, 2022 at 6:52 AM PGNet Dev wrote: > removing dovecot from the equation, reduced this to just tika, > reproducible here > > running > > ls -al /srv/tika/tika-server.jar >

bug: adding to tika 2.4.2 config.xml truncates metadata return

2022-07-26 Thread PGNet Dev
removing dovecot from the equation, reduced this to just tika, reproducible here running ls -al /srv/tika/tika-server.jar lrwxrwxrwx 1 root root 50 Jul 26 05:42 /srv/tika/tika-server.jar -> tika-server-standard-2.4.2-20220725.215245-121.jar systemctl status tika

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Cristian Zamfir
> Isn’t that what /rmeta does? >>> >>>> >>>> On Thu, May 6, 2021 at 8:03 AM Cristian Zamfir >>>> wrote: >>>> >>>>> Cool, this will work! Looking forward to release 1.27. >>>>> It does not work for archives

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Tim Allison
n wrote: > >> Isn’t that what /rmeta does? >> >>> >>> On Thu, May 6, 2021 at 8:03 AM Cristian Zamfir >>> wrote: >>> >>>> Cool, this will work! Looking forward to release 1.27. >>>> It does not work for archives though, is there a way

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Cristian Zamfir
ian Zamfir >> wrote: >> >>> Cool, this will work! Looking forward to release 1.27. >>> It does not work for archives though, is there a way to also get >>> recursively the metadata from all the files in the archive using tika/text >>> accept: applic

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Tim Allison
Isn’t that what /rmeta does? > > On Thu, May 6, 2021 at 8:03 AM Cristian Zamfir > wrote: > >> Cool, this will work! Looking forward to release 1.27. >> It does not work for archives though, is there a way to also get >> recursively the metadata from all the files

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Cristian Zamfir
Cool, this will work! Looking forward to release 1.27. It does not work for archives though, is there a way to also get recursively the metadata from all the files in the archive using tika/text accept: application/json? I suppose I can always wrap around the Tika library and implement this

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Tim Allison
t; X-TIKA:content is html and I would need plain text. > > What would be ideal would be an option to /tika (text|body) to essentially > > do what /remeta provides and concatenate in the output the metadata and the > > data. Something like `curl -H "Accept: text/plain"

Re: tika-server - get metadata and content with a single round trip time

2021-05-06 Thread Cristian Zamfir
to /tika (text|body) to essentially do what /remeta provides and concatenate in the output the metadata and the data. Something like `curl -H "Accept: text/plain" -H "X-Tika-meta: recursive" http://localhost:9998/tika` ? What do you think, does it make sense? Thanks, Cristi

Re: tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Tim Allison
All, I recently added a feature matrix page to our wiki for some of the content +/- metadata endpoints in tika-server: https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared. Please take a look and let me know what you think. Cheers, Tim

Re: tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Tim Allison
t;> See also /rmeta. > >> > >> On Wed, May 5, 2021 at 5:20 AM Cristian Zamfir > wrote: > >>> > >>> Hi! > >>> > >>> Is there an option to tika-server to concatenate the metadata and the > content in the same call to localhost:9998/tika, in order to avoid a > separate upload of the file just to get the metadata? > >>> > >>> Thanks! > >>> Cristi >

Re: tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Tim Allison
0 AM Cristian Zamfir wrote: >>> >>> Hi! >>> >>> Is there an option to tika-server to concatenate the metadata and the >>> content in the same call to localhost:9998/tika, in order to avoid a >>> separate upload of the file just to get the metadata? >>> >>> Thanks! >>> Cristi

Re: tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Cristian Zamfir
ry before the 1.27 release. > > > See also /rmeta. > > On Wed, May 5, 2021 at 5:20 AM Cristian Zamfir > wrote: > >> Hi! >> >> Is there an option to tika-server to concatenate the metadata and the >> content in the same call to localhost:9998/tika, in order

Re: tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Tim Allison
there an option to tika-server to concatenate the metadata and the > content in the same call to localhost:9998/tika, in order to avoid a > separate upload of the file just to get the metadata? > > Thanks! > Cristi >

tika-server - get metadata and content with a single round trip time

2021-05-05 Thread Cristian Zamfir
Hi! Is there an option to tika-server to concatenate the metadata and the content in the same call to localhost:9998/tika, in order to avoid a separate upload of the file just to get the metadata? Thanks! Cristi

Re: Metadata

2020-12-29 Thread Nick Burch
On Mon, 28 Dec 2020, Peter Kronenberg wrote: For the metadata that comes back from a parse (example below), clearly, the fields are dependent on the file type and information available. Are there any 'standard' fields that come back for all/any files? Such as Author, date, x-pars

Metadata

2020-12-28 Thread Peter Kronenberg
For the metadata that comes back from a parse (example below), clearly, the fields are dependent on the file type and information available. Are there any 'standard' fields that come back for all/any files? Such as Author, date, x-parsed-by, etc. Is there a list of these

Re: Missing XMP Metadata from PDF

2020-05-12 Thread Tim Allison
Yep, that's a problem. Thank you! https://issues.apache.org/jira/browse/TIKA-3101 On Mon, May 11, 2020 at 2:24 PM Tim Allison wrote: > Thank you for letting us know about this and sharing a file. My belief is > that we should be trusting the XMP metadata over the PDFInfo for DC

Re: Missing XMP Metadata from PDF

2020-05-11 Thread Tim Allison
Thank you for letting us know about this and sharing a file. My belief is that we should be trusting the XMP metadata over the PDFInfo for DC metadata keys like TikaCoreProperties.CREATED. I'll take a look. On Mon, May 11, 2020 at 11:40 AM Tucker B wrote: > I have a PDF with XMP metad

Missing XMP Metadata from PDF

2020-05-11 Thread Tucker B
I have a PDF with XMP metadata with two rdf:Description tags with different namespaces. The first namespace is DublinCore the other is XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata properly and properly identify the namespaces. However, it appears the PDFParser in Tika is

Re: Get file metadata without retrieving entire file with Tika Server

2016-10-14 Thread Mr Havecamp
Thanks for the confirmation. Is this because obtaining information about, for example, the file size requires the entire file? I was under the impression that file metadata was contained in the x bytes of the file but this is probably something I have misunderstood. Thanks Hayden On 13

Re: Get file metadata without retrieving entire file with Tika Server

2016-10-13 Thread Nick Burch
extract the metadata information (we wouldn't be indexing the video content). For a great many file formats, including most video ones, you need the whole file to be able to fully extract all the metadata Nick

Get file metadata without retrieving entire file with Tika Server

2016-10-13 Thread Mr Havecamp
A while back we contributed a workaround we had for extracting metadata/content from remote urls. It wasn't the most ideal way to handle extraction of remote files but it meant we could index full text from files stored on a completely different server from our JAXRS server. We&#

Re: DATE metadata from email

2016-05-16 Thread Philipp Steinkrüger
> On 15 May 2016, at 20:18 , Nick Burch wrote: > > On Sun, 15 May 2016, Philipp Steinkrüger wrote: >> To begin with, I noticed the following behaviour which might or might not be >> a bug. I asked this question on stackexchange >> (https://stackoverflow.com/quest

Re: DATE metadata from email

2016-05-15 Thread Nick Burch
On Sun, 15 May 2016, Philipp Steinkrüger wrote: To begin with, I noticed the following behaviour which might or might not be a bug. I asked this question on stackexchange (https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date <https://stackoverflow.com/questi

DATE metadata from email

2016-05-15 Thread Philipp Steinkrüger
://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date <https://stackoverflow.com/questions/37226842/tika-metadata-from-email-misses-date>) but perhaps this is a better place. I have two email testfiles: • A file that has been created by using "save as" i

Re: file-related metadata

2016-03-28 Thread Brian Young
So, after some debugging I discovered the root cause. The image metadata extractor is producing properties for "file modified date", "file name" and "file size." Unfortunately as mentioned in my original post, the file information is at times misleading since i

file-related metadata

2016-03-25 Thread Brian Young
Hello, I'm having an issue where I'm getting back two or three metadata properties that are related to a temp file that tika is apparently creating under the hood: File Modified Date (the current date) File Name (temp file name: apache-tika-3021300783416279997.tmp) File Size I assu

TIKA-1768: Document headers and footers in metadata

2015-10-20 Thread Aeham Abushwashi
Hello, Are there any plans to incorporate TIKA-1768 into the next 1.x release? I'd welcome any feedback on the patch and thoughts on better ways for implementing the enhancement. Best regards, Aeham

Re: Customizing Metadata Keys

2014-10-09 Thread Can Duruk
> I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to put in a plug for the RecursiveParserWrapper, which may be of use for you. I’ve been intending to add that to the app commandline and to server…how are you handling embedded document metadata? Would the wrapper

RE: Customizing Metadata Keys

2014-10-09 Thread Allison, Timothy B.
I agree with Nick’s recommendation on post-parsing key mapping, and I’d like to put in a plug for the RecursiveParserWrapper, which may be of use for you. I’ve been intending to add that to the app commandline and to server…how are you handling embedded document metadata? Would the wrapper be

Re: Customizing Metadata Keys

2014-10-09 Thread Can Duruk
ping downstream ContentHandler > that takes in the Metadata object and will reformat > the wrote: > > Perhaps a re-mapping downstream ContentHandler > that takes in the Metadata object and will reformat > the > > > Chris Mattmann > chris.mattm...

Re: Customizing Metadata Keys

2014-10-09 Thread Chris Mattmann
Perhaps a re-mapping downstream ContentHandler that takes in the Metadata object and will reformat the Reply-To: Date: Thursday, October 9, 2014 at 12:32 PM To: Subject: Re: Customizing Metadata Keys >On Wed, 8 Oct 2014, Can Duruk wrote: >> My question is regarding setting the meta

Re: Customizing Metadata Keys

2014-10-09 Thread Nick Burch
On Wed, 8 Oct 2014, Can Duruk wrote: My question is regarding setting the metadata keys coming from the parsers to my own keys. For my application, I am using Tika to extract the metadata for a bunch of files. I am using the embedded HTTP server which I modified for my needs to return instead

Customizing Metadata Keys

2014-10-08 Thread Can Duruk
Hi all, My question is regarding setting the metadata keys coming from the parsers to my own keys. For my application, I am using Tika to extract the metadata for a bunch of files. I am using the embedded HTTP server which I modified for my needs to return instead of CSV. (Hoping to submit that

Extract metadata

2014-02-25 Thread Sudheshna Iyer
Hello, 1. I have few questions about the extraction of metadata. So I wanted to join mailing list of Tika user group. Can you please provide the email address for it? 2. How do I extract the metadata from a file? For eg: I need author information. So for different files, author information

Re: extract metadata of pdf files with tika

2013-10-23 Thread Nick Burch
On Wed, 23 Oct 2013, Samuel Desseaux wrote: What language are you writing the rest of your solution in? How are you planning to transform and filter the metadata to get your xml? With java, i think. The simplest way to get started them would be using the Tika Facade helper class: http

Re: extract metadata of pdf files with tika

2013-10-23 Thread Samuel Desseaux
Le 23/10/2013 15:12, Nick Burch a écrit : On Wed, 23 Oct 2013, Samuel Desseaux wrote: I have many pdf files which i would like to extract metadata, in order to have an xml file (which respect dublin core). Do i have to write a program with tika to do it? What language are you writing the

Re: extract metadata of pdf files with tika

2013-10-23 Thread Nick Burch
On Wed, 23 Oct 2013, Samuel Desseaux wrote: I have many pdf files which i would like to extract metadata, in order to have an xml file (which respect dublin core). Do i have to write a program with tika to do it? What language are you writing the rest of your solution in? How are you

extract metadata of pdf files with tika

2013-10-23 Thread Samuel Desseaux
Hi, I'm a little newbie with tika and would need some help. I have many pdf files which i would like to extract metadata, in order to have an xml file (which respect dublin core). I've followed these links http://www.hascode.com/2012/12/content-detection-metadata-and-content-extra

RE: Extracting Metadata from MS Office (2007 +) Files on Glassfish

2013-09-26 Thread Tad Wimmer
er 26, 2013 12:54 PM To: user@tika.apache.org Subject: Re: Extracting Metadata from MS Office (2007 +) Files on Glassfish On Thu, 26 Sep 2013, Tad Wimmer wrote: > The tika-core-0.7, tika-parsers-0.7, tika-app-0.8, Firstly, tika-0.7 is rather old, you should upgrade. Secondly, the tika-app jar is st

Re: Extracting Metadata from MS Office (2007 +) Files on Glassfish

2013-09-26 Thread Nick Burch
On Thu, 26 Sep 2013, Tad Wimmer wrote: The tika-core-0.7, tika-parsers-0.7, tika-app-0.8, Firstly, tika-0.7 is rather old, you should upgrade. Secondly, the tika-app jar is standalone and shouldn't be included in a webapp. Thirdly, all the jars need to be from the same version of tika, you ca

Extracting Metadata from MS Office (2007 +) Files on Glassfish

2013-09-26 Thread Tad Wimmer
Hello. I'm a Tika newbie, and running into an issue with Tika on Glassfish. I'm using Tika to extract metadata from documents uploaded to a JSF 2.0 Web application using Prime Faces p:fileupload (Prime Faces 3.5) and running on Glassfish 3.2.2. Here is the essentials of my code: pr

Re: Tika metadata values

2013-04-01 Thread Nick Burch
On Mon, 1 Apr 2013, Gary McGath wrote: You can get some idea of the kinds of metadata Tika will return by running the tika-app jar with the --list-met-models option. However, I think that might need a bit of tweaking since the work done to make more of the Tika metadata be properties based

Re: Tika metadata values

2013-04-01 Thread Gary McGath
On 4/1/13 10:48 AM, Nick Burch wrote: > On Mon, 1 Apr 2013, Gary McGath wrote: >> Is there any documentation of the metadata values (e.g., compression >> types) that Tika can return? > > Metadata? Or mime type? If I wanted to know if a file was a .tar.gz > compressed arc

Re: Tika metadata values

2013-04-01 Thread Nick Burch
On Mon, 1 Apr 2013, Gary McGath wrote: Is there any documentation of the metadata values (e.g., compression types) that Tika can return? Metadata? Or mime type? If I wanted to know if a file was a .tar.gz compressed archive, or a .arj one, I'd use the mimetype detection in Tika rather

Tika metadata values

2013-04-01 Thread Gary McGath
Is there any documentation of the metadata values (e.g., compression types) that Tika can return? I've been trying to find them in the source code and not having much luck; a grep of the whole directory turns up, for example, the string "lzw" in the test files but nowhere else,

Re: Improvement in Metadata Class

2013-03-06 Thread Lewis John Mcgibbney
gt;> >> RE: #3 — it would be great to get Nutch using Tika's metadata container >> — I don't think we have anything special in Nutch that prevents it. >> RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so >> thanks! >> >> Th

Re: Improvement in Metadata Class

2013-03-06 Thread Lewis John Mcgibbney
get Nutch using Tika's metadata container > — I don't think we have anything special in Nutch that prevents it. > RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so > thanks! > > Thanks! > > Cheers, > Chris > > > From: Lewis John Mcgi

Re: Improvement in Metadata Class

2013-03-03 Thread Mattmann, Chris A (388J)
Hey Lewis, RE: #3 — it would be great to get Nutch using Tika's metadata container — I don't think we have anything special in Nutch that prevents it. RE: #2 — I committed your Tika doc patch during ApacheCon NA 2013 so thanks! Thanks! Cheers, Chris From: Lewis John

Re: Improvement in Metadata Class

2013-02-26 Thread Lewis John Mcgibbney
Very helpful as ever Nick. See you later. Get the lager ready. Lewis On Tue, Feb 26, 2013 at 4:37 PM, Nick Burch wrote: > On Tue, 26 Feb 2013, Lewis John Mcgibbney wrote: > >> 1. >> In Apache Nutch we are using the Metadata class [0] as follows >> if (tik

Re: Improvement in Metadata Class

2013-02-26 Thread Nick Burch
On Tue, 26 Feb 2013, Lewis John Mcgibbney wrote: 1. In Apache Nutch we are using the Metadata class [0] as follows if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue; TITLE value is deprecated and I want to upgrade API usage. What should I be using? I was going to say "Just chec

Improvement in Metadata Class

2013-02-26 Thread Lewis John Mcgibbney
Hi, (This is maybe traffic for dev@ but I hope it is OK here on user@) 1. In Apache Nutch we are using the Metadata class [0] as follows if (tikaMDName.equalsIgnoreCase(Metadata.TITLE)) continue; TITLE value is deprecated and I want to upgrade API usage. What should I be using? 2. I would like

Re: Newlines not escaped in CSV Metadata (Tika Rest Server)

2012-10-27 Thread David James
A quick update: stripping out the carriage returns '\r' before passing to Ruby's CSV.parse does the trick -- nothing elsr is needed. On Oct 27, 2012, at 11:54 AM, "Mattmann, Chris A (388J)" wrote: > Hey David, > > Thanks man for following this up on list and for the blog post -- great work!

Re: Newlines not escaped in CSV Metadata (Tika Rest Server)

2012-10-27 Thread Mattmann, Chris A (388J)
Hey David, Thanks man for following this up on list and for the blog post -- great work! I love TIKA-593 (our REST server) too! :) Cheers, Chris On Oct 26, 2012, at 2:37 PM, David James wrote: > I have found no evidence that Tika is the problem. I have found reason > to suspect that Ruby 1.9.3

Re: Newlines not escaped in CSV Metadata (Tika Rest Server)

2012-10-26 Thread David James
I have found no evidence that Tika is the problem. I have found reason to suspect that Ruby 1.9.3.'s CSV is acting funny. This is my work-around for Ruby 1.9.3, maybe it will be useful to someone besides me. class TikaCSV def self.parse(s) s.split(/\n(?="[^"])/).reduce([]) { |a, x| a += CSV.

Re: Newlines not escaped in CSV Metadata (Tika Rest Server)

2012-10-26 Thread David James
not caused any problems with Ruby 1.9.3's CSV.parse. > My problem is that newlines are not escaped in the CSV Metadata from the Tika > Rest Server.

Newlines not escaped in CSV Metadata (Tika Rest Server)

2012-10-26 Thread David James
Tika Users, First off, Tika has been super helpful. The new REST server especially so. My problem is that newlines are not escaped in the CSV Metadata from the Tika Rest Server. Would it make sense to escape the newlines? Any opinions either way? I realize that there is no single CSV standard

Re: Parse metadata only

2012-05-29 Thread Nick Burch
hout doing a parse on the whole document (with a content handler, etc.)? At the moment, that's not possible. Most file formats don't have all their metadata in entirely separate places, so you end up having to process almost all of the file anyway. (There has been talk about imple

Parse metadata only

2012-05-29 Thread Thinus Prinsloo
Hey all - I hope this is the right place to ask. Feel free to point me somewhere else if needed. I would like to parse the meta-data of a massive amount of PDF files only. I do not want to extract the text, not yet anyway, only get meta-data information such as "Creation-Date", etc. Is it pos

Re: ForkParser and Metadata

2012-05-01 Thread Michael McCandless
On Tue, May 1, 2012 at 1:47 PM, Jukka Zitting wrote: > Currently the ForkParser doesn't return metadata, though adding that feature > shouldn't be too difficult. My original use case didn't need metadata, so I > never implemented that bit. OK, thanks Jukka. I'll

Re: ForkParser and Metadata

2012-05-01 Thread Jukka Zitting
Hi, Currently the ForkParser doesn't return metadata, though adding that feature shouldn't be too difficult. My original use case didn't need metadata, so I never implemented that bit. Jukka Zitting 1.5.2012 19.26 "Michael McCandless" kirjoitti: > Does anyone know

ForkParser and Metadata

2012-05-01 Thread Michael McCandless
Does anyone know whether ForkParser expected to return the Metadata back from the child process...? It seems not to now: when I run TikaCLI -m --fork, I seem to get no metadata (but if I don't specify --fork, I do). Mike McCandless http://blog.mikemccandless.com

Re: PDF title metadata

2011-11-10 Thread Nick Burch
On Thu, 10 Nov 2011, Bai Shen wrote: I have a pdf that I'm using tika on, and it's not finding a title. Is there a way to configure tika to set the filename as the title if one isn't found? Tika doesn't current support that, you'll need to do the check yourself. Shouldn't be too tricky though

PDF title metadata

2011-11-10 Thread Bai Shen
I have a pdf that I'm using tika on, and it's not finding a title. Is there a way to configure tika to set the filename as the title if one isn't found? My google-fu is failing me today. Thanks.

RE: Metadata extracted by OutlookExtractor

2011-09-28 Thread Swapna Vuppala
Thanks for the info Nick, I'll have a look at that. Best Regards, Swapna. -Original Message- From: Nick Burch [mailto:nick.bu...@alfresco.com] Sent: Wednesday, September 28, 2011 4:29 PM To: user@tika.apache.org Subject: Re: Metadata extracted by OutlookExtractor On Wed, 28 Sep

Re: Metadata extracted by OutlookExtractor

2011-09-28 Thread Nick Burch
On Wed, 28 Sep 2011, Swapna Vuppala wrote: Am new to using Solr and Tika. Am trying to index .msg files (Outlook mails) into Solr. For this, I need a list of metadata extracted by Tika from emails. I would like to know what all fields from a .msg file are extracted by Tika's outlookextr

Metadata extracted by OutlookExtractor

2011-09-28 Thread Swapna Vuppala
Hi, Am new to using Solr and Tika. Am trying to index .msg files (Outlook mails) into Solr. For this, I need a list of metadata extracted by Tika from emails. I would like to know what all fields from a .msg file are extracted by Tika's outlookextractor. Can you please direct me where

Re: org.apache.tika.exception.TikaException while trying to get images' metadata ...

2011-09-09 Thread Albretch Mueller
Hmm! But, why do I have then so many corrupted files? ~ $ ls -l total 240 -rw-r--r-- 1 knoppix knoppix 23772 Sep 9 16:05 26-3.jpg -rw-r--r-- 1 knoppix knoppix 172471 Sep 9 16:05 3997912989_5e666b3a4b.jpg -rwxr-xr-x 1 knoppix knoppix 9620 Sep 9 16:05 6926.jpeg -rwxr-xr-x 1 knoppix knoppix 1

Re: org.apache.tika.exception.TikaException while trying to get images' metadata ...

2011-09-09 Thread Nick Burch
On Thu, 8 Sep 2011, Albretch Mueller wrote: ~ getting metadata from files using tika seemed easy. Right now what I am most interested in is images but all tika gives me is: ~ org.apache.tika.exception.TikaException: Can't read JPEG metada

org.apache.tika.exception.TikaException while trying to get images' metadata ...

2011-09-08 Thread Albretch Mueller
After reading this post: ~ http://blog.jeroenreijn.com/2010/04/metadata-extraction-with-apache-tika.html ~ getting metadata from files using tika seemed easy. Right now what I am most interested in is images but all tika gives me is: ~ org.apache.tika.exception.TikaException: Can't read

Re: Tika as server/daemon for content, metadata and language

2011-06-24 Thread Mattmann, Chris A (388J)
:31 AM, Marian Steinbach wrote: > Hi! > > I have tested the Tika client for extraction of content, metadata and > language and I'm really happy with the results. > > For performance reasons when extracting larger numbers of documents I > think it would be worthwhile to

  1   2   >