Re: tika for touchstone file format

Nick Burch Thu, 08 Oct 2015 03:01:31 -0700

On 06/10/15 12:36, Eva Schlauch wrote:

Thanks for the nice introduction to apache tika at the Budapest
apache:big_data conference!


Glad to see we've inspired you to join the community :)

I am considering to use apache tika together with apache oodt. Some of
the files that are generated here (in the mm-wave lab at MPIfR) are of
the touchstone file format, namely .s2p and similar. (I think the
general file would be .snp). I have almost no experience with writing
parsers, but as far as I see, I would need a new parser to be used from
within tika, is that right? Haven't seen any parser in a short web search,
but would be glad to be pointed to the right direction.


I'd suggest first starting with the "5 minute parser quickstart" guide -

http://tika.apache.org/1.10/parser_guide.html - though be aware thatit's only 5 minutes in the same way that a Jamie Oliver 30 minute mealis 30 minutes, i.e. after practice and with the environment prepared!

How would I go about to get tika parse the metadata from the header?
Here is a link to the format description:
http://na.support.keysight.com/plts/help/WebHelp/FilePrint/SnP_File_Format.htm

It looks like the header should be fairly easy to spot. First thing todo would be to create/get a small test file for each of the s#p files.Open an enhancement in jira for mime detection, and attach those. Next,follow the guide to add mime entries and mime magic for those, alongwith writing a small detection test using those files. Attach all thatto the jira

With detection working, it's onto the parser. With a plain text basedformat, it should be fairly easy. Looks from that page like using aBufferedReader to read it line-by-line should work. Tokenize/split toget the key, map that onto a suitable Tika metadata key, and save thevalue (converting if needed). Looks like you might need to have two setsof logic, one for s1p and one for s2p, as they look to have slightlydifferent headers, though I guess the data is the same?

Finally, for the values, probably outputting in a table would be theleast worst option

Depending on your preferences, you can use a github fork with a custombranch, or reviewboard, or just patches in jira, to get feedback on yourwork.

Oh, and if you're looking for some existing code to "crib" from, youmight be better looking at some of the non-scientific parsers. Quite afew of the scientific parsers (though not all!) have been written byother new community members, so may not always follow best-practice.Don't get me wrong - they work, and it's awesome that we have them! Theyjust might not always be the best example to learn from.


We'll look forward to your patches soon :)

Thanks
Nick

Re: tika for touchstone file format

Reply via email to