On 06/10/15 12:36, Eva Schlauch wrote:
Thanks for the nice introduction to apache tika at the Budapest
apache:big_data conference!

Glad to see we've inspired you to join the community :)

I am considering to use apache tika together with apache oodt. Some of
the files that are generated here (in the mm-wave lab at MPIfR) are of
the touchstone file format, namely .s2p and similar. (I think the
general file would be .snp). I have almost no experience with writing
parsers, but as far as I see, I would need a new parser to be used from
within tika, is that right? Haven't seen any parser in a short web search,
but would be glad to be pointed to the right direction.

I'd suggest first starting with the "5 minute parser quickstart" guide -
http://tika.apache.org/1.10/parser_guide.html - though be aware that it's only 5 minutes in the same way that a Jamie Oliver 30 minute meal is 30 minutes, i.e. after practice and with the environment prepared!

How would I go about to get tika parse the metadata from the header?
Here is a link to the format description:
http://na.support.keysight.com/plts/help/WebHelp/FilePrint/SnP_File_Format.htm

It looks like the header should be fairly easy to spot. First thing to do would be to create/get a small test file for each of the s#p files. Open an enhancement in jira for mime detection, and attach those. Next, follow the guide to add mime entries and mime magic for those, along with writing a small detection test using those files. Attach all that to the jira

With detection working, it's onto the parser. With a plain text based format, it should be fairly easy. Looks from that page like using a BufferedReader to read it line-by-line should work. Tokenize/split to get the key, map that onto a suitable Tika metadata key, and save the value (converting if needed). Looks like you might need to have two sets of logic, one for s1p and one for s2p, as they look to have slightly different headers, though I guess the data is the same?

Finally, for the values, probably outputting in a table would be the least worst option


Depending on your preferences, you can use a github fork with a custom branch, or reviewboard, or just patches in jira, to get feedback on your work.

Oh, and if you're looking for some existing code to "crib" from, you might be better looking at some of the non-scientific parsers. Quite a few of the scientific parsers (though not all!) have been written by other new community members, so may not always follow best-practice. Don't get me wrong - they work, and it's awesome that we have them! They just might not always be the best example to learn from.

We'll look forward to your patches soon :)

Thanks
Nick

Reply via email to