Mark,

Thanks for your response.

Maybe I should explain what my little project is.

There is an ancient grammar of Sanskrit, composed in Sanskrit about 600BC
and called the Āṣṭādhyāyī.  It originally was oral, later written down.  It
contains over 4000 sutras (the sutras were later numbered in the form
1.2.345) ; the sutras we may loosely call “rules” which, when applied in a
supposedly consistent manner, produce strings of the Sanskrit language.  There
are also technical terms in the grammar.   Both the sutras and technical
terms are littered throughout the literature, which is somewhat vast.



My little project is an indexing system.  It has the following parts:

1)    A program that reads in a list of technical terms and stores them in
a PostgreSQL database along with the page numbers and work that they occur
in.  Technical terms will be in the form *devanāgari* (name for the writing
system); devanāgari looks like this: जैदागह.
The section I am currently working on is this section:  read in the list
(one technical term to a line) put it in a list ( *list-name [ ] *).  I
don’t want carriage returns/line feeds.

Then run the list of technical terms against the document which is in the
form of a Word.doc file.  The output will be a file consisting of lines in
the form
*technical-term page 1, …, page n.*
Note that I need to pick up the page-numbersin the document, thus POI or
Tika.  Which do you think would be better for this simple task?

2)    A second part is analyzing the text (a work about the grammar) for
sutra numbers, which have the form *1.8.123* or 1*.8.123-45*.  They always
have this form or, rarely, just *8.123*.
Again, the output file will contain lines like
1.8.123*  page 1, … page n.
*I have not started this section.

3)    The above 2 parts are separate runs on the document.  The screens are
built using NetBeans Swing.



A simple indexing system.  I suspect this kind of thing is done all over
the internet on all kinds of stuff.



Note:  no extracting (cutting out strings from the text), just recognition
of items and their page numbers.



What tools should I be using?  I am not a computer professional, although I
did teach computer science at the Georgia Institute of Technology and
Kalamazoo College in my younger days and did database and expert system
work at Boeing.



Any help you can offer will be appreciated.

On Wed, Jan 23, 2013 at 3:01 AM, Mark Beardsley <[email protected]>wrote:

> Good to hear that you have the IDE working now.
>
> Secondly, whay do you actually want to do with the documents you have? Is
> it
> simply a case of reading their contents? If so, then I would suggest that
> you take a look at a second Apache project called Tika -
> http://tika.apache.org/. It has been created to do just that, process a
> huge
> variety of documents and strip the contents from them. The part that deals
> with Office documents is built upon POI and has been the target of lots of
> work to iron out bugs with errant carriage returns and the like. You may
> see
> on the POI lists the name Nick Burch cropping up; well Nick was the chair
> of
> the POI project for a few years and is very much involved with Tika so
> there
> is considerable overlap.
>
> Give Tika a go if all you need to do is get at the document's content. If
> you need to do some other work with the documents then let us know and we
> can dig deeper to handle those carriage returns and odd characters.
>
> Yours
>
> Mark B
>
>
>
> --
> View this message in context:
> http://apache-poi.1045710.n5.nabble.com/Problem-integrating-Apache-POI-into-NetBeans-tp5711909p5711954.html
> Sent from the POI - User mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to