I learned of ghostscript's -dFILTER options from: https://askubuntu.com/questions/477663/how-to-remove-images-from-a-pdf-file
and tested it on what I consider to be more real life cases: https://nysl.ptfs.com/data/Library1/Library1/pdf/39007765_US-History-and-Government-Russian-Edition_2004-JAN-28.pdf https://nysl.ptfs.com/data/Library1/Library1/pdf/39007765_EARTH-SCIENCE-REFERENCE-TABLE-CHINESE-2010.pdf https://download.archive.org/byte-magazine-1986-02/1986_02_BYTE_11-02_Text_Processing.pdf https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Physical-Geography-Nov-1884.pdf https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Astronomy-Mar-14-1894.pdf Depending on your expectations it may not really work that well. The output file still needs heavy "human" eyeballing intervention. Do those options figure out where an image might be based on the pixel arrangement on the layout of the page or do they actually work based on the page's readily available metadata? PDF files, contrary to what their name suggest aren't neither portable, nor documents. Also there are a plethora of pdf file types from page-to-page quasi textual to image-based ones, image-based containing the actual text, ... If we can't hope automatically to be able to fully algorithmically textualize pdf files, why not designing GUIs to help "humans" to pick up where algorithms end? To me that would be a much needed tika subproject. A JAva-based "file cleansing/reformatting" GUI lbrtchx
