Re: Parsing a PDF file

2016-07-11 Thread [-hh]
Sorry, an essential line is missing:
replace "" with f0 in fScript

Here the full (correct) button-script:

local PDFfolder="/Users/admin/Downloads/precincts"

on mouseUp
  set defaultfolder to PDFfolder
  put the files into ff
  filter ff with "*.pdf"
  put field "AS" into aScript
  repeat for each line f in ff
put aScript into fScript
put PDFfolder & "/" & f into f0
replace "//" with "/" in f0
replace "" with f0 in fScript
do fScript as applescript
go this stack
set itemdelimiter to "."
put "txt" into last item of f0
set itemdelimiter to comma
put clipboardData["text"] into url ("file:" )
put f0 & cr before fld "jobsDone"
  end repeat
end mouseUp






--
View this message in context: 
http://runtime-revolution.278305.n4.nabble.com/Parsing-a-PDF-file-tp4706466p4706578.html
Sent from the Revolution - User mailing list archive at Nabble.com.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-11 Thread [-hh]
> Roger E.  wrote:
> > Since this seems to be Mac only, why not "do as Applescript" then select
> > all, and Copy?
Kay C. L. wrote
> Because Preview isn't properly scriptable and you can't "Select All"
> or "Copy". As Richard said, the answer is with Automator. 

Automator is a GUI to "bundled" Applescript routines. May be an alternative
way here to use directly Applescript, because that's easier to
"adjust-if-needed":

Here is a LC script and an AppleScript that together do the PDF2TXT job.
It's pretty slow but it's delivering tables a little bit better "formatted"
than pdfToText does and AcrobatReader's "Save as text" does.

I prefer to separate the steps and watch the process (activated apps).
[a] Download all files to a folder.
[b] Convert all pdf files of that folder to text into that folder.
[c] Work on the converted files.

The following works here, running MacOS 10.11.5, with LC 6/7/8.
Probably you need at least MacOS 10.6.

To step [b]:

[1] Allow "Accessibility" as described
and put the following into a field "AScript"
-- begin field
-- Needs assistive access enabled:
--  Before MacOS 10.11:
--   System preferences/Accessibility --> Enable access for assistive d.
--  MacOS 10.11 and later:
--   System preferences/Security/Accessibility --> add Livecode
tell application "Preview"
  activate -- when activated you see menu "Edit" highlighting on/off
  set myPath to ""
  open myPath
  tell application "System Events"
tell process "Preview"
  tell menu bar 1
click menu item "Select All" of menu "Edit"
click menu item "Copy" of menu "Edit"
  end tell
end tell
  end tell
  close document 1
end tell
-- give Preview some time, else the script may appear "unstable"
delay 7 -- (seconds) adjust to the speed of your machine
tell application "Livecode" to activate
--end field

[2] Make a button "Convert PDFs" with the following script
--begin script
-- the path to the folder where all your PDFs reside
local PDFfolder="/Users/admin/Downloads/precincts"

on mouseUp
  set defaultfolder to PDFfolder
  put the files into ff
  filter ff with "*.pdf"
  put field "AS" into aScript
  repeat for each line f in ff
put aScript into fScript
put PDFfolder & "/" & f into f0
replace "//" with "/" in f0
do fScript as applescript
go this stack
set itemdelimiter to "."
put "txt" into last item of f0
set itemdelimiter to comma
put clipboardData["text"] into url ("file:" )
-- put f0 & cr before fld "jobsDone" -- for testing
  end repeat
end mouseUp
--end script




--
View this message in context: 
http://runtime-revolution.278305.n4.nabble.com/Parsing-a-PDF-file-tp4706466p4706577.html
Sent from the Revolution - User mailing list archive at Nabble.com.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-11 Thread Jim Hurley
Kay Lan wrote:

In this particular case I found it much easier to open the PDF file in Adobe 
Acrobat and doing a “Save as — Text (Accessible)”

Jim 
> 
> On Mon, Jul 11, 2016 at 9:36 AM, Roger Eller
> > wrote:
>> Since this seems to be Mac only, why not "do as Applescript" the select
>> all, and Copy?
>> 
> Because Preview isn't properly scriptable and you can't "Select All"
> or "Copy". As Richard said, the answer is with Automator.
> 
> If you open Automator, select a new 'application', then in the left
> hand column you'll see "PDF's", as an option. If you click on that and
> browse down the middle column you'll see 'Extract PDF Text', and if
> you click on that, in it's description you'll see that it can extract
> Plain or Rich text.
> 
> So how can we get this to work with LC?
> 
> 1) In Automator, drag the 'Extract PDF Text' action into the right
> hand workspace window.
> a) Choose the output type - most likely Plain Text
> b) Select a folder to save to - for convenience we'll use "Desktop"
> c) For the Output File Name you probably want to use a Custom Name -
> pdf2text or whatever. You do not need to specify the suffix.
> d) tick the Replace Existing files box.
> 
> 2) Back in the left hand column where you clicked on the PDFs icon,
> now click on the 'Files & Folders' icon (looks like the Finder icon).
>> From the middle column drag 'Ask for Finder Items' into the right hand
> column, place it above 'Extract PDF Text'.
> a) Set the 'Start at: to a logical location, like Downloads, if that
> is where your PDFs are likely to be located.
> b) Type: should be left at files and do NOT tick the Allow Multiple
> Selection box as these instruction are for a single file only.
> 
> 
> 3) From the middle column drag 'Open Finder Items' and place it
> 'between' the last two actions - so the order will be Ask for Finder
> Items, Open Finder Items, Extract PDF Text.
> a) Set Open with: to Preview.
> 
> 4) Optionally, if you don't always have Preview open and you don't
> want to be left with the PDF file open, in the left hand column click
> Utility, and from the middle column drag 'Quit Application' to the end
> of your workflow.
> a) set it to "Preview.app"
> 
> You can now test this by clicking the Run button in the top right
> corner. What should happen is you should get a standard Open File
> dialog box to point to a file, you then select a file and shortly
> thereafter the Automator log window at the bottom should have all
> green ticks.
> 
> You should then be able to navigate to the Desktop folder and the file
> 'pdf2text.txt' should be there.
> 
> So to complete the LC integration process. Save your Automator
> workflow, and call it something like pdf2text. For this example we'll
> also save it to Desktop.
> 
> Then in you LC script:
> 
> on mouseUp
>   set the defaultFolder to specialFolderPath("desktop")
>   launch pdf2text.app
>   --if file is large, consider a wait 1 or more here.
>   put textDecode(URL
> "file:/Users/yourname/Desktop/pdf2text.txt","utf8") into tNotPDF
>   --do what you have to after this
> 
>   --your Automator app will auto Quit once it's done it's thing so
> there is no need to balance the 'launch' command with a 'kill' command
> end mouseUp
> 
> It should be noted that Automators Extract PDF Text typically does a
> better job of text extraction than manually Select All + Copy + Paste.
> 
> Unfortunately I consider both these options about 30% or less accurate
> than using my old PPC G5 running Leopard and Devon Technologies old
> PDF2RTFService. I had not previously offered a solution to the OP
> because, get a PPC Mac, install Leopard and PDF2TEXTService is only
> really an option if you are handling many large, complex formatted
> pdfs day in, day out, as I am. Jim's problem sounds like a one off.
> 
> 
> e-livec...@lists.runrev.com 
> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> --
> 
> End of use-livecode Digest, Vol 154, Issue 21
> *

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-10 Thread Kay C Lan
On Mon, Jul 11, 2016 at 9:36 AM, Roger Eller
 wrote:
> Since this seems to be Mac only, why not "do as Applescript" the select
> all, and Copy?
>
Because Preview isn't properly scriptable and you can't "Select All"
or "Copy". As Richard said, the answer is with Automator.

If you open Automator, select a new 'application', then in the left
hand column you'll see "PDF's", as an option. If you click on that and
browse down the middle column you'll see 'Extract PDF Text', and if
you click on that, in it's description you'll see that it can extract
Plain or Rich text.

So how can we get this to work with LC?

1) In Automator, drag the 'Extract PDF Text' action into the right
hand workspace window.
a) Choose the output type - most likely Plain Text
b) Select a folder to save to - for convenience we'll use "Desktop"
c) For the Output File Name you probably want to use a Custom Name -
pdf2text or whatever. You do not need to specify the suffix.
d) tick the Replace Existing files box.

2) Back in the left hand column where you clicked on the PDFs icon,
now click on the 'Files & Folders' icon (looks like the Finder icon).
>From the middle column drag 'Ask for Finder Items' into the right hand
column, place it above 'Extract PDF Text'.
a) Set the 'Start at: to a logical location, like Downloads, if that
is where your PDFs are likely to be located.
b) Type: should be left at files and do NOT tick the Allow Multiple
Selection box as these instruction are for a single file only.


3) From the middle column drag 'Open Finder Items' and place it
'between' the last two actions - so the order will be Ask for Finder
Items, Open Finder Items, Extract PDF Text.
a) Set Open with: to Preview.

4) Optionally, if you don't always have Preview open and you don't
want to be left with the PDF file open, in the left hand column click
Utility, and from the middle column drag 'Quit Application' to the end
of your workflow.
a) set it to "Preview.app"

You can now test this by clicking the Run button in the top right
corner. What should happen is you should get a standard Open File
dialog box to point to a file, you then select a file and shortly
thereafter the Automator log window at the bottom should have all
green ticks.

You should then be able to navigate to the Desktop folder and the file
'pdf2text.txt' should be there.

So to complete the LC integration process. Save your Automator
workflow, and call it something like pdf2text. For this example we'll
also save it to Desktop.

Then in you LC script:

on mouseUp
   set the defaultFolder to specialFolderPath("desktop")
   launch pdf2text.app
   --if file is large, consider a wait 1 or more here.
   put textDecode(URL
"file:/Users/yourname/Desktop/pdf2text.txt","utf8") into tNotPDF
   --do what you have to after this

   --your Automator app will auto Quit once it's done it's thing so
there is no need to balance the 'launch' command with a 'kill' command
end mouseUp

It should be noted that Automators Extract PDF Text typically does a
better job of text extraction than manually Select All + Copy + Paste.

Unfortunately I consider both these options about 30% or less accurate
than using my old PPC G5 running Leopard and Devon Technologies old
PDF2RTFService. I had not previously offered a solution to the OP
because, get a PPC Mac, install Leopard and PDF2TEXTService is only
really an option if you are handling many large, complex formatted
pdfs day in, day out, as I am. Jim's problem sounds like a one off.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-10 Thread Roger Eller
Since this seems to be Mac only, why not "do as Applescript" the select
all, and Copy?

~Roger
On Jul 10, 2016 7:18 PM, "Jim Hurley"  wrote:

> hh wrote:
>
> >
> > [Description for MacOS, works on Win/Linux similar.]
> >
> > The best results for extracting tables from PDF I had with the free "RAW"
> > method:
> >
> > = Open the file with Preview.
> > = Select All (menu Edit). Copy.
> > = Go to a LC stack with a field "INCOMING"
> > = Use by a button or the message box the line
> >put clipboardData["Text"] into fld "INCOMING"
> >
> > If you use simply "paste" you get (probably unwanted) styles with your
> text.
> > (If you have a lot of files: Preview is scriptable.)
> >
>
> Thanks, that works well.
>
> I thought I would try to program your method  within LC:
>
> on mouseUp
>get url "
> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/cumulativereport.pdf
> "
>set the clipboarddata["text"] to it
>put the clipboarddata["text"] into field 1
> end mouseUp
>
> But, no dice. I'm guessing there is something about "Copy" in Preview that
> is missing in LC.
>
> Same applies to opening the file in Adobe and doing a "save as" and
> choosing "text (accessible)"
>
> Be nice if LC could beef-up "get url" to do whatever it is that  Preview
> and Adobe do.
>
> Thanks again,
>
> Jim
>
>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-10 Thread Jim Hurley
hh wrote:

> 
> [Description for MacOS, works on Win/Linux similar.]
> 
> The best results for extracting tables from PDF I had with the free "RAW"
> method:
> 
> = Open the file with Preview.
> = Select All (menu Edit). Copy.
> = Go to a LC stack with a field "INCOMING"
> = Use by a button or the message box the line
>put clipboardData["Text"] into fld "INCOMING"
> 
> If you use simply "paste" you get (probably unwanted) styles with your text.
> (If you have a lot of files: Preview is scriptable.)
> 

Thanks, that works well. 

I thought I would try to program your method  within LC:

on mouseUp
   get url 
"https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/cumulativereport.pdf;
   set the clipboarddata["text"] to it
   put the clipboarddata["text"] into field 1
end mouseUp

But, no dice. I'm guessing there is something about "Copy" in Preview that is 
missing in LC.

Same applies to opening the file in Adobe and doing a "save as" and choosing 
"text (accessible)"

Be nice if LC could beef-up "get url" to do whatever it is that  Preview and 
Adobe do.

Thanks again,

Jim




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-10 Thread Jim Hurley
Mark Wieder wrote:
> 
> On 07/09/2016 08:54 AM, Richard Gaskin wrote:
> 
>> Across the US we're beginning to see a revolution in government data
>> sharing.
> 
> 
> Except, of course, when it comes to actual data.

Hi Mark. See my reply to Richard. The actual data I wanted was how the election 
turned out. 
The essential reply was: Sorry, we have limited access to our data.

It is interesting that the County’s IT department is working overtime providing 
county parcel data, even though there has been no pressure to do so.
I think it is because there are programmers there, and they love to program. A 
trait not uncommon among us LC folks.

Jim Hurley


> 
> Many of the laws we as citizens of the US are required to follow are not 
> available for us to read without paying a fee. Cal Malamud, Public 
> Resource, et al, are working to place the legal system in the public 
> domain, but the legal establishment fights back. If you want access to 
> the proceedings of U.S. federal courts, PACER is behind a paywall.
> 
> 
> The state of Georgia (in the US, not the more reasonable country) makes 
> the claim of copyright in its lawsuit against publishing its laws: "Each 
> of these annotations is an original and creative work of authorship that 
> is protected by copyrights owned by the State of Georgia."
> 
> 
> 
> 
> 
> -- 
>  Mark Wieder
>  ahsoftw...@gmail.com
> 
> 
> 


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-10 Thread Jim Hurley
Richard Gaskin wrote:
> 
> Jim Hurley wrote:
> 
>> Thanks Richard.
>> 
>> You are so right about releasing data in complex formats.
>> I spoke to the election's off about posting election results in PDF
>> format.
>> I knew there was not use fighting them when they told me that it was
>> now County "policy" to post everything in PDF--not unlike those 10
>> policies of renown that were carved in stone--and a metaphor was born.
> 
> Unfortunate, as it renders the data nearly useless.  I agree you need to 
> pick your battles, but it's dismaying in an ostensible democracy when 
> the process of open data for civic-minded citizens is implemented in 
> ways that ultimately deliver the opposite of the intended goal.

Part of the problem in rural areas, such as the county I retired to, is budget.
The Board Of Supervisors is ruled by budget considerations. They see it as the 
central issue in their reelection.
They have cut back the budget for the elections office. That is not a good 
place to economize.
There have been numerous screw ups recently. I get the feeling that the staff 
lives in constant terror of messing up.
I served as the database manager for the current head of the department in his 
last election (it is an elected office—don’t ask) and I’m confident he had no 
idea what I did in that capacity. 
> 
> Across the US we're beginning to see a revolution in government data 
> sharing.  At the municipal level one of the shining examples has been 
> Raleigh, NC, in no small part due to the work of Jason Hibbets.  He 
> works as the Community Manager for Red Hat, and has devoted significant 
> volunteer time working with city officials to make data available so 
> local devs can deliver apps for the community.
> 
> Notes on his work and a link to his excellent book, "The Foundation for 
> an Open Source City" (I got a signed copy when I met him at the SoCal 
> Linux Expo a couple years ago) is here:
> http://theopensourcecity.com/
> 
> The slides from the SCaLE talk where I met him are linked to from this 
> page outlining his presentation:
> http://www.socallinuxexpo.org/scale12x/presentations/open-source-all-cities.html
> 
> 
>> In the County's old system, each of the 50 election precincts were
>> stored in 50 web pages as HTML documents.
>> That was perfect for LiveCode's "get url". It was a matter of second
>> to  visit all 50 pages, parse the text, and store the data.
> 
> So much for progress. ;)
> 
> Too often we see Cargo Cult thinking in data management, where folks 
> start using a tool or a format only because they hear about it others, 
> but since they don't actually use the system they're delivering they 
> never come to understand what's useful and what's an impedance.
> 
> 
>> (The other two text options in Adobe are "Rich Text Format" and "Text
>> (Plain)", neither of which works--only "Text (Accessible)"
> 
> What is "Text (Accessible)”?

I don’t know. It apparently is neither RTF nor “Text—plain” .

I tried to save the PDF file as “Text—plain”  and got this response:

Acrobat was unable to make this document accessible because of the 
following error:
Bad PDF; could not read page structure.  [7]
Please note that some pages of this document may have been changed. Because 
of this failure, you are advised to not save these changes.

In the “Text (Accessible)” format there seems to be an implied criticism of the 
 “Text—plain” format. Text (Accessible) really is accessible, the others, not 
so much.
Apparently you can save it as plain text, its just not accessible. I love tech 
jargon. 

> 
>> I was unaware of Apple's Automator. I'll look into it--but it is
>> unnecessary for this project.
> 
> Warning:  Automator is a lot of fun, and may be addictive.  Be careful 
> playing with it, since you may find yourself experimenting with all 
> sorts of things and before you know it your Saturday is completely gone. :)

Fair warning. Thanks.
Jim Hurley


> 
> -- 
>  Richard Gaskin
>  Fourth World Systems
>  Software Design and Development for the Desktop, Mobile, and the Web
>  
>  ambassa...@fourthworld.comhttp://www.FourthWorld.com
> 
> 


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-09 Thread Mark Wieder

On 07/09/2016 08:54 AM, Richard Gaskin wrote:


Across the US we're beginning to see a revolution in government data
sharing.



Except, of course, when it comes to actual data.

Many of the laws we as citizens of the US are required to follow are not 
available for us to read without paying a fee. Cal Malamud, Public 
Resource, et al, are working to place the legal system in the public 
domain, but the legal establishment fights back. If you want access to 
the proceedings of U.S. federal courts, PACER is behind a paywall.



The state of Georgia (in the US, not the more reasonable country) makes 
the claim of copyright in its lawsuit against publishing its laws: "Each 
of these annotations is an original and creative work of authorship that 
is protected by copyrights owned by the State of Georgia."






--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-09 Thread Richard Gaskin

Jim Hurley wrote:

> Thanks Richard.
>
> You are so right about releasing data in complex formats.
> I spoke to the election's off about posting election results in PDF
> format.
> I knew there was not use fighting them when they told me that it was
> now County "policy" to post everything in PDF--not unlike those 10
> policies of renown that were carved in stone--and a metaphor was born.

Unfortunate, as it renders the data nearly useless.  I agree you need to 
pick your battles, but it's dismaying in an ostensible democracy when 
the process of open data for civic-minded citizens is implemented in 
ways that ultimately deliver the opposite of the intended goal.


Across the US we're beginning to see a revolution in government data 
sharing.  At the municipal level one of the shining examples has been 
Raleigh, NC, in no small part due to the work of Jason Hibbets.  He 
works as the Community Manager for Red Hat, and has devoted significant 
volunteer time working with city officials to make data available so 
local devs can deliver apps for the community.


Notes on his work and a link to his excellent book, "The Foundation for 
an Open Source City" (I got a signed copy when I met him at the SoCal 
Linux Expo a couple years ago) is here:

http://theopensourcecity.com/

The slides from the SCaLE talk where I met him are linked to from this 
page outlining his presentation:

http://www.socallinuxexpo.org/scale12x/presentations/open-source-all-cities.html


> In the County's old system, each of the 50 election precincts were
> stored in 50 web pages as HTML documents.
> That was perfect for LiveCode's "get url". It was a matter of second
> to  visit all 50 pages, parse the text, and store the data.

So much for progress. ;)

Too often we see Cargo Cult thinking in data management, where folks 
start using a tool or a format only because they hear about it others, 
but since they don't actually use the system they're delivering they 
never come to understand what's useful and what's an impedance.



> (The other two text options in Adobe are "Rich Text Format" and "Text
> (Plain)", neither of which works--only "Text (Accessible)"

What is "Text (Accessible)"?


> I was unaware of Apple's Automator. I'll look into it--but it is
> unnecessary for this project.

Warning:  Automator is a lot of fun, and may be addictive.  Be careful 
playing with it, since you may find yourself experimenting with all 
sorts of things and before you know it your Saturday is completely gone. :)


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-09 Thread Jim Hurley
Thanks Richard. 

You are so right about releasing data in complex formats.
I spoke to the election's off about posting election results in PDF format.
I knew there was not use fighting them when they told me that it was now County 
"policy" to post everything in PDF--not unlike those 10 policies of renown that 
were carved in stone--and a metaphor was born.

In the County's old system, each of the 50 election precincts were stored in 50 
web pages as HTML documents.
That was perfect for LiveCode's "get url". It was a matter of second to  visit 
all 50 pages, parse the text, and store the data.

Thankfully this new PDF web page has all the data for all 50 precincts on the 
one page.
If I save the page to a pdf file, open than file in Adobe Acrobat, and save it 
as "Text (Accessible)" , as you suggested, I get a readable text file for LC to 
work its magic on.

(The other two text options in Adobe are "Rich Text Format" and "Text (Plain)", 
neither of which works--only "Text (Accessible)"

I was unaware of Apple's Automator. I'll look into it--but it is unnecessary 
for this project.

Thanks again,

Jim Hurley


> Message: 9
> Date: Fri, 8 Jul 2016 08:44:50 -0700
> From: Richard Gaskin <ambassa...@fourthworld.com>
> To: use-livecode@lists.runrev.com
> Subject: Re: Parsing a PDF file
> Message-ID: <577fca72.2040...@fourthworld.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Jim Hurley wrote:
> 
>> My County is now publishing the election results to the web as a PDF
>> file:
>> 
>> 
> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
>> 
>> Is there a way to parse these PDF  files?
> 
> It's unfortunate that so many orgs release data useful to analysis in 
> complex formats that inhibit such use.  PDF is great when the goal is to 
> preserve page layout, but a uniquely poor choice for sharing data to be 
> used for analytics.  Alas, that hasn't slowed its unfortunate use in 
> such contexts.
> 
> If this is to be done within an application for others to use, perhaps 
> the smoothest user experience would be via the XPDF external, currently 
> available only in LiveCode Business Edition at $1999/yr.  While that may 
> seem high, for commercial products of such scope it may be a good bargain.
> 
> However, if this is only for use in tools you'll be using yourself, 
> where an extra step or two is less important, there are many options.
> 
> If it's just one file, perhaps the simplest is to use Save As Text from 
> Adobe's PDF Viewer.
> 
> If you'll need to automate this for reuse, here's a way to use Apple's 
> Automator for that:
> <https://www.engadget.com/2013/02/11/mac-101-use-automater-to-extract-text-from-pdfs/>
> 
> I believe there may also be a command line option available on macOS, 
> which could be called from within LC using the shell function.  I don't 
> know the name of the command line tool for that on macOS, but in Linux I 
> use pdftotext, where the syntax is pretty simple:
> 
>   pdftotext  
> 
> e.g.:
> 
>   put "/Users/me/folder/SomeFile.pdf" into tSrc
>   put "/Users/me/folder/SomeFile.txt" into tDest
>   get shell("pdftotext "& tSrc && tDest)
> 
> -- 
>  Richard Gaskin
>  Fourth World Systems
>  Software Design and Development for the Desktop, Mobile, and the Web
>  
>  ambassa...@fourthworld.comhttp://www.FourthWorld.com
> 
> 
> 


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Mark Rauterkus
Hi,

OT Tip: Go to the source of the data, the election board. Tell them you
want the raw data made available PLUS the PDF.

That is not an unreasonable request. Open Government advocates / folks
would support that, IMHO.



--
Ta.


Mark Rauterkus   m...@rauterkus.com
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Richard Gaskin

Dar Scott wrote:

>> On Jul 8, 2016, at 9:44 AM, Richard Gaskin wrote:
>> It's unfortunate that so many orgs release data useful to analysis
>> in complex formats that inhibit such use.
...
> To make it worse, documents for human consumption are claimed to be
> the same when underneath there are big changes.  Tables are moved
> around, rotated, have zeros converted to blanks, have commas added
> and so on.
>
> You know that party bosses get files in useful forms.  I'd contact
> the right people in the state government and get the right files.

Amen, brother Dar!

For all the people who pass around PDFs, when you ask them where it came 
from they just look a you with that flouride stare.  But PDF isn't an 
authoring format, it's a delivery format - everything in that format 
began life in something more malleable.



> One thing that has worked for me for onetime analysis is trying
> different file name extensions in downloading.  The right file might
> be there.

Good thought.  Unfortunately with the URL Jim provided both .txt or .csv 
produce merely 404s.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Dar Scott

> On Jul 8, 2016, at 9:44 AM, Richard Gaskin  wrote:
> 
> > My County is now publishing the election results to the web as a PDF
> > file:
> >
> > https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
> >  
> > 
> >
> > Is there a way to parse these PDF  files?
> 
> It's unfortunate that so many orgs release data useful to analysis in complex 
> formats that inhibit such use.  PDF is great when the goal is to preserve 
> page layout, but a uniquely poor choice for sharing data to be used for 
> analytics.  Alas, that hasn't slowed its unfortunate use in such contexts.

To make it worse, documents for human consumption are claimed to be the same 
when underneath there are big changes.  Tables are moved around, rotated, have 
zeros converted to blanks, have commas added and so on.

You know that party bosses get files in useful forms.  I'd contact the right 
people in the state government and get the right files.  

One thing that has worked for me for onetime analysis is trying different file 
name extensions in downloading.  The right file might be there.  

Dar


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread [-hh]
[Description for MacOS, works on Win/Linux similar.]

The best results for extracting tables from PDF I had with the free "RAW"
method:

= Open the file with Preview.
= Select All (menu Edit). Copy.
= Go to a LC stack with a field "INCOMING"
= Use by a button or the message box the line
put clipboardData["Text"] into fld "INCOMING"

If you use simply "paste" you get (probably unwanted) styles with your text.
(If you have a lot of files: Preview is scriptable.)



--
View this message in context: 
http://runtime-revolution.278305.n4.nabble.com/Parsing-a-PDF-file-tp4706466p4706481.html
Sent from the Revolution - User mailing list archive at Nabble.com.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Richard Gaskin

Paul Dupuis wrote:

> In truth a NEW portable document format needs to be invented that
> connects and preserves content to its appearance, but I suspect that
> people who want to keep both intact and portable are just using HTML5
> and CSS3.

CSS is a wonderful solution.

Being prone to idealism, I like to believe (admittedly in the absence of 
all current evidence) that we're only a few years away from the nearly 
complete abandonment of PDF as a popular format for everything except 
perhaps the subset of documents that truly must remain in form that 
emulates yesteryear's printed page, an ever-shrinking use-case.


For everything else, content in HTML with formatting in CSS is a 
wonderful option.


Better still might be for the world to adopt LiveCode stacks as a 
universal document format.  For more programmer-accessible and 
feature-rich than PDF, and fully open source.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Mike Bonner
Its ugly but, could you use pdf.js to extract the text  in a browser widget
showing the pdf?  http://git.macropus.org/2011/11/pdftotext/example/

Not sure what else is in pdf.js but it looks interesting.

On Fri, Jul 8, 2016 at 10:30 AM, Paul Dupuis  wrote:

> On 7/8/2016 11:55 AM, Colin Holgate wrote:
> > I was trying an export as spreadsheet from Acrobat Pro, but that didn’t
> work. Doing a Save as Text from Acrobat Reader was more successful, but the
> columns come out in a different order, and some columns get combined into a
> single string.
>
> Over the few years, I have spent a ridiculous amount of time exploring
> PDF access via LiveCode is every way possible. Ultimately, for our needs
> we created the XPDF external and transferred it to LiveCode, but we
> explored javascript extraction from a browser. Interapplication
> communication, shell command line tools, etc., etc.
>
> The reality is the PDF format is great for visually representing a
> printed page and totally sucks for text content - that is actually
> getting the characters of the document rather than an image of the
> characters.
>
> There is NO really mapping of characters to their appearance in the PDF
> other than geometric position on the page. You get no font information,
> no size, no styles, zip. You get line breaks at the end of every visible
> line and you can get line breaks in what appears to be the middle of
> content depending upon how the original source document was rendered
> into a PDF. Headers and footers end up in the middle of paragraphs. You
> have no real way to tell a line break from a paragraph break and more.
>
> In truth a NEW portable document format needs to be invented that
> connects and preserves content to its appearance, but I suspect that
> people who want to keep both intact and portable are just using HTML5
> and CSS3.
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Mike Bonner
Might read this one too:
http://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript

On Fri, Jul 8, 2016 at 10:48 AM, Mike Bonner  wrote:

> Its ugly but, could you use pdf.js to extract the text  in a browser
> widget showing the pdf?
> http://git.macropus.org/2011/11/pdftotext/example/
>
> Not sure what else is in pdf.js but it looks interesting.
>
> On Fri, Jul 8, 2016 at 10:30 AM, Paul Dupuis 
> wrote:
>
>> On 7/8/2016 11:55 AM, Colin Holgate wrote:
>> > I was trying an export as spreadsheet from Acrobat Pro, but that didn’t
>> work. Doing a Save as Text from Acrobat Reader was more successful, but the
>> columns come out in a different order, and some columns get combined into a
>> single string.
>>
>> Over the few years, I have spent a ridiculous amount of time exploring
>> PDF access via LiveCode is every way possible. Ultimately, for our needs
>> we created the XPDF external and transferred it to LiveCode, but we
>> explored javascript extraction from a browser. Interapplication
>> communication, shell command line tools, etc., etc.
>>
>> The reality is the PDF format is great for visually representing a
>> printed page and totally sucks for text content - that is actually
>> getting the characters of the document rather than an image of the
>> characters.
>>
>> There is NO really mapping of characters to their appearance in the PDF
>> other than geometric position on the page. You get no font information,
>> no size, no styles, zip. You get line breaks at the end of every visible
>> line and you can get line breaks in what appears to be the middle of
>> content depending upon how the original source document was rendered
>> into a PDF. Headers and footers end up in the middle of paragraphs. You
>> have no real way to tell a line break from a paragraph break and more.
>>
>> In truth a NEW portable document format needs to be invented that
>> connects and preserves content to its appearance, but I suspect that
>> people who want to keep both intact and portable are just using HTML5
>> and CSS3.
>>
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Paul Dupuis
On 7/8/2016 11:55 AM, Colin Holgate wrote:
> I was trying an export as spreadsheet from Acrobat Pro, but that didn’t work. 
> Doing a Save as Text from Acrobat Reader was more successful, but the columns 
> come out in a different order, and some columns get combined into a single 
> string.

Over the few years, I have spent a ridiculous amount of time exploring
PDF access via LiveCode is every way possible. Ultimately, for our needs
we created the XPDF external and transferred it to LiveCode, but we
explored javascript extraction from a browser. Interapplication
communication, shell command line tools, etc., etc.

The reality is the PDF format is great for visually representing a
printed page and totally sucks for text content - that is actually
getting the characters of the document rather than an image of the
characters.

There is NO really mapping of characters to their appearance in the PDF
other than geometric position on the page. You get no font information,
no size, no styles, zip. You get line breaks at the end of every visible
line and you can get line breaks in what appears to be the middle of
content depending upon how the original source document was rendered
into a PDF. Headers and footers end up in the middle of paragraphs. You
have no real way to tell a line break from a paragraph break and more.

In truth a NEW portable document format needs to be invented that
connects and preserves content to its appearance, but I suspect that
people who want to keep both intact and portable are just using HTML5
and CSS3.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Colin Holgate
I was trying an export as spreadsheet from Acrobat Pro, but that didn’t work. 
Doing a Save as Text from Acrobat Reader was more successful, but the columns 
come out in a different order, and some columns get combined into a single 
string.


> On Jul 8, 2016, at 11:44 AM, Richard Gaskin  
> wrote:
> 
> Jim Hurley wrote:
> 
> > My County is now publishing the election results to the web as a PDF
> > file:
> >
> > https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
> >
> > Is there a way to parse these PDF  files?
> 
> It's unfortunate that so many orgs release data useful to analysis in complex 
> formats that inhibit such use.  PDF is great when the goal is to preserve 
> page layout, but a uniquely poor choice for sharing data to be used for 
> analytics.  Alas, that hasn't slowed its unfortunate use in such contexts.
> 
> If this is to be done within an application for others to use, perhaps the 
> smoothest user experience would be via the XPDF external, currently available 
> only in LiveCode Business Edition at $1999/yr.  While that may seem high, for 
> commercial products of such scope it may be a good bargain.
> 
> However, if this is only for use in tools you'll be using yourself, where an 
> extra step or two is less important, there are many options.
> 
> If it's just one file, perhaps the simplest is to use Save As Text from 
> Adobe's PDF Viewer.
> 
> If you'll need to automate this for reuse, here's a way to use Apple's 
> Automator for that:
> 
> 
> I believe there may also be a command line option available on macOS, which 
> could be called from within LC using the shell function.  I don't know the 
> name of the command line tool for that on macOS, but in Linux I use 
> pdftotext, where the syntax is pretty simple:
> 
>  pdftotext  
> 
> e.g.:
> 
>  put "/Users/me/folder/SomeFile.pdf" into tSrc
>  put "/Users/me/folder/SomeFile.txt" into tDest
>  get shell("pdftotext "& tSrc && tDest)
> 
> -- 
> Richard Gaskin
> Fourth World Systems
> Software Design and Development for the Desktop, Mobile, and the Web
> 
> ambassa...@fourthworld.comhttp://www.FourthWorld.com
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Richard Gaskin

Jim Hurley wrote:

> My County is now publishing the election results to the web as a PDF
> file:
>
> 
https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf

>
> Is there a way to parse these PDF  files?

It's unfortunate that so many orgs release data useful to analysis in 
complex formats that inhibit such use.  PDF is great when the goal is to 
preserve page layout, but a uniquely poor choice for sharing data to be 
used for analytics.  Alas, that hasn't slowed its unfortunate use in 
such contexts.


If this is to be done within an application for others to use, perhaps 
the smoothest user experience would be via the XPDF external, currently 
available only in LiveCode Business Edition at $1999/yr.  While that may 
seem high, for commercial products of such scope it may be a good bargain.


However, if this is only for use in tools you'll be using yourself, 
where an extra step or two is less important, there are many options.


If it's just one file, perhaps the simplest is to use Save As Text from 
Adobe's PDF Viewer.


If you'll need to automate this for reuse, here's a way to use Apple's 
Automator for that:



I believe there may also be a command line option available on macOS, 
which could be called from within LC using the shell function.  I don't 
know the name of the command line tool for that on macOS, but in Linux I 
use pdftotext, where the syntax is pretty simple:


  pdftotext  

e.g.:

  put "/Users/me/folder/SomeFile.pdf" into tSrc
  put "/Users/me/folder/SomeFile.txt" into tDest
  get shell("pdftotext "& tSrc && tDest)

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Dan Brown
> Doesn’t that just let you show PDFs? Would it help to parse the contents?

Using the "pdftotext" component of Xpdf you can use a shell command to
extract the text from a pdf and place it into for example a text file
which you can then parse

something like..
___
put  "C:\pdftotext" & " -layout" & " " & pPDF & " " & pTEXTFILE into tcommand

PUT shell ( tcommand ) into meh
---

pPDF is the location and filename of the PDF file you want to process,
pTEXTFILE is the location and name of the text file you want to create




On Fri, Jul 8, 2016 at 4:10 PM, Colin Holgate  wrote:
> Doesn’t that just let you show PDFs? Would it help to parse the contents?
>
> I’m trying another approach, will report back soon.
>
>
>> On Jul 8, 2016, at 10:58 AM, Peter TB Brett  wrote:
>>
>> On 08/07/2016 15:11, Jim Hurley wrote:
>>> My County is now publishing the election results to the web as a PDF file:
>>>
>>>  
>>> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
>>>
>>> Is there a way to parse these PDF  files?
>>
>> The XPDF external, which is now available as part of LiveCode Business, is 
>> probably what you're looking for.
>>
>>Peter
>>
>> --
>> Dr Peter Brett 
>> LiveCode Technical Project Manager
>>
>> LiveCode 2016 Conference: https://livecode.com/edinburgh-2016/
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Colin Holgate
Doesn’t that just let you show PDFs? Would it help to parse the contents?

I’m trying another approach, will report back soon.


> On Jul 8, 2016, at 10:58 AM, Peter TB Brett  wrote:
> 
> On 08/07/2016 15:11, Jim Hurley wrote:
>> My County is now publishing the election results to the web as a PDF file:
>> 
>>  
>> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
>> 
>> Is there a way to parse these PDF  files?
> 
> The XPDF external, which is now available as part of LiveCode Business, is 
> probably what you're looking for.
> 
>Peter
> 
> -- 
> Dr Peter Brett 
> LiveCode Technical Project Manager
> 
> LiveCode 2016 Conference: https://livecode.com/edinburgh-2016/
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Parsing a PDF file

2016-07-08 Thread Paul Dupuis
On 7/8/2016 10:11 AM, Jim Hurley wrote:
> My County is now publishing the election results to the web as a PDF file: 
>
>   
> https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf
>
> Is there a way to parse these PDF  files? 
>

The XPDF external will let you extract the text (and Images if there
were any). You will need to write code to parse the fields (separated by
white space) from each line if you want the numbers for analysis.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Parsing a PDF file

2016-07-08 Thread Peter TB Brett

On 08/07/2016 15:11, Jim Hurley wrote:

My County is now publishing the election results to the web as a PDF file:


https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf

Is there a way to parse these PDF  files?


The XPDF external, which is now available as part of LiveCode Business, 
is probably what you're looking for.


Peter

--
Dr Peter Brett 
LiveCode Technical Project Manager

LiveCode 2016 Conference: https://livecode.com/edinburgh-2016/

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Parsing a PDF file

2016-07-08 Thread Jim Hurley
My County is now publishing the election results to the web as a PDF file: 


https://www.mynevadacounty.com/nc/elections/docs/2016%20Elections/June%207%2c%202016%2c%20Presidential%20Primary/Election%20Results/precinctreport.pdf

Is there a way to parse these PDF  files? 

Thanks, Jim
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode