Re: reading text in pdf, some working sample code

2017-11-21 Thread dieter
Daniel Gross  writes:
> I am new to python and jumped right into trying to read out (english) text
> from PDF files.
>
> I tried various libraries (including slate)

You could give "pdfminer" a try.

Note, however, that it may not be possible to extract the text:
PDF is a generic format which works by mapping character codes to glyphs
(i.e. visual symbols); if your PDF uses a special map for this
(especially with non standard glyph collections (aka "font"s)),
then the text extraction (which in fact extracts sequences
of character codes) can give unusable results.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Chris Angelico
On Wed, Nov 22, 2017 at 4:10 PM, Gregory Ewing
 wrote:
> Michael Torrie wrote:
>>
>> You also have this header set:
>>
>>> X-Copyright: (C) Copyright 2017 Stefan Ram. All rights reserved.
>>> ... It is forbidden to change
>>> URIs of this article into links...
>
>
> What is "changing a URI into a link" meant to mean? Does it
> include automatically displaying something that looks like
> a URI as a clickable element, as many news and mail clients
> do nowadays? If so, a lot of people will be inadvertently
> violating this copyright condition, including me.
>
> It looks like I'm going to have to filter Mr. Ram's posts
> out of my usenet feed as well, lest I accidentally show one
> of his URIs as a link on my screen.
>

Or, just ignore his copyright altogether, and let him prove its
defensibility in court. Can you actually enforce that EVERY usenet
server carry out your wishes? What if one of them strips off the
(non-standard) X-Copyright header and carries the message further? I
would hope that a server admin is not liable in court years down the
track for setting something up and leaving it to its own devices.

Anyone can write anything. Good luck actually making it mean anything.

Of course, getting all your posts plonked is the safest way to comply
with copyright, so I think that's what's going to happen...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: "help( pi )"

2017-11-21 Thread Chris Angelico
On Wed, Nov 22, 2017 at 4:47 PM, Gregory Ewing
 wrote:
> Cameron Simpson wrote:
>>
>> one could change  implementations such that applying a docstring to an
>> object _removed_ it from  the magic-shared-singleton pool,
>
>
> That's not sufficient, though. Consider:
>
>BUFFER_SIZE = 256
>BUFFER_SIZE.__doc__ = "Size of the buffer"
>
>TWO_TO_THE_EIGHT = 256
>TWO_TO_THE_EIGHT.__doc__ = "My favourite power of two"
>
> Before the code is even run, the compiler may have merged the
> two occurences of the integer literal 256 into one entry in
> co_consts. By the time the docstrings are assigned, it's too
> late to decide that they really needed to be different objects.
>
> So, an int with a docstring needs to be explicitly created as
> a separate object to begin with, one way or another.

class Int(int):
def __new__(cls, *a, **kw):
__doc__ = kw.pop("doc", None)
self = super().__new__(cls, *a, **kw)
self.__doc__ = __doc__
return self

BUFFER_SIZE = Int(256, doc="Size of the buffer")

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: "help( pi )"

2017-11-21 Thread Gregory Ewing

Cameron Simpson wrote:
one could change  implementations such that applying a docstring to an 
object _removed_ it from  the magic-shared-singleton pool,


That's not sufficient, though. Consider:

   BUFFER_SIZE = 256
   BUFFER_SIZE.__doc__ = "Size of the buffer"

   TWO_TO_THE_EIGHT = 256
   TWO_TO_THE_EIGHT.__doc__ = "My favourite power of two"

Before the code is even run, the compiler may have merged the
two occurences of the integer literal 256 into one entry in
co_consts. By the time the docstrings are assigned, it's too
late to decide that they really needed to be different objects.

So, an int with a docstring needs to be explicitly created as
a separate object to begin with, one way or another.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Gregory Ewing

Michael Torrie wrote:

You also have this header set:


X-Copyright: (C) Copyright 2017 Stefan Ram. All rights reserved.
... It is forbidden to change
URIs of this article into links... 


What is "changing a URI into a link" meant to mean? Does it
include automatically displaying something that looks like
a URI as a clickable element, as many news and mail clients
do nowadays? If so, a lot of people will be inadvertently
violating this copyright condition, including me.

It looks like I'm going to have to filter Mr. Ram's posts
out of my usenet feed as well, lest I accidentally show one
of his URIs as a link on my screen.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: reading text in pdf, some working sample code

2017-11-21 Thread Paul Moore
I haven't tried it, but a quick Google search found PyPDF2 -
https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python

You don't give much detail about what you tried and how it failed, so
if the above doesn't work for you, I'd suggest providing more detail
as to what your problem is.

Paul

On 21 November 2017 at 15:18, Daniel Gross  wrote:
> Hi,
>
> I am new to python and jumped right into trying to read out (english) text
> from PDF files.
>
> I tried various libraries (including slate) out there but am running into
> diverse problems, such as with encoding or buffer too small errors -- deep
> inside some decompression code.
>
> Essentially, i want to extract all text and then do some natural language
> processing on the text. Is there some sample code available that works
> together with a clear description of the expected python installatin
> environment needed.
>
> In slate btw, i got the buffer error, it seems i must "guess" the right
> encoding of the text included in the PDF when opening the file. Still
> trying to figure out how to get the encoding info out of the PDF ... (if
> available there)
>
> thank you,
>
> Daniel
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: __hash__ and ordered vs. unordered collections

2017-11-21 Thread Josh B.
On Monday, November 20, 2017 at 3:17:49 PM UTC-5, Chris Angelico wrote:
> Neither is perfect. You have to take your pick between them.

Right on, thanks for weighing in, Chris. Your responses have been very helpful.

I wouldn't feel comfortable claiming the authority to make this call alone. But 
fortunately I reached out to Raymond Hettinger and am delighted to have his 
guidance, pasted below. Great to have this predicament resolved.

In case of interest, I've implemented Raymond's advice in the latest release of 
bidict, the bidirectional map library I authored . 
Feedback always welcome.

Thanks,
Josh

-- Forwarded message --
From: Raymond Hettinger 
Date: Mon, Nov 20, 2017 at 4:46 PM
Subject: Re: __hash__ and ordered vs. unordered collections
To: j...@math.brown.edu


If you want to make ordered and unordered collections interoperable, I would 
just let equality be unordered all the time.  You can always provide a separate 
method for an ordered_comparison.

IMO, the design for __eq__ in collections.OrderedDict was a mistake.  It 
violates the Liskov Substitution Principle which would let ordered dicts always 
be substituted whereever regular dicts were expected.  It is simpler to have 
comparisons be unordered everywhere.

But there are no perfect solutions.  A user of an ordered collection may 
rightfully expect an ordered comparison, while a user of both collections may 
rightfully expect them to be mutually substitutable.


Raymond
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Posts by Stefan Ram

2017-11-21 Thread Michael Torrie
On 11/21/2017 07:50 AM, Ethan Furman wrote:
> Everyone else, please do not quote Stefan's messages as they may then
> end up on the mailing list possibly violating his copyright.

The good news is, at least, that quoting his messages with attribution
is certainly fair use in all jurisdictions I'm aware of.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Rick Johnson
On Tuesday, November 21, 2017 at 5:57:42 AM UTC-6, Ned Batchelder wrote:
[...]
> [...]
> I don't understand the motivation for limiting how words
> are distributed, but others on this list also do it. For
> example, Dennis Lee Bieber's messages are not in the
> Python-List archives either. 

I called out Bieber years ago for his X-NO-ARCHIVE business
and nobody else seemed to care. In fact, the mood at the
time was more negative towards me for calling it out than towards
Dennis. 

Sorry i don't have a link to the thread, but i'm sure a
determined person could find it. It was probably somewhere
between 2008 and 2010.

> If something is worth saying, why not let people find it
> later?

That has always been my opinion as well. And if i remember
correctly, Dennis said something to effect of (paraphrasing)
"I don't want my words achived so that someone can leverage
my words against me for nefarious reasons"

"Nefarious reasons"? 

Paranoid or lame? 

You decide.
-- 
https://mail.python.org/mailman/listinfo/python-list


reading text in pdf, some working sample code

2017-11-21 Thread Daniel Gross
Hi,

I am new to python and jumped right into trying to read out (english) text
from PDF files.

I tried various libraries (including slate) out there but am running into
diverse problems, such as with encoding or buffer too small errors -- deep
inside some decompression code.

Essentially, i want to extract all text and then do some natural language
processing on the text. Is there some sample code available that works
together with a clear description of the expected python installatin
environment needed.

In slate btw, i got the buffer error, it seems i must "guess" the right
encoding of the text included in the PDF when opening the file. Still
trying to figure out how to get the encoding info out of the PDF ... (if
available there)

thank you,

Daniel
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Christopher Reimer
On Nov 21, 2017, at 5:36 AM, Rustom Mody  wrote:
> 
>> On Tuesday, November 21, 2017 at 5:27:42 PM UTC+5:30, Ned Batchelder wrote:
>>> On 11/20/17 9:50 AM, Stefan Ram wrote:
>>> Ned Batchelder  writes:
 Also, why set headers that prevent the Python-List mailing list from
 archiving your messages?
>>>   I am posting to a Usenet newsgroup. I am not aware of any
>>>   "Python-List mailing list".
>>> 
>>>   I am posting specifically to the Usenet, because I am aware
>>>   of it's rules and I like it and wish to support it.
>>> 
>>>   I do not post to a "mailing list" because I do not know which
>>>   rules apply for mailing lists and whether mailing lists in
>>>   general or any specific mailing list is an environment that I
>>>   like or wish to support.
>>> 
>> 
>> The dual nature of this online community has long been confusing and 
>> complicated.  It's both a newsgroup and a mailing list.  Add in Google 
>> Groups, and you really have three different faces of the same content.
>> 
>> The fact is, posting to comp.lang.python means that your words are also 
>> being distributed as a mailing list. Because of your messages' headers, 
>> they are not in the archive of that list 
>> (https://mail.python.org/pipermail/python-list/2017-November/thread.html), 
>> or in Google Groups 
>> (https://groups.google.com/forum/#!topic/comp.lang.python/0ejrtZ6ET9g). 
>> It makes for odd reading via those channels.
>> 
>> I don't understand the motivation for limiting how words are 
>> distributed, but others on this list also do it. For example, Dennis Lee 
>> Bieber's messages are not in the Python-List archives either. If 
>> something is worth saying, why not let people find it later?
> 
> To which I would add:
> Setting headers is hardly a working method.
> Somebody quotes Stefan or Dennis and they are on the archives
> And some quote including emails some not
> etc
> -- 
> https://mail.python.org/mailman/listinfo/python-list

A troll tried to prove that I was too retarded to program in Python by claiming 
that I asked a question on this list in the archives that could have been 
answered by searching the web. The funny thing is that none of the links that 
the troll provided answered my question.

Chris R. 
-- 
https://mail.python.org/mailman/listinfo/python-list


Posts by Stefan Ram

2017-11-21 Thread Ethan Furman

On 11/20/2017 10:47 AM, Michael Torrie wrote:> On 11/20/2017 07:50 AM, Stefan 
Ram wrote:

>>I am posting to a Usenet newsgroup. I am not aware of any
>>"Python-List mailing list".
>
> As far as I'm concerned, this list is primarily a mailing list, hosted
> by Mailman at python.org, and is mirrored to Usenet via a gateway as a
> service by python.org.  Granted, this is just a matter of perspective.

> You also have this header set:
>> X-Copyright: (C) Copyright 2017 Stefan Ram. All rights reserved.
>> Distribution through any means other than regular usenet
>> channels is forbidden. It is forbidden to publish this
>> article in the world wide web. It is forbidden to change
>> URIs of this article into links. It is forbidden to remove
>> this notice or to transfer the body without this notice.
>
> Looks to me like the mailing list needs to block your messages, lest
> python.org be in violation of your copyright.

Stefan, please look into the Python mailing list [1], and either remove your copyright or include python.org as an 
exception (and let us know if you do).


Until then, all your messages will be auto-discarded at the Usenet/mailing list 
boundary.

Everyone else, please do not quote Stefan's messages as they may then end up on the mailing list possibly violating his 
copyright.


--
~Ethan~

[1] https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list


Re: General Purpose Pipeline library?

2017-11-21 Thread Jason
On Monday, November 20, 2017 at 10:49:01 AM UTC-5, Jason wrote:
> a pipeline can be described as a sequence of functions that are applied to an 
> input with each subsequent function getting the output of the preceding 
> function:
> 
> out = f6(f5(f4(f3(f2(f1(in))
> 
> However this isn't very readable and does not support conditionals.
> 
> Tensorflow has tensor-focused pipepines:
> fc1 = layers.fully_connected(x, 256, activation_fn=tf.nn.relu, 
> scope='fc1')
> fc2 = layers.fully_connected(fc1, 256, activation_fn=tf.nn.relu, 
> scope='fc2')
> out = layers.fully_connected(fc2, 10, activation_fn=None, scope='out')
> 
> I have some code which allows me to mimic this, but with an implied parameter.
> 
> def executePipeline(steps, collection_funcs = [map, filter, reduce]):
>   results = None
>   for step in steps:
>   func = step[0]
>   params = step[1]
>   if func in collection_funcs:
>   print func, params[0]
>   results = func(functools.partial(params[0], 
> *params[1:]), results)
>   else:
>   print func
>   if results is None:
>   results = func(*params)
>   else:
>   results = func(*(params+(results,)))
>   return results
> 
> executePipeline( [
>   (read_rows, (in_file,)),
>   (map, (lower_row, field)),
>   (stash_rows, ('stashed_file', )),
>   (map, (lemmatize_row, field)),
>   (vectorize_rows, (field, min_count,)),
>   (evaluate_rows, (weights, None)),
>   (recombine_rows, ('stashed_file', )),
>   (write_rows, (out_file,))
>   ]
> )
> 
> Which gets me close, but I can't control where rows gets passed in. In the 
> above code, it is always the last parameter.
> 
> I feel like I'm reinventing a wheel here.  I was wondering if there's already 
> something that exists?

Why do I want this? Because I'm tired of writing code that is locked away in a 
bespoke function. I'd  have an army of functions all slightly different in 
functionality. I require flexibility in defining pipelines, and I don't want a 
custom pipeline to require any low-level coding. I just want to feed a sequence 
of functions to a script and have it process it. A middle ground between the 
shell | operator and bespoke python code. Sure, I could write many binaries 
bound by shell, but there are some things done far easier in python because of 
its extensive libraries and it can exist throughout the execution of the 
pipeline whereas any temporary persistence  has to be though environment 
variables or files.

Well after examining your feedback, it looks like Grapevine has 99% of the 
concepts that I wanted to invent, even if the | operator seems a bit clunky. I 
personally prefer the affluent interface convention. But this should work.

Kamaelia could also work, but it seems a little bit more grandiose. 


Thanks everyone who chimed in!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Rustom Mody
On Tuesday, November 21, 2017 at 7:06:18 PM UTC+5:30, Rustom Mody wrote:
> On Tuesday, November 21, 2017 at 5:27:42 PM UTC+5:30, Ned Batchelder wrote:
> > On 11/20/17 9:50 AM, Stefan Ram wrote:
> > > Ned Batchelder  writes:
> > >> Also, why set headers that prevent the Python-List mailing list from
> > >> archiving your messages?
> > >I am posting to a Usenet newsgroup. I am not aware of any
> > >"Python-List mailing list".
> > >
> > >I am posting specifically to the Usenet, because I am aware
> > >of it's rules and I like it and wish to support it.
> > >
> > >I do not post to a "mailing list" because I do not know which
> > >rules apply for mailing lists and whether mailing lists in
> > >general or any specific mailing list is an environment that I
> > >like or wish to support.
> > >
> > 
> > The dual nature of this online community has long been confusing and 
> > complicated.  It's both a newsgroup and a mailing list.  Add in Google 
> > Groups, and you really have three different faces of the same content.
> > 
> > The fact is, posting to comp.lang.python means that your words are also 
> > being distributed as a mailing list. Because of your messages' headers, 
> > they are not in the archive of that list 
> > (https://mail.python.org/pipermail/python-list/2017-November/thread.html), 
> > or in Google Groups 
> > (https://groups.google.com/forum/#!topic/comp.lang.python/0ejrtZ6ET9g). 
> > It makes for odd reading via those channels.
> > 
> > I don't understand the motivation for limiting how words are 
> > distributed, but others on this list also do it. For example, Dennis Lee 
> > Bieber's messages are not in the Python-List archives either. If 
> > something is worth saying, why not let people find it later?
> 
> To which I would add:
> Setting headers is hardly a working method.
> Somebody quotes Stefan or Dennis and they are on the archives
> And some quote including emails some not
> etc

O and one more thing:
If Stefan or Dennis say something to the above dont expect a response from me
since I would not have seen theirs 😉
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: General Purpose Pipeline library? (Posting On Python-List Prohibited)

2017-11-21 Thread Jason
On Monday, November 20, 2017 at 4:02:31 PM UTC-5, Lawrence D’Oliveiro wrote:
> On Tuesday, November 21, 2017 at 4:49:01 AM UTC+13, Jason wrote:
> > a pipeline can be described as a sequence of functions that are
> > applied to an input with each subsequent function getting the output
> > of the preceding function:
> > 
> > out = f6(f5(f4(f3(f2(f1(in))
> > 
> > However this isn't very readable and does not support conditionals.
> 
> Do you want a DAG in general?

If the nodes have a __call__, yes?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Rustom Mody
On Tuesday, November 21, 2017 at 5:27:42 PM UTC+5:30, Ned Batchelder wrote:
> On 11/20/17 9:50 AM, Stefan Ram wrote:
> > Ned Batchelder  writes:
> >> Also, why set headers that prevent the Python-List mailing list from
> >> archiving your messages?
> >I am posting to a Usenet newsgroup. I am not aware of any
> >"Python-List mailing list".
> >
> >I am posting specifically to the Usenet, because I am aware
> >of it's rules and I like it and wish to support it.
> >
> >I do not post to a "mailing list" because I do not know which
> >rules apply for mailing lists and whether mailing lists in
> >general or any specific mailing list is an environment that I
> >like or wish to support.
> >
> 
> The dual nature of this online community has long been confusing and 
> complicated.  It's both a newsgroup and a mailing list.  Add in Google 
> Groups, and you really have three different faces of the same content.
> 
> The fact is, posting to comp.lang.python means that your words are also 
> being distributed as a mailing list. Because of your messages' headers, 
> they are not in the archive of that list 
> (https://mail.python.org/pipermail/python-list/2017-November/thread.html), 
> or in Google Groups 
> (https://groups.google.com/forum/#!topic/comp.lang.python/0ejrtZ6ET9g). 
> It makes for odd reading via those channels.
> 
> I don't understand the motivation for limiting how words are 
> distributed, but others on this list also do it. For example, Dennis Lee 
> Bieber's messages are not in the Python-List archives either. If 
> something is worth saying, why not let people find it later?

To which I would add:
Setting headers is hardly a working method.
Somebody quotes Stefan or Dennis and they are on the archives
And some quote including emails some not
etc
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to Generate dynamic HTML Report using Python

2017-11-21 Thread Ned Batchelder

On 11/20/17 9:50 AM, Stefan Ram wrote:

Ned Batchelder  writes:

Also, why set headers that prevent the Python-List mailing list from
archiving your messages?

   I am posting to a Usenet newsgroup. I am not aware of any
   "Python-List mailing list".

   I am posting specifically to the Usenet, because I am aware
   of it's rules and I like it and wish to support it.

   I do not post to a "mailing list" because I do not know which
   rules apply for mailing lists and whether mailing lists in
   general or any specific mailing list is an environment that I
   like or wish to support.



The dual nature of this online community has long been confusing and 
complicated.  It's both a newsgroup and a mailing list.  Add in Google 
Groups, and you really have three different faces of the same content.


The fact is, posting to comp.lang.python means that your words are also 
being distributed as a mailing list. Because of your messages' headers, 
they are not in the archive of that list 
(https://mail.python.org/pipermail/python-list/2017-November/thread.html), 
or in Google Groups 
(https://groups.google.com/forum/#!topic/comp.lang.python/0ejrtZ6ET9g). 
It makes for odd reading via those channels.


I don't understand the motivation for limiting how words are 
distributed, but others on this list also do it. For example, Dennis Lee 
Bieber's messages are not in the Python-List archives either. If 
something is worth saying, why not let people find it later?


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list


Re: how to compare and check if two binary(h5) files numerically have the same contents

2017-11-21 Thread Cameron Simpson

On 21Nov2017 02:04, Heli  wrote:
I am trying to compare the contents of two binary files. I use python 3.6 
filecomp comparing same name files inside two directories.


results_dummy=filecmp.cmpfiles(dir1, dir2, common, shallow=True)

The above line works for *.bin file I have in both directories, but it does not 
work with h5 files.

When comparing two hdf5 files that contain exactly the same groups/datasets and 
numerical data, filecmp.cmpfiles finds them as mismatch. My hdf files are not 
binary equal but contain the same exact data.

Is there anyway to compare the contents of two hdf5 files from within Python 
script and without using h5diff?


There are several packages on PyPI related to the H5 data format:

 https://pypi.python.org/pypi?%3Aaction=search&term=h5

I imagine what you need to do is to load your 2 H5 data files and then compare 
the data structures within them. Hopefully one of these packages can be used 
for this. This one looks promising:


 https://pypi.python.org/pypi/h5py/2.7.1

If you have pip, you should be able to install it thus:

 pip install --user h5py

to make use of it.

Cheers,
Cameron Simpson  (formerly c...@zip.com.au)
--
https://mail.python.org/mailman/listinfo/python-list


how to compare and check if two binary(h5) files numerically have the same contents

2017-11-21 Thread Heli
Dear all, 

I am trying to compare the contents of two binary files. I use python 3.6 
filecomp comparing same name files inside two directories.

results_dummy=filecmp.cmpfiles(dir1, dir2, common, shallow=True)

The above line works for *.bin file I have in both directories, but it does not 
work with h5 files.

When comparing two hdf5 files that contain exactly the same groups/datasets and 
numerical data, filecmp.cmpfiles finds them as mismatch. My hdf files are not 
binary equal but contain the same exact data. 

Is there anyway to compare the contents of two hdf5 files from within Python 
script and without using h5diff?

Thanks in Advance,
-- 
https://mail.python.org/mailman/listinfo/python-list