On 07/31/2017 09:39 AM, bruce wrote: > Hi guys. > > Testing getting data from a number of different US based/targeted > websites. So the input data source for the most part, will be "ascii". > I'm getting a few "weird" chars every now and then asn as fas as I can > tell, they should be utf-8. > > However, the following hasn;t always worked: > s=str(s).decode('utf-8').strip() > > So, is there a quick/dirty approach I can use to simply strip out the > "non-ascii" chars. I know, this might not be the "best/pythonic" way, > and that it might result in loss of some data/chars, but I can live > with it for now. > > thoughts/comments ??
It's easy enough to toss chars if you don't care what's being tossed, which sounds like your case, something like: ''.join(i for i in s if ord(i) < 128) but there's actually lots to think about here (I'm sure others will jump in) - Python2 strings default to ascii, Python3 to unicode, there may be some excitement with the use of ord() depending on how the string is passed around - websites will tell you their encoding, which you could and probably should make use of - web scraping with Python is a pretty well developed field, perhaps you might want to use one of the existing projects? (https://scrapy.org/ is pretty famous, certainly not the only one) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor