Cool! Code! (gullom voice - we must look at code, yesss....) > Hi there. > > What this section area does is takes a data file that is comma separated > and > imports - there is a unique ID in the first field and a code in the second > field that corresponds to a certain section of information. What I need > from > this is for the process to role up against unique ID all section holdings > withot duplicates, report on section combinations, and overal section > counts. In addtion I need the ability to assigna value for page count to > these sections and have the ability to uploada translation file just in > case > a section is identiifed by multiple values that needs to be normalized to > a > single unique value. > > Sorry for the lengthly code response - all commenst are appreciated - as > mentioned I am quite new with Python - it is doing what I need it to do > but > I think that it is a mess and needs to be cleaned up a little. > > Thanks for any comments. > > GTXY20 > > import sys > import os > class __analysis: > def __init__(self): > print '***Analysis Tool***' > datafile=raw_input('data file name:') > self.datafile=datafile > self.parsefile()
Ouch. Usually in OOP, one never puts any user interaction into a class. Anytime you want to make an __analysis object, code execution is halted unto the user responds.... Yuck. If necessary, then make the datafile a parameter you pass in and do the raw_input outside of the class definition before you instantiate an object. As an aside... setting the filename to datafile and then datafile directly to self.datafile is a little wasteful when you can set it directly to self.datafile > # script to import unitID section data and section page count reference > and > create a sorted dictionary > # where in uhdata{} key=unitID and value=unitID section holdings > # where in pgcnt{} key=Section and value=page count > > def parsefile(self): > try: > uhdatafile = open(self.datafile, 'r') > records = uhdatafile.read() > uhdatafile.close() > lines = records.split() Ummm this will split on every space also. Using the name 'lines' as an indicator I think you were expecting this to actually split on newline characters, so instead you should use uhdatafile.readlines() > self.uhdata={} > for line in lines: Here you iterate over a created list. It is easier and more efficient to iterate over the file itself. for line in uhdatafile: do_stuff() uhdatafile.close() > uh, tf = line.split(',') > if uh in self.uhdata: > f=self.uhdata[uh] > if tf not in f: > f.append(tf) > else: > self.uhdata[uh]=[tf] This can read try: f = self.uhdata[uh] except KeyError: self.uhdata[uh] = [] finally: self.uhdata[uh].append(tf) Also note that there is no error checking in the algorithm anywhere here. The file must be exactly formatted or it will create an error. > for uh, Sections in self.uhdata.items(): > Sections.sort() This will not work. In the documentation, it says that dictionary object return a *copy* of the (key,value) pairs. You are sorting those *copies*. Also sinces uh is not used here itervalues() is sufficient. for Sections in self.uhdata.itervalues(): Sections.sort() > except IOError: > print 'file not found check file name' > analysis() > > ftranslateok=raw_input('would you like to translate section codes? > (y/n):') > if ftranslateok == 'y': > self.transFn() > else: > pass This is unnecessary, code execution will continue whether you use else: pass or not. > pgcountok=raw_input('would you like to assign section page counts? > (y/n):') > if pgcountok == 'y': > self.setPageCounts() > else: > missingpgcounts={} > fmissingpgcounts=[] > for x in self.uhdata: > for f in self.uhdata[x]: > if f not in fmissingpgcounts: > fmissingpgcounts.append(f) fmissingpgcounts = [f for f in self.uhdata.itervalues() if f not in fmissingpgcounts] Ahhh... This looks like a set. fmissingpgcounts = set(self.uhdata.itervalues()) > for x in fmissingpgcounts: > missingpgcounts[x]=0 > self.pgcounts = missingpgcounts Oh my. So we have a dictionary, create a unique list of the values and then create another dictionary from that? The intermediate variable missingpgcounts is not necessary. self.pgcounts = dict((x,0) for x in fmissingpgcounts) > fdistmodel=raw_input('would you like to define max section > distribution cut off? (y/n):') > if fdistmodel == 'y': > self.fdistmax=raw_input('what is the max distributions before a > full book?:') > self.fdistmax=int(self.fdistmax) > self.Sectiondistmax() > else: > self.fdistmax=1000000000 > self.Sectiondistmax() self.Sectiondistmax() needs to be typed only once unindented after the if/else block. If self.fdistmax will never be zero, then you can do this... self.fdistmax = int(self.fdistmax) or 1000000000 Oh, and if the user presses enter without puttting in a number then your program will exit with an error. If you want you can use this to your advantage. (Press enter for default of all) try: self.fdistmax = int(self.fdistmax) except ValueError: self.fdistmax = False False is a better flag here. Though you will never have a value of a billion, you could never, ever possibly have a value of False. (0) Also helps readability. See below. > sys.exit(1) > > # function to determine number of uniqueID for each section > def Sectionqty(self): > Sectionqtyoutfile = open('Sectionqty.txt', 'w+') > Sectionqtyoutfile.write ('Section\tQTY\n') > from collections import defaultdict > fcounts=defaultdict(int) > flst=[] > flst2=[] Okay this whole section can be rewritten. The reason is because the two blocks are basically the same thing with the one stipulation based on a flag. I will attempt to rewrite this so the code is not copied. > if self.fdistmax == 1000000000: > for v in self.uhdata.values(): > for item in v: > fcounts[item]+=1 > > for k,v in sorted(fcounts.items()): > Section=k > fqty=v > Sectionqtyoutfile.write ('%s\t%s\n' % (Section, fqty)) > > else: > for k,v in self.uhdata.items(): > if len(v)<=self.fdistmax: > flst.append(self.uhdata[k]) > for i in flst: > for x in i: > flst2.append(x) > for Sections in flst2: > fcounts[Sections]+=1 > for k,v in sorted(fcounts.items()): > Section= k > fqty= v > Sectionqtyoutfile.write ('%s\t%s\n' % (Section, fqty)) for k,v in self.uhdata.items(): if self.fdistmax or len(v)<=self.fdistmax: for x in self.uhdata[k]: fcounts[x]+=1 for (Section, fqty) in sorted(fcounts.items()): Sectionqtyoutfile.write('%s\t%s\n'%(Section,fqty)) > Sectionqtyoutfile.close() > self.SectionCombqty() > > # function to determine number of uniqueID section combinations and > associated section page counts > def SectionCombqty(self): > SectionCombqtyoutfile = open('SectionCombqty.txt', 'w+') > SectionCombqtyoutfile.write('Combination Qty\tNumber of > Sections\tCombination\tCombinationPageCount\tTotalPages\n') > fullbook = 'Full Book' > fgreater=[] > fcheck=0 > from collections import defaultdict > fcomb=defaultdict(int) > for uh in self.uhdata.keys(): > fcomblst=self.uhdata[uh] > fcomb[tuple(fcomblst)]+=1 for value in self.uhdata.itervalues(): fcomb[tuple(value)]+=1 > if self.fdistmax == 1000000000: > for count, items in sorted( ((v,k) for k,v in fcomb.items > ()),reverse=True): > fpgcounts = sum([self.pgcounts.get(i,i) for i in > list(items)]) > Sectioncomb = ','.join(items) > holdings = len(items) > totpgcounts = count*fpgcounts > SectionCombqtyoutfile.write ('%s\t%s\t%s\t%s\t%s\n' % > (count,holdings,Sectioncomb,fpgcounts,totpgcounts)) > > else: > for count, items in sorted( ((v,k) for k,v in fcomb.items > ()),reverse=True): > if len(items) <= self.fdistmax: > fpgcounts = sum([self.pgcounts.get(i,i) for i in > list(items)]) > Sectioncomb = ','.join(items) > holdings = len(items) > totpgcounts = count*fpgcounts > SectionCombqtyoutfile.write ('%s\t%s\t%s\t%s\t%s\n' % > (count,holdings,Sectioncomb,fpgcounts,totpgcounts)) > for count, items in sorted( ((v,k) for k,v in fcomb.items > ()),reverse=True): > if len(items)>self.fdistmax: > fgreater.append(count) > > fcheck=sum(fgreater) > SectionCombqtyoutfile.write ('%s\t''>''%s\t%s\t%s\t%s\n' % > (fcheck,self.fdistmax,fullbook,fullbook,fullbook)) See what I did to the section above to see how to make this one block instead of two copy&paste blocks. > > SectionCombqtyoutfile.close() > > # where in pgcnt{} key=Section and value=page count > def setPageCounts(self): > pagecountfile=raw_input('page count file name:') > self.pagecountfile=pagecountfile > try: > pagecountinfile = open(self.pagecountfile, 'r') > records = pagecountinfile.read() > pagecountinfile.close() > self.pgcounts={} > lines = records.split() > for line in lines: > fpg, cnt = line.split(',') > self.pgcounts[fpg]=int(cnt) This is very similar to the block at the beginning of this post. Make the same changes here. > except IOError: > print 'file not found check file name' > analysis() > > # function to determine number of uniqueID distributions and associated > Sections held > def Sectiondistmax(self): > from collections import defaultdict > Sectiondistoutfile = open('Sectiondist.txt', 'w+') > Sectiondistoutfile.write ('SectionDistributions\tQTY\n') > fgreater=[] > fullbook = "Full Book" > fcheck=0 > fcount=defaultdict(int) > for uh in self.uhdata.keys(): > f=self.uhdata[uh] > fcount[len(f)]+=1 > if self.fdistmax == 1000000000: > for k,v in sorted(fcount.items()): You don't have to name these k,v. Any name is fine like: for fdist, fqty in fcount.items(sorted=True): > fdist=k > fqty=v > Sectiondistoutfile.write ('%s\t%s\n' % (fdist,fqty)) > > else: > for k,v in sorted(fcount.items()): > if k <= self.fdistmax: > fdist=k > fqty=v Same name thing. > Sectiondistoutfile.write ('%s\t%s\n' % (fdist,fqty)) > > for k,v in sorted(fcount.items()): > if k > self.fdistmax: > fgreater.append(fcount[k]) > fcheck=sum(fgreater) > Sectiondistoutfile.write ('%s\t%s\n' % (fullbook,fcheck)) > Sectiondistoutfile.close() > self.Sectionqty() > > #function to translate UnitID Sectioncodes to normalized assigned Section > code (e.g. parent and mulitple child section codes) > def transFn(self): > transfile=raw_input('Section translate file name:') > self.transfile=transfile > try: > transfilein=open(self.transfile, 'r') > records = transfilein.read() > transfilein.close() > lines = records.split() > transDict = {} > for line in lines: > key, value = line.split(',') > transDict[key] = value Again, the same thing here as way up above. > for key, value in self.uhdata.items(): > self.uhdata[key] = [ transDict.get(i, i) for i in value ] > for k in self.uhdata: > self.uhdata[k]=sorted(set(self.uhdata[k])) Uhhh... Right. > except IOError: > print 'file not found check file name' > analysis() Just some suggestions. A long piece of code for what you are trying to accomplish... It probably could be shortened, seeing how much duplicate code is within it. All I did was work through it block by block changing pieces so that they make a little more sense and are more efficient. I suspect that a rework of the complete design would be beneficial. I'm just too tired and busy to think for that long. ;-) HTH, JS _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor