Re: Is there a source for detailed, instruction-level performance info?

2016-01-22 Thread Frank Swarbrick
I'm no assembler expert, but how about
L R15,COUNTER
AFI   R15,1
STR15,COUNTER

> Date: Fri, 22 Jan 2016 14:14:49 -0500
> From: i...@panix.com
> Subject: Re: Is there a source for detailed, instruction-level performance 
> info?
> To: IBM-MAIN@LISTSERV.UA.EDU
> 
> In article <000401d140df$6f05a2a0$4d10e7e0$@att.net> Skip wrote:
> 
> > As a newbie, I got curious about the relative speed of these strategies:
> >
> > 1. L R15, COUNTER
> > 2. A R15,=F(+1)
> > 3. ST R15, COUNTER
> >
> > 1. L R15, COUNTER
> > 2. LA R15,1(,R15) 
> > 3. ST R15, COUNTER
> >
> > I asked my manager, who encouraged me to delve into the manual Shmuel cites.
> > I decided that LA was faster because there was no storage access. The
> > program ran like a banshee. It ran so fast that it was used to benchmark new
> > hardware. Really!
> >
> > It wasn't till later that I pondered a basic flaw. As written, the program
> > could not handle a counter greater than 16M because it ran in 24 bit mode.
> > This was before XA. At the time I wrote it, the data base was comfortably
> > within that limit, but over time, long after I had moved on, I (still)
> > wonder if the application survived and if any counter ever hit the limit.
> > Moral: whether or not size matters, speed is certainly not a simple metric. 
> 
>L R15,COUNTER
>BCTR  R15,0
>STR15,COUNTER
> 
> gives a 2^31 limit, 2^32 if some care is taken when printing the totals, at
> speed comparable to LA.
> 
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
  
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-22 Thread Frank Swarbrick
Or dare I suggest:
ASI  COUNTER,1

> Date: Fri, 22 Jan 2016 14:30:02 -0700
> From: frank.swarbr...@outlook.com
> Subject: Re: Is there a source for detailed, instruction-level performance 
> info?
> To: IBM-MAIN@LISTSERV.UA.EDU
> 
> I'm no assembler expert, but how about
> L R15,COUNTER
> AFI   R15,1
> STR15,COUNTER
> 
> > Date: Fri, 22 Jan 2016 14:14:49 -0500
> > From: i...@panix.com
> > Subject: Re: Is there a source for detailed, instruction-level performance 
> > info?
> > To: IBM-MAIN@LISTSERV.UA.EDU
> > 
> > In article <000401d140df$6f05a2a0$4d10e7e0$@att.net> Skip wrote:
> > 
> > > As a newbie, I got curious about the relative speed of these strategies:
> > >
> > > 1. L R15, COUNTER
> > > 2. A R15,=F(+1)
> > > 3. ST R15, COUNTER
> > >
> > > 1. L R15, COUNTER
> > > 2. LA R15,1(,R15) 
> > > 3. ST R15, COUNTER
> > >
> > > I asked my manager, who encouraged me to delve into the manual Shmuel 
> > > cites.
> > > I decided that LA was faster because there was no storage access. The
> > > program ran like a banshee. It ran so fast that it was used to benchmark 
> > > new
> > > hardware. Really!
> > >
> > > It wasn't till later that I pondered a basic flaw. As written, the program
> > > could not handle a counter greater than 16M because it ran in 24 bit mode.
> > > This was before XA. At the time I wrote it, the data base was comfortably
> > > within that limit, but over time, long after I had moved on, I (still)
> > > wonder if the application survived and if any counter ever hit the limit.
> > > Moral: whether or not size matters, speed is certainly not a simple 
> > > metric. 
> > 
> >L R15,COUNTER
> >BCTR  R15,0
> >STR15,COUNTER
> > 
> > gives a 2^31 limit, 2^32 if some care is taken when printing the totals, at
> > speed comparable to LA.
> > 
> > --
> > For IBM-MAIN subscribe / signoff / archive access instructions,
> > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
> 
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
  
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-22 Thread Charles Mills
Depending on the newness of your hardware.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Frank Swarbrick
Sent: Friday, January 22, 2016 1:32 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

Or dare I suggest:
ASI  COUNTER,1

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-22 Thread Randy Hudson
In article <000401d140df$6f05a2a0$4d10e7e0$@att.net> Skip wrote:

> As a newbie, I got curious about the relative speed of these strategies:
>
> 1. L R15, COUNTER
> 2. A R15,=F(+1)
> 3. ST R15, COUNTER
>
> 1. L R15, COUNTER
> 2. LA R15,1(,R15) 
> 3. ST R15, COUNTER
>
> I asked my manager, who encouraged me to delve into the manual Shmuel cites.
> I decided that LA was faster because there was no storage access. The
> program ran like a banshee. It ran so fast that it was used to benchmark new
> hardware. Really!
>
> It wasn't till later that I pondered a basic flaw. As written, the program
> could not handle a counter greater than 16M because it ran in 24 bit mode.
> This was before XA. At the time I wrote it, the data base was comfortably
> within that limit, but over time, long after I had moved on, I (still)
> wonder if the application survived and if any counter ever hit the limit.
> Moral: whether or not size matters, speed is certainly not a simple metric. 

   L R15,COUNTER
   BCTR  R15,0
   STR15,COUNTER

gives a 2^31 limit, 2^32 if some care is taken when printing the totals, at
speed comparable to LA.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-22 Thread Frank Swarbrick
Works on mine!  :-)

> Date: Fri, 22 Jan 2016 13:33:49 -0800
> From: charl...@mcn.org
> Subject: Re: Is there a source for detailed, instruction-level performance 
> info?
> To: IBM-MAIN@LISTSERV.UA.EDU
> 
> Depending on the newness of your hardware.
> 
> Charles
> 
> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
> Behalf Of Frank Swarbrick
> Sent: Friday, January 22, 2016 1:32 PM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: Re: Is there a source for detailed, instruction-level performance
> info?
> 
> Or dare I suggest:
> ASI  COUNTER,1
> 
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
  
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-08 Thread Clark Morris
On 4 Jan 2016 07:26:22 -0800, in bit.listserv.ibm-main you wrote:

>Jerry Callen wrote:
>> I'm really looking to make this core as fast as possible.
>Mike Colishaw tells some funny stories about optimization initiatives on 
>software projects. Is it certain that the place 
>you are optimizing is your real bottleneck.

This brings to mind a run time reduction for a hog at a place where I
was contracting.  The initial request was to optimize a date routine
which was using 10 percent of the CPU.  I was able to cut the time
used by better than 50 percent measured by SMF records for runs of a
test harness which did in effect run a loop of NOOPs, run a loop of
calculations using the current method and run a loop of calculations
using my revised method.  My revision exposed a bug in the using
program that had to be fixed but it did virtually nothing to reduce
the overall run time, something I had warned the client about before
spending too much time on it.  Thus because of the measurements that I
was careful to take and the warning, I maintained credibility.  The
changes were not put into general production with other programs
because that would have required testing especially since the
revisions gave a consistent way of handling error conditions rather
than leaving random garbage.  

Later, I revisited the run and cut substantial time from it by
modifying the customer file VSAM read routine/write to save the
results of all reads and writes by record type and comparing the keys
of records to see if the record requested was in memory for the random
reads and comparing the entire record for writes to see if it had
already been written.  This saved over a million reads and a
substantial number of writes thus cutting both I-O time and CPU time
since the client was using a VSAM compression package.  Again because
of testing considerations, the revised subroutine was put in
production only for this program which was in the critical path where
it made a substantial difference in elapsed time.  I logged read and
write totals by record type to SYSOUT in the routine.

I also was able to substantially reduce the run times in general for
the application by making more aggressive use of BLSR than had
previously been done at the shop by measuring the reduction in EXCPs
in relation to overall number of reads and writes using SMF reports.

Clark Morris   

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-04 Thread Jerry Callen
Jack J. Woehr wrote:

> Not sure how relevant that this is to mainframe programming, but years ago
> when I designed and executed with a team of nine a data-heavy server in
> Unix optimized for multiple cores, what we found was that reroutable queuing
> of data from one simplistic processing engine to the next (with reservoirs
> for data accumulation) got the most performance.

I'm not sure I grok this. Are you talking about hardware or software (or a 
combination)? I *think* what you are describing is something akin to what 
happens in a Unix pipeline or message queueing systems. Can you provide a 
reference? It sounds interesting, and similar to other "flow-based" programming 
systems (Volcano, IBM's DataStage EE, Expressor Software's parallel engine, 
various parallel database engines, etc.).

That said: in this case, the core algorithm is very small (under 100 
instructions), and already parallelized across multiple threads with limited 
interaction across threads (data parallel style). Pipeline parallelism isn't 
appropriate WITHIN this core, though the core could sensibly be used as a 
component in a larger pipeline-parallel job. I'm really looking to make this 
core as fast as possible.

The effect of other work in the system has been mentioned; I get that. The 
impact of the "hinting" instructions IBM has provided (PFD, BPP, NIAI) will 
obviously be affected by context switches. But IBM presumably provided them for 
a reason.

-- Jerry

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-04 Thread Jack J. Woehr

Jerry Callen wrote:

Jack J. Woehr wrote:


Not sure how relevant that this is to mainframe programming, but years ago
when I designed and executed with a team of nine a data-heavy server in
Unix optimized for multiple cores, what we found was that reroutable queuing
of data from one simplistic processing engine to the next (with reservoirs
for data accumulation) got the most performance.

I'm not sure I grok this. Are you talking about hardware or software (or a 
combination)


Software running on multithreaded Linux on 8-16 cores.

We had all sorts of data processing that happened on an steady hi-volume stream of incoming data before the massaged 
data got parked in the database.


Our software architecture was based on that fine principle of Naval Engineering 
enunciated so many decades ago:

   "All machines are the same. There's a gozinta, a gozouta, and in the middle 
there's a pocketa-pocketa."

The pocketa-pocketas were simple threads on cores. The gozintas and gozoutas 
were MQSeries queues.

Incoming data was dealt into various MQSeries queues.

Threads were doing simple processing steps.

Each thread took from a queue, processed, and then wrote to one of several queues as appropriate, where the next thread 
for the next appropriate processing step did the same etc., until finally written to the database.


Worked well with Linux multithreading architecture on multiple cores. Keep all 
the cores balanced in load.

--
Jack J. Woehr # Science is more than a body of knowledge. It's a way of
www.well.com/~jax # thinking, a way of skeptically interrogating the universe
www.softwoehr.com # with a fine understanding of human fallibility. - Carl Sagan


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-04 Thread Jack J. Woehr

Jerry Callen wrote:

I'm really looking to make this core as fast as possible.
Mike Colishaw tells some funny stories about optimization initiatives on software projects. Is it certain that the place 
you are optimizing is your real bottleneck.


--
Jack J. Woehr # Science is more than a body of knowledge. It's a way of
www.well.com/~jax # thinking, a way of skeptically interrogating the universe
www.softwoehr.com # with a fine understanding of human fallibility. - Carl Sagan

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-04 Thread Joel C. Ewing
On 01/04/2016 09:23 AM, Jack J. Woehr wrote:
> Jerry Callen wrote:
>> Jack J. Woehr wrote:
>>
>>> Not sure how relevant that this is to mainframe programming, but
>>> years ago
>>> when I designed and executed with a team of nine a data-heavy server in
>>> Unix optimized for multiple cores, what we found was that reroutable
>>> queuing
>>> of data from one simplistic processing engine to the next (with
>>> reservoirs
>>> for data accumulation) got the most performance.
>> I'm not sure I grok this. Are you talking about hardware or software
>> (or a combination)
>
> Software running on multithreaded Linux on 8-16 cores.
>
> We had all sorts of data processing that happened on an steady
> hi-volume stream of incoming data before the massaged data got parked
> in the database.
>
> Our software architecture was based on that fine principle of Naval
> Engineering enunciated so many decades ago:
>
>"All machines are the same. There's a gozinta, a gozouta, and in
> the middle there's a pocketa-pocketa."
>
> The pocketa-pocketas were simple threads on cores. The gozintas and
> gozoutas were MQSeries queues.
>
> Incoming data was dealt into various MQSeries queues.
>
> Threads were doing simple processing steps.
>
> Each thread took from a queue, processed, and then wrote to one of
> several queues as appropriate, where the next thread for the next
> appropriate processing step did the same etc., until finally written
> to the database.
>
> Worked well with Linux multithreading architecture on multiple cores.
> Keep all the cores balanced in load.
>
The basic underlying concept here seems to be that data that needs to be
touched soon by some other process needs to be kept where it may be more
quickly accessed than less needed data.  While this can be addressed on
the application design level, this has taken all sorts of forms above
the application level on IBM mainframes over the decades.

Regarding keeping loads on processor cores balanced in a Unix
environment,  I suspect this is not relevant to MVS, which I think is
more concerned with trying to minimize unnecessary CP context switching
to optimize processor cache usage, even though this can result in a very
skewed processor utilization under less than 100% load.

Throwing more buffers at very active VSAM files, using LSR VSAM buffer
pools in CICS regions to share in-memory data across many thousands of
CICS transactions, and throwing large amounts of real memory at DB2
buffer pools to retain recently referenced DB2 table data pages in
memory are all techniques that have been used on IBM mainframes under
MVS to keep data that may be needed again soon in main memory without
modifying application code; and the mainframe hardware support for
virtual memory paging, processor cache, DASD controller cache and DASD
device cache all serve a similar purpose at the hardware level.

I am certain that one can obtain higher performance with fewer physical
resources when an application design explicitly and correctly designates
which output data will be quickly needed again by another process, like
the mentioned approach of designing the application to use a system of
internal, in-memory queues; but the approach that IBM seems to have
taken in recent years of addressing performance problems at a higher
level has the advantage that you may be able, with somewhat more
resources, to adequately improve the performance of many applications
without having to analyze and potentially redesign individual,
imperfectly-designed applications. 

The rather dramatic reduction in the relative cost of hardware over the
last decade or so has increased the attractiveness of letting the
Operating System and hardware approximate which data is likely to
benefit from being kept most accessible, especially when needed insight
to do that explicitly may have been lacking or impractical during
application design.

-- 
Joel C. Ewing,Bentonville, AR   jcew...@acm.org 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-03 Thread Alan Altmark
On Mon, 28 Dec 2015 11:02:15 -0600, Jerry Callen  wrote:

>I'm not really after detailed timing. I'm looking for implementation details 
>of the same sort used by compiler writers to guide selection of instruction 
>sequences, where the guiding principle is, "How can I avoid pipeline stalls?" 
>As I noted, several SHARE presentations contain SOME of this information, 
>which I've already benefited from, but I'm looking for more.

Just remember that your machine runs more than one thing at a time, and trying 
to over-engineer a solution may end up being suboptimal.  Nice for 
single-thread performance, but it may be irrelevant in terms of overall 
throughput.  If caches are sufficiently polluted by other LPARs, you may find 
no advantage.  Someday they may even violate the law of causality.  Who 
knows

Every processor family has different behaviors.   The longevity of the z 
architecture can be attributed to the fact that we don't get overly carried 
away trying to teach the machines to sit up and beg.  At some point, it's "fast 
enough" and making it go faster is just an academic exercise.   

>I'm still hoping that someone will reveal the existence of a document intended 
>for compiler writers...

Folks who participate in PWD may have access to that kind of information -- I 
don't know.  If you participate in Linux GCC development, then you can see what 
IBM loads upstream for new processors.

Alan Altmark
IBM

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-03 Thread Alan Altmark
On Wed, 30 Dec 2015 17:08:26 -0600, Jerry Callen  wrote:
>Bob Rogers (no longer at IBM...) 

Bob rejoined IBM a while back, working in z/VM.

Alan Altmark
IBM

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2016-01-03 Thread Jack J. Woehr

Alan Altmark wrote:

On Mon, 28 Dec 2015 11:02:15 -0600, Jerry Callen  wrote:


"How can I avoid pipeline stalls?"


Not sure how relevant that this is to mainframe programming, but years ago when 
I designed and executed
with a team of nine a data-heavy server in Unix optimized for multiple cores, 
what we found was that
reroutable queuing of data from one simplistic processing engine to the next 
(with reservoirs for data accumulation)
got the most performance. It was a much more productive approach in light of 
memory, bus and i/o considerations than
executing an arbitrary design and trying to speed it up on the processor cores 
themselves.

That architecture is more or less how OS/400 worked on its best days, btw.


--
Jack J. Woehr # Science is more than a body of knowledge. It's a way of
www.well.com/~jax # thinking, a way of skeptically interrogating the universe
www.softwoehr.com # with a fine understanding of human fallibility. - Carl Sagan

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-30 Thread Tony Harminc
On 30 December 2015 at 18:42, Charles Mills  wrote:
>On 30 December 2015 at 18:08, Jerry Callen  wrote:
>> How about it, IBM? Surely there must be someone Poughkeepsie who wants to 
>> visit San Antonio in
>> March? :-)

> Or possibly in Toronto!

I'll have you know we were all convinced Toronto wasn't going to have
a winter this year! It was golfing weather until two days ago and ski
places a bit north of here had given up and re-opened their summer
hiking and golf programs. Well, ahem... A green Christmas and a
frozen-clods-of-ice New Year. I guess San Antonio does sound
appealing.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-30 Thread Tony Harminc
On 30 December 2015 at 17:10, Charles Mills  wrote:
> I would assume there is some sort of a compiler/hardware architecture
> liaison group within IBM. I would bet that if someone from that group were
> to put together a SHARE presentation called "Write Machine Code Like a
> Compiler -- How to Write the Fastest Code Possible for the z13 (z14,
> whatever)" that it would be a big hit.

I'm sure it would be. I've wondered over the years just how the
compiler and architecture people interact, and what influences what.
Clearly there's been a change in that many (most?) new hardware
features seem to be geared quite obviously to language features, which
surely wasn't the case in e.g. 1964.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-30 Thread Jerry Callen
Charles Mills wrote:

> I would assume there is some sort of a compiler/hardware architecture
> liaison group within IBM. I would bet that if someone from that group were
> to put together a SHARE presentation called "Write Machine Code Like a
> Compiler -- How to Write the Fastest Code Possible for the z13 (z14,
> whatever)" that it would be a big hit.

I second that!

Bob Rogers (no longer at IBM...) ran a series of "How Do You Do What You Do 
When You're a (z10/z196/z13) CPU?" sessions over the years. But they were more 
focused on overall characteristics of the system, not on specifically how an 
assembly language programmer can best exploit the hardware. A session that 
presumes that background knowledge and really gets down into the weeds would be 
great. And maybe follow it up with a BOF for assembly weenies to address 
specific questions.

How about it, IBM? Surely there must be someone Poughkeepsie who wants to visit 
San Antonio in March? :-)

-- Jerry

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-30 Thread Charles Mills
Or possibly in Toronto!

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Jerry Callen
Sent: Wednesday, December 30, 2015 3:08 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance info?

Charles Mills wrote:

> I would assume there is some sort of a compiler/hardware architecture 
> liaison group within IBM. I would bet that if someone from that group 
> were to put together a SHARE presentation called "Write Machine Code 
> Like a Compiler -- How to Write the Fastest Code Possible for the z13 
> (z14, whatever)" that it would be a big hit.

I second that!

Bob Rogers (no longer at IBM...) ran a series of "How Do You Do What You Do 
When You're a (z10/z196/z13) CPU?" sessions over the years. But they were more 
focused on overall characteristics of the system, not on specifically how an 
assembly language programmer can best exploit the hardware. A session that 
presumes that background knowledge and really gets down into the weeds would be 
great. And maybe follow it up with a BOF for assembly weenies to address 
specific questions.

How about it, IBM? Surely there must be someone Poughkeepsie who wants to visit 
San Antonio in March? :-)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-30 Thread Charles Mills
I would assume there is some sort of a compiler/hardware architecture
liaison group within IBM. I would bet that if someone from that group were
to put together a SHARE presentation called "Write Machine Code Like a
Compiler -- How to Write the Fastest Code Possible for the z13 (z14,
whatever)" that it would be a big hit.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Jim Mulder
Sent: Monday, December 28, 2015 9:57 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

> An example: which of these code sequences do you suppose runs faster?
> 
>  la rX,0(rI,rBase)   rX -> array[i]
>  lg rY,0(,rX)rY = array[i]
>  agsi 0(rX),1++array[i]
> * Now do something with rY
> 
> vs:
>  lg rY,0(rI,rBase)   rY = array[i]
>  la rX,1(,rY)rX = rY + 1
>  stg rX,0(rI,rBase)  effect is ++array[i]
> * Now do something with rY
> 
> The first is substantially faster. I would have GUESSED that the 
> second would be faster, since I need the value in rY anyway. (I'm in 
> 64-bit mode, so using "LOAD ADDRESS for the increment is safe...)

  "Substantially faster" is probably a cache effect.  Assuming a cache miss
on array[i], in sequence 2, the LG will miss and install the cache line
shared, and then the STG will need to do an upgrade to exclusive.  The AGSI
in sequence 1 will miss and install the cache line
exclusive, avoiding the upgrade to exclusive.   Adding a PFD 2,0(rI,rBase)
before the LG in sequence 2 may make these sequences perform similarly.

 Also, in sequence 1, changing lg rY(0(,Rx) to lg rY,0(rI,rBase)  may avoid
some Address Generation Interlock effects (although various machines have
various AGI bypasses for various instructions).  And it may just transfer
some of the AGI effect from the LG down to the AGSI. 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-28 Thread Bernd Oppolzer

As Tom has noted, the most dramatic performance enhancements typically
come from a change in strategy or algorithm used.  In my experience you
get better results by looking for ways to accomplish the end result by
having the program do fewer actions rather than concentrating on
micro-optimzing the individual actions.



Very true. As others have pointed out, very often the problem is 
sequential search

of tables considered to be small which are large in real situations.

My example is the DB2 CAF interface module DSNALI;
there is a loop that checks if a certain module has already been loaded by
sequentially looking up the CDE list. And this is done on every single 
DB2 action,
for example a fetch of a DB2 row. This works in a batch environment 
where the

number of modules is low. Now we had at a customers site a situation where
DSNALI was used in an environment where there where some 5000 modules
in the CDE list. The CPU amount in the DSNALI loop added up to more than 
5 %
of the overall CPU, although this was a region with heavy math load; 
DSNALI should

normally be invisible.

We talked with IBM on this, but IBM didn't fix it - and we were not 
allowed to fix it
at the customer's site (it's IBM software). The solution in the end was: 
we changed

all those processes to RRSAF, that is: DSNRLI; DSNRLI had no such problem.

Kind regards

Bernd



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-28 Thread Shmuel Metz (Seymour J.)
In <3361710c8fdd49d9a5ba76a61a993...@su806104.ad.ing.net>, on
12/28/2015
   at 06:15 AM, "Windt, W.K.F. van der (Fred)"
 said:

>And on newer machines (with the general-instructions-extension) you
>can simply do:

>  ASI COUNTER,1

>I assume it is faster than the sequence of three instructions but
>have not verified that.

I would be very much surprised if it were not faster. Also, it makes
the code clearer.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-28 Thread Jerry Callen
Back from Christmas break to a lot of responses - thanks to all. Summarizing 
the responses thus far:

* Select and/or tune your algorithm first.

Check.

* Tuning for a specific machine is a bad idea because you'll have to retune for 
every new machine.

Check. This code is sufficiently critical that it's worth doing 
platform-specific tuning. Otherwise I wouldn't be coding in assembler anyway. 
And it's small enough to be manageable.

* The compilers are very good and can probably do better than you can.

Usually, I agree. On this small kernel, though, I'm already beating xlc and (by 
a lesser margin) gcc. Looking at the code they generate has been enlightening; 
gcc is especially aggressive on loop unrolling, and that's on my list of  
things to try.

* Modern machines are too complex to enable detailed timing formulae to be 
published.

I'm not really after detailed timing. I'm looking for implementation details of 
the same sort used by compiler writers to guide selection of instruction 
sequences, where the guiding principle is, "How can I avoid pipeline stalls?" 
As I noted, several SHARE presentations contain SOME of this information, which 
I've already benefited from, but I'm looking for more.

* Just write some timing loops and figure it out yourself.

I've been doing that, using mostly the algorithm itself plus small timing 
experiments (which can be deceiving since they aren't in the context of the 
rest of the algorithm). I've already managed to squeeze out about a 15% 
improvement over my initial code.

An example: which of these code sequences do you suppose runs faster?

 la rX,0(rI,rBase)   rX -> array[i]
 lg rY,0(,rX)rY = array[i]
 agsi 0(rX),1++array[i]
* Now do something with rY

vs:
 lg rY,0(rI,rBase)   rY = array[i]
 la rX,1(,rY)rX = rY + 1
 stg rX,0(rI,rBase)  effect is ++array[i]
* Now do something with rY

The first is substantially faster. I would have GUESSED that the second would 
be faster, since I need the value in rY anyway. (I'm in 64-bit mode, so using 
"LOAD ADDRESS for the increment is safe...)

* Suggestions regarding branch prediction.

Spot on, and I already did that, too. Luckily branch prediction is one of the 
few performance characteristics that's actually semi-architectural (see the 
description of "BRANCH PREDICTION PRELOAD" in the Principles of Operation). 
Consider these two code sequences:

loop ds 0h
*Compute a value in r1
 cgrjne r2,r1,loop   continue if not done
vs:

loop ds 0h
*Compute a value in r1
 cgrje r2,r1,loopexitexit if done
 j loop
loopexit ds 0h

The second sequence is MUCH faster - though I haven't yet tried using "BRANCH 
PREDICTION PRELOAD" to alter the default prediction.

* Cache issues

This kernel unfortunately has little (data) cache locality; it's inherent in 
the problem. What I HAVE found is that using "PREFETCH DATA" and specifying 
store intent before the first access DOES help a bit. I haven't fiddled with 
"NEXT INSTRUCTION ACCESS INTENT" yet; that seems like a potentially enormous 
rathole. Using MVCL isn't an option; my data elements are small, aligned 
multiples of doublewords, and I determined very early on that a load/store loop 
way outperforms both MVCL and MVC.

On the I-cache side: the kernel is very small (a few dozen instructions) and 
chugs along for a good long time once it's called, so it's undoubtedly in 
cache. Following David Bond's suggestion to align the tops of loops on 16-byte 
boundaries helped a bit.

* The usual IBM-MAIN drift, reminiscences of old machines/code/operating 
systems/management practices/etc.

Check. :-)

I'm still hoping that someone will reveal the existence of a document intended 
for compiler writers...

-- Jerry

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-28 Thread Jim Mulder
> An example: which of these code sequences do you suppose runs faster?
> 
>  la rX,0(rI,rBase)   rX -> array[i]
>  lg rY,0(,rX)rY = array[i]
>  agsi 0(rX),1++array[i]
> * Now do something with rY
> 
> vs:
>  lg rY,0(rI,rBase)   rY = array[i]
>  la rX,1(,rY)rX = rY + 1
>  stg rX,0(rI,rBase)  effect is ++array[i]
> * Now do something with rY
> 
> The first is substantially faster. I would have GUESSED that the 
> second would be faster, since I need the value in rY anyway. (I'm in
> 64-bit mode, so using "LOAD ADDRESS for the increment is safe...)

  "Substantially faster" is probably a cache effect.  Assuming a cache 
miss on array[i], in sequence 2, the LG will miss and install
the cache line shared, and then the STG will need to do an upgrade to
exclusive.  The AGSI in sequence 1 will miss and install the cache line
exclusive, avoiding the upgrade to exclusive.   Adding a PFD 2,0(rI,rBase)
before the LG in sequence 2 may make these sequences perform similarly.

 Also, in sequence 1, changing lg rY(0(,Rx) to 
lg rY,0(rI,rBase)  may avoid some Address Generation Interlock 
effects (although various machines have various AGI bypasses for various
instructions).  And it may just transfer some of the AGI effect from the 
LG down to the AGSI. 
 
Jim Mulder   z/OS System Test   IBM Corp.  Poughkeepsie,  NY



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Skip Robinson
Sweet. This would have saved me hours of sleeplessness--or at least fitful 
sleep--in the ensuing decades. ;-) If the business ever turned out to be so 
successful as to exceed two billion records of any type, this report would have 
been a joy to (re)write yet again. 

.
.
.
J.O.Skip Robinson
Southern California Edison Company
Electric Dragon Team Paddler 
SHARE MVS Program Co-Manager
323-715-0595 Mobile
jo.skip.robin...@att.net
jo.skip.robin...@gmail.com


> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU]
> On Behalf Of Tony Harminc
> Sent: Sunday, December 27, 2015 12:14 PM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: [Bulk] Re: Is there a source for detailed, instruction-level 
> performance
> info?
> 
> On 27 December 2015 at 14:47, Skip Robinson <jo.skip.robin...@att.net>
> wrote:
> > As a newbie, I got curious about the relative speed of these strategies:
> >
> > 1. L R15, COUNTER
> > 2. A R15,=F(+1)
> > 3. ST R15, COUNTER
> >
> > 1. L R15, COUNTER
> > 2. LA R15,1(,R15)
> > 3. ST R15, COUNTER
> >
> > I asked my manager, who encouraged me to delve into the manual Shmuel
> cites.
> > I decided that LA was faster because there was no storage access. The
> > program ran like a banshee. It ran so fast that it was used to
> > benchmark new hardware. Really!
> >
> > It wasn't till later that I pondered a basic flaw. As written, the
> > program could not handle a counter greater than 16M because it ran in 24 bit
> mode.
> 
> There's a third model for this very common operation:
> 
> LAR15,1
> A R15,COUNTER
> STR15,COUNTER
> 
> This handles the full 31-bit range and maintains the advantage of the LA not
> referencing storage, but brings the ST closer in time/cycles/etc. to the A, 
> which
> could conceivably make it wait for the result of the A, whereas the LA could
> perhaps have its result ready faster.
> 
> Nonetheless it in some sense looks "wrong", I suppose because we are thinking
> of COUNTER as needing to have something added to it, and not
> 1 as needing to have something added to *it*, however much we know addition
> to be commutative.
> 
> > Moral: whether or not size matters, speed is certainly not a simple metric.
> 
> Indeed.
> 
> Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Tom Marchant
On Sun, 27 Dec 2015 13:21:23 -0800, Anne & Lynn Wheeler wrote:

>later, newer memory for 370/168 was less expensive ... and started to
>see four mbytes as much more common ... aka four mbytes as 370/165 would
>have met that typical MVT customer could have gotten 16 regions ... w/o
>having to resort to virtual memory ... but the decision had already been
>made.

Sure, but even before you got to 16 regions, you had the problem of different 
regions having to use the same storage protection key. As a result, the system 
could no longer ensure that one region was isolated from other regions.

-- 
Tom Marchant

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Anne & Lynn Wheeler
shmuel+ibm-m...@patriot.net (Shmuel Metz  , Seymour J.) writes:
> We ran more than that, plus TSO, on a 2 MiB machine.

IBM executives were looking at 370/165 ... where typical customer had
1mbyte ... in part because 165 real memory was very expensive ... and
typical regions were such that they only got four in 1mbytes (after
system real storage requirement)

later, newer memory for 370/168 was less expensive ... and started to
see four mbytes as much more common ... aka four mbytes as 370/165 would
have met that typical MVT customer could have gotten 16 regions ... w/o
having to resort to virtual memory ... but the decision had already been
made.

basical initial transtion os/mvt to os/vs2 svs was MVT laid out in
single 16mbyte virtual address space ... and little bit of code to build
the segement/page tables and handle page faults. The biggest code hit
was adding channel program translation in EXCP ...  code initially
copied from CP67 CCWTRANS channel program translation.

prior reference/discssion to justification for 370 virutal memory
http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory

later transition to os/vs2 MVS with multiple virtual address spaces
... had other problems. The os/360 MVT heritage was heavily based on
pointer passing API paradigm ... and with move to MVS. The first pointer
passing ... was put an 8mbyte image of the MVT kernel into every
application virtual address space ... leaving only 8mbytes (out of 16)
for application use. Then because of subsystems were now in their own
(different) virtual address space ... needed a way for passing
parameters & data back and forth between applications and subsystems
using pointer passing API. The result was "common segment" ... a one
mbyte area that also appeared in every virtual address space  which
could be used to pass arguments/data back between applications and
subsystems (leaving only 7mbytes for applications). The next issue was
demand for common segment was somewhat proportional to number of
concurrent applications and subsystems ... so the common segment area
became common system area (CSA) as requirements exceeded 1mbytes. Into
the 3033 area, larger operations were pushing CSA to 4&5 mbytes and
threatening to push it to 8mbytes ... leaving no space at all for
application (of course with MVS kernel at 8mbytes and CSA at 8mbytes,
there wouldn't be any left for applications ... which drops the demand
for CSA to zero).

Part of the solution to address the OS/360 MVT pointer passing API
problem was included in the original XA architecture (later referred to
811 ... because documents dated Nov1978) were access registers ... and
ability to address/access multiple address spaces.  To try and alleviate
the CSA explosion in 3033 time-frame ... a subset of access registers
was retrofitted to 3033 as dual-address space mode ... but it provide
only limited help since it still required updating all the subsystems to
support dual-address space mode (instead of CSA).

trivia: the person responsible for retrofitting dual-access to 3033 ...
later leaves IBM for another vendor and later shows up as one of the
people behind HP Snake and later Itanium.

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In <87bn9fwuo0@garlic.com>, on 12/24/2015
   at 10:47 AM, Anne & Lynn Wheeler  said:

>As a result, a typical 1mbyte 370/165 would only have four regions.

We ran more than that, plus TSO, on a 2 MiB machine.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In
,
on 12/24/2015
   at 02:31 PM, Mike Schwab  said:

>https://en.wikipedia.org/wiki/IBM_7030_Stretch
>First computer to implement: Multiprogramming, memory protection,
>generalized interrupts, the eight-bit byte,

FSVO eight equal to all of 1-8.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In <013c01d13e5a$89687c80$9c397580$@mcn.org>, on 12/24/2015
   at 06:51 AM, Charles Mills  said:

>This is true so much that the z13 processors implement a kind of 
>"internal multiprogramming"

IBM calls it Simultaneous Multi-threading, except in PoOps where it is
just "Multithreading facility".

>so that one CPU internal thread can do something useful while 
>another thread is waiting for a storage reference.

Or waiting for other resources.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In , on 12/25/2015
   at 01:23 AM, "Robert A. Rosenberg"  said:

>This story (and the others) reminds me of an incident that occurred 
>early in my programming life.

The clasic example is Multics. Early on they redesigned the file
system, replacing some Assembly Language for Multics (ALM) code with
PL/I code; the new version ran faster. As always, an efficient
algorith tumps micro-optimizarion. Not that you shouldn't try to write
good code as well ;-)

I ran into this when I had to rewrite an input routine for a PC
application written in Ada. The old version was assembler, and called
BDOS for each character. The new version was in Ada and directly
copied the data from the screen buffer. Major speedup.

>This worked until the Bean Counters wanted to replace the 2540 
>with a 2501

Ah, bean counters. You lucked out; they can be very expensive if they
don't know what they're doing.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In <567b4a30.8050...@yahoo.com>, on 12/23/2015
   at 08:28 PM, Thomas Kern
<0041d919e708-dmarc-requ...@listserv.ua.edu> said:

>Perhaps what might be useful would be an assembler program to run
>loops  of individual instructions and output some timing
>information.

That would work on a simpler machine. Even the timings in, e.g.,
GA22-7011-4, BM System/370 Model 158 Functional Characteristics, were
too complex.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Tony Harminc
On 27 December 2015 at 14:47, Skip Robinson  wrote:
> As a newbie, I got curious about the relative speed of these strategies:
>
> 1. L R15, COUNTER
> 2. A R15,=F(+1)
> 3. ST R15, COUNTER
>
> 1. L R15, COUNTER
> 2. LA R15,1(,R15)
> 3. ST R15, COUNTER
>
> I asked my manager, who encouraged me to delve into the manual Shmuel cites.
> I decided that LA was faster because there was no storage access. The
> program ran like a banshee. It ran so fast that it was used to benchmark new
> hardware. Really!
>
> It wasn't till later that I pondered a basic flaw. As written, the program
> could not handle a counter greater than 16M because it ran in 24 bit mode.

There's a third model for this very common operation:

LAR15,1
A R15,COUNTER
STR15,COUNTER

This handles the full 31-bit range and maintains the advantage of the
LA not referencing storage, but brings the ST closer in
time/cycles/etc. to the A, which could conceivably make it wait for
the result of the A, whereas the LA could perhaps have its result
ready faster.

Nonetheless it in some sense looks "wrong", I suppose because we are
thinking of COUNTER as needing to have something added to it, and not
1 as needing to have something added to *it*, however much we know
addition to be commutative.

> Moral: whether or not size matters, speed is certainly not a simple metric.

Indeed.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Skip Robinson
Ahead of New Year's resolution mania, I have to confess to a questionable
decision dating back to my absolute first IT job as a programmer trainee.
The company was TRW Credit Data, ancestor of Experian. The application was
Business Credit, which performed B2B reporting analogous to consumer
reporting.. My task was to (re)write a report on all the record types in the
data base. The procedure was simple: read a record, determine the type,
increment the appropriate counter, then read the next record. 

As a newbie, I got curious about the relative speed of these strategies:

1. L R15, COUNTER
2. A R15,=F(+1)
3. ST R15, COUNTER

1. L R15, COUNTER
2. LA R15,1(,R15) 
3. ST R15, COUNTER

I asked my manager, who encouraged me to delve into the manual Shmuel cites.
I decided that LA was faster because there was no storage access. The
program ran like a banshee. It ran so fast that it was used to benchmark new
hardware. Really!

It wasn't till later that I pondered a basic flaw. As written, the program
could not handle a counter greater than 16M because it ran in 24 bit mode.
This was before XA. At the time I wrote it, the data base was comfortably
within that limit, but over time, long after I had moved on, I (still)
wonder if the application survived and if any counter ever hit the limit.
Moral: whether or not size matters, speed is certainly not a simple metric. 

.
.
.
J.O.Skip Robinson
Southern California Edison Company
Electric Dragon Team Paddler 
SHARE MVS Program Co-Manager
323-715-0595 Mobile
jo.skip.robin...@att.net
jo.skip.robin...@gmail.com

> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU]
> On Behalf Of Shmuel Metz (Seymour J.)
> Sent: Thursday, December 24, 2015 12:26 PM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: [Bulk] Re: Is there a source for detailed, instruction-level
performance
> info?
> 
> In <567b4a30.8050...@yahoo.com>, on 12/23/2015
>at 08:28 PM, Thomas Kern
> <0041d919e708-dmarc-requ...@listserv.ua.edu> said:
> 
> >Perhaps what might be useful would be an assembler program to run loops
> >of individual instructions and output some timing information.
> 
> That would work on a simpler machine. Even the timings in, e.g.,
GA22-7011-4,
> BM System/370 Model 158 Functional Characteristics, were too complex.
> 
> --
>  Shmuel (Seymour J.) Metz, SysProg and JOAT
>  ISO position; see <http://patriot.net/~shmuel/resume/brief.html>
> We don't care. We don't have to care, we're Congress.
> (S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Shmuel Metz (Seymour J.)
In
,
on 12/27/2015
   at 03:14 PM, Tony Harminc  said:

>There's a third model for this very common operation:

>LAR15,1
>A R15,COUNTER
>STR15,COUNTER

If you're that concerned about speed:

  LAR11,1
 LOOP GET   foo
  logic to determine type
  L Rl,COUNTER
  ARR1,R11
  STR1,COUNTER
  B LOOP
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see  
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-27 Thread Windt, W.K.F. van der (Fred)
> >LAR15,1
> >A R15,COUNTER
> >STR15,COUNTER
>
> If you're that concerned about speed:
>
>   LAR11,1
>  LOOP GET   foo
>   logic to determine type
>   L Rl,COUNTER
>   ARR1,R11
>   STR1,COUNTER
>   B LOOP

And on newer machines (with the general-instructions-extension) you can simply 
do:

  ASI COUNTER,1

I assume it is faster than the sequence of three instructions but have not 
verified that.

Fred!


ATTENTION:
The information in this e-mail is confidential and only meant for the intended 
recipient. If you are not the intended recipient , don't use or disclose it in 
anyway. Please let the sender know and delete the message immediately.
--

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Anne & Lynn Wheeler
mike.a.sch...@gmail.com (Mike Schwab) writes:
> If branch predicting is a big hang up, the obvious solution is to
> start processing all possible outcomes then keep the one that is
> actually taken.  I. E.  B OUTCOME(R15) where R15 is a return code of
> 0,4,8,12,16.

aka, speculative execution ... instructions executed on path ... that
is not actually taken ... are not committed
https://en.wikipedia.org/wiki/Speculative_execution
and
https://en.wikipedia.org/wiki/Speculative_execution#Eager_execution

Eager execution is a form of speculative execution where both sides of
the conditional branch are executed; however, the results are committed
only if the predicate is true. With unlimited resources, eager execution
(also known as oracle execution) would in theory provide the same
performance as perfect branch prediction. With limited resources eager
execution should be employed carefully since the number of resources
needed grows exponentially with each level of branches executed
eagerly.[7]

... snip ...

https://en.wikipedia.org/wiki/Eager_evaluation

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Charles Mills
Thanks. It's held up reasonably well I think considering that it is 42 years
old next month. (Other than the references to specific products!)

I did not have a link but the Google do: 
https://books.google.com/books?id=q_IffYrk4VEC=PA18 

I find reading the Computerworld -- especially the ads -- to be fascinating.
Terminals were a big deal!

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Frank Swarbrick
Sent: Thursday, December 24, 2015 4:45 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

Interesting article.  Do you have a link to the article it appears to be a
response to?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Joel C. Ewing
On 12/24/2015 12:52 PM, Tom Brennan wrote:
> Farley, Peter x23353 wrote:
>> So what is an ordinary programmer to do?
>
> Years ago I guess I had nothing to do so I wrote a program that hooked
> into various LINK/LOAD SVC's and recorded the load module name (like
> Isogon and TADz do).  That huge pile of data ended up on a tape and I
> wrote some code to scan the tape for a particular module, to find out
> who was using it and how often.
>
> The scan took forever, so I worked quite a bit trying to make the main
> loop more efficient.  Co-worker Stuart Holland looked at my logic and
> quickly switched it to using a hashing lookup algorithm, making it run
> probably a thousand times faster.  Oops :)
>
As Tom has noted, the most dramatic performance enhancements typically
come from a change in strategy or algorithm used.  In my experience you
get better results by looking for ways to accomplish the end result by
having the program do fewer actions rather than concentrating on
micro-optimzing the individual actions. 

You may not be able to predict how to micro-manage the mix of
instructions in a loop or a highly-used section of code to optimize its
performance, but if by changing strategy you can significantly reduce
the required number of times the loop or section of code is executed it
is reasonable to always expect better performance.

Although obviously some instructions require more resources than others,
in general if you can reduce the total number of instructions executed
without dramatically changing the program's instruction mix, that too
must have a positive impact.  If you can reduce a program's references
to storage without also increasing the number of instructions the
program executes, that also always has a positive effect.

No matter how fast a processor is, I/O operations are always expensive
both in real  time and CP time.  An approach that requires significantly
fewer external records to be read or written is always a significant
improvement.  If you can't reduce the number of logical record
read/writes, perhaps changing buffer management strategies can
significantly reduce the number of physical read/writes and still get
orders of magnitude performance gains.

If writing a highly used section of code directly in assembler, one can
strive for a minimal number of instructions by choosing efficient data
representations for the task at hand and minimize storage references by
using registers wisely, and mostly that gets you close enough for local
code optimization without worrying about whether instruction X might
execute faster than instruction Y.

-- 
Joel C. Ewing,Bentonville, AR   jcew...@acm.org 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Richard Pinion
Don't use zoned decimal for subscripts or counters, rather use indexes for
subscripts and binary for counter type variables.  And when using conditional
branching, try to code so as to make the branch the exception rather than the
rule.  For large table lookups, use a binary search as opposed to a sequential
search.  

These simple coding techniques can also reduce CPU time.



--- jcew...@acm.org wrote:

From: "Joel C. Ewing" <jcew...@acm.org>
To:   IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance info?
Date: Thu, 24 Dec 2015 15:53:42 -0600

On 12/24/2015 12:52 PM, Tom Brennan wrote:
> Farley, Peter x23353 wrote:
>> So what is an ordinary programmer to do?
>
> Years ago I guess I had nothing to do so I wrote a program that hooked
> into various LINK/LOAD SVC's and recorded the load module name (like
> Isogon and TADz do).  That huge pile of data ended up on a tape and I
> wrote some code to scan the tape for a particular module, to find out
> who was using it and how often.
>
> The scan took forever, so I worked quite a bit trying to make the main
> loop more efficient.  Co-worker Stuart Holland looked at my logic and
> quickly switched it to using a hashing lookup algorithm, making it run
> probably a thousand times faster.  Oops :)
>
As Tom has noted, the most dramatic performance enhancements typically
come from a change in strategy or algorithm used.  In my experience you
get better results by looking for ways to accomplish the end result by
having the program do fewer actions rather than concentrating on
micro-optimzing the individual actions. 

You may not be able to predict how to micro-manage the mix of
instructions in a loop or a highly-used section of code to optimize its
performance, but if by changing strategy you can significantly reduce
the required number of times the loop or section of code is executed it
is reasonable to always expect better performance.

Although obviously some instructions require more resources than others,
in general if you can reduce the total number of instructions executed
without dramatically changing the program's instruction mix, that too
must have a positive impact.  If you can reduce a program's references
to storage without also increasing the number of instructions the
program executes, that also always has a positive effect.

No matter how fast a processor is, I/O operations are always expensive
both in real  time and CP time.  An approach that requires significantly
fewer external records to be read or written is always a significant
improvement.  If you can't reduce the number of logical record
read/writes, perhaps changing buffer management strategies can
significantly reduce the number of physical read/writes and still get
orders of magnitude performance gains.

If writing a highly used section of code directly in assembler, one can
strive for a minimal number of instructions by choosing efficient data
representations for the task at hand and minimize storage references by
using registers wisely, and mostly that gets you close enough for local
code optimization without worrying about whether instruction X might
execute faster than instruction Y.

-- 
Joel C. Ewing,Bentonville, AR   jcew...@acm.org 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




_
Netscape.  Just the Net You Need.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Robert A. Rosenberg
At 15:53 -0600 on 12/24/2015, Joel C. Ewing wrote about Re: Is there 
a source for detailed, instruction-level perfo:



As Tom has noted, the most dramatic performance enhancements typically
come from a change in strategy or algorithm used.  In my experience you
get better results by looking for ways to accomplish the end result by
having the program do fewer actions rather than concentrating on
micro-optimzing the individual actions


This story (and the others) reminds me of an incident that occurred 
early in my programming life.


We had an application that read Column Binary data on a 2540 Card 
Reader. The gotcha was that the card was not pure Column Binary but 
half CB with the other half being normal EBCDIC. The program would 
read the card as CB but not eject it leaving the card image in the 
reader's buffer. It would then do a 2nd read (from the buffer) as 
EBCDIC with the bad format flag on and eject the card. The result was 
a 160 byte image of the card as CB and a 80 byte EBCDIC image with 
the CB columns as random junk.


This worked until the Bean Counters wanted to replace the 2540 with a 
2501 (since we were no longer punching output cards so did not need 
the punch capability of the 2540). Since the 2501 was an unbuffered 
Read-and-Eject device you got one crack at reading the card so it 
could only be read as CB (unless we did a second pass of the deck to 
get the EBCDIC data). It was decided to have the program take the CB 
image and convert the EBCDIC section from CB. The task of writing the 
conversion routine was given to another programmer who built a table 
of all the 256 2-byte bit patterns that represented the holes in the 
card. His program would then do a search of the table one column at a 
time (I do not remember if this was a Binary or Hash search). In any 
case the program was slow/inefficient.


I was asked to look at his code and see if I could speed up his code. 
I was able to do so by starting from scratch by using a few TRs and 
an OC. The basic idea was to use a TR to separate the top 6 rows from 
the bottom 6 rows of the card image in the CB buffer. Then TR each of 
the two set of rows to form an 5-bit map showing if Row 12/11/0/8/9 
was punched or not and a 3-bit binary number from 0 to 7 showing 
which row in the 1-7 range was punched (ie: Row 5 yielded 101). 
OC'ing the top row over the bottom yielded a value showing which 
punches were on the card. Running the result through one final TR 
converted the Card-Image EBCDIC into the Internal Mapping.


Using the same sequence and a different set of TR Tables and 
replacing the final TR with a TRT that checked for more then 1 bit 
on, acted as a sanity check on the EBCDIC part of the CB. My version 
ran VERY fast. The major effort was creating the TR tables (and all 
of the mapping info needed was there since this was done by the 
original programmer when he created his tables).


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Frank Swarbrick
Interesting article.  Do you have a link to the article it appears to be a 
response to?

> Date: Thu, 24 Dec 2015 14:42:19 -0500
> From: charl...@mcn.org
> Subject: Re: Is there a source for detailed, instruction-level performance 
> info?
> To: IBM-MAIN@LISTSERV.UA.EDU
> 
> Or as I said in 1974 ... 
> https://books.google.com/books?id=XrgyMRVh128C=PA16 
> 
> (Gawd, I'm turning into Lynn Wheeler ... )
> 
> Charles
> 
> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
> Behalf Of Skip Robinson
> Sent: Thursday, December 24, 2015 2:27 PM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: Re: Is there a source for detailed, instruction-level performance
> info?
> 
> This is in no way a personal comment on Tom's experience. 
> 
> 'What a programmer is supposed to do' is avoid stupid code. We were once
> tasked with finding the bottleneck in a fairly mundane VSAM application. It
> ran horribly, consuming scads of both CPU and wall clock. It didn't take
> long using an OTS product to discover that for every single I/O, the cluster
> was being opened and closed again even though nothing else happened in the
> meantime. Simply changing that logic slashed resource utilization.
> 
> In another case, we were on the verge of upgrading a CEC when the
> application folks themselves discovered a few grossly inefficient SQL calls.
> Fixing those calls dropped overall LPAR utilization dramatically. 
> 
> What Tom and I are both saying is that focus on instruction timing should be
> seen as more of an avocation than a serious professional pursuit. Like
> playing with model trains at the expense of improving actual rail systems.
> It's interesting, but not much real business depends on the outcome.
> 
> --
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
  
--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Anne & Lynn Wheeler
rpin...@netscape.com (Richard Pinion) writes:
> Don't use zoned decimal for subscripts or counters, rather use indexes
> for subscripts and binary for counter type variables.  And when using
> conditional branching, try to code so as to make the branch the
> exception rather than the rule.  For large table lookups, use a binary
> search as opposed to a sequential search.
>
> These simple coding techniques can also reduce CPU time.

in late 70s we would have friday nights after work ... and discuss a
number of things ... along the lines of what came up in tandem memos
... aka I was blamed for online computer conferencing on the internal
network (larger than arpanet/internet from just about the beginning
until sometime mid-80s) in the late 70s and early 80. folklore is that
when the corporate executive committee were told about online computer
conferencing (and the internal network), 5of6 wanted to fire me. from
IBMJARGON:

[Tandem Memos] n. Something constructive but hard to control; a fresh of
breath air (sic). "That's another Tandem Memos." A phrase to worry
middle management. It refers to the computer-based conference (widely
distributed in 1981) in which many technical personnel expressed
dissatisfaction with the tools available to them at that time, and also
constructively criticized the way products were [are] developed. The
memos are required reading for anyone with a serious interest in quality
products. If you have not seen the memos, try reading the November 1981
Datamation summary.

... snip ...

one of the issues was that the majority of the people inside the company
didn't actually use computers ... and we thot things would be be
improved if the people in the company actually had personal experience
using computers, especially managers and executives. So we eventually
came up with the idea of online telephone books ... of (nearly)
everybody in the corporation ... especially if lookup elapsed time was
less than look up of paper telephone book.

avg binary search of 256k is 18 ... aka 2*18. Also important was there
were nearly 64 entries in physical block ... so binary search to the
correct physical block is 12 reads (i.e. 64 is 2**6, 18-6=12).

However, it is fairly easy to calculate the name letter frequency ... so
instead of doing binary search, do radix search (based on letter
frequency) and can get within the correct physical block within 1-3
physical reads (instead of 12). We also got fancy doing first two letter
frequency and partially adjusting 2nd probe, based on how accurate the
first probe was.  In any case, binary search for totally unknown
distribution characteristics.

So one friday night, we established the criteria, to design, implement,
test and deploy the lookup program had to take less than a person week
of effort ... and less than another person week to design, implement,
test and deploy the process for collecting, formating and distributing
the online telephone books.

trivia ... long ago and far away ... a couple people I had worked with
at Oracle (when I was at IBM and working on cluster scaleup for HA/CMP),
had left and were at small client/server responsible for something
called commerce server. After cluster scaleup was transferred, announced
as IBM supercomputer, and we were told we couldn't work on anything with
more than four processors ... we decide to leave. We are then brought in
as consultants at this small client/server startup because they want to
do payment transactions on the server, the startup had also invented
this technology called SSL they want to use, the result is now
frequently called "electronic commerce".

TCP/IP protocol has session termination process that includes something
called FINWAIT list. At the time, session termination was relative
infrequent process and common TCP/IP implementations used a sequential
search of the FINWAIT list (assuming that there would be few or none
entries on the list). HTTP (& HTTPS) implementation chose to use TCP
... even tho it is a datagram protocol rather than a session protocol.
There was period in the early/mid 90s as web use was scaling up where
webservers saturated spending 90-95% of cpu time doing FINWAIT list
searches  before the various implementations were upgraded to do
significantly more efficient management of FINWAIT (session termination)
process.

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Ed Gould

On Dec 24, 2015, at 4:06 PM, Richard Pinion wrote:

Don't use zoned decimal for subscripts or counters, rather use  
indexes for
subscripts and binary for counter type variables.  And when using  
conditional
branching, try to code so as to make the branch the exception  
rather than the
rule.  For large table lookups, use a binary search as opposed to a  
sequential

search.

These simple coding techniques can also reduce CPU time.



Very true. EREP has one report That does this and it took a LOT of  
CPU seconds to run it.
At the time IBM distributed microfiche for the module in question  
(sorry do NOT remember the name)
I went through the module and found that instead of indexing into an  
array the IBM did a sequential lookup.
It was unfortunate that the module was written in PLS so I couldn't  
fix it easily. I opened a PMR with IBM about performance on the  
module and when I explained what the coder did and how easily  it  
could be fixed. I got a laugh out of level 2 and was told to live  
with it.
I decided to write a quicky report that erep put out and it ran in  
less than 5 seconds CPU time. I replaced the IBM procedure with one I  
wrote and the CE was happy and I never bothered to let him know that  
it wasn't an IBM report.
He was happy and the whole program took maybe 20 minutes to write and  
30 minutes debugging and instant CE happiness .


Ed

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Tom Brennan

Farley, Peter x23353 wrote:

So what is an ordinary programmer to do?


Years ago I guess I had nothing to do so I wrote a program that hooked 
into various LINK/LOAD SVC's and recorded the load module name (like 
Isogon and TADz do).  That huge pile of data ended up on a tape and I 
wrote some code to scan the tape for a particular module, to find out 
who was using it and how often.


The scan took forever, so I worked quite a bit trying to make the main 
loop more efficient.  Co-worker Stuart Holland looked at my logic and 
quickly switched it to using a hashing lookup algorithm, making it run 
probably a thousand times faster.  Oops :)


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Tony Harminc
On 23 December 2015 at 10:46, Jerry Callen  wrote:
> I'm in the process of hand-tuning a small, performance critical algorithm on 
> a Z13, and I'm hampered by the lack of detailed information on the 
> instruction-level performance of the machine.

Just to add two thoughts to the several good comments here...

First, what is the nature of your "small, performance critical
algorithm", and why do you see the need to hand tune it? Is it small
in the sense of code size, but the code itself loops, or small but
invoked very frequently, or small and invoked rarely but it really
really has to perform when it is, or...? That's the code; now how
about the data? Lots of it? Dense or sparse? Or a small amount that is
worked on intensively?

To some extent I think answers to this will determine your best course
of action. Not that I think you haven't thought about these things,
but clearly if you have a small routine that's invoked frequently, it
is important to remove as much of the calling overhead as possible so
that it doesn't outweigh your actual code. And if the data is large or
sparse, and the code small, you need to think hard about data cache
performance vs instruction cache, and the cache level at which they
converge/interfere.

Second, in the absence of detailed documentation on the machine (which
I think you will never see for a modern implementation, and which will
in any case change in the next one), you will do well do emulate what
those best informed do: write your routine in (say) C, and look at
what the IBM compiler generates, not just for the latest and greatest
OPT(...) value, but for some lower ones to see what has changed. Of
course this has some problems. In many cases you won't know *why* the
compiler does something, and therefore how to extrapolate to what you
want to do. And the high level languages lack a mechanism to tell the
compiler much of anything about what *you* know about the code and
data that it can't. There are occasional ways of sneaking hints
through the HLL to the optimizer, such as declaring the size of length
fields so as to limit the cases the compiler has to consider, but
these are accidental and incomplete.

As for your specific question

> * Are there any ways to bypass the L1 cache on moves of less than a page, 
> when simply moving
>  data without looking at it?

there is a cache usage hinting scheme when using the MVCL[E]
instructions: a padding byte value of X'B0' during non-padding
execution hints that you won't be referencing the target soon, and by
implication that  cache should be avoided. This has been
documented in the PofO for years, and how a specific model treats it
is, of course, not mentioned there.

There is also the more general Next Instruction Access Intent (NIAI)
instruction that may be of use, though naturally its effects are also
described in somewhat more general terms than you might like.

Tony H.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Skip Robinson
This is in no way a personal comment on Tom's experience. 

'What a programmer is supposed to do' is avoid stupid code. We were once
tasked with finding the bottleneck in a fairly mundane VSAM application. It
ran horribly, consuming scads of both CPU and wall clock. It didn't take
long using an OTS product to discover that for every single I/O, the cluster
was being opened and closed again even though nothing else happened in the
meantime. Simply changing that logic slashed resource utilization.

In another case, we were on the verge of upgrading a CEC when the
application folks themselves discovered a few grossly inefficient SQL calls.
Fixing those calls dropped overall LPAR utilization dramatically. 

What Tom and I are both saying is that focus on instruction timing should be
seen as more of an avocation than a serious professional pursuit. Like
playing with model trains at the expense of improving actual rail systems.
It's interesting, but not much real business depends on the outcome.

.
.
.
J.O.Skip Robinson
Southern California Edison Company
Electric Dragon Team Paddler 
SHARE MVS Program Co-Manager
323-715-0595 Mobile
jo.skip.robin...@att.net
jo.skip.robin...@gmail.com


> -Original Message-
> From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU]
> On Behalf Of Tom Brennan
> Sent: Thursday, December 24, 2015 10:52 AM
> To: IBM-MAIN@LISTSERV.UA.EDU
> Subject: [Bulk] Re: Is there a source for detailed, instruction-level
performance
> info?
> 
> Farley, Peter x23353 wrote:
> > So what is an ordinary programmer to do?
> 
> Years ago I guess I had nothing to do so I wrote a program that hooked
into
> various LINK/LOAD SVC's and recorded the load module name (like Isogon and
> TADz do).  That huge pile of data ended up on a tape and I wrote some code
to
> scan the tape for a particular module, to find out who was using it and
how
> often.
> 
> The scan took forever, so I worked quite a bit trying to make the main
loop
> more efficient.  Co-worker Stuart Holland looked at my logic and quickly
> switched it to using a hashing lookup algorithm, making it run probably a
> thousand times faster.  Oops :)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Charles Mills
Right ... the point being "don't bother figuring out whether store or store
halfword is faster -- change your algorithm so you have only 1/1000 as many
stores."

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Tom Brennan
Sent: Thursday, December 24, 2015 1:52 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

Farley, Peter x23353 wrote:
> So what is an ordinary programmer to do?

Years ago I guess I had nothing to do so I wrote a program that hooked into
various LINK/LOAD SVC's and recorded the load module name (like Isogon and
TADz do).  That huge pile of data ended up on a tape and I wrote some code
to scan the tape for a particular module, to find out who was using it and
how often.

The scan took forever, so I worked quite a bit trying to make the main loop
more efficient.  Co-worker Stuart Holland looked at my logic and quickly
switched it to using a hashing lookup algorithm, making it run probably a
thousand times faster.  Oops :)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Charles Mills
Storage is the new DASD and CPU time is the new wall clock time.

I like it.

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Anne & Lynn Wheeler
Sent: Thursday, December 24, 2015 1:47 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

so such accounting measuring CPU time (elapsed instruction time) is
analogous to early accounting which measured by elapsed wall clock time.

cache miss/memory access latency ... when measured in count of processor
cycles is comparable to 60s disk access when measured in in count of 60s
processor cycles.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Charles Mills
Or as I said in 1974 ... 
https://books.google.com/books?id=XrgyMRVh128C=PA16 

(Gawd, I'm turning into Lynn Wheeler ... )

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Skip Robinson
Sent: Thursday, December 24, 2015 2:27 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

This is in no way a personal comment on Tom's experience. 

'What a programmer is supposed to do' is avoid stupid code. We were once
tasked with finding the bottleneck in a fairly mundane VSAM application. It
ran horribly, consuming scads of both CPU and wall clock. It didn't take
long using an OTS product to discover that for every single I/O, the cluster
was being opened and closed again even though nothing else happened in the
meantime. Simply changing that logic slashed resource utilization.

In another case, we were on the verge of upgrading a CEC when the
application folks themselves discovered a few grossly inefficient SQL calls.
Fixing those calls dropped overall LPAR utilization dramatically. 

What Tom and I are both saying is that focus on instruction timing should be
seen as more of an avocation than a serious professional pursuit. Like
playing with model trains at the expense of improving actual rail systems.
It's interesting, but not much real business depends on the outcome.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Tom Brennan
And a secondary point is that the original programmer might simply need 
some help.  I'm more of a brute-force-make-it-work type of programmer, 
which may not be best when real Computer Science training is needed.


Charles Mills wrote:

Right ... the point being "don't bother figuring out whether store or store
halfword is faster -- change your algorithm so you have only 1/1000 as many
stores."

Charles

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Tom Brennan
Sent: Thursday, December 24, 2015 1:52 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

Farley, Peter x23353 wrote:


So what is an ordinary programmer to do?



Years ago I guess I had nothing to do so I wrote a program that hooked into
various LINK/LOAD SVC's and recorded the load module name (like Isogon and
TADz do).  That huge pile of data ended up on a tape and I wrote some code
to scan the tape for a particular module, to find out who was using it and
how often.

The scan took forever, so I worked quite a bit trying to make the main loop
more efficient.  Co-worker Stuart Holland looked at my logic and quickly
switched it to using a hashing lookup algorithm, making it run probably a
thousand times faster.  Oops :)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Mike Schwab
On Thu, Dec 24, 2015 at 12:47 PM, Anne & Lynn Wheeler  wrote:
>

> risc has been doing cache miss compensation for decades, out-of-order
> execution, branch prediction, speculative execution, hyperthreading ...
> can be viewed as hardware analogy to 60s multitasking ... given the
> processor something else to do while waiting for cache miss. Decade or
> more ago, some of the other non-risc chips started moving to hardware
> layer that translated instructions into risc micro-ops for scheduling
> and execution ... largely mitigating performance difference between
> those CISC architectures and RISC.
>

>
> as an aside, 370/195 pipeline was doing out-of-order execution ...  but
> didn't do branch proediction or speculative execution ... and
> conditional branch would drain the pipeline. careful coding could keep
> the execution units busy getting 10MIPS ... but normal codes typically
> ran around 5MIPS (because of conditional branches). I got sucked into
> helping with hyperthreading 370/195 (which never shipped), it would
> simulate two processors with two instructions streams, sets of
> registers, etc ... assuming two instruction streams, each running at
> 5MIPS would then keep all execution units running at 10MIPS.
>

>
> --
> virtualization experience starting Jan1968, online at home since Mar1970
>
https://en.wikipedia.org/wiki/IBM_7030_Stretch
First computer to implement: Multiprogramming, memory protection,
generalized interrupts, the eight-bit byte, instruction pipelining,
prefetch and decoding, and memory interleaving.

If branch predicting is a big hang up, the obvious solution is to
start processing all possible outcomes then keep the one that is
actually taken.  I. E.  B OUTCOME(R15) where R15 is a return code of
0,4,8,12,16.


-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Farley, Peter x23353
Chris and Charles,

I get that instruction-level tuning isn't even remotely feasible or productive 
any more due to pipelining and complex cache issues.  My own experience in 
performance enhancing projects has validated advice I was given in this forum 
may years ago:  The only way you can test performance is to repeat your 
before-and-after tests 5-10 times and average the results.  That is far from 
reassuring to the average programmer, especially when the development 
environment may be so constrained or loaded that it looks nothing like the 
"real" production environment in which the code to be tuned actually runs.  
There are lies, there are d***ed lies and there are statistics.  Few 
programmers I know want to bet their job on averages.

However, that leaves regular application programmers in an untenable position 
when management says "reduce your MIPS/MSU's to control our costs" or "the 
batch window/the peak online volume" can't be met anymore because the 
application is too slow, so get in there and speed it up!

Performance analysis tools available to ordinary application folk like Strobe 
or TriTune (just to mention two with which I have worked) can help identify 
glaring individual CPU hotspots, but of course tuning the overall application 
program design is always likely to bring more productive results than simple 
hotspot elimination.  But there aren't any *design* analysis tools that I know 
of.  Only years on the job and wide experience of "the way to do things" seems 
to be able to tackle that side of the problem.

I once thought that the "design pattern" paradigm might provide some help, but 
as far as I can see it is useless in any practical sense for regular business 
application programming practice in HLL's and assembler.

Understanding the z/Linux GNU C optimization parameters would probably give a 
careful assembler programmer strong hints on "best practice", but that seems 
untenable to do by hand, even at the level of a one-page "critical process".  
And obviously most HLL programmers don't have the option or even want to have 
the option tune at that level, deferring to the compiler's capabilities.

So what is an ordinary programmer to do?

Peter

-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Blaicher, Christopher Y.
Sent: Thursday, December 24, 2015 12:22 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance info?

I have looked at the public documentation on the z13 and had the privilege to 
speak to some of the people behind parts of it, and it is an amazing machine.
The reason you can't say how long an instruction takes is that in many cases 
things are happening A) out of sequence; B) at the same time and C) dependent 
on cache hit ratios.
A z13 can be looking at up to about 50 instructions to see if there is anything 
it can do.  If one of those instructions is not dependent on something yet to 
be done, it will do it and hold on to the result until needed.  It also has 
many more registers than the 16 we think of, so if you have code using R1 and 
that is followed by a LHI R1,n instruction it may use R1 for the leading-in 
code and use one of the extra registers, call it register X27, to hold the 
value of the LHI.  When that value is needed the X27 register is used instead.  
The machine remembers all this.
Also, the z/13 can be working 6 instructions in parallel.
One of the big pains for the z13 is an unpredictable branch.  That is one that 
goes one way this time and the other way the next.  The machine has a lot of 
branch prediction stuff (didn't know what else to call it) in it so that it 
tries to know where it will go, but if it predicts wrong, there is a 26 cycle 
cost, and when you consider that hits at least 6 instructions, that is a 
non-trivial expense.
That brings us to cache hit ratios.  A 1/10th of a second wait may seem like 
less than a blink of an eye, it is forever in our high speed machines.  In that 
time all your code and data has probably been purged out of level 1 cache, and 
maybe out of level 2 cache.  Bringing it back into cache takes time, a few 
cycles for level 1, and a factor more for each level away from level 1.
Today it is impossible to say how long an instruction takes.  It is even 
impossible to say how long a process takes because it varies based on what is 
in cache at the time.
Another thing that effects things is do you get dispatched on the same 
processor or not.  If not, then all the level 1 cache has to be reloaded.
Bottom line, instruction speed is almost meaningless.  You have to look at it 
from a workload perspective.

Chris Blaicher
Technical Architect
Software Development
Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8234  |  M: 512-627-3803
E: cblaic...@syncsort.com


-Original Message

Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Blaicher, Christopher Y.
I have looked at the public documentation on the z13 and had the privilege to 
speak to some of the people behind parts of it, and it is an amazing machine.
The reason you can't say how long an instruction takes is that in many cases 
things are happening A) out of sequence; B) at the same time and C) dependent 
on cache hit ratios.
A z13 can be looking at up to about 50 instructions to see if there is anything 
it can do.  If one of those instructions is not dependent on something yet to 
be done, it will do it and hold on to the result until needed.  It also has 
many more registers than the 16 we think of, so if you have code using R1 and 
that is followed by a LHI R1,n instruction it may use R1 for the leading-in 
code and use one of the extra registers, call it register X27, to hold the 
value of the LHI.  When that value is needed the X27 register is used instead.  
The machine remembers all this.
Also, the z/13 can be working 6 instructions in parallel.
One of the big pains for the z13 is an unpredictable branch.  That is one that 
goes one way this time and the other way the next.  The machine has a lot of 
branch prediction stuff (didn't know what else to call it) in it so that it 
tries to know where it will go, but if it predicts wrong, there is a 26 cycle 
cost, and when you consider that hits at least 6 instructions, that is a 
non-trivial expense.
That brings us to cache hit ratios.  A 1/10th of a second wait may seem like 
less than a blink of an eye, it is forever in our high speed machines.  In that 
time all your code and data has probably been purged out of level 1 cache, and 
maybe out of level 2 cache.  Bringing it back into cache takes time, a few 
cycles for level 1, and a factor more for each level away from level 1.
Today it is impossible to say how long an instruction takes.  It is even 
impossible to say how long a process takes because it varies based on what is 
in cache at the time.
Another thing that effects things is do you get dispatched on the same 
processor or not.  If not, then all the level 1 cache has to be reloaded.
Bottom line, instruction speed is almost meaningless.  You have to look at it 
from a workload perspective.

Chris Blaicher
Technical Architect
Software Development
Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8234  |  M: 512-627-3803
E: cblaic...@syncsort.com


-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Charles Mills
Sent: Thursday, December 24, 2015 9:51 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance info?

Not so simple anymore.

"How long does a store halfword take?" used to be a question that had an 
answer. It no longer does.

My working rule of thumb (admittedly grossly oversimplified) is "instructions 
take no time, storage references take forever." I have heard it said that 
storage is the new DASD. This is true so much that the z13 processors implement 
a kind of "internal multiprogramming" so that one CPU internal thread can do 
something useful while another thread is waiting for a storage reference.

Here is an example of how complex it is. I am responsible for an "event" or 
transaction driven program. I of course have test programs that will run events 
through the subject software. How many microseconds does each event consume? 
One surprising factor is how fast do you push the events through.
If I max out the speed of event generation (as opposed to say, one event tenth 
of a second) then on a real-world shared Z the microseconds of CPU per event 
falls in HALF! Same exact sequence of instructions -- half the CPU time! Why? 
My presumption is that because if the program is running flat out it "owns" the 
caches and there is much less processor "wait" (for instruction and data fetch, 
not ECB type wait) time.

Charles
-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of Thomas Kern
Sent: Wednesday, December 23, 2015 5:28 PM
To: IBM-MAIN@LISTSERV.UA.EDU<mailto:IBM-MAIN@LISTSERV.UA.EDU>
Subject: Re: Is there a source for detailed, instruction-level performance info?

Perhaps what might be useful would be an assembler program to run loops of 
individual instructions and output some timing information.

--
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
lists...@listserv.ua.edu<mailto:lists...@listserv.ua.edu> with the message: 
INFO IBM-MAIN



  



ATTENTION: -

The information contained in this message (including any files transmitted with 
this message) may contain proprietary, trade secret or other confidential 
and/or legally privileged information. Any pricing information contained in 
this message or in any files tr

Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Ed Jaffe

On 12/23/2015 7:46 AM, Jerry Callen wrote:

I'm in the process of hand-tuning a small, performance critical algorithm on a Z13, and 
I'm hampered by the lack of detailed information on the instruction-level performance of 
the machine. Back in the day, IBM used to publish a "Functional 
Characteristics" manual for each CPU model that provided this information; those 
seem to have been discontinued. I'm looking for things like:

* Is register renaming used for GPRs and/or FPRs? (affects the need for loop 
unrolling)
* Are there any ways to bypass the L1 cache on moves of less than a page, when 
simply moving data without looking at it?
* Does LMG/STMG outperform a linear sequence of LG/STG? Under what 
circumstances?

At the very least, compiler maintainers need this information to select the 
right instruction sequences for each model. Does anyone know of a source for 
this information, other than writing test kernels and trying things out? I 
already have the SHARE presentations by David Bond and Bob Rogers, which 
contain some of this information, but it's not enough.

I was really excited by the addition of vector registers on the Z13 (yippee, an 
additional 512 bytes of high speed scratch storage!), but the load/store performance 
hasn't turned out to be what I had hoped for. I may well be using the facility "the 
wrong way"; having detailed implementation information would sure help.


There is an instruction-level performance benchmark/report function in 
recent zHISR releases. I'm not sure, but you might need a later release 
than what's currently installed at your location to get that feature.


--
Edward E Jaffe
Phoenix Software International, Inc
831 Parkview Drive North
El Segundo, CA 90245
http://www.phoenixsoftware.com/

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Anne & Lynn Wheeler
charl...@mcn.org (Charles Mills) writes:
> Not so simple anymore.
>
> "How long does a store halfword take?" used to be a question that had an
> answer. It no longer does.
>
> My working rule of thumb (admittedly grossly oversimplified) is
> "instructions take no time, storage references take forever." I have heard
> it said that storage is the new DASD. This is true so much that the z13
> processors implement a kind of "internal multiprogramming" so that one CPU
> internal thread can do something useful while another thread is waiting for
> a storage reference.
>
> Here is an example of how complex it is. I am responsible for an "event" or
> transaction driven program. I of course have test programs that will run
> events through the subject software. How many microseconds does each event
> consume? One surprising factor is how fast do you push the events through.
> If I max out the speed of event generation (as opposed to say, one event
> tenth of a second) then on a real-world shared Z the microseconds of CPU per
> event falls in HALF! Same exact sequence of instructions -- half the CPU
> time! Why? My presumption is that because if the program is running flat out
> it "owns" the caches and there is much less processor "wait" (for
> instruction and data fetch, not ECB type wait) time.

so such accounting measuring CPU time (elapsed instruction time) is
analogous to early accounting which measured by elapsed wall clock time.

cache miss/memory access latency ... when measured in count of processor
cycles is comparable to 60s disk access when measured in in count of 60s
processor cycles.

There is lot of analogy between page thrashing when overcommitting real
memory and cache misses. This is old account of motivation behind moving
370 to all virtual memory. The issue was that as processors got faster,
they spent more and more time waiting for disk. To keep the processors
busy, required increasing levels of multiprogramming to overlap
execution with waiting on disk. At the time, MVT storage allocation was
so bad that a region sizes needed to be four times larger than actually
used. As a result, a typical 1mbyte 370/165 would only have four
regions. Going to virtual memory, it would be possible to run 16 regions
in a typical 1mbyte 370/165 with little or no paging ... significantly
increasing aggregate throughput.
http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory

risc has been doing cache miss compensation for decades, out-of-order
execution, branch prediction, speculative execution, hyperthreading ...
can be viewed as hardware analogy to 60s multitasking ... given the
processor something else to do while waiting for cache miss. Decade or
more ago, some of the other non-risc chips started moving to hardware
layer that translated instructions into risc micro-ops for scheduling
and execution ... largely mitigating performance difference between
those CISC architectures and RISC.

IBM documentation claimed that half the per processor improvement from
z10->z196 was the introduction of many of the features that have been
common in risc implementation for decades ... with further refinement in
ec12 and z13.

z10, 64processors, aggregate 30BIPS or 496MIPS/proc
z196, 80processors, aggregate 50BIPS or 625MIPS/proc
EC12, 101 processor, aggregate 75BIPS or 743MIPS/proc

however, z13 claims 30% more throughput than EC12 with 40% more
processors ... which would make it 700MIPS/processor

by comparison z10 era E5-2600v1 blade was about 500 BIPS, 16 processors
or 31BIPS/proce. E5-2600v4 blade is pushing 2000BIPS, 36 processors or
50BIPS/proc.

as an aside, 370/195 pipeline was doing out-of-order execution ...  but
didn't do branch proediction or speculative execution ... and
conditional branch would drain the pipeline. careful coding could keep
the execution units busy getting 10MIPS ... but normal codes typically
ran around 5MIPS (because of conditional branches). I got sucked into
helping with hyperthreading 370/195 (which never shipped), it would
simulate two processors with two instructions streams, sets of
registers, etc ... assuming two instruction streams, each running at
5MIPS would then keep all execution units running at 10MIPS.

from account of shutdown of ACS-360
http://people.cs.clemson.edu/~mark/acs_end.html

Sidebar: Multithreading

In summer 1968, Ed Sussenguth investigated making the ACS-360 into a
multithreaded design by adding a second instruction counter and a second
set of registers to the simulator. Instructions were tagged with an
additional "red/blue" bit to designate the instruction stream and
register set; and, as was expected, the utilization of the functional
units increased since more independent instructions were available.

IBM patents and disclosures on multithreading include:

US Patent 3,728,692, J.W. Fennel, Jr., Instruction selection in a
two-program counter instruction unit, filed August 1971, and issued
April 1973.

US Patent 3,771,138, J.O. Celtruda, et al., 

Re: Is there a source for detailed, instruction-level performance info?

2015-12-24 Thread Charles Mills
Not so simple anymore.

"How long does a store halfword take?" used to be a question that had an
answer. It no longer does.

My working rule of thumb (admittedly grossly oversimplified) is
"instructions take no time, storage references take forever." I have heard
it said that storage is the new DASD. This is true so much that the z13
processors implement a kind of "internal multiprogramming" so that one CPU
internal thread can do something useful while another thread is waiting for
a storage reference.

Here is an example of how complex it is. I am responsible for an "event" or
transaction driven program. I of course have test programs that will run
events through the subject software. How many microseconds does each event
consume? One surprising factor is how fast do you push the events through.
If I max out the speed of event generation (as opposed to say, one event
tenth of a second) then on a real-world shared Z the microseconds of CPU per
event falls in HALF! Same exact sequence of instructions -- half the CPU
time! Why? My presumption is that because if the program is running flat out
it "owns" the caches and there is much less processor "wait" (for
instruction and data fetch, not ECB type wait) time.

Charles
-Original Message-
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Thomas Kern
Sent: Wednesday, December 23, 2015 5:28 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Is there a source for detailed, instruction-level performance
info?

Perhaps what might be useful would be an assembler program to run loops of
individual instructions and output some timing information.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Is there a source for detailed, instruction-level performance info?

2015-12-23 Thread Jerry Callen
I'm in the process of hand-tuning a small, performance critical algorithm on a 
Z13, and I'm hampered by the lack of detailed information on the 
instruction-level performance of the machine. Back in the day, IBM used to 
publish a "Functional Characteristics" manual for each CPU model that provided 
this information; those seem to have been discontinued. I'm looking for things 
like:

* Is register renaming used for GPRs and/or FPRs? (affects the need for loop 
unrolling)
* Are there any ways to bypass the L1 cache on moves of less than a page, when 
simply moving data without looking at it?
* Does LMG/STMG outperform a linear sequence of LG/STG? Under what 
circumstances?

At the very least, compiler maintainers need this information to select the 
right instruction sequences for each model. Does anyone know of a source for 
this information, other than writing test kernels and trying things out? I 
already have the SHARE presentations by David Bond and Bob Rogers, which 
contain some of this information, but it's not enough.

I was really excited by the addition of vector registers on the Z13 (yippee, an 
additional 512 bytes of high speed scratch storage!), but the load/store 
performance hasn't turned out to be what I had hoped for. I may well be using 
the facility "the wrong way"; having detailed implementation information would 
sure help.


-- Jerry Callen
   Rocket Software

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-23 Thread Jerry Callen
BTW - I applaud IBM's provision of the non-privileged ECAG instruction for 
obtaining cache characteristics. Here's the output of a little program I wrote 
to format the information it provides (on a Z13):

level 0: private
  data: line size=256, set associativity=8, total size=128K
  instruction: line size=256, set associativity=6, total size=96K

level 1: private
  data: line size=256, set associativity=8, total size=2048K
  instruction: line size=256, set associativity=8, total size=2048K

level 2: shared
  unified: line size=256, set associativity=16, total size=65536K

level 3: shared
  unified: line size=256, set associativity=30, total size=491520K

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Is there a source for detailed, instruction-level performance info?

2015-12-23 Thread Thomas Kern
Perhaps what might be useful would be an assembler program to run loops 
of individual instructions and output some timing information.


/Tom

On 12/23/2015 11:20, Shmuel Metz (Seymour J.) wrote:

In <8970116796168447.wa.jcallennarsil@listserv.ua.edu>, on
12/23/2015
at 09:46 AM, Jerry Callen  said:


I'm in the process of hand-tuning a small, performance critical
algorithm on a Z13,

Lather, rinse, repeat every time you change processors.


and I'm hampered by the lack of detailed
information on the instruction-level performance of the machine.

Be careful what you ask for; you might get it.


Back in the day, IBM used to publish a "Functional Characteristics"
manual for each CPU model that provided this information;

Except when it was in a separate timing manual.

Every time that IBM introduced a new processor the timing formulæ
became more  complicated. I suspect that a timing manual for the z13
would be massive.
  


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN