Re: [galaxy-dev] Local Galaxy concept system: hardware spec questions

2012-08-14 Thread Sebastian Schaaf

Hey Scott,

First of all thanks for the long reply - to keep it short I'll follow 
you with answering inline:



Scott McManus wrote:

Hey Sebastian-

It may help to consider other pieces aside from compute nodes
that you will need, such as nodes for proxies and databases,
networking gear (such as switches and cables), and so on.
http://usegalaxy.org/production has some details, and there are
high-level pieces explained at
http://wiki.g2.bx.psu.edu/Events/GDC2010?action=AttachFile&do=get&target=GDC2010_building_scalable.pdf

Thanks, I read through it, that is some evidence.

You should also talk to your institution's IT folks about power
requirements, how those costs passed on, off-site backup storage
(though it sounds like you're counting on RAID 5/6), etc.
One non-technical note regarding the organization (techies: skip that): 
This is right the point we're on currently - we had a first 
non-technical conversation ~1 month ago, and in the last days suddenly 
funding was released and led to "zugzwang" (as far as I read it also 
describes in English the force to (re)act).
The structure is roughly as follows: there is the IT provider for the 
complete hospital campus (consisting of several clinics and some medical 
school institutions; we belong to the latter) and our own institute's 
IT, which serves internally science and research. We had hours of chats 
inside our institute and agreed that we are neither able nor willed to 
manage everything on our own (the system is intended for everyone doing 
NGS research at the campus). This main IT was not integrated in the 
announced non-technical conversation.


Regarding the technical environment everything is on the way, today 
we'll have another meeting (the "main" IT folks are bothered by our 
targeted "custom" hardware). Backup is also part of conversations in 
September, we don't want to count on RAID6 alone. This topic is 
additionally very politics-driven (who pays for what?)... Technically 
the need is no question.


It also may help if folks could share their experiences with benchmarking
their own systems along with the tools that they've been using.
The Galaxy Czars conference call could help - you could bring this
up at the next meeting.
Fortunately I joined the Czars group from the first meeting and also 
took part at the GCC2012 breakout session. You're absolutely right. Too 
bad that is of so short time until we have to act - that's the reason 
why I included the whole list, hoping that anyone did some benchmarking. 
We planed to, but our first server behaved quite "moody"...
Sharing some experiences or "hard fact values" including system specs 
would be great for other people who are at the point to order hardware 
and are forced to state some reasons.


I've answered inline, but in general I think that the bottleneck
for your planned architecture will be I/O with respect to disk.
The next bottleneck may be with respect to the network - if you
have a disk farm with a 1 Gbps (125 MBps) connection, then it
doesn't matter if your disks can write 400+ MBps. (Nate also
included this in his presentation.) You may want to consider
Infiniband over Ethernet - I think the Galaxy Czars call would
be really helpful in this respect.
Planned for the HDD connection is a RAID controller offering 1 GB/s - 
the array on our first server btw delivered 450 MB/s (measured). Network 
should not be the problem for the concept, it is intended to be 
relatively autarkic. Network load will only appear while loading data 
from an archive or the sequencer itself. A 10 Gbit/s connection is 
available. InfiniBand was considered for a short time but would exceed 
the current funding. A cluster is available, but the connection speed is 
quite low (due to usage more for statistical analyses).



1. Using the described bioinformatics software: where are the
potential
system bottlenecks? (connections between CPUs, RAM, HDDs)

One way to get a better idea is to start with existing resources,
create a sample workflow or two, and measure performance. Again,
the Galaxy czars call could be a good bet.
This is what we wanted to do (see above), but we did not get so far due 
to the announced technical issues (RAID controller, HDD crashes etc.)



2. What is the expected relation of integer-based and floating point
based calculations, which will be loading the CPU cores?

This also depends on the tools being used. This might be more
relevant if your architecture were to use more specialized hardware
(such as GPUs or FPGAs), but this should be a secondary concern.
From plain theory I would expect the Needleman-Wunsch algorithm to be 
of high relevance, which should be integer calculation, basically. In 
the case of pairwise sequence alignment. MSAs may be different (may 
require floating point calcs). Unfortunately, GPU and/or FPGA usage are 
currently far out of range of this first concept, but in the back of my 
mind for a longer time :). In the standard CPU setting/environment I 
would 

Re: [galaxy-dev] Local Galaxy concept system: hardware spec

2012-08-14 Thread Paul-Michael Agapow
Some quick answers in the hopes that more qualified people will chip in:

I have a couple of question around the topic "hardware requirements" 
for 
a server which is intended to be bought and used as concept machine for 
NGS-related jobs.

First a comment - it sounds a bit like you are where we were 12 months ago in 
developing our Galaxy system and looking at similar needs. I think you'll 
almost always need more of everything, because people will always be analysing 
bigger datasets, building bigger assemblies, etc.

1. Using the described bioinformatics software: where are the potential 
system bottlenecks? (connections between CPUs, RAM, HDDs)

While I/O is potentially a bottleneck (due to Galaxy copying and writing the 
datasets etc.), I wonder if in practice this is the case. Many of the NGS tasks 
are so long running that I/O issues may not be a significant hit. However, you 
may have a potential bottleneck in getting data onto the system. How does 
information get from the sequencer into the Galaxy instance? This may need some 
thinking about.

2. What is the expected relation of integer-based and floating point 
based calculations, which will be loading the CPU cores?

I have no idea what this means.

3. Regarding the architectural differences (strengths, weaknesses): 
Would an AMD- or an Intel-System be more suitable?

I don't think this will make any difference. If it's a question of the number 
of cores, that depends to some extent on how many concurrent users or tasks 
you'll have running. I suspect your number of concurrent users will be low 
(i.e. at any time, very few people and logged in and running stuff under 
Galaxy).

6. HDD access (R/W) is mainly in bigger blocks instead of masses of 
short operations - correct?

That's my impression.

--
Paul Agapow (paul-michael.aga...@hpa.org.uk)
Bioinformatics, Health Protection Agency (UK)


**
The information contained in the EMail and any attachments is confidential and 
intended solely and for the attention and use of the named addressee(s). It may 
not be disclosed to any other person without the express authority of the HPA, 
or the intended recipient, or both. If you are not the intended recipient, you 
must not disclose, copy, distribute or retain this message or any part of it. 
This footnote also confirms that this EMail has been swept for computer viruses 
by Symantec.Cloud, but please re-sweep any attachments before opening or 
saving. HTTP://www.HPA.org.uk
**

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Local Galaxy concept system: hardware spec questions

2012-08-13 Thread Scott McManus

Hey Sebastian-

It may help to consider other pieces aside from compute nodes
that you will need, such as nodes for proxies and databases, 
networking gear (such as switches and cables), and so on.
http://usegalaxy.org/production has some details, and there are
high-level pieces explained at 
http://wiki.g2.bx.psu.edu/Events/GDC2010?action=AttachFile&do=get&target=GDC2010_building_scalable.pdf

You should also talk to your institution's IT folks about power
requirements, how those costs passed on, off-site backup storage
(though it sounds like you're counting on RAID 5/6), etc.

It also may help if folks could share their experiences with benchmarking
their own systems along with the tools that they've been using.
The Galaxy Czars conference call could help - you could bring this
up at the next meeting.

I've answered inline, but in general I think that the bottleneck
for your planned architecture will be I/O with respect to disk.
The next bottleneck may be with respect to the network - if you
have a disk farm with a 1 Gbps (125 MBps) connection, then it 
doesn't matter if your disks can write 400+ MBps. (Nate also 
included this in his presentation.) You may want to consider 
Infiniband over Ethernet - I think the Galaxy Czars call would 
be really helpful in this respect.

> 1. Using the described bioinformatics software: where are the
> potential
> system bottlenecks? (connections between CPUs, RAM, HDDs)

One way to get a better idea is to start with existing resources, 
create a sample workflow or two, and measure performance. Again,
the Galaxy czars call could be a good bet.

> 2. What is the expected relation of integer-based and floating point
> based calculations, which will be loading the CPU cores?

This also depends on the tools being used. This might be more 
relevant if your architecture were to use more specialized hardware
(such as GPUs or FPGAs), but this should be a secondary concern.

> 3. Regarding the architectural differences (strengths, weaknesses):
> Would an AMD- or an Intel-System be more suitable?

I really can't answer which processor line is more suitable, but 
I think that having enough RAM per core is more important. Nate shows 
that main.g2.bx.psu.edu has 4 GB RAM per core.

> 4. How much I/O (read and write) can be expected at the memory
> controllers? Which tasks are most I/O intensive (regarding RAM and/or
> HDDs)?

Workflows currently write all output to disk and read all input from
disk. This gets back to previous questions on benchmarking.
 
> 5. Roughly separated in mapping and clustering jobs: which amounts of
> main memory can be expected to be required by a single job (given
> e.g.
> Illumina exome data, 50x coverage)? As far as I know mapping should
> be
> around 4 GB, clustering much more (may reach high double digits).

Nate's presentation shows that main.g2.bx.psu.edu has 24 to 48 GB per
8 core reservation, and as before it shows that there is 4 GB per core.

> 6. HDD access (R/W) is mainly in bigger blocks instead of masses of
> short operations - correct?

Again, this all depends on the tool being used and could help with some
benchmarks. This question sounds like it's mostly related to choosing the
filesystem - is that right? If so, then you may want to consider a 
compressing file system such as ZFS or BtrFS. You may also want to consider
filesystems like Ceph or Gluster (now Red Hat). I know that Ceph can
run on top of XFS and BtrFS, but you should look into BtrFS's churn rate -
it might still be evolving quickly. Again, a ping to the Galaxy Czars call 
may help on any and possibly all of these questions.

Good luck!

-Scott
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Local Galaxy concept system: hardware spec questions

2012-08-13 Thread Sebastian Schaaf

Yes, thanks, I should have mentioned that.
I posted in both forum and dev-list, because I don't expect the forum 
members and the dev-list subscribers to be a 100% identical...


Sorry for any inconvenience...



Peter Cock wrote:

On Mon, Aug 13, 2012 at 11:23 AM, Sebastian Schaaf
 wrote:

Hi all,

I have a couple of question around the topic "hardware requirements" for a
server which is intended to be bought and used as concept machine for
NGS-related jobs. It should be used for development of tools and workflows
(using Galaxy, sure) as well as platform for some "alpha" users, who should
learn to work on NGS data, which they just began to generate.
...

Duplicate thread on the SEQanswers forum:
http://seqanswers.com/forums/showthread.php?t=22456

Peter



--
Sebastian Schaaf, M.Sc. Bioinformatics
Chair of Biometry and Bioinformatics
Department of Medical Information Sciences, Biometry and Epidemiology
University of Munich
Marchioninistr. 15, K U1 (postal)
Marchioninistr. 17, U 006 (office)
D-81377 Munich (Germany)
Tel: +49 89 2180-78178

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-dev] Local Galaxy concept system: hardware spec questions

2012-08-13 Thread Peter Cock
On Mon, Aug 13, 2012 at 11:23 AM, Sebastian Schaaf
 wrote:
> Hi all,
>
> I have a couple of question around the topic "hardware requirements" for a
> server which is intended to be bought and used as concept machine for
> NGS-related jobs. It should be used for development of tools and workflows
> (using Galaxy, sure) as well as platform for some "alpha" users, who should
> learn to work on NGS data, which they just began to generate.
> ...

Duplicate thread on the SEQanswers forum:
http://seqanswers.com/forums/showthread.php?t=22456

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/