Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread Anton B. Rang
 When you have a striped storage device under a
 file system, then the database or file system's view
 of contiguous data is not contiguous on the media.

Right.  That's a good reason to use fairly large stripes.  (The primary 
limiting factor for stripe size is efficient parallel access; using a 100 MB 
stripe size means that an average 100 MB file gets less than two disks' worth 
of throughput.)

ZFS, of course, doesn't have this problem, since it's handling the layout on 
the media; it can store things as contiguously as it wants.

 There are many different ways to place the data on the media and we would 
 typically
 strive for a diverse stochastic spread.

Err ... why?

A random distribution makes reasonable sense if you assume that future read 
requests are independent, or that they are dependent in unpredictable ways. 
Now, if you've got sufficient I/O streams, you could argue that requests *are* 
independent, but in many other cases they are not, and they're usually 
predictable (particularly after a startup period). Optimizing for the predicted 
access cases makes sense. (Optimizing for observed access may make sense in 
some cases as well.)

-- Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread Louwtjie Burger
 We are all anxiously awaiting data...
   -- richard

Would it be worthwhile to build a test case:

- Build a postgresql database and import 1 000 000 (or more) lines of data.
- Run a single and multiple large table scan queries ... and watch the system

then,

- Update a column of each row in the database, run the same queries
and watch the system

Continue updating more colums (to get more defrag) until you notice something.

I personally believe that since most people will have hardware LUN's
(with underlying RAID) and cache, it will be difficult to notice
anything. Given that those hardware LUN's might be busy with their own
wizardry ;) You will also have to minimize the effect of the database
cache ...

It will be a tough assignment ... maybe someone has already done this?

Thinking about this (very abstract) ... does it really matter?

[8KB-a][8KB-b][8KB-c]

So what it 8KB-b gets updated and moved somewhere else? If the DB gets
a request to read 8KB-a, it needs to do an I/O (eliminate all
caching). If it gets a request to read 8KB-b, it needs to do an I/O.

Does it matter that b is somewhere else ... it still needs to go get
it ... only in a very abstract world with read-ahead (both hardware or
db) would 8KB-b be in cache after 8KB-a was read.

Hmmm... the only way is to get some data :) *hehe*
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal error: Bad file number

2007-11-15 Thread Mark J Musante
On Thu, 15 Nov 2007, Manoj Nayak wrote:

 I am getting following error message when I run any zfs command.I have 
 attach the script I use to create ramdisk image for Thumper.

 # zfs volinit
 internal error: Bad file number
 Abort - core dumped

This sounds as if you may have somehow lost the /dev/zfs link.  Try 
linking /dev/zfs to ../devices/pseudo/[EMAIL PROTECTED]:zfs assuming the latter 
exists 
at all.  If that doesn't do the trick, could you attach a truss -f output?


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4500 device disconnect problem persists

2007-11-15 Thread Peter Eriksson
Speaking of error recovery due to bad blocks - anyone know if the SATA disks 
that are delivered with the Thumper have enterprise or desktop 
firmware/settings by default? If I'm not mistaken one of the differences is 
that the enterrprise variant more quickly gives up with bad blocks and 
reports those to the operating system compared to the desktop variant that 
will keep on retrying forever (or almost atleast)...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread Robert Milkowski
Hello can,

Thursday, November 15, 2007, 2:54:21 AM, you wrote:

cyg The major difference between ZFS and WAFL in this regard is that
cyg ZFS batch-writes-back its data to disk without first aggregating
cyg it in NVRAM (a subsidiary difference is that ZFS maintains a
cyg small-update log which WAFL's use of NVRAM makes unnecessary). 
cyg Decoupling the implementation from NVRAM makes ZFS usable on
cyg arbitrary rather than specialized platforms, and that without
cyg doubt  constitutes a significant advantage by increasing the
cyg available options (in both platform and price) for those
cyg installations that require the kind of protection (and ease of
cyg management) that both WAFL and ZFS offer and that don't require
cyg the level of performance that WAFL provides and ZFS often may not
cyg (the latter hasn't gotten much air time here, and while it can be
cyg discussed to some degree in the abstract a better approach would
cyg be to have some impartial benchmarks to look at, because the
cyg on-disk block layouts do differ significantly and sometimes
cyg subtly even if the underlying approaches don't).

Well, ZFS allows you to put its ZIL on a separate device which could
be NVRAM.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS

2007-11-15 Thread Paul Bartholdi
On 11/15/07, Paul Kraus [EMAIL PROTECTED] wrote:

 Splitting this thread and changing the subject to reflect that...

 On 11/14/07, can you guess? [EMAIL PROTECTED] wrote:

  Another prominent debate in this thread revolves around the question of
  just how significant ZFS's unusual strengths are for *consumer* use.
  WAFL clearly plays no part in that debate, because it's available only
  on closed, server systems.

 I am both a large systems administrator and a 'home user' (I
 prefer that term to 'consumer'). I am also very slow to adopt new
 technologies in either environment. We have started using ZFS at work
 due to performance improvements (for our workload) over UFS (or any
 other FS we tested). At home the biggest reason I went with ZFS for my
 data is ease of management. I split my data up based on what it is ...
 media (photos, movies, etc.), vendor stuff (software, datasheets,
 etc.), home directories, and other misc. data. This gives me a good
 way to control backups based on the data type. I know, this is all
 more sophisticated than the typical home user. The biggest win for me
 is that I don't have to partition my storage in advance. I build one
 zpool and multiple datasets. I don't set quotas or reservations
 (although I could).

 So I suppose my argument for ZFS in home use is not data
 integrity, but much simpler management, both short and long term.


I am in the same situation as you and fully agree, except for data
integrity. At work, a sofistigated backup system keeps many copies of my
files, while at home it is much more rudimentary and data integrity
becomes also very important, certainly more than speed.

Paul

Paul Kraus
 Albacon 2008 Facilities
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





Paul Bartholdi
Chemin de la Barillette 11
CH-1260 NYON
Suisse  tel +41 22 361 0222

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
 can you guess? wrote:
  For very read intensive and position sensitive
  applications, I guess 
  this sort of capability might make a difference?
  
  No question about it.  And sequential table scans
 in databases 
  are among the most significant examples, because
 (unlike things 
  like streaming video files which just get laid down
 initially 
  and non-synchronously in a manner that at least
 potentially 
  allows ZFS to accumulate them in large, contiguous
 chunks - 
  though ISTR some discussion about just how well ZFS
 managed 
  this when it was accommodating multiple such write
 streams in 
  parallel) the tables are also subject to
 fine-grained, 
  often-random update activity.
  
  Background defragmentation can help, though it
 generates a 
  boatload of additional space overhead in any
 applicable snapshot.
 
 The reason that this is hard to characterize is that
 there are
 really two very different configurations used to
 address different
 performance requirements: cheap and fast.  It seems
 that when most
 people first consider this problem, they do so from
 the cheap
 perspective: single disk view.  Anyone who strives
 for database
 performance will choose the fast perspective:
 stripes.

And anyone who *really* understands the situation will do both.

  Note: data
 redundancy isn't really an issue for this analysis,
 but consider it
 done in real life.  When you have a striped storage
 device under a
 file system, then the database or file system's view
 of contiguous
 data is not contiguous on the media.

The best solution is to make the data piece-wise contiguous on the media at the 
appropriate granularity - which is largely determined by disk access 
characteristics (the following assumes that the database table is large enough 
to be spread across a lot of disks at moderately coarse granularity, since 
otherwise it's often small enough to cache in the generous amounts of RAM that 
are inexpensively available today).

A single chunk on an (S)ATA disk today (the analysis is similar for 
high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size to yield 
over 80% of the disk's maximum possible (fully-contiguous layout) sequential 
streaming performance (after the overhead of an 'average' - 1/3 stroke - 
initial seek and partial rotation are figured in:  the latter could be avoided 
by using a chunk size that's an integral multiple of the track size, but on 
today's zoned disks that's a bit awkward).  A 1 MB chunk yields around 50% of 
the maximum streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
effectively used as the disk chunk size as you seem to be suggesting yields 
only about 15% of the disk's maximum streaming performance (leaving aside an 
additional degradation to a small fraction of even that should you use RAID-Z). 
 And if you match the ZFS block size to a 16 KB database block size and use 
that as the effective unit of distribution across the set of disks, you'll obt
 ain a mighty 2% of the potential streaming performance (again, we'll be 
charitable and ignore the further degradation if RAID-Z is used).

Now, if your system is doing nothing else but sequentially scanning this one 
database table, this may not be so bad:  you get truly awful disk utilization 
(2% of its potential in the last case, ignoring RAID-Z), but you can still read 
ahead through the entire disk set and obtain decent sequential scanning 
performance by reading from all the disks in parallel.  But if your database 
table scan is only one small part of a workload which is (perhaps the worst 
case) performing many other such scans in parallel, your overall system 
throughput will be only around 4% of what it could be had you used 1 MB chunks 
(and the individual scan performances will also suck commensurately, of course).

Using 1 MB chunks still spreads out your database admirably for parallel 
random-access throughput:  even if the table is only 1 GB in size (eminently 
cachable in RAM, should that be preferable), that'll spread it out across 1,000 
disks (2,000, if you mirror it and load-balance to spread out the accesses), 
and for much smaller database tables if they're accessed sufficiently heavily 
for throughput to be an issue they'll be wholly cache-resident.  Or another way 
to look at it is in terms of how many disks you have in your system:  if it's 
less than the number of MB in your table size, then the table will be spread 
across all of them regardless of what chunk size is used, so you might as well 
use one that's large enough to give you decent sequential scanning performance 
(and if your table is too small to spread across all the disks, then it may 
well all wind up in cache anyway).

ZFS's problem (well, the one specific to this issue, anyway) is that it tries 
to use its 'block size' to cover two different needs:  performance for 
moderately fine-grained updates (though its need to propagate those updates 
upward to the root of the applicable tree 

Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread Andy Lubel
On 11/15/07 9:05 AM, Robert Milkowski [EMAIL PROTECTED] wrote:

 Hello can,
 
 Thursday, November 15, 2007, 2:54:21 AM, you wrote:
 
 cyg The major difference between ZFS and WAFL in this regard is that
 cyg ZFS batch-writes-back its data to disk without first aggregating
 cyg it in NVRAM (a subsidiary difference is that ZFS maintains a
 cyg small-update log which WAFL's use of NVRAM makes unnecessary).
 cyg Decoupling the implementation from NVRAM makes ZFS usable on
 cyg arbitrary rather than specialized platforms, and that without
 cyg doubt  constitutes a significant advantage by increasing the
 cyg available options (in both platform and price) for those
 cyg installations that require the kind of protection (and ease of
 cyg management) that both WAFL and ZFS offer and that don't require
 cyg the level of performance that WAFL provides and ZFS often may not
 cyg (the latter hasn't gotten much air time here, and while it can be
 cyg discussed to some degree in the abstract a better approach would
 cyg be to have some impartial benchmarks to look at, because the
 cyg on-disk block layouts do differ significantly and sometimes
 cyg subtly even if the underlying approaches don't).
 
 Well, ZFS allows you to put its ZIL on a separate device which could
 be NVRAM.

Like RAMSAN SSD

http://www.superssd.com/products/ramsan-300/

It is the only FC attached, Battery-backed SSD that I know of, and we have
dreams of clusterfication.  Otherwise we would use one of those PCI-Express
based NVRAM cards that are on the horizon.

My initial results for lots of small files was very pleasing.

I dream of a JBOD with lots of disks + something like this built into 3u.
Too bad Sun's forthcoming JBODS probably wont have anything similar to
this...

-Andy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot send/receive via intermediate device

2007-11-15 Thread Darren J Moffat
Simple answer yes.

Slightly longer answer.

zfs send just writes to stdout where you put that is upto your needs, 
can can be a file in some filesystem, a raw disk, a tape, a pipe to 
another program (such as ssh or compress or encrypt)

zfs recv reads from stdin so just do the reverse of what you did for send.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread can you guess?
...

 Well, ZFS allows you to put its ZIL on a separate
 device which could
 be NVRAM.

And that's a GOOD thing (especially because it's optional rather than requiring 
that special hardware be present).  But if I understand the ZIL correctly not 
as effective as using NVRAM as a more general kind of log for a wider range of 
data sizes and types, as WAFL does.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS stable in OpenSolaris?

2007-11-15 Thread Mark Phalan

On Thu, 2007-11-15 at 17:20 +, Darren J Moffat wrote:
 hex.cookie wrote:
  In production environment, which platform should we use? Solaris 10 U4 or 
  OpenSolaris 70+?  How should we estimate a stable edition for production? 
  Or OpenSolaris is stable in some build?
 
 All depends on what you define by stable.
 
 Do you intend to pay Sun for a service contract ?
   If so S10u4 is likley your best route
 
 Do you care about patching rather than upgrading ?
   If patching S10u4
   If you can do upgrade (highly recommened IMO)
 using live_upgrade(5) then a Solaris Express
 
 For an OpenSolaris based distribution I think the realistic choices are 
 from the following list:
 Solaris Express Community Edition (SX:CE)
 Solaris Express Developer Edition (SX:DE)
 Belenix
 Nexenta
 OpenSolaris Developer Preview (Project Indiana)

or Martux if you want to run on Sparc.

-Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot mount 'mypool': Input/output error

2007-11-15 Thread Nabeel Saad
I appreciate the different responses that I have gotten.  As some of you may 
have realized I am not a guru in Linux / Solaris... 

I have been trying to figure out what file system my Solaris box was using... I 
got a comment from Paul that from the fdisk command he could see that most 
likely the partitions are Solaris UFS... I don't see that information anywhere, 
so I'm wondering if I missed something, or if you are assuming this Paul?

I am sure I will not use ZFS to its fullest potential at all.. right now I'm 
trying to recover the dead disk, so if it works to mount a single disk/boot 
disk, that's all I need, I don't need it to be very functional.  As I 
suggested, I will only be using this to change permissions and then return the 
disk into the appropriate Server once I am able to log back into that server.

I will try the zfs import just to give it a go.  I have done modprobe fuse and 
have it loaded... but the fact that allow is not available in the latest 
version clears up why that wasn't working...

Sorry Darren, I was not sure what the CC Forums really did and I just chose 
ones that I thought might be related to ZFS not realized that Crypto is 
probably another project...

I got another suggestion that the file system is UFS, which would make me think 
that mount -t ufs /dev/sda1 /mnt/mymount should work, but given that that fails 
with 

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
   or too many mounted file systems

some thing is not right... but that's probably more a linux community 
discussion topic... thanks thought.

Thanks for your suggestion Mark, I will look in the linux FUSE although I do 
have a feeling we downloaded the Solaris FUSE software and put it on a linux 
box... I'll have to look into that some more.

Thank you for your responses...
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS stable in OpenSolaris?

2007-11-15 Thread Darren J Moffat
hex.cookie wrote:
 In production environment, which platform should we use? Solaris 10 U4 or 
 OpenSolaris 70+?  How should we estimate a stable edition for production? Or 
 OpenSolaris is stable in some build?

All depends on what you define by stable.

Do you intend to pay Sun for a service contract ?
If so S10u4 is likley your best route

Do you care about patching rather than upgrading ?
If patching S10u4
If you can do upgrade (highly recommened IMO)
  using live_upgrade(5) then a Solaris Express

For an OpenSolaris based distribution I think the realistic choices are 
from the following list:
Solaris Express Community Edition (SX:CE)
Solaris Express Developer Edition (SX:DE)
Belenix
Nexenta
OpenSolaris Developer Preview (Project Indiana)

Another important consideration is what ZFS functionality you need since 
not all features available in OpenSolaris releases were backported to 
Solaris 10u4 (because some of them were completed *after* S10u4 shipped).

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to create ZFS pool ?

2007-11-15 Thread Mike Dotson
On Thu, 2007-11-15 at 05:25 -0800, Boris Derzhavets wrote:
 Thank you very much Mike for your feedback.
 Just one more question.
 I noticed five device under /dev/rdsk:-
 c1t0d0p0
 c1t0d0p1
 c1t0d0p2
 c1t0d0p3
 c1t0d0p4
 been created by system immediately after installation completed.
 I believe it's x86 limitation (no more then 4 primary partitions)
 If I've got your point right, in case when Other OS partition gets number 3.
 I am supposed to run:-
 # zpool create pool  c1t0d0p3

Yes.  Just make sure it's the correct partition, ie. partition 3 is
actually where you want the zpool otherwise you'll corrupt/loose what
ever data is on that partition.  You also need to make sure that
partition 3 is defined and you can see it in fdisk as Solaris creates
these p? devices whether they exist or not.

So if I read your previous email correctly, you'll need to run format,
select your first disk then run fdisk again.  Empty/unused space doesn't
mean a partition has been created.

From there, you'll want to create a new partition and if you're not
familiar with Solaris fdisk, it's a PITA until you get really used to
it.  You'll want to start one (1) cylinder past the end of your last
partition so there's no overlap, then calculate the size of the
partition.  I usually use cylinders for this.

So on one of my systems:

 Total disk size is 17849 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1   ActiveSolaris2  1  52245224 29



SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Exit (update disk configuration and exit)
   6. Cancel (exit without updating disk configuration)
Enter Selection: 

So the last cylinder is 5224 so we'll start on 5225 and to use the rest
of the disk, you'll want to take the max cylinders (17849 from top line)
and subtract 5225 which gives you 12624.  

Select 1 to create a new partition:
Select the partition type to create:
   1=SOLARIS2  2=UNIX3=PCIXOS 4=Other
   5=DOS12 6=DOS16   7=DOSEXT 8=DOSBIG
   9=DOS16LBA  A=x86 BootB=Diagnostic C=FAT32
   D=FAT32LBA  E=DOSEXTLBA   F=EFI0=Exit? 

Select 4 for Other OS
Specify the percentage of disk to use for this partition
(or type c to specify the size in cylinders). 

Now select c for cylinders (I've never been much one for trusting
percentages;)

Enter starting cylinder number:  5225
Enter partition size in cylinders: 12624
(It'll ask you about making it the active partition - say no here)


 Total disk size is 17849 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1   ActiveSolaris2  1  52245224 29
  2 Other OS   5225  1784812624 71




SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Exit (update disk configuration and exit)
   6. Cancel (exit without updating disk configuration)

Double check you're not overlapping any of the partitions and select 5
to save the partition.

In this case, the pool would be c1t0d0p2.  Not the most technically
accurate but think of p0 as the entire disk and your first partition
starts with p1 and so forth.

Hope that helps.  If you want, post your fdisk partition table if you
want a second set of eyes.  

 Boris.
  
 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Mike Dotson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on a raid box

2007-11-15 Thread Dan Pritts
On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote:
 Hi everyone,
 
 We've building a storage system that should have about 2TB of storage
 and good sequential write speed. The server side is a Sun X4200 running
 Solaris 10u4 (plus yesterday's recommended patch cluster), the array we
 bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and
 it's connected to the Sun through U320-scsi.

We are doing basically the same thing with simliar Western Scientific
(wsm.com) raids, based on infortrend controllers.  ZFS notices when we
pull a disk and goes on and does the right thing.

I wonder if you've got a scsi card/driver problem.  We tried using
an Adaptec card with solaris with poor results; switched to LSI,
it just works.

danno
--
Dan Pritts, System Administrator
Internet2
office: +1-734-352-4953 | mobile: +1-734-834-7224
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS

2007-11-15 Thread can you guess?
...

 At home the biggest reason I
 went with ZFS for my
 data is ease of management. I split my data up based
 on what it is ...
 media (photos, movies, etc.), vendor stuff (software,
 datasheets,
 etc.), home directories, and other misc. data. This
 gives me a good
 way to control backups based on the data type.

It's not immediately clear why simply segregating the different data types into 
different directory sub-trees wouldn't allow you to do pretty much the same 
thing.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [fuse-discuss] cannot mount 'mypool': Input/output error

2007-11-15 Thread Mark Phalan

On Thu, 2007-11-15 at 07:22 -0800, Nabeel Saad wrote:
 Hello, 
 
 I have a question about using ZFS with Fuse.  A little bit of background of 
 what we've been doing first...  We recently had an issue with a Solaris 
 server where the permissions of the main system files in /etc and such were 
 changed.  On server restart, Solaris threw an error and it was not possible 
 to log in, even as root. 
 
 So, given that it's the only Solaris machine we have, we took out the drive 
 and after much trouble trying with different machines, we connected it to 
 Linux 2005 Limited Edition server using a USB to SATA connector.  The linux 
 machine now sees the device in /dev/sda* and I can confirm this by doing the 
 following: 
 
 [root]# fdisk sda 
 
 Command (m for help): p 
 
 Disk sda (Sun disk label): 16 heads, 149 sectors, 65533 cylinders 
 Units = cylinders of 2384 * 512 bytes 
 
 Device FlagStart   EndBlocks   Id  System 
 sda1  1719 11169  112644002  SunOS root 
 sda2  u  0  1719   20490483  SunOS swap 
 sda3 0 65533  781153365  Whole disk 
 sda5 16324 65533  586571288  SunOS home 
 sda6 11169 16324   61447607  SunOS var 
 
 Given that Solaris uses ZFS,

Solaris *can* use ZFS. ZFS root isn't supported by any distro (other
than perhaps Indiana). The filesystem you are trying to mount is
probably UFS.

  we figured to be able to change the permissions, we'll need to be able to 
 mount the device.  So, we found Fuse, downloaded, installed it along with 
 ZFS.  Everything went as expected until the creation of the pool for some 
 reason.  We're interested in either sda1, sda3 or sda5, we'll know better 
 once we can mount them...   
 
 So, we do ./run.sh  and then the zpool and zfs commands are available.  My 
 ZFS questions come here, once we run the create command, I get the error 
 directly: 
 
 [root]# zpool create mypool sda 

If you want to destroy the data on /dev/sda then this is a good start.
IF it were ZFS (which it probably isn't) you'd want to be using zpool
import.

 fuse: mount failed: Invalid argument 
 cannot mount 'mypool': Input/output error 
 
 However, if I list the pools, clearly it's been created: 
 
 [root]# zpool list 
 NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT 
 mypool 74.5G 88K   74.5G 0%  ONLINE - 
 
 It seems the issue is with the mounting, and I can't understand why: 
 
 [root]# zfs mount mypool 
 fuse: mount failed: Invalid argument 
 cannot mount 'mypool': Input/output error 
 
 [root]# zfs mount 
 
 I had searched through the source code trying to figure out what argument was 
 considered invalid and found the following: 
 
 477 if (res == -1) {
478 /*
479  * Maybe kernel doesn't support unprivileged mounts, in this
480  * case try falling back to fusermount
481  */
482 if (errno == EPERM) {
483 res = -2;
484 } else {
485 int errno_save = errno;
486 if (mo-blkdev  errno == ENODEV  
 !fuse_mnt_check_fuseblk())
487 fprintf(stderr, fuse: 'fuseblk' support missing\n);
488 else
489 fprintf(stderr, fuse: mount failed: %s\n,
490 strerror(errno_save));
491 }
492 
493 goto out_close;
494 }
 
 in the following file: 
 
 http://cvs.opensolaris.org/source/xref/fuse/libfuse/mount.c 

This is the OpenSolaris fuse code, you're using FUSE on Linux. You
should check with the Linux FUSE community...

-Mark

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS implimentations

2007-11-15 Thread Stephen Stogner
Hello,
Does any one have some real world examples of using a large ZFS cluster ie some 
where with 40+ vdev's in the range of a few hundred or so terrabytes?

Thank you.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs mount -a intermittent

2007-11-15 Thread Andre Lue
I have a slimmed down install on on_b61 and sometimes when the box is rebooted 
it fails to automatically remount the pool. Most cases if I login and run zfs 
mount -a it will mount. Some cases I have to reboot again. Can someone provide 
some insight as to what may be going on here?

truss captures the following when it fails 
412:brk(0x0808D000) = 0
412:brk(0x0809D000) = 0
412:brk(0x080AD000) = 0
412:brk(0x080BD000) = 0
412:open(/dev/zfs, O_RDWR)= 3
412:fstat64(3, 0x08047BA0)  = 0
412:d=0x0448 i=95420420 m=0020666 l=1  u=0 g=3 rdev=0x02D800
00
412:at = Nov 15 06:17:13 PST 2007  [ 1195136233 ]
412:mt = Nov 15 06:17:13 PST 2007  [ 1195136233 ]
412:ct = Nov 15 06:17:13 PST 2007  [ 1195136233 ]
412:bsz=8192  blks=0 fs=devfs
412:stat64(/dev/pts/0, 0x08047CB0)= 0
412:d=0x044C i=447105886 m=0020620 l=1  u=0 g=0 rdev=0x00600
000
412:at = Nov 15 06:17:32 PST 2007  [ 1195136252 ]
412:mt = Nov 15 06:17:32 PST 2007  [ 1195136252 ]
412:ct = Nov 15 06:17:32 PST 2007  [ 1195136252 ]
412:bsz=8192  blks=0 fs=dev
412:open(/etc/mnttab, O_RDONLY)   = 4
412:fstat64(4, 0x08047B60)  = 0
412:d=0x04580001 i=2 m=0100444 l=2  u=0 g=0 sz=651
412:at = Nov 15 06:17:38 PST 2007  [ 1195136258 ]
412:mt = Nov 15 06:17:38 PST 2007  [ 1195136258 ]
412:ct = Nov 15 06:17:04 PST 2007  [ 1195136224 ]
412:bsz=512   blks=2 fs=mntfs
412:open(/etc/dfs/sharetab, O_RDONLY) Err#2 ENOENT
412:open(/etc/mnttab, O_RDONLY)   = 5
412:fstat64(5, 0x08047B80)  = 0
412:d=0x04580001 i=2 m=0100444 l=3  u=0 g=0 sz=651
412:at = Nov 15 06:17:38 PST 2007  [ 1195136258 ]
412:mt = Nov 15 06:17:38 PST 2007  [ 1195136258 ]
412:ct = Nov 15 06:17:04 PST 2007  [ 1195136224 ]
412:bsz=512   blks=2 fs=mntfs
412:sysconfig(_CONFIG_PAGESIZE) = 4096
412:ioctl(3, ZFS_IOC_POOL_CONFIGS, 0x08046DA4)  = 0
412:llseek(5, 0, SEEK_CUR)  = 0
412:close(5)= 0
412:close(3)= 0
412:llseek(4, 0, SEEK_CUR)  = 0
412:close(4)= 0
412:_exit(0)

Looking at the ioctl call in libzfs_configs.c  i think 412:ioctl(3, 
ZFS_IOC_POOL_CONFIGS, 0x08046DA4)  = 0 is matching the section of code 
below. 
245 for (;;) {
246 if (ioctl(zhp-zpool_hdl-libzfs_fd, ZFS_IOC_POOL_STATS,
247 zc) == 0) {
248 /*
249  * The real error is returned in the zc_cookie 
field.
250  */
251 error = zc.zc_cookie;
252 break;
253 }
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
Richard Elling wrote:

...

 there are
 really two very different configurations used to
 address different
 performance requirements: cheap and fast.  It seems
 that when most
 people first consider this problem, they do so from
 the cheap
 perspective: single disk view.  Anyone who strives
 for database
 performance will choose the fast perspective:
 stripes.
 

 And anyone who *really* understands the situation will do both.
   
 
 I'm not sure I follow.  Many people who do high performance
 databases use hardware RAID arrays which often do not
 expose single disks.

They don't have to expose single disks:  they just have to use reasonable chunk 
sizes on each disk, as I explained later.

Only very early (or very low-end) RAID used very small per-disk chunks (up to 
64 KB max).  Before the mid-'90s chunk sizes had grown to 128 - 256 KB per disk 
on mid-range arrays in order to improve disk utilization in the array.  From 
talking with one of its architects years ago my impression is that HP's (now 
somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as 
an example, though today one could argue for as much as 4 MB and soon perhaps 
even more).

The array chunk size is not the unit of update, just the unit of distribution 
across the array:  RAID-5 will happily update a single 4 KB file block within a 
given array chunk and the associated 4 KB of parity within the parity chunk.  
But the larger chunk size does allow files to retain the option of using 
logical contiguity to attain better streaming sequential performance, rather 
than splintering that logical contiguity at fine grain across multiple disks.

...

 A single chunk on an (S)ATA disk today (the analysis is similar for 
 high-performance SCSI/FC/SAS disks) needn't exceed about 4 MB in size 
 to yield over 80% of the disk's maximum possible (fully-contiguous 
 layout) sequential streaming performance (after the overhead of an 
 'average' - 1/3 stroke - initial seek and partial rotation are figured 
 in:  the latter could be avoided by using a chunk size that's an 
 integral multiple of the track size, but on today's zoned disks that's 
 a bit awkward).  A 1 MB chunk yields around 50% of the maximum 
 streaming performance.  ZFS's maximum 128 KB 'chunk size' if 
 effectively used as the disk chunk size as you seem to be suggesting 
 yields only about 15% of the disk's maximum streaming performance 
 (leaving aside an additional degradation to a small fraction of even 
 that should you use RAID-Z).  And if you match the ZFS block size to a 
 16 KB database block size and use that as the effective unit of 
 distribution across the set of disks, you'll 
 obtain a mighty 2% of the potential streaming performance (again, we'll 
 be charitable and ignore the further degradation if RAID-Z is used).

   
 
 You do not seem to be considering the track cache, which for
 modern disks is 16-32 MBytes.  If those disks are in a RAID array,
 then there is often larger read caches as well.

Are you talking about hardware RAID in that last comment?  I thought ZFS was 
supposed to eliminate the need for that.

  Expecting a seek and
 read for each iop is a bad assumption.

The bad assumption is that the disks are otherwise idle and therefore have the 
luxury of filling up their track caches - especially when I explicitly assumed 
otherwise in the following paragraph in that post.  If the system is heavily 
loaded the disks will usually have other requests queued up (even if the next 
request comes in immediately rather than being queued at the disk itself, an 
even half-smart disk will abort any current read-ahead activity so that it can 
satisfy the new request).

Not that it would necessarily do much good for the case currently under 
discussion even if the disks weren't otherwise busy and they did fill up the 
track caches:  ZFS's COW policies tend to encourage data that's updated 
randomly at fine grain (as a database table often is) to be splattered across 
the storage rather than neatly arranged such that the next data requested from 
a given disk will just happen to reside right after the previous data requested 
from that disk.

 
 Now, if your system is doing nothing else but sequentially scanning 
 this one database table, this may not be so bad:  you get truly awful 
 disk utilization (2% of its potential in the last case, ignoring 
 RAID-Z), but you can still read ahead through the entire disk set and 
 obtain decent sequential scanning performance by reading from all the 
 disks in parallel.  But if your database table scan is only one small 
 part of a workload which is (perhaps the worst case) performing many 
 other such scans in parallel, your overall system throughput will be 
 only around 4% of what it could be had you used 1 MB chunks (and the 
 individual scan performances will also suck commensurately, of course).

...

 Real data would be greatly appreciated.  In my tests, I see
 reasonable media bandwidth speeds 

[zfs-discuss] read/write NFS block size and ZFS

2007-11-15 Thread msl
Hello all...
 I'm migrating a nfs server from linux to solaris, and all clients(linux) are 
using read/write block sizes of 8192. That was the better performance that i 
got, and it's working pretty well (nfsv3). I want to use all the zfs' 
advantages, and i know i can have a performance loss, so i want to know if 
there is a recomendation for bs on nfs/zfs, or what do you think about it.
I must test, or there is no need to make such configurations with zfs?
Thanks very much for your time!
Leal.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread can you guess?
Adam Leventhal wrote:
 On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
 How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
 Nope.

 A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
 and one can make a software implementation similarly robust with some effort 
 (e.g., by using a transaction log to protect the data-plus-parity 
 double-update or by using COW mechanisms like ZFS's in a more intelligent 
 manner).
 
 Can you reference a software RAID implementation which implements a solution
 to the write hole and performs well.

No, but I described how to use a transaction log to do so and later on in the 
post how ZFS could implement a different solution more consistent with its 
current behavior.  In the case of the transaction log, the key is to use the 
log not only to protect the RAID update but to protect the associated 
higher-level file operation as well, such that a single log force satisfies 
both (otherwise, logging the RAID update separately would indeed slow things 
down - unless you had NVRAM to use for it, in which case you've effectively 
just reimplemented a low-end RAID controller - which is probably why no one has 
implemented that kind of solution in a stand-alone software RAID product).

...
 
 The part of RAID-Z that's brain-damaged is its 
 concurrent-small-to-medium-sized-access performance (at least up to request 
 sizes equal to the largest block size that ZFS supports, and arguably 
 somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
 parallel (though the latter also take an extra rev to complete), RAID-Z can 
 satisfy only one small-to-medium access request at a time (well, plus a 
 smidge for read accesses if it doesn't verity the parity) - effectively 
 providing RAID-3-style performance.
 
 Brain damage seems a bit of an alarmist label.

I consider 'brain damage' to be if anything a charitable characterization.

 While you're certainly right
 that for a given block we do need to access all disks in the given stripe,
 it seems like a rather quaint argument: aren't most environments that matter
 trying to avoid waiting for the disk at all?

Everyone tries to avoid waiting for the disk at all.  Remarkably few succeed 
very well.

 Intelligent prefetch and large
 caches -- I'd argue -- are far more important for performance these days.

Intelligent prefetch doesn't do squat if your problem is disk throughput (which 
in server environments it frequently is).  And all caching does (if you're 
lucky and your workload benefits much at all from caching) is improve your 
system throughput at the point where you hit the disk throughput wall.

Improving your disk utilization, by contrast, pushes back that wall.  And as I 
just observed in another thread, not by 20% or 50% but potentially by around 
two decimal orders of magnitude if you compare the sequential scan performance 
to multiple randomly-updated database tables between a moderately 
coarsely-chunked conventional RAID and a fine-grained ZFS block size (e.g., the 
16 KB used by the example database) with each block sprayed across several 
disks.

Sure, that's a worst-case scenario.  But two orders of magnitude is a hell of a 
lot, even if it doesn't happen often - and suggests that in more typical cases 
you're still likely leaving a considerable amount of performance on the table 
even if that amount is a lot less than a factor of 100.

 
 The easiest way to fix ZFS's deficiency in this area would probably be to 
 map each group of N blocks in a file as a stripe with its own parity - which 
 would have the added benefit of removing any need to handle parity groups at 
 the disk level (this would, incidentally, not be a bad idea to use for 
 mirroring as well, if my impression is correct that there's a remnant of 
 LVM-style internal management there).  While this wouldn't allow use of 
 parity RAID for very small files, in most installations they really don't 
 occupy much space compared to that used by large files so this should not 
 constitute a significant drawback.
 
 I don't really think this would be feasible given how ZFS is stratified
 today, but go ahead and prove me wrong: here are the instructions for
 bringing over a copy of the source code:
 
   http://www.opensolaris.org/os/community/tools/scm

Now you want me not only to design the fix but code it for you?  I'm afraid 
that you vastly overestimate my commitment to ZFS:  while I'm somewhat 
interested in discussing it and happy to provide what insights I can, I really 
don't personally care whether it succeeds or fails.

But I sort of assumed that you might.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-15 Thread can you guess?
...

 For modern disks, media bandwidths are now getting to
 be  100 MBytes/s.
 If you need 500 MBytes/s of sequential read, you'll
 never get it from 
 one disk.

And no one here even came remotely close to suggesting that you should try to.

 You can get it from multiple disks, so the questions
 are:
 1. How to avoid other bottlenecks, such as a
  shared fibre channel 
 ath?  Diversity.
 2. How to predict the data layout such that you
 can guarantee a wide 
 spread?

You've missed at least one more significant question:

3.  How to lay out the data such that this 500 MB/s drain doesn't cripple 
*other* concurrent activity going on in the system (that's what increasing the 
amount laid down on each drive to around 1 MB accomplishes - otherwise, you can 
easily wind up using all the system's disk resources to satisfy that one 
application, or even fall short if you have fewer than 50 disks available, 
since if you spread the data out relatively randomly in 128 KB chunks on a 
system with disks reasonably well-filled with data you'll only be obtaining 
around 10 MB/s from each disk, whereas with 1 MB chunks similarly spread about 
each disk can contribute more like 35 MB/s and you'll need only 14 - 15 disks 
to meet your requirement).

Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Fwd: ZFS for consumers WAS:Yager on ZFS

2007-11-15 Thread Paul Kraus
Sent from the correct address...

-- Forwarded message --
From: Paul Kraus [EMAIL PROTECTED]
Date: Nov 15, 2007 12:57 PM
Subject: Re: [zfs-discuss] ZFS for consumers WAS:Yager on ZFS
To: zfs-discuss@opensolaris.org


On 11/15/07, can you guess? [EMAIL PROTECTED] wrote:
 ...

  At home the biggest reason I
  went with ZFS for my
  data is ease of management. I split my data up based
  on what it is ...
  media (photos, movies, etc.), vendor stuff (software,
  datasheets,
  etc.), home directories, and other misc. data. This
  gives me a good
  way to control backups based on the data type.

 It's not immediately clear why simply segregating the different data
 types into different directory sub-trees wouldn't allow you to do pretty
 much the same thing.

An old habit ... I think about backups along the lines of
ufsdumps of entire filesystems, I know, an outdated model.

I also like being able to see how much space I am using for
each with a simple df rather than a du (that takes a while to run). I
can also tune compression on a data type basis (no real point in
trying to compress media files that are already compressed MPEG and
JPEGs).

--
Paul Kraus
Albacon 2008 Facilities


-- 
Paul Kraus
Albacon 2008 Facilities
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Macs compatibility (was Re: Yager on ZFS)

2007-11-15 Thread Anton B. Rang
This is clearly off-topic :-) but perhaps worth correcting --

Long-time MAC users must be getting used to having their entire world 
disrupted and having to re-buy all their software. This is at least the 
second complete flag-day (no forward or backwards compatibility) change 
they've been through.

Actually, no; a fair number of Macintosh applications written in 1984, for the 
original Macintosh, still run on machines/OSes shipped in 2006. Apple provided 
processor compatibility by emulating the 68000 series on PowerPC, and the 
PowerPC on Intel; and OS compatibility by providing essentially a virtual 
machine running Mac OS 9 inside Mac OS X (up through 10.4).

Sadly, Mac OS 9 applications no longer run on Mac OS 10.5, so it's true that 
the world is disrupted now for those with software written prior to 2000 or 
so.

To make this vaguely Solaris-relevant, it's impressive that SunOS 4.x 
applications still generally run on Solaris 10, at least on SPARC systems, 
though Sun doesn't do processor emulation. Still not very ZFS-relevant. :-)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-15 Thread Marc Bevand
can you guess? billtodd at metrocast.net writes:
 
 You really ought to read a post before responding to it:  the CERN study
 did encounter bad RAM (and my post mentioned that) - but ZFS usually can't
 do a damn thing about bad RAM, because errors tend to arise either
 before ZFS ever gets the data or after it has already returned and checked
 it (and in both cases, ZFS will think that everything's just fine).

According to the memtest86 author, corruption most often occurs at the moment 
memory cells are written to, by causing bitflips in adjacent cells. So when a 
disk DMA data to RAM, and corruption occur when the DMA operation writes to 
the memory cells, and then ZFS verifies the checksum, then it will detect the 
corruption.

Therefore ZFS is perfectly capable (and even likely) to detect memory 
corruption during simple read operations from a ZFS pool.

Of course there are other cases where neither ZFS nor any other checksumming 
filesystem is capable of detecting anything (e.g. the sequence of events: data 
is corrupted, checksummed, written to disk).

-- 
Marc Bevand

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss