[ltp] Re: solid state drive?

Theodore Tso linux-thinkpad@linux-thinkpad.org
Wed, 18 Feb 2009 10:15:13 -0500


On Wed, Feb 18, 2009 at 03:19:21PM +0100, Laurent wrote:
> Hi,
>
>> Write amplification has been addressed by next generation SSD's (at
>> the moment the Intel SSD's are alone in the market, but I expect other
>> manufacturers will coming up with competing products soon.)
>
> The Intel SSD address one problem and create another. The first article
> i linked benchmarks an Intel after several months of use. The really
> bad performance is caused by the anti-amplification algo. The effect
> is stronger if you use lots of small writes. i have seen the same thing
> in our testlab: the performance after 2 months depends on the type of
> data written. swap and ext3 + small files kills the performance faster
> then ext2 + big files.

I've been looking into this.  There are a couple of things going on
here, actually.  The first is lack of TRIM support.  This means that
even when you a delete a file, the SDD has no way of knowing the file
has been erased, so when you do a small write, it may end up
needlessly copying data blocks that it doesn't need to copy.  This is
why, in the article you linked to, the performance degradation could
be reset by using an ATA SECURITY ERASE --- this allows the disk to
understand that all of the sectors are no longer in use.

The second problem is that the default MS-DOS partition scheme doesn't
align the filesystem on the erase blocks, which makes it almost
impossible for the filesystem to do a good job even if it could.

As far as the journaling is concerned, the journal is a contiguous
file on disk so writes to it end up being efficient.  Since the
journal wraps and writes to it are done contiguously, the Intel SSD
shouldn't have too much problems dealing with the journal.  I just got
my SSD yesterday, so I need to run some tests to confirm this, but I
don't forsee a big issue here.

The big deal is making sure that the filesystem is aligned on an erase
block, which I can do if I choose partitioning paremeters of 224 heads
and 56 sectors/track --- that leads to 12544 sectors/cylinder, or
49*128k per cyclinder.  The first partition will still not be 128k
aligned, but that's OK --- we'll just call that /boot, which doesn't
get modified very often.  Subsequent partitions will be aligned on an
cylinder boundary, so will be 128k aligned, and after I do a quick
hack to e2fsprogs, the journal will also be conveniently set up to be
aligned on a 128k boundary.

Small files are small files, and there isn't too much that can be done
about them; however, the ext4 filesystem *does* have the ability to
understand that files should be aligned on raid stripes, so with the
appropriate mke2fs parameters, it should be possible to align large
files on 128k erase blocks, which should help the write amplification
problem tremendously.

As far as your test lab is concerned, what program are you using to
simulate filesystem aging, out of curiosity --- and are you seeing the
performance degredation on writes only, or on reads as well?

> For the current generation it's a matter of erasing the blocks fast
> enough. Most SSD are fast as long as the blocks written by the OS
> are as big or bigger as the native block size. Ext3 journal is messy
> and syncs lot lots of small blocks. That make the disc a lot slower
> if you use lots of smalls files (like my email client or my dev-app).

Yeah, one of the things I'm looking into doing is how to tune ext4's
allocation algorithm to work better on SSD's.  In practice this means
turning off some of ext3/ext4's anti-fragmentation code, since if you
are sucking down a large number of small files into a maildir
directory (for example) you want to keep them packed tightly together
so they only consume a single erase block.

So for example, I suspect that for ext2 and ext3, mounting the
filesystem with the "noreservation" mount option would be a very good
thing to help optimize small files.  Ext4 has a different (and far
more complicated) block allocator, so we'll probably need different
tuning parmaeters to accomodate SSD's.  This is definitely something
I'm looking into --- I figure I can get a paper and a trip to
Linux.conf.au out of it.  :-)

						- Ted