[ltp] Re: solid state drive?

Wed, 18 Feb 2009 18:48:31 +0100

Hi,

> I've been looking into this.  There are a couple of things going on
> here, actually.  The first is lack of TRIM support. (...)

TRIM & Co will be generation 3. Currently nothing supports it. Yes, it
will solve problems in the future, but right now ... we will have to
deal with the erase-all way of things and lots of generation 1 drives.

SSD will get a lot better once everything knows about the flash.

> The second problem is that the default MS-DOS partition scheme doesn't
> align the filesystem on the erase blocks, which makes it almost
> impossible for the filesystem to do a good job even if it could.

Yes, that is also part of the once-everything-knows-about block.

BTW.: you don't need a partition table under linux. mkfs.xyz /dev/sdb
works fine. But there is no space for a bootloader.

> As far as the journaling is concerned, the journal is a contiguous
> file on disk so writes to it end up being efficient.

Yes and no: let's assume you create a new file and write a bit of data.
The journal needs to kown about (i'm not 100% sure about ext3,
extrapolating from DB-commit logs here):

- the new file (first write, 10-50 bytes, metadata)
- the change in the directory-node (another small write)
- the change in the access time of the dir-node (another one,
   can by switched off)
- the commit of these changes (...)

(writing file content)

- the change in the free-node list
- changes in the metadata (size, maybe atime (can be disabled))
- the commit of these changes.

Each point is a write. These will not translate into one big
block-device write. There should be at least 2, probably 3, maybe
even more block-device writes. And each write causes a block erase
(in generation 0 and 1 SSDs).

For gen. 2, like the intel, each write can be rerouted to a
different flashblock. That sounds like a very big LUT and a lot of
trouble. But it will be faster and ext3 will no longer suffer.
And these new SSD do NCQ. That should combine a lot of writes
into one big commit.

> The big deal is making sure that the filesystem is aligned on an erase
> block, which I can do if I choose partitioning paremeters of 224 heads
> and 56 sectors/track --- that leads to 12544 sectors/cylinder, or
> 49*128k per cyclinder.  The first partition will still not be 128k
> aligned, but that's OK --- we'll just call that /boot, which doesn't
> get modified very often.  Subsequent partitions will be aligned on an
> cylinder boundary, so will be 128k aligned, and after I do a quick
> hack to e2fsprogs, the journal will also be conveniently set up to be
> aligned on a 128k boundary.

That may or may not help. i'm not sure if the intels can profit from that
setup.

> Small files are small files, and there isn't too much that can be done
> about them; however, the ext4 filesystem *does* have the ability to
> understand that files should be aligned on raid stripes, so with the
> appropriate mke2fs parameters, it should be possible to align large
> files on 128k erase blocks, which should help the write amplification
> problem tremendously.

Nope. Example:

The SSD has a block A filled with data:

XXXYYYZZZ

The OS starts a new file, it gets allocated somewhere inside
block A. The filecontent is:

1122

So the SSD turns A into:

XX1122ZZZ

To get that the SSD has to (in one way or another):
- read the block
- erase the whole block
- write back the whole block.

The Intel way is to split the flash-blocks into smaller parts
(same example again with generation 2): The new data (1122)
is written. The old block A has valid data inside. So the SSD
searchs for space in a free, pre-erased block (let's call it B)
and writes the data inside:

A:         B:
XXXYYYZZZ  1122FFFFFF

The FFFFF-part of B is invalid and can be used for the next
small write. The XYYY is invalid and access to that region
is rerouted to B. A is erased if all the data in A becomes
invalid.

Generation 2 does not need aligned files. It simply does
not care.

The system fails if you have no more free blocks to play
with. Things go south once all your blocks are 70% valid
and contain old, "overwritten" data. At that point you
need a SSD-internal defrag to get a some fresh free blocks.
And the intels fail to do that.

TRIM will increase the number of free blocks and solve the
problem (and creates a new one: how to you recover deleted
files if the SSD no longer knows where they are...) if the
fs supports it.

> As far as your test lab is concerned, what program are you using to
> simulate filesystem aging, out of curiosity --- and are you seeing the
> performance degredation on writes only, or on reads as well?

The test lab is about running DBs and email servers. We used some
SSDs "for fun" to see if it helps or not. No large scale testing.
So far they all suck for commit-logs/journals. Generation 1 didn't
work fast at all. Generation 2 (consumer edition) writespeeds go
south after 5-8 days of intensive use. The readspeeds suffer,
but not much.

But generation 1 drives are cool for swap if you use big blocks
(512 KB). But the bang/buck ratio still sucks if you can just add
another 4 GB RAM. The costs of generation 2 are too high for swap
(and i didn't test them after the results from the commit/journal
test).

SSDs are best for readonly DBs containing a lot of small
informations read in a random order. These rock with all generations.

> I figure I can get a paper and a trip to
> Linux.conf.au out of it.  :-)

I wish you luck and want a copy.

cu