[ltp] Re: solid state drive?

Wed, 18 Feb 2009 14:36:35 -0500

On Wed, Feb 18, 2009 at 06:48:31PM +0100, Laurent wrote:
> Yes and no: let's assume you create a new file and write a bit of data.
> The journal needs to kown about (i'm not 100% sure about ext3,
> extrapolating from DB-commit logs here):
>
> ....
>
> Each point is a write. These will not translate into one big
> block-device write. There should be at least 2, probably 3, maybe
> even more block-device writes. And each write causes a block erase
> (in generation 0 and 1 SSDs).

The ext3 journaling combines multiple fileysystem operations into a
single commit.  By default, commits take place every 5 seconds (or
every 10 minutes if laptop mode is enabled); commits can happen sooner
if there is memory pressure, or if the application explicitly calls
fsync().  This tends to avoid small writes for the journal in general,
unless there are problematic applications that are calling fsync() all
of the time.

>> Small files are small files, and there isn't too much that can be done
>> about them; however, the ext4 filesystem *does* have the ability to
>> understand that files should be aligned on raid stripes, so with the
>> appropriate mke2fs parameters, it should be possible to align large
>> files on 128k erase blocks, which should help the write amplification
>> problem tremendously.
>
> Nope. Example...
>

Sorry, I combined two thoughts into one.  As I said, for small files,
there's not much we can do.  So we agree there.

But for _large_ files, aligning on erase boundaries *should* help.  It
avoids the need to "read the block", "erase the whole block" and
"write back the whole block"; and even if the block was previously
fragmented because small files, the moment we can allocate the block
for a large file, it should allow the SSD to reassemble it into a
single contiguous block.

> The system fails if you have no more free blocks to play
> with. Things go south once all your blocks are 70% valid
> and contain old, "overwritten" data. At that point you
> need a SSD-internal defrag to get a some fresh free blocks.
> And the intels fail to do that.

Where did you get the 70% figure from?  That seems to be a pretty low
number.  I've been given the hint from someone who should know that
absent the TRIM command, reserving a 1GB partition which is left
completetly unused and untouched is enough to significantly help.

> TRIM & Co will be generation 3. Currently nothing supports it. Yes, it
> will solve problems in the future, but right now ... we will have to
> deal with the erase-all way of things and lots of generation 1 drives.

There is at least some claim that TRIM support might be coming with a
firmware update to the X25-M.  We'll see if that is true.  I hope so,
given how much I paid for the silly thing!  ;-)

Personally, I'm not particularly interested in generation 1 drives.
The interesting question is how to tune a filesystem and the storage
stack so that what you call "gen 2" drives (for which the Intel X25-M
is the only shipping example I'm aware of at the moment, although
rumor has it that Sandisk will be shipping a product into this space
soon) will work well --- with or without TRIM support.

>> As far as your test lab is concerned, what program are you using to
>> simulate filesystem aging, out of curiosity --- and are you seeing the
>> performance degredation on writes only, or on reads as well?
>
> The test lab is about running DBs and email servers. We used some
> SSDs "for fun" to see if it helps or not. No large scale testing.
> So far they all suck for commit-logs/journals. Generation 1 didn't
> work fast at all. Generation 2 (consumer edition) writespeeds go
> south after 5-8 days of intensive use. The readspeeds suffer,
> but not much.

When you say intensive use, was this with the DB commit-logs or e-mail
use case, or both?  Both of these will probably be the worst case for
SSD's because of the high fsync() load on the filesystem.  So I can
definitely see how ext3 would be highly problematic for those sorts of
workloads.  

(Of course, using ext2 on an e-mail server where there's no guarantee
the filesystem will be consistent or that there won't be data loss
after a system crash has its own problems.  I suspect the right answer
for e-mail servers is to stick with HDD's, and if you really want to
throw money at the problem and have scalability problems, use multiple
servers and MX records for the front-end hub, and multiple PO/IMAP
servers for the back-end.  I used to be on MIT's network operations
group, so I know something about engineering very large scale mail
server infrastructure.  We played with battery-backed DRAM's for the
spool directory, but *man* that stuff was expensive....)

Personally, I see SDD's as being most useful for laptop workloads,
where the power and shock resistance is worthwhile.  Given the price
issues with the SSD's, in enterprise database workloads, RAID will
probably given you better write speeds at a given price level, and for
commit logs and mail servers, write speeds will generally far more
important than read speeds.  I suppose you're right that they're
useful for read-only or read-mostly DB's, but that's not terribly
interesting, is it?  :-) Most of the read-only tables I can think
about are small enough they can fit in memory --- I suppose the entire
airline pricing rules db as used by Orbitz and company might be one
such example, but even then, it's probably not *that* big.

     	      	       	     	  	       - Ted