Thursday, August 19, 2010

The 2010 Linux Storage and Filesystem Summit, day 2

The second day of the 2010 Linux Storage and Filesystem Summit was held on August 9 in Boston. Those who have not yet read the coverage from day 1 may want to start there. This day's topics were, in general, more detailed and technical and less amenable to summarization here. Nonetheless, your editor will try his best.

Writeback

The first session of the day was dedicated to the writeback issue. Writeback, of course, is the process of writing modified pages of files back to persistent store. There have been numerous complaints over recent years that writeback performance in Linux has regressed; the curious reader can refer to this article for some details, or this bugzilla entry for many, many details. The discussion was less focused on this specific problem, though; instead, the developers considered the problems with writeback as a whole.

Sorin Faibish started with a discussion of some research that he has done in this area. The challenges for writeback are familiar to those who have been watching the industry; the size of our systems - in terms of both memory and storage - has increased, but speed of those systems has not increased proportionally. As a result, writing back a given percentage of a system's pages takes longer than it once did. It is always easier for the writeback system to fail to keep up with processes which are dirtying pages, leading to poor performance.

His assertion is that the use of watermarks to control writeback is no longer appropriate for contemporary systems. Writeback should not wait until a certain percentage of memory is dirty; it should start sooner, and, crucially, be tied to the rate with which processes are dirtying pages. The system, he says, should work much more aggressively to ensure that the writeback rate matches the dirty rate.

From there, the discussion wandered through a number of specific issues. Linux writeback now works by flushing out pages belonging to a specific file (inode) at a time, with the hope that those pages will be located nearby on the disk. The memory management code will normally ask the filesystem to flush out up to 4MB of data for each inode. One poorly-kept secret of Linux memory management is that filesystems routinely ignore that request - they typically flush far more data than requested if there are that many dirty pages. It's only by generating much larger I/O requests that they can get the best performance.

No comments: