============================================== Permanent Data Storage (Flash) ============================================== :TEP: 103 :Group: Core Working Group :Type: Documentary :Status: Final :TinyOS-Version: 2.x :Author: David Gay, Jonathan Hui .. Note:: This memo documents a part of TinyOS for the TinyOS Community, and requests discussion and suggestions for improvements. Distribution of this memo is unlimited. This memo is in full compliance with TEP 1. Abstract ==================================================================== This memo documents a set of hardware-independent interfaces to non-volatile storage for TinyOS 2.x. It describes some design principles for the HPL and HAL layers of various flash chips. 1. Introduction ==================================================================== Flash chips are a form of EEPROM (electrically-erasable, programmable read-only memory), distinguished by a fast erase capability. However, erases can only be done in large units (from 256B to 128kB depending on the flash chip). Erases are the only way to switch bits from 0 to 1, and programming operations can only switch 1's to 0's. Additionally, some chips require that programming only happen once between each erase, or that it be in relatively large units (e.g., 256B). In the table below, we summarise these differences by categorising flash chips by their underlying technology (NOR vs NAND). We also include a column for Atmel's AT45DB flash chip family, as it has significantly different tradeoffs than other flash chips: ============= ================== ================= =================== X NOR AT45DB NAND (ex: ST M25P40, (ex: Samsung Intel PXA27x) K9K1G08R0B) ============= ================== ================= =================== Erase Slow (seconds) Fast (ms) Fast (ms) Erase unit Large (64KB-128KB) Small (256B) Medium (8KB-32KB) Writes Slow (100s kB/s) Slow (60kB/s) Fast (MBs/s) Write unit 1 bit 256B 100's of bytes Bit-errors Low Low High (requires ECC, bad-block mapping) Read Bus limited [*]_ Slow+Bus limited Bus limited Erase cycles 10^4 - 10^5 10^4 [*]_ 10^5 - 10^7 Intended use Code storage Data storage Data storage Energy/byte 1uJ 1uJ .01uJ ============= ================== ================= =================== .. [*] M25P40 reads are limited by the use of a 25MHz SPI bus. The PXA27x flash is memory mapped (reads are very fast and can directly execute code). .. [*] Or infinite? Data sheet just says that every page within a sector must be written every 10^4 writes within that sector The energy/byte is the per-byte cost of erasing plus programming. It is derived from the timing and power consumption of erase and write operations (for NOR flash, values are for the STMicroelectronics M25P family, for NAND flash, values are from the Samsung datasheet). Energy/byte for reads appears to depend mostly on how long the read takes (the power consumptions are comparable), i.e., on the efficiency of the bus + processor. Early TinyOS platforms all used a flash chip from the AT45DB family. In TinyOS 1.x, this chip could be accessed through three different components: - Using a low-level interface (``PageEEPROMC``) which gave direct access to per-page read, write and erase operations. - Using a high-level memory-like interface (``ByteEEPROMC``) with read, write and logging operations. - Using a simple file system (``Matchbox``) with sequential-only files [1_]. Some more recent platforms use different flash chips: the ST M25P family (Telos rev. B, eyes) and the Intel Strataflash (Intel Mote2). None of the three components listed above are supported on these chips: - The ``PageEEPROMC`` component is (and was intended to be) AT45DB-specific - ``ByteEEPROMC`` allows arbitrary rewrites of sections of the flash. This is not readily implementable on a flash chip with large erase units. - The ``Matchbox`` implementation was AT45DB-specific. It was not reimplemented for these other chips, in part because it does not support some applications (e.g., network reprogramming) very well. 2. Storage in TinyOS 2.x ==================================================================== One approach to hiding the differences between different flash chips is to provide a disk-like, block interface (with, e.g., 512B blocks). This is the approach taken by compact flash cards. However, in the context of TinyOS, this approach has several drawbacks: - This approach is protected by patents, making it difficult to provide in a free, open-source operating system. - To support arbitrary block writes where blocks are smaller than the erase unit, and to deal with the limited number of erase cycles/block requires remapping blocks. We believe that maintaining this remapping table is too expensive on many mote-class devices. A second approach is to provide a generic low-level interface providing operations (read, write, erase) corresponding to the basic flash operations, along with information describing the flash chip's layout (minimum write and erase unit, timing information, etc). However, we believe that the large differences between NAND and NOR flash (see the table above), in particular the differences in reliability, write and erase units, make the design of a useful generic low-level interface tricky at best. We thus believe it is best, for now at least, to define high-level storage abstractions that are useful for sensor network applications, and leave their implementation up to each flash chip - such abstractions will be necessary anyway. We leave open the possibility that a future TEP may define portable lower-level flash interfaces (either for all flash chips, or, e.g., for NOR-family flashes). Such low-level interfaces would allow implementations of the storage abstractions defined in this TEP to be used for multiple flash chips. This TEP describes three high-level storage abstractions: large objects written in a single session, small objects with arbitrary reads and writes, and logs. TinyOS 2.x, divides flash chips into separate volumes (with sizes fixed at compile-time). Each volume provides a single storage abstraction (the abstraction defines the format). We prefer the use of single abstractions over fixed-size volumes over the use of a more general filing system (like Matchbox) for several reasons: - TinyOS is currently targeted at running a single application, and many applications know their storage needs in advance: for instance, a little space for configuration data, and everything else for a log of all sampled data. In such cases, the flexibility offered by a filing system (e.g., arbitrary numbers of files) is overkill, - Each abstraction is relatively easy to implement on a new flash chip, and has relatively little overhead. - The problem of dealing with the limited number of erase cycles/block is simplified: it is unlikely that user applications will need to rewrite the same small object 100'000 times, or cycle 100'000 times through their log. Thus the abstractions can mostly ignore the need for "wear levelling" (ensuring that each block of the flash is erased the same number of time, to maximise flash chip lifetime). New abstractions (including a filing system) can easily be added to this framework. The rest of this TEP covers some principles for the organisation of flash chips (Section 3), then describes the flash volumes and storage abstractions in detail (Section 4). 3. HPL/HAL/HIL Architecture ==================================================================== The flash chip architecture follows the three-layer Hardware Abstraction Architecture (HAA), with each chip providing a presentation layer (HPL, Section 3.1), adaptation layer (HAL, Section 3.2) and platform-independent interface layer (HIL, Section 3.3) [2_]. The implementation of these layers SHOULD be found in the ``tos/chips/CHIPNAME`` directory. If a flash chip is part of a larger family with a similar interface, the HAA SHOULD support all family members by relying, e.g., on platform-provided configuration information. Appendix A shows example HPL and HAL specifications for the AT45DB and ST M25P chip families. 3.1 Hardware Presentation Layer (HPL) -------------------------------------------------------------------- The flash HPL has a chip-dependent, platform-independent interface. The implementation of this HPL is platform-dependent. The flash HPL SHOULD be stateless. To remain platform independent, a flash chip's HPL SHOULD connect to platform-specific components providing access to the flash chip; these components SHOULD be placed in the ``tos/platforms/PLATFORM/chips/CHIPNAME`` directory. If the flash chip implementation supports a family of flash chips, this directory MAY also contain a file describing the particular flash chip found on the platform. 3.2 Hardware Adaptation Layer (HAL) -------------------------------------------------------------------- The flash HAL has a chip-dependent, platform-independent interface and implementation. Flash families with a common HPL SHOULD have a common HAL. Flash HAL's SHOULD expose a ``Resource`` interface and automatically power-manage the underlying flash chip. Finally, the flash HAL MUST provide a way to access the volume information specified by the programmer (see Section 3). This allows users to build new flash abstractions that interact cleanly with the rest of the flash system. 3.3 Hardware Interface Layer (HIL) -------------------------------------------------------------------- Each flash chip MUST support at least one of the storage abstractions described in Section 4. These abstractions SHOULD be presented in components named ``ChipAbstractionC``, e.g., ``At45dbLogStorageC``. Additionally, a flash chip implementation MAY support platforms with multiple instances of the same storage chip. The way in which this is achieved is not specified further in this TEP. Each platform MUST have ``AbstractionC`` components (e.g., ``LogStorageC``) implementing the storage abstractions of Section 4 supported by its flash chip(s). On platforms with multiple storage chips SHOULD redirect uses of ``AbstractionC`` to the appropriate storage chip, based on the requested volume. 4. Non-Volatile Storage Abstractions =================================================================== The HIL implementations are platform-independent, but chip (family) dependent. They implement the three storage abstractions and volume structure discussed in the introduction. 4.1. Volumes ------------------------------------------------------------------- The division of the flash chip into fixed-size volumes is specified by an XML file that is placed in the application's directory (where one types 'make'). The XML file specifies the allocation as follows: :: The name and size parameters are required, while base is optional. The name is a string containing one or more characters in [a-zA-Z0-9\_], while size and base are in bytes. Each storage chip MUST provide a compile-time tool that translates the allocation specification to chip-specific nesC code. There is no constraint on how this is done or what code is produced, except that the specification to physical allocation MUST be one-to-one (i.e. a given specification should always have the same resulting physical allocation on a given chip) and the result MUST be placed in the build directory. When not specified, the tool picks a suitable physical location for a volume. If there is any reason that the physical allocation cannot be satisfied, an error should be given at compile time. The tool SHOULD be named ``tos-storage-CHIPNAME`` and be distributed with the other tools supporting a platform. The XML file SHOULD be named ``volumes-CHIPNAME.xml``. The compile-time tool MUST prepend 'VOLUME\_' to each volume name in the XML file and '#define' each resulting name to map to a unique integer. The storage abstractions are accessed by instantiating generic components that take the volume macro as argument: :: components new BlockStorageC(VOLUME_DELUGE0); If the named volume is not in the specification, nesC will give a compile-time error since the symbol will be undefined. A volume MUST NOT be used with more than one storage abstraction instance. 4.2 Large objects ------------------------------------------------------------------ The motivating example for large objects is the transmission or long-term storage of large pieces of data. For instance, programs in a network-reprogramming system, or large data-packets in a reliable data-transmission system. Such objects have an interesting characteristic: each byte in the object is written at most once. This leads to the definition of the ``BlockStorageC`` abstraction for storing large objects or other "write-once" objects: - A large object ranges from a few kilobytes upwards. - A large object is erased before the first write. - A sync ensures that a large object survives a reboot or crash - Reads are unrestricted - Each byte can only be written once between two erases Large objects are accessed by instantiating a BlockStorageC component which takes a volume id argument: :: generic configuration BlockStorageC(volume_id_t volid) { provides { interface BlockWrite; interface BlockRead; } } ... The ``BlockRead`` and ``BlockWrite`` interfaces (briefly presented in Appendix B) contain the following operations (all split-phase, except ``BlockRead.getSize``): - ``BlockWrite.erase``: erase the volume. After a reboot or a commit, a volume MUST be erased before it can be written to. - ``BlockWrite.write``: write some bytes starting at a given offset. Each byte MUST NOT be written more than once between two erases. - ``BlockWrite.sync``: ensure all previous writes are present on a given volume. Sync MUST be called to ensure written data survives a reboot or crash. - ``BlockRead.read``: read some bytes starting at a given offset. - ``BlockRead.computeCrc``: compute the CRC of some bytes starting at a given offset. - ``BlockRead.getSize``: return bytes available for large object storage in volume. For full details on arguments and other considerations, see the comments in the interface definitions. Note that these interfaces contain no direct support for verifying the integrity of the BlockStorage data, but such support can easily be built by using the ``computeCrc`` command and storing the result in a well-defined location, and checking this CRC when desired. 4.3 Logging ------------------------------------------------------------------ Event and result logging is a common requirement in sensor networks. Such logging should be reliable (a mote crash should not lose data). It should also be easy to extract data from the log, either partially or fully. Some logs are *linear* (stop logging when the volume is full), others are *circular* (the oldest data is overwritten when the volume is full). The ``LogStorageC`` abstraction supports these requirements. The log is record based: each call to ``LogWrite.append`` (see below) creates a new record. On failure (crash or reboot), the log MUST only lose whole records from the end of the log. Additionally, once a circular log wraps around, calls to ``LogWrite.append`` MUST only lose whole records from the beginning of the log. Logs are accessed by instantiating a LogStorageC component which takes a volume id and a boolean argument: :: generic configuration LogStorageC(volume_id_t volid, bool circular) { provides { interface LogWrite; interface LogRead; } } ... If the ``circular`` argument is TRUE, the log is circular; otherwise it is linear. The ``LogRead`` and ``LogWrite`` interfaces (briefly presented in Appendix B) contain the following operations (all split-phase except ``LogWrite.currentOffset``, ``LogRead.currentOffset`` and ``LogRead.getSize``): - ``LogWrite.erase``: erase the log. A log MUST be erased (possibly in some previous session) before any other operation can be used. - ``LogWrite.append``: append some bytes to the log. In a circular log, this may overwrite the current read position. In this case, the read position MUST be advanced to the log's current beginning (i.e., as if ``LogRead.seek`` had been called with ``SEEK_BEGINNING``). Additionally, the ``LogWrite.appendDone`` event reports whenever, in a circular log, an append MAY have erased old records. Each append creates a separate record. Log implementations may have a maximum record size; all implementations MUST support records of up to 255 bytes. - ``LogWrite.sync``: guarantee that data written so far will not be lost to a crash or reboot (it can still be overwritten when a circular log wraps around). Using ``sync`` MAY waste some space in the log. - ``LogWrite.currentOffset``: return cookie representing current append position (for use with ``LogRead.seek``). - ``LogRead.read``: read some bytes from the current read position in the log and advance the read position. ``LogStorageC`` implementations MUST include error detection codes to increase the likelihood of detection of corrupted or invalid log data. Data returned by a successful read MUST have passed this error detection check. The behaviour on failure of this check is unspecified (e.g., the at45db believes as if the end of the log has been reached; other implementations may behave differently). - ``LogRead.currentOffset``: return cookie representing current read position (for use with ``LogRead.seek``). - ``LogRead.seek``: set the read position to a value returned by a prior call to ``LogWrite.currentOffset`` or ``LogRead.currentOffset``, or to the special ``SEEK_BEGINNING`` value. In a circular log, if the specified position has been overwritten, behave as if ``SEEK_BEGINNING`` was requested. ``SEEK_BEGINNING`` positions the read position at the beginning of the oldest record still present in the log. After reboot, the current read position is ``SEEK_BEGINNING``. - ``LogRead.getSize``: return an approximation of the log's capacity in bytes. Uses of ``sync`` and other overhead may reduce this number. For full details on arguments, etc, see the comments in the interface definitions. Note that while each call to ``append`` logically creates a separate record, the ``LogStorageC`` API does not report record boundaries. Additionally, crashes, reboots, and appends after wrap-around in a circular log can cause the loss of multiple consecutive records. Taken together, these restrictions mean that a ``LogStorageC`` implementation MAY internally aggregate several consecutive appends into a single record. However, the guarantee that only whole records are lost is sufficient to ensure that applications do not to have worry about incomplete or inconsistent log entries. 4.4 Small objects: ------------------------------------------------------------------ Sensor network applications need to store configuration data, e.g., mote identity, radio frequency, sample rates, etc. Such data is not large, but losing it may lead to a mote misbehaving or losing contact with the network. The ``ConfigStorageC`` abstraction stores a single small object in a volume. It: - Assumes that configuration data is relatively small (a few hundred bytes). - Allows random reads and writes. - Has simple transactional behaviour: each read is a separate transaction, all writes up to a commit form a single transaction. - At reboot, the volume contains the data as of the most recent successful commit. Small objects are accessed by instantiating a ConfigStorageC component which takes a volume id argument: :: generic configuration ConfigStorageC(volume_id_t volid) { provides { interface Mount; interface ConfigStorage; } } ... A small object MUST be mounted (via the ``Mount`` interface) before the first use. The ``Mount`` and ``ConfigStorage`` interfaces (briefly presented in Appendix B) contain the following operations (all split-phase except ``ConfigStorage.getSize`` and ``ConfigStorage.valid``): - ``Mount.mount``: mount the volume. - ``ConfigStorage.valid``: return TRUE if the volume contains a valid small object. - ``ConfigStorage.read``: read some bytes starting at a given offset. Fails if the small object is not valid. Note that this reads the data as of the last successful commit. - ``ConfigStorage.write``: write some bytes to a given offset. - ``ConfigStorage.commit``: make the small object contents reflect all the writes since the last commit. - ``ConfigStorage.getSize``: return the number of bytes that can be stored in the small object. For full details on arguments, etc, see the comments in the interface definitions. 5. Implementations ==================================================================== An AT45DB implementation can be found in tinyos-2.x/tos/chips/at45db. An ST M25P implementation can be found in tinyos-2.x/tos/chips/stm25p. 6. Authors' Addresses ==================================================================== | David Gay | 2150 Shattuck Ave, Suite 1300 | Intel Research | Berkeley, CA 94704 | | phone - +1 510 495 3055 | email - david.e.gay@intel.com | | | Jonathan Hui | 657 Mission St. Ste. 600 | Arched Rock Corporation | San Francisco, CA 94105-4120 | | phone - +1 415 692 0828 | email - jhui@archedrock.com 7. Citations ==================================================================== .. [1] David Gay. "Design of Matchbox, the simple filing system for motes. (version 1.0)." .. [2] TEP 2: Hardware Abstraction Architecture. Appendix A. HAA for some existing flash chips ==================================================================== A.1 AT45DB ------------------------------------------------------------------ The Atmel AT45DB family HPL is: :: configuration HplAt45dbC { provides interface HplAt45db; } ... The ``HplAt45db`` interface has flash->buffer, buffer->flash, compare buffer to flash, erase page, read, compute CRC, and write operations. Most of these operations are asynchronous, i.e., their completion is signaled before the flash chip has completed the operation. The HPL also includes operations to wait for asynchronous operations to complete. A generic, system-independent implementation of the HPL (``HplAt45dbByteC``) is included allowing platforms to just provide SPI and chip selection interfaces. Different members of the AT45DB family are supported by specifying a few constants (number of pages, page size). The AT45DB HAL has two components, one for chip access and the other providing volume information: :: component At45dbC { provides { interface At45db; interface Resource[uint8_t client]; interface ResourceController; interface ArbiterInfo; } } ... configuration At45dbStorageManagerC { provides interface At45dbVolume[volume_id_t volid]; } ... Note that the AT45DB HAL resource management is independent of the underlying HPL's power management. The motivation for this is that individual flash operations may take a long time, so it may be desirable to release the flash's bus during long-running operations. The ``At45db`` interface abstracts from the low-level HPL operations by: - using the flash's 2 RAM buffers as a cache to allow faster reads and writes - hiding the asynchronous nature of the HPL operations - verifying that all writes were successful It provides cached read, write and CRC computation, and page erase and copy. It also includes flush and sync operations to manage the cache. The ``At45dbVolume`` interface has operations to report volume size and map volume-relative pages to absolute pages. A.2 ST M25P ------------------------------------------------------------------ The ST M25P family HPL is: :: configuration Stm25pSpiC { provides interface Init; provides interface Resource; provides interface Stm25pSpi; } The ``Stm25pSpi`` interface has read, write, compute CRC, sector erase and block erase operations. The implementation of this HPL is system-independent, built over a few system-dependent components providing SPI and chip selection interfaces. Note that these two examples have different resource management policies: the AT45DB encapsulates resource acquisition and release within each operation, while the M25P family requires that HPL users acquire and release the resource itself. The ST M25P HAL is: :: configuration Stm25pSectorC { provides interface Resource as ClientResource[storage_volume_t volume]; provides interface Stm25pSector as Sector[storage_volume_t volume]; provides interface Stm25pVolume as Volume[storage_volume_t volume]; } The ``Stm25pSector`` interface provides volume-relative operations similar to those from the HPL interface: read, write, compute CRC and erase. Additionally, it has operations to report volume size and remap volume-relative addresses. Clients of the ST M25P HAL must implement the ``getVolumeId`` event of the ``Stm25pVolume`` interface so that the HAL can obtain the volume id of each of its clients. Appendix B. Storage interfaces ==================================================================== These interfaces are presented briefly here for reference; please refer to the TinyOS documentation for a full description of the commands, events and their parameters. B.1 BlockStorage interfaces ------------------------------------------------------------------ The BlockStorage interfaces are: :: interface BlockRead { command error_t read(storage_addr_t addr, void* buf, storage_len_t len); event void readDone(storage_addr_t addr, void* buf, storage_len_t len, error_t error); command error_t computeCrc(storage_addr_t addr, storage_len_t len, uint16_t crc); event void computeCrcDone(storage_addr_t addr, storage_len_t len, uint16_t crc, error_t error); command storage_len_t getSize(); } interface BlockWrite { command error_t write(storage_addr_t addr, void* buf, storage_len_t len); event void writeDone(storage_addr_t addr, void* buf, storage_len_t len, error_t error); command error_t erase(); event void eraseDone(error_t error); command error_t sync(); event void syncDone(error_t error); } B.2 ConfigStorage interfaces ------------------------------------------------------------------ The ConfigStorage interfaces are: :: interface Mount { command error_t mount(); event void mountDone(error_t error); } interface ConfigStorage { command error_t read(storage_addr_t addr, void* buf, storage_len_t len); event void readDone(storage_addr_t addr, void* buf, storage_len_t len, error_t error); command error_t write(storage_addr_t addr, void* buf, storage_len_t len); event void writeDone(storage_addr_t addr, void* buf, storage_len_t len, error_t error); command error_t commit(); event void commitDone(error_t error); command storage_len_t getSize(); command bool valid(); } B.3 LogStorage interfaces ------------------------------------------------------------------ The LogStorage interfaces are: :: interface LogRead { command error_t read(void* buf, storage_len_t len); event void readDone(void* buf, storage_len_t len, error_t error); command storage_cookie_t currentOffset(); command error_t seek(storage_cookie_t offset); event void seekDone(error_t error); command storage_len_t getSize(); } interface LogWrite { command error_t append(void* buf, storage_len_t len); event void appendDone(void* buf, storage_len_t len, bool recordsLost, error_t error); command storage_cookie_t currentOffset(); command error_t erase(); event void eraseDone(error_t error); command error_t sync(); event void syncDone(error_t error); }