Found wdiff, but it reported no recognisable version. Falling back to builtin diff colouring... Diff: draft-ietf-nfsv4-minorversion1-21.txt - draft-ietf-nfsv4-minorversion1-22.txt
 draft-ietf-nfsv4-minorversion1-21.txt   draft-ietf-nfsv4-minorversion1-22.txt 
NFSv4 S. Shepler NFSv4 S. Shepler
Internet-Draft M. Eisler Internet-Draft M. Eisler
Intended status: Standards Track D. Noveck Intended status: Standards Track D. Noveck
Expires: August 28, 2008 Editors Expires: September 14, 2008 Editors
February 25, 2008 March 13, 2008
NFS Version 4 Minor Version 1 NFS Version 4 Minor Version 1
draft-ietf-nfsv4-minorversion1-21.txt draft-ietf-nfsv4-minorversion1-22.txt
Status of this Memo Status of this Memo
By submitting this Internet-Draft, each author represents that any By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79. aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on August 28, 2008. This Internet-Draft will expire on September 14, 2008.
Copyright Notice Copyright Notice
Copyright (C) The IETF Trust (2008). Copyright (C) The IETF Trust (2008).
Abstract Abstract
This Internet-Draft describes NFS version 4 minor version one, This Internet-Draft describes NFS version 4 minor version one,
including features retained from the base protocol and protocol including features retained from the base protocol and protocol
extensions made subsequently. Major extensions introduced in NFS extensions made subsequently. Major extensions introduced in NFS
skipping to change at page 6, line 15 skipping to change at page 6, line 15
11.7.6. The Change Attribute and File System Transitions . . 229 11.7.6. The Change Attribute and File System Transitions . . 229
11.7.7. Lock State and File System Transitions . . . . . . . 230 11.7.7. Lock State and File System Transitions . . . . . . . 230
11.7.8. Write Verifiers and File System Transitions . . . . 234 11.7.8. Write Verifiers and File System Transitions . . . . 234
11.7.9. Readdir Cookies and Verifiers and File System 11.7.9. Readdir Cookies and Verifiers and File System
Transitions . . . . . . . . . . . . . . . . . . . . 234 Transitions . . . . . . . . . . . . . . . . . . . . 234
11.7.10. File System Data and File System Transitions . . . . 234 11.7.10. File System Data and File System Transitions . . . . 234
11.8. Effecting File System Referrals . . . . . . . . . . . . 236 11.8. Effecting File System Referrals . . . . . . . . . . . . 236
11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 236 11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 236
11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 240 11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 240
11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 242 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 242
11.10. The Attribute fs_locations_info . . . . . . . . . . . . 244 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 245
11.10.1. The fs_locations_server4 Structure . . . . . . . . . 248 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 248
11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 253 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 253
11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 254 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 254
11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 256 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 256
12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 260 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 260
12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 260 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 260
12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 262 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 262
12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 262 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 262
12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 262 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 262
12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 263 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 263
skipping to change at page 6, line 46 skipping to change at page 6, line 46
12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 267 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 267
12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 269 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 269
12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 270 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 270
12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 271 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 271
12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 274 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 274
12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 281 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 281
12.5.7. Metadata Server Write Propagation . . . . . . . . . 281 12.5.7. Metadata Server Write Propagation . . . . . . . . . 281
12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 281 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 281
12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 283 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 283
12.7.1. Recovery from Client Restart . . . . . . . . . . . . 283 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 283
12.7.2. Dealing with Lease Expiration on the Client . . . . 283 12.7.2. Dealing with Lease Expiration on the Client . . . . 284
12.7.3. Dealing with Loss of Layout State on the Metadata 12.7.3. Dealing with Loss of Layout State on the Metadata
Server . . . . . . . . . . . . . . . . . . . . . . . 284 Server . . . . . . . . . . . . . . . . . . . . . . . 285
12.7.4. Recovery from Metadata Server Restart . . . . . . . 285 12.7.4. Recovery from Metadata Server Restart . . . . . . . 285
12.7.5. Operations During Metadata Server Grace Period . . . 287 12.7.5. Operations During Metadata Server Grace Period . . . 287
12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 287 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 288
12.8. Metadata and Storage Device Roles . . . . . . . . . . . 288 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 288
12.9. Security Considerations for pNFS . . . . . . . . . . . . 288 12.9. Security Considerations for pNFS . . . . . . . . . . . . 288
13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 289 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 289
13.1. Client ID and Session Considerations . . . . . . . . . . 289 13.1. Client ID and Session Considerations . . . . . . . . . . 290
13.1.1. Sessions Considerations for Data Servers . . . . . . 292 13.1.1. Sessions Considerations for Data Servers . . . . . . 292
13.2. File Layout Definitions . . . . . . . . . . . . . . . . 292 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 292
13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 293 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 293
13.4. Interpreting the File Layout . . . . . . . . . . . . . . 297 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 297
13.4.1. Determining the Stripe Unit Number . . . . . . . . . 297 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 297
13.4.2. Interpreting the File Layout Using Sparse Packing . 297 13.4.2. Interpreting the File Layout Using Sparse Packing . 297
13.4.3. Interpreting the File Layout Using Dense Packing . . 300 13.4.3. Interpreting the File Layout Using Dense Packing . . 300
13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 302 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 302
13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 304 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 304
13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 305 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 305
skipping to change at page 8, line 31 skipping to change at page 8, line 31
18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 406 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 406
18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 408 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 408
18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 409 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 409
18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 411 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 411
18.15. Operation 17: NVERIFY - Verify Difference in 18.15. Operation 17: NVERIFY - Verify Difference in
Attributes . . . . . . . . . . . . . . . . . . . . . . . 412 Attributes . . . . . . . . . . . . . . . . . . . . . . . 412
18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 413 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 413
18.17. Operation 19: OPENATTR - Open Named Attribute 18.17. Operation 19: OPENATTR - Open Named Attribute
Directory . . . . . . . . . . . . . . . . . . . . . . . 432 Directory . . . . . . . . . . . . . . . . . . . . . . . 432
18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 433 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 433
18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 434 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 435
18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 435 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 435
18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 437 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 437
18.22. Operation 25: READ - Read from File . . . . . . . . . . 437 18.22. Operation 25: READ - Read from File . . . . . . . . . . 437
18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 440 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 440
18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 443 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 444
18.25. Operation 28: REMOVE - Remove File System Object . . . . 444 18.25. Operation 28: REMOVE - Remove File System Object . . . . 445
18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 447 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 447
18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 450 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 451
18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 451 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 452
18.29. Operation 33: SECINFO - Obtain Available Security . . . 452 18.29. Operation 33: SECINFO - Obtain Available Security . . . 452
18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 455 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 455
18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 458 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 458
18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 459 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 459
18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 464 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 464
18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 465 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 465
18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 468 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 468
18.36. Operation 43: CREATE_SESSION - Create New Session and 18.36. Operation 43: CREATE_SESSION - Create New Session and
Confirm Client ID . . . . . . . . . . . . . . . . . . . 484 Confirm Client ID . . . . . . . . . . . . . . . . . . . 484
18.37. Operation 44: DESTROY_SESSION - Destroy existing 18.37. Operation 44: DESTROY_SESSION - Destroy existing
skipping to change at page 113, line 37 skipping to change at page 113, line 37
The fs_layout_type attribute (see Section 3.3.13) applies to a file The fs_layout_type attribute (see Section 3.3.13) applies to a file
system and indicates what layout types are supported by the file system and indicates what layout types are supported by the file
system. When the client encounters a new fsid, the client SHOULD system. When the client encounters a new fsid, the client SHOULD
obtain the value for the fs_layout_type attribute associated with the obtain the value for the fs_layout_type attribute associated with the
new file system. This attribute is used by the client to determine new file system. This attribute is used by the client to determine
if the layout types supported by the server match any of the client's if the layout types supported by the server match any of the client's
supported layout types. supported layout types.
5.11.2. Attribute 66: layout_alignment 5.11.2. Attribute 66: layout_alignment
When a client has layouts for a file system, the layout_alignment When a client holds layouts on files of a file system, the
attribute indicates the preferred alignment for I/O to files on that layout_alignment attribute indicates the preferred alignment for I/O
file system. Where possible, the client should send READ and WRITE to files on that file system. Where possible, the client should send
operations with offsets that are whole multiples of the READ and WRITE operations with offsets that are whole multiples of
layout_alignment attribute. the layout_alignment attribute.
5.11.3. Attribute 65: layout_blksize 5.11.3. Attribute 65: layout_blksize
When a client has layouts for a file system, the layout_blksize When a client holds layouts on files of a file system, the
attribute indicates the preferred block size for I/O to files on that layout_blksize attribute indicates the preferred block size for I/O
file system. Where possible, the client should send READ operations to files on that file system. Where possible, the client should send
with a count argument that is a whole multiple of layout_blksize, and READ operations with a count argument that is a whole multiple of
WRITE operations with a data argument of size that is a whole layout_blksize, and WRITE operations with a data argument of size
multiple of layout_blksize. that is a whole multiple of layout_blksize.
5.11.4. Attribute 63: layout_hint 5.11.4. Attribute 63: layout_hint
The layout_hint attribute (see Section 3.3.19) may be set on newly The layout_hint attribute (see Section 3.3.19) may be set on newly
created files to influence the metadata server's choice for the created files to influence the metadata server's choice for the
file's layout. If possible, this attribute is one of those set in file's layout. If possible, this attribute is one of those set in
the initial attributes within the OPEN operation. The metadata the initial attributes within the OPEN operation. The metadata
server may choose to ignore this attribute. The layout_hint server may choose to ignore this attribute. The layout_hint
attribute is a sub-set of the layout structure returned by LAYOUTGET. attribute is a sub-set of the layout structure returned by LAYOUTGET.
For example, instead of specifying particular devices, this would be For example, instead of specifying particular devices, this would be
skipping to change at page 117, line 26 skipping to change at page 117, line 26
of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of
administration retention holds does not prevent the enabling of administration retention holds does not prevent the enabling of
event-based or non-event-based retention. event-based or non-event-based retention.
6. Access Control Attributes 6. Access Control Attributes
Access Control Lists (ACLs) are file attributes that specify fine Access Control Lists (ACLs) are file attributes that specify fine
grained access control. This chapter covers the "acl", "dacl", grained access control. This chapter covers the "acl", "dacl",
"sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and
their interactions. Note that file attributes may apply to any file their interactions. Note that file attributes may apply to any file
system objects. system object.
6.1. Goals 6.1. Goals
ACLs and modes represent two well established models for specifying ACLs and modes represent two well established models for specifying
permissions. This chapter specifies requirements that attempt to permissions. This chapter specifies requirements that attempt to
meet the following goals: meet the following goals:
o If a server supports the mode attribute, it should provide o If a server supports the mode attribute, it should provide
reasonable semantics to clients that only set and retrieve the reasonable semantics to clients that only set and retrieve the
mode attribute. mode attribute.
skipping to change at page 122, line 28 skipping to change at page 122, line 28
const ACE4_WRITE_RETENTION_HOLD = 0x00000400; const ACE4_WRITE_RETENTION_HOLD = 0x00000400;
const ACE4_DELETE = 0x00010000; const ACE4_DELETE = 0x00010000;
const ACE4_READ_ACL = 0x00020000; const ACE4_READ_ACL = 0x00020000;
const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_ACL = 0x00040000;
const ACE4_WRITE_OWNER = 0x00080000; const ACE4_WRITE_OWNER = 0x00080000;
const ACE4_SYNCHRONIZE = 0x00100000; const ACE4_SYNCHRONIZE = 0x00100000;
Note that some masks have coincident values, for example, Note that some masks have coincident values, for example,
ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries
ACE4_LIST_DIRECTORY, ACE4_ADD_SUBDIRECTORY, and ACE4_TRAVERSE are ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are
intended to be used with directory objects, while ACE4_READ_DATA, intended to be used with directory objects, while ACE4_READ_DATA,
ACE4_WRITE_DATA, and ACE4_EXECUTE are intended to be used with non- ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with
directory objects. non-directory objects.
6.2.1.3.1. Discussion of Mask Attributes 6.2.1.3.1. Discussion of Mask Attributes
ACE4_READ_DATA ACE4_READ_DATA
Operation(s) affected: Operation(s) affected:
READ READ
OPEN OPEN
skipping to change at page 149, line 39 skipping to change at page 149, line 39
With the exception of special stateids, to be discussed later, each With the exception of special stateids, to be discussed later, each
stateid represents locking objects of one of a set of types defined stateid represents locking objects of one of a set of types defined
by the NFSv4.1 protocol. Note that in all these cases, where we by the NFSv4.1 protocol. Note that in all these cases, where we
speak of guarantee, it is understood there are situations such as a speak of guarantee, it is understood there are situations such as a
client restart, or lock revocation, that allow the guarantee to be client restart, or lock revocation, that allow the guarantee to be
voided. voided.
o Stateids may represent opens of files. o Stateids may represent opens of files.
Each stateid in this case represents the open for a given Each stateid in this case represents the open for a given
clientid/open-owner/filehandle triple. Such tateids are subject clientid/open-owner/filehandle triple. Such stateids are subject
to change (with consequent bumping of the seqid) in response to to change (with consequent bumping of the seqid) in response to
OPENs that result in upgrade and OPEN_DOWNGRADE operations. OPENs that result in upgrade and OPEN_DOWNGRADE operations.
o Stateids may represent sets of byte-range locks. o Stateids may represent sets of byte-range locks.
All locks held on a particular file by a particular owner and all All locks held on a particular file by a particular owner and all
gotten under the aegis of a particular open file are associated gotten under the aegis of a particular open file are associated
with a single stateid with the seqid being bumped as LOCK and with a single stateid with the seqid being bumped as LOCK and
LOCKU operation affect that set of locks. LOCKU operation affect that set of locks.
skipping to change at page 154, line 24 skipping to change at page 154, line 24
analyzed by this procedure. analyzed by this procedure.
If server restart has resulted in an invalid client ID or a sessionid If server restart has resulted in an invalid client ID or a sessionid
which is invalid, SEQUENCE will return an error and the operation which is invalid, SEQUENCE will return an error and the operation
that takes a stateid as an argument will never be processed. that takes a stateid as an argument will never be processed.
If there has been a server restart where there is a persistent If there has been a server restart where there is a persistent
session, and all leased state has been lost, then the session in session, and all leased state has been lost, then the session in
question will, although valid, be marked as dead, and any operation question will, although valid, be marked as dead, and any operation
not satisfied by means of the reply cache will receive the error not satisfied by means of the reply cache will receive the error
NFS4ERR_DEADSESSION, and thus not be processed as indicated below NFS4ERR_DEADSESSION, and thus not be processed as indicated below.
either.
When a stateid is being tested, and the "other" field is all zeros or When a stateid is being tested, and the "other" field is all zeros or
all ones, a check that the "other" and "seqid" fields match a defined all ones, a check that the "other" and "seqid" fields match a defined
combination for a special stateid is done and the results determined combination for a special stateid is done and the results determined
as follows: as follows:
o If the "other" and "seqid" fields do not match a defined o If the "other" and "seqid" fields do not match a defined
combination associated with a special stateid, the error combination associated with a special stateid, the error
NFS4ERR_BAD_STATEID is returned. NFS4ERR_BAD_STATEID is returned.
skipping to change at page 158, line 15 skipping to change at page 158, line 14
SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock
revocation events. When these bits are set, the client should use revocation events. When these bits are set, the client should use
TEST_STATEID to find what stateids have been revoked and use TEST_STATEID to find what stateids have been revoked and use
FREE_STATEID to acknowledge loss of the associated state. FREE_STATEID to acknowledge loss of the associated state.
o The status bit SEQ4_STATUS_LEASE_MOVE indicates that o The status bit SEQ4_STATUS_LEASE_MOVE indicates that
responsibility for lease renewal has been transferred to one or responsibility for lease renewal has been transferred to one or
more new servers. more new servers.
o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that
due to server restart or restart the client must reclaim locking due to server restart the client must reclaim locking state.
state.
o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates server has o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates server has
encountered an unrecoverable fault with the backchannel (e.g. it encountered an unrecoverable fault with the backchannel (e.g. it
has lost track of a sequence id for a slot in the backchannel). has lost track of a sequence id for a slot in the backchannel).
8.4. Crash Recovery 8.4. Crash Recovery
A critical requirement in crash recovery is that both the client and A critical requirement in crash recovery is that both the client and
the server know when the other has failed. Additionally, it is the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server required that a client sees a consistent view of data across server
skipping to change at page 174, line 29 skipping to change at page 174, line 29
write delegation and WRITE conflicts with a read delegation. write delegation and WRITE conflicts with a read delegation.
When a client holds a delegation, it is particularly important to When a client holds a delegation, it is particularly important to
make sure that the stateid sent conveys the association of operation make sure that the stateid sent conveys the association of operation
with the delegation, to avoid the delegation from being avoidably with the delegation, to avoid the delegation from being avoidably
recalled. When the delegation stateid, or a stateid open associated recalled. When the delegation stateid, or a stateid open associated
with that delegation, or a stateid representing byte-range locks with that delegation, or a stateid representing byte-range locks
derived form such an open is used, the server knows that the READ, derived form such an open is used, the server knows that the READ,
WRITE, or SETATTR does not conflict with the delegation, but is sent WRITE, or SETATTR does not conflict with the delegation, but is sent
under the aegis of the delegation. Even though it is possible for under the aegis of the delegation. Even though it is possible for
the server to determine from the clientid (gotten from the sessionid) the server to determine from the clientid (via the sessionid) that
that the client does in fact have a delegation, the server is not the client does in fact have a delegation, the server is not obliged
obliged to check this, so using a special stateid can result in to check this, so using a special stateid can result in avoidable
avoidable recall of the delegation. recall of the delegation.
9.2. Lock Ranges 9.2. Lock Ranges
The protocol allows a lock-owner to request a lock with a byte range The protocol allows a lock-owner to request a lock with a byte range
and then either upgrade, downgrade, or unlock a sub-range of the and then either upgrade, downgrade, or unlock a sub-range of the
initial lock, or a range that consists of a range which overlaps, initial lock, or a range that consists of a range which overlaps,
fully or partially, that initial lock or a combination of a set of fully or partially, that initial lock or a combination of a set of
existing locks for the same lock-owner. It is expected that this existing locks for the same lock-owner. It is expected that this
will be an uncommon type of request. In any case, servers or server will be an uncommon type of request. In any case, servers or server
file systems may not be able to support sub-range lock semantics. In file systems may not be able to support sub-range lock semantics. In
skipping to change at page 186, line 33 skipping to change at page 186, line 33
however, the server may extend the period in which conflicting however, the server may extend the period in which conflicting
requests are held off. Eventually the occurrence of a conflicting requests are held off. Eventually the occurrence of a conflicting
request from another client will cause revocation of the delegation. request from another client will cause revocation of the delegation.
A loss of the backchannel (e.g. by later network configuration A loss of the backchannel (e.g. by later network configuration
change) will have the same effect. A recall request will fail and change) will have the same effect. A recall request will fail and
revocation of the delegation will result. revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it A client normally finds out about revocation of a delegation when it
uses a stateid associated with a delegation and receives one of the uses a stateid associated with a delegation and receives one of the
errors NFS4EER_EXPIRED, NFS4ERR_ADMIN_REVOKED, or errors NFS4EER_EXPIRED, NFS4ERR_ADMIN_REVOKED, or
MFS4ERR_DELEG_REVOKED. It also may find out about delegation NFS4ERR_DELEG_REVOKED. It also may find out about delegation
revocation after a client restart when it attempts to reclaim a revocation after a client restart when it attempts to reclaim a
delegation and receives that same error. Note that in the case of a delegation and receives that same error. Note that in the case of a
revoked write open delegation, there are issues because data may have revoked write open delegation, there are issues because data may have
been modified by the client whose delegation is revoked and been modified by the client whose delegation is revoked and
separately by other clients. See Section 10.5.1 for a discussion of separately by other clients. See Section 10.5.1 for a discussion of
such issues. Note also that when delegations are revoked, such issues. Note also that when delegations are revoked,
information about the revoked delegation will be written by the information about the revoked delegation will be written by the
server to stable storage (as described in Section 8.4.3). This is server to stable storage (as described in Section 8.4.3). This is
done to deal with the case in which a server restarts after revoking done to deal with the case in which a server restarts after revoking
a delegation but before the client holding the revoked delegation is a delegation but before the client holding the revoked delegation is
skipping to change at page 230, line 4 skipping to change at page 230, line 4
each of the target file systems. each of the target file systems.
11.7.6. The Change Attribute and File System Transitions 11.7.6. The Change Attribute and File System Transitions
Since the change attribute is defined as a server-specific one, Since the change attribute is defined as a server-specific one,
change attributes fetched from one server are normally presumed to be change attributes fetched from one server are normally presumed to be
invalid on another server. Such a presumption is troublesome since invalid on another server. Such a presumption is troublesome since
it would invalidate all cached change attributes, requiring it would invalidate all cached change attributes, requiring
refetching. Even more disruptive, the absence of any assured refetching. Even more disruptive, the absence of any assured
continuity for the change attribute means that even if the same value continuity for the change attribute means that even if the same value
is gotten on refetch no conclusions can drawn as to whether the is retrieved on refetch no conclusions can drawn as to whether the
object in question has changed. The identical change attribute could object in question has changed. The identical change attribute could
be merely an artifact of a modified file with a different change be merely an artifact of a modified file with a different change
attribute construction algorithm, with that new algorithm just attribute construction algorithm, with that new algorithm just
happening to result in an identical change value. happening to result in an identical change value.
When the two file systems have consistent change attribute formats, When the two file systems have consistent change attribute formats,
and this fact is communicated to the client by reporting as in the and this fact is communicated to the client by reporting as in the
same _change_ class, the client may assume a continuity of change same _change_ class, the client may assume a continuity of change
attribute construction and handle this situation just as it would be attribute construction and handle this situation just as it would be
handled without any file system transition. handled without any file system transition.
skipping to change at page 237, line 48 skipping to change at page 237, line 48
that op but was moved between the last LOOKUP and the GETFH (since that op but was moved between the last LOOKUP and the GETFH (since
COMPOUND is not atomic). Even if we had the fsids for all of the COMPOUND is not atomic). Even if we had the fsids for all of the
intermediate directories, we could have no way of knowing that /this/ intermediate directories, we could have no way of knowing that /this/
is/the/path was the root of a new file system, since we don't yet is/the/path was the root of a new file system, since we don't yet
have its fsid. have its fsid.
In order to get the necessary information, let us re-send the chain In order to get the necessary information, let us re-send the chain
of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we
can be sure where the appropriate file system boundaries are. The can be sure where the appropriate file system boundaries are. The
client could choose to get fs_locations_info at the same time but in client could choose to get fs_locations_info at the same time but in
most cases the client will have a good guess as to where fs most cases the client will have a good guess as to where file system
boundaries are (because of where NFS4ERR_MOVED was gotten and where boundaries are (because of where and where not NFS4ERR_MOVED was
not) making fetching of fs_locations_info unnecessary. received) making fetching of fs_locations_info unnecessary.
OP01: PUTROOTFH --> NFS_OK OP01: PUTROOTFH --> NFS_OK
- Current fh is root of pseudo-fs. - Current fh is root of pseudo-fs.
OP02: GETATTR(fsid) --> NFS_OK OP02: GETATTR(fsid) --> NFS_OK
- Just for completeness. Normally, clients will know the fsid of - Just for completeness. Normally, clients will know the fsid of
the pseudo-fs as soon as they establish communication with a the pseudo-fs as soon as they establish communication with a
server. server.
skipping to change at page 239, line 31 skipping to change at page 239, line 31
in fact the fsid we have for this file system might be a valid in fact the fsid we have for this file system might be a valid
fsid of a different file system on that new server. fsid of a different file system on that new server.
- In this particular case, we are pretty sure anyway that what has - In this particular case, we are pretty sure anyway that what has
moved is /this/is/the/path rather than /this/is/the since we have moved is /this/is/the/path rather than /this/is/the since we have
the fsid of the latter and it is that of the pseudo-fs, which the fsid of the latter and it is that of the pseudo-fs, which
presumably cannot move. However, in other examples, we might not presumably cannot move. However, in other examples, we might not
have this kind of information to rely on (e.g. /this/is/the might have this kind of information to rely on (e.g. /this/is/the might
be a non-pseudo file system separate from /this/is/the/path), so be a non-pseudo file system separate from /this/is/the/path), so
we need to have another reliable source information on the we need to have another reliable source information on the
boundary of the fs which is moved. If, for example, the file boundary of the file system which is moved. If, for example, the
system "/this/is" had moved we would have a case of migration file system "/this/is" had moved we would have a case of migration
rather than referral and once the boundaries of the migrated file rather than referral and once the boundaries of the migrated file
system was clear we could fetch fs_locations_info. system was clear we could fetch fs_locations_info.
- We are fetching fs_locations_info because the fact that we got an - We are fetching fs_locations_info because the fact that we got an
NFS4ERR_MOVED at this point means that it most likely that this is NFS4ERR_MOVED at this point means that it most likely that this is
a referral and we need the destination. Even if it is the case a referral and we need the destination. Even if it is the case
that "/this/is/the" is a file system which has migrated, we will that "/this/is/the" is a file system which has migrated, we will
still need the location information for that file system. still need the location information for that file system.
OP14: GETFH --> NFS4ERR_MOVED OP14: GETFH --> NFS4ERR_MOVED
skipping to change at page 242, line 27 skipping to change at page 242, line 27
o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid,
size, time_modify) --> NFS_OK. The attributes will be as shown size, time_modify) --> NFS_OK. The attributes will be as shown
below. below.
The attributes for "path" will only contain The attributes for "path" will only contain
o rdattr_error (value: NFS_OK) o rdattr_error (value: NFS_OK)
o fs_locations_info o fs_locations_info
o mounted_on_fileid (value: unique fileid within referring fs) o mounted_on_fileid (value: unique fileid within referring file
system)
o fsid (value: unique value within referring server) o fsid (value: unique value within referring server)
The attribute entry for "path" will not contain size or time_modify The attribute entry for "path" will not contain size or time_modify
because these attributes are not available within an absent file because these attributes are not available within an absent file
system. system.
11.9. The Attribute fs_locations 11.9. The Attribute fs_locations
The fs_locations attribute is structured in the following way: The fs_locations attribute is structured in the following way:
skipping to change at page 266, line 19 skipping to change at page 266, line 19
the same layout type and client ID again. This requirement is the same layout type and client ID again. This requirement is
feasible because the device ID is 16 bytes long, leaving sufficient feasible because the device ID is 16 bytes long, leaving sufficient
room to store a generation number if server's implementation requires room to store a generation number if server's implementation requires
most of the rest of the device ID's content to be reused. This most of the rest of the device ID's content to be reused. This
requirement is necessary because otherwise the race conditions requirement is necessary because otherwise the race conditions
between asynchronous notification of device ID addition and deletion between asynchronous notification of device ID addition and deletion
would be too difficult to sort out. would be too difficult to sort out.
Device ID to device address mappings are not leased, and can be Device ID to device address mappings are not leased, and can be
changed at any time. (Note that while device ID to device address changed at any time. (Note that while device ID to device address
mappings are likely to change after the metadata server restarts the mappings are likely to change after the metadata server restarts, the
server is not required to change the mappings.) A server has two server is not required to change the mappings.) A server has two
choices for changing mappings. It can recall all layouts referring choices for changing mappings. It can recall all layouts referring
to the device ID or it can use a notification mechanism. to the device ID or it can use a notification mechanism.
The NFSv4.1 protocol has no optimal way to recall all layouts that The NFSv4.1 protocol has no optimal way to recall all layouts that
referred to a particular device ID (unless the server associates a referred to a particular device ID (unless the server associates a
single device ID with a single fsid or a single client ID; in which single device ID with a single fsid or a single client ID; in which
case, CB_LAYOUTRECALL has options for recalling all layouts case, CB_LAYOUTRECALL has options for recalling all layouts
associated with the fsid, client ID pair or just the client ID). associated with the fsid, client ID pair or just the client ID).
skipping to change at page 270, line 29 skipping to change at page 270, line 29
CB_LAYOUTRECALL request. When the client fully processes the CB_LAYOUTRECALL request. When the client fully processes the
response to a LAYOUTGET or LAYOUTRETURN, or fully processes the response to a LAYOUTGET or LAYOUTRETURN, or fully processes the
arguments of a CB_LAYOUTRECALL, it MUST use the seqid of the stateid arguments of a CB_LAYOUTRECALL, it MUST use the seqid of the stateid
of the reply from LAYOUTGET and LAYOUTRETURN, or the seqid of the of the reply from LAYOUTGET and LAYOUTRETURN, or the seqid of the
stateid in the arguments of CB_LAYOUTRECALL, on subsequent calls to stateid in the arguments of CB_LAYOUTRECALL, on subsequent calls to
LAYOUTGET or LAYOUTRETURN. The client and server use the "seqid" of LAYOUTGET or LAYOUTRETURN. The client and server use the "seqid" of
the layout stateid for the following purposes: the layout stateid for the following purposes:
o Permit the client to send parallel LAYOUTGET operations on the o Permit the client to send parallel LAYOUTGET operations on the
same file. As with parallel opens (see Section 9.10) the use of same file. As with parallel opens (see Section 9.10) the use of
the sequence ID allows a client to avoid serializing LAYOUTGET the stateid's seqid allows a client to avoid serializing LAYOUTGET
operations. If LAYOUTGETs were serialized, especially non- operations. If LAYOUTGETs were serialized, especially non-
overlapping LAYOUTGETs, then non-overlapping I/Os to storage overlapping LAYOUTGETs, then non-overlapping I/Os to storage
devices would in turn be effectively serialized with each other. devices would in turn be effectively serialized with each other.
In the event parallel LAYOUTGET operations are sent with a non- In the event parallel LAYOUTGET operations are sent with a non-
layout stateid (because the client does not yet have a layout layout stateid (because the client does not yet have a layout
stateid), the successful responses MUST have the same "other" stateid), the successful responses MUST have the same "other"
field in the LAYOUTSTATEID, and each response with a unique field in the LAYOUTSTATEID, and each response with a unique
"seqid", where the lowest "seqid" is one, and the highest "seqid" "seqid", where the lowest "seqid" is one, and the highest "seqid"
is equal to the count of parallel LAYOUTGET operations invoked on is equal to the count of parallel LAYOUTGET operations invoked on
the non-layout stateid. the non-layout stateid.
skipping to change at page 272, line 48 skipping to change at page 272, line 48
update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the update time_modify at LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the
updated attributes should be visible if that file was modified since updated attributes should be visible if that file was modified since
the latest previous LAYOUTCOMMIT or LAYOUTGET. the latest previous LAYOUTCOMMIT or LAYOUTGET.
12.5.4.2. LAYOUTCOMMIT and size 12.5.4.2. LAYOUTCOMMIT and size
The size of a file may be updated when the LAYOUTCOMMIT operation is The size of a file may be updated when the LAYOUTCOMMIT operation is
used by the client. One of the fields in the argument to used by the client. One of the fields in the argument to
LAYOUTCOMMIT is loca_last_write_offset; this field indicates the LAYOUTCOMMIT is loca_last_write_offset; this field indicates the
highest byte offset written but not yet committed with the highest byte offset written but not yet committed with the
LAYOUTCOMMIT operation. The data type of lora_last_write_offset is LAYOUTCOMMIT operation. The data type of loca_last_write_offset is
newoffset4 and is switched on a boolean value, no_newoffset, that newoffset4 and is switched on a boolean value, no_newoffset, that
indicates if a previous write occurred or not. If no_newoffset is indicates if a previous write occurred or not. If no_newoffset is
FALSE, an offset is not given. If the client has a layout with FALSE, an offset is not given. If the client has a layout with
LAYOUTIOMODE4_RW iomode on the file, with an lo_offset and lo_length LAYOUTIOMODE4_RW iomode on the file, with an lo_offset and lo_length
that overlaps loca_last_write_offset, then the client MAY set that overlaps loca_last_write_offset, then the client MAY set
no_newoffset to TRUE and provide an offset that will update the file no_newoffset to TRUE and provide an offset that will update the file
size. Keep in mind that offset is not the same as length, though size. Keep in mind that offset is not the same as length, though
they are related. For example, a loca_last_write_offset value of they are related. For example, a loca_last_write_offset value of
zero means that one byte was written at offset zero, and so the zero means that one byte was written at offset zero, and so the
length of the file is at least one byte. length of the file is at least one byte.
skipping to change at page 273, line 41 skipping to change at page 273, line 41
The results of LAYOUTCOMMIT contain a new size value in the form of a The results of LAYOUTCOMMIT contain a new size value in the form of a
newsize4 union data type. If the file's size is set as a result of newsize4 union data type. If the file's size is set as a result of
LAYOUTCOMMIT, the metadata server must reply with the new size; LAYOUTCOMMIT, the metadata server must reply with the new size;
otherwise the new size is not provided. If the file size is updated, otherwise the new size is not provided. If the file size is updated,
the metadata server SHOULD update the storage devices such that the the metadata server SHOULD update the storage devices such that the
new file size is reflected when LAYOUTCOMMIT processing is complete. new file size is reflected when LAYOUTCOMMIT processing is complete.
For example, the client should be able to READ up to the new file For example, the client should be able to READ up to the new file
size. size.
If the client wants to explicitly zero-extend or truncate a file, the The client can extend the length of a file or truncate a file by
SETATTR operation MUST be used; SETATTR use is not required when sending a SETATTR operation to the metadata server with the size
simply writing past EOF via WRITE. attribute specified. If the size specified is larger than the
current size of the file, the file is "zero extended", i.e., zeroes
are implicitly added between the file's previous EOF and the new EOF.
(In many implementations the zero extended region of the file
consists of unallocated holes in the file.) When the client writes
past EOF via WRITE, the SETATTR operation does not need to be used.
12.5.4.3. LAYOUTCOMMIT and layoutupdate 12.5.4.3. LAYOUTCOMMIT and layoutupdate
The LAYOUTCOMMIT argument contains a loca_layoutupdate field The LAYOUTCOMMIT argument contains a loca_layoutupdate field
(Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This
argument is a layout type-specific structure. The structure can be argument is a layout type-specific structure. The structure can be
used to pass arbitrary layout type-specific information from the used to pass arbitrary layout type-specific information from the
client to the metadata server at LAYOUTCOMMIT time. For example, if client to the metadata server at LAYOUTCOMMIT time. For example, if
using a block/volume layout, the client can indicate to the metadata using a block/volume layout, the client can indicate to the metadata
server which reserved or allocated blocks the client used or did not server which reserved or allocated blocks the client used or did not
skipping to change at page 277, line 26 skipping to change at page 277, line 35
12.5.5.2. Sequencing of Layout Operations 12.5.5.2. Sequencing of Layout Operations
As with other stateful operations, pNFS requires the correct As with other stateful operations, pNFS requires the correct
sequencing of layout operations. PNFS uses the "seqid" in the layout sequencing of layout operations. PNFS uses the "seqid" in the layout
stateid to provide the correct sequencing between regular operations stateid to provide the correct sequencing between regular operations
and callbacks. It is the server's responsibility to avoid and callbacks. It is the server's responsibility to avoid
inconsistencies regarding the layouts provided and the client's inconsistencies regarding the layouts provided and the client's
responsibility to properly serialize its layout requests and layout responsibility to properly serialize its layout requests and layout
returns. returns.
12.5.5.2.1. Recall/Return Sequencing 12.5.5.2.1. Layout Recall and Return Sequencing
Section 2.10.5.3 describes the sessions mechanism for allowing the One critical issue with regard to layout operations sequencing
client to detect such situations in order to delay processing such a concerns callbacks. The protocol must defend against races between
CB_LAYOUTRECALL. The server MUST reference all conflicting LAYOUTGET the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent
operations in the CB_SEQUENCE that precedes the CB_LAYOUTRECALL. A CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that
zero length array of referenced operations is used by the server to implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations
tell the client that the server does not know of any LAYOUTGET to which the client has not yet received a reply. The client detects
operations that conflict with the recall. such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's
layout stateid. If the "seqid" is not one higher than what the
client currently has recorded, and the client has at least one
LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows
the server sent the CB_LAYOUTRECALL after sending a response to an
outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before
processing such a CB_LAYOUTRECALL until it processes all replies for
outstanding LAYOUTGET and LAYOUTRETURN operations for the
corresponding file with seqid less than the seqid given by
CB_LAYOUTRECALL (lor_stateid, see Section 20.3.)
While referencing conflicting operations in CB_SEQUENCE conveys to In addition to the seqid-based mechanism, Section 2.10.5.3 describes
the client that the server is aware of races, one critical issue with the sessions mechanism for allowing the client to detect callback
regard to operation sequencing concerns callbacks. The protocol must race conditions and delay processing such a CB_LAYOUTRECALL. The
defend against races between the reply to a LAYOUTGET or LAYOUTRETURN server MAY reference conflicting operations in the CB_SEQUENCE that
operation and a subsequent CB_LAYOUTRECALL. A client MUST NOT precedes the CB_LAYOUTRECALL. Because the server has already sent
process a CB_LAYOUTRECALL that implies one or more outstanding replies for these operations before issuing the callback, the replies
LAYOUTGET or LAYOUTRETURN operations to which the client has not yet may race with the CB_LAYOUTRECALL. The client MUST wait for all the
received a reply. The client detects such a CB_LAYOUTRECALL by referenced calls to complete and update its view of the layout state
examining the "seqid" field of the recall's layout stateid. If the before processing the CB_LAYOUTRECALL.
"seqid" is not one higher than what the client currently has
recorded, and the client has at least one LAYOUTGET and/or
LAYOUTRETURN operation outstanding, the client knows the server sent
the CB_LAYOUTRECALL after the server sent a response to an
outstanding LAYOUTGET or LAYOUTRETURN.
12.5.5.2.1.1. Get/Return Sequencing 12.5.5.2.1.1. Get/Return Sequencing
The protocol allows the client to send concurrent LAYOUTGET and The protocol allows the client to send concurrent LAYOUTGET and
LAYOUTRETURN operations to the server. The protocol does not provide LAYOUTRETURN operations to the server. The protocol does not provide
any means for the server to process the requests in the same order in any means for the server to process the requests in the same order in
which they were created. However, through the use of the "seqid" which they were created. However, through the use of the "seqid"
field in the layout stateid, the client can determine the order in field in the layout stateid, the client can determine the order in
which parallel outstanding operations were processed by the server. which parallel outstanding operations were processed by the server.
Thus, when a layout retrieved by an outstanding LAYOUTGET operation Thus, when a layout retrieved by an outstanding LAYOUTGET operation
skipping to change at page 284, line 46 skipping to change at page 285, line 12
the lease expires, but arrive after the lease expires. See the lease expires, but arrive after the lease expires. See
Section 12.7.3. Section 12.7.3.
12.7.3. Dealing with Loss of Layout State on the Metadata Server 12.7.3. Dealing with Loss of Layout State on the Metadata Server
This is a description of the case where all of the following are This is a description of the case where all of the following are
true: true:
o the metadata server has not restarted o the metadata server has not restarted
o a pNFS client's device ID to layouts have been discarded (usually o a pNFS client's layouts have been discarded (usually because the
because the client's lease expired) and are invalid client's lease expired) and are invalid
o an I/O from the pNFS client arrives at the storage device o an I/O from the pNFS client arrives at the storage device
The metadata server and its storage devices MUST solve this by The metadata server and its storage devices MUST solve this by
fencing the client. In other words, prevent the execution of I/O fencing the client. In other words, prevent the execution of I/O
operations from the client to the storage devices after layout state operations from the client to the storage devices after layout state
loss. The details of how fencing is done are specific to the layout loss. The details of how fencing is done are specific to the layout
type. The solution for NFSv4.1 file-based layouts is described in type. The solution for NFSv4.1 file-based layouts is described in
(Section 13.11), and for other layout types in their respective (Section 13.11), and for other layout types in their respective
external specification documents. external specification documents.
12.7.4. Recovery from Metadata Server Restart 12.7.4. Recovery from Metadata Server Restart
The pNFS client will discover that the metadata server has restarted The pNFS client will discover that the metadata server has restarted
(e.g. restarted) via the methods described in Section 8.4.2 and via the methods described in Section 8.4.2 and discussed in a pNFS-
discussed in a pNFS-specific context in Paragraph 2, of specific context in Paragraph 2, of Section 12.7.2. The client MUST
Section 12.7.2. The client MUST stop using layouts and delete the stop using layouts and delete the device ID to device address
device ID to device address mappings it previously received from the mappings it previously received from the metadata server. Having
metadata server. Having done that, if the client wrote data to the done that, if the client wrote data to the storage device without
storage device without committing the layouts via LAYOUTCOMMIT, then committing the layouts via LAYOUTCOMMIT, then the client has
the client has additional work to do in order to have the client, additional work to do in order to have the client, metadata server
metadata server and storage device(s) all synchronized on the state and storage device(s) all synchronized on the state of the data.
of the data.
o If the client has data still modified and unwritten in the o If the client has data still modified and unwritten in the
client's memory, the client has only two choices. client's memory, the client has only two choices.
1. The client can obtain a layout via LAYOUTGET after the 1. The client can obtain a layout via LAYOUTGET after the
server's grace period and write the data to the storage server's grace period and write the data to the storage
devices. devices.
2. The client can write that data through the metadata server 2. The client can write that data through the metadata server
using the WRITE (Section 18.32) operation, and then obtain using the WRITE (Section 18.32) operation, and then obtain
skipping to change at page 424, line 15 skipping to change at page 424, line 15
| CLAIM_DELEG_PREV_FH | granted to a previous client instance; | | CLAIM_DELEG_PREV_FH | granted to a previous client instance; |
| | used after the client restarts. The server | | | used after the client restarts. The server |
| | MAY support CLAIM_DELEGATE_PREV or | | | MAY support CLAIM_DELEGATE_PREV or |
| | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If |
| | it does support either open type, | | | it does support either open type, |
| | CREATE_SESSION MUST NOT remove the | | | CREATE_SESSION MUST NOT remove the |
| | client's delegation state, and the server | | | client's delegation state, and the server |
| | MUST support the DELEGPURGE operation. | | | MUST support the DELEGPURGE operation. |
+----------------------+--------------------------------------------+ +----------------------+--------------------------------------------+
For OPEN requests whose claim type is other than CLAIM_PREVIOUS (i.e. For OPEN requests that reach the server during the grace period, the
requests other than those devoted to reclaiming opens after a server server returns an error of NFS4ERR_GRACE. The following claim types
restart) that reach the server during its grace or lease expiration are exceptions:
period, the server returns an error of NFS4ERR_GRACE.
o OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted
to reclaiming opens after a server reboot and are typically only
valid during the grace period.
o OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and
CLAIM_DELEG_CUR_FH are valid both during and after the grace
period. Since the granting of the delegation that they are
subordinate to assures that there is no conflict with locks to be
reclaimed by other clients, the server need not return
NFS4ERR_GRACE when these are received during the grace period.
For any OPEN request, the server may return an open delegation, which For any OPEN request, the server may return an open delegation, which
allows further opens and closes to be handled locally on the client allows further opens and closes to be handled locally on the client
as described in Section 10.4. Note that delegation is up to the as described in Section 10.4. Note that delegation is up to the
server to decide. The client should never assume that delegation server to decide. The client should never assume that delegation
will or will not be granted in a particular instance. It should will or will not be granted in a particular instance. It should
always be prepared for either case. A partial exception is the always be prepared for either case. A partial exception is the
reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed.
In this case, delegation will always be granted, although the server In this case, delegation will always be granted, although the server
may specify an immediate recall in the delegation structure. may specify an immediate recall in the delegation structure.
skipping to change at page 429, line 23 skipping to change at page 429, line 31
use time_modify_set or time_access_set to store the verifier. The use time_modify_set or time_access_set to store the verifier. The
server SHOULD NOT store the verifier in the following attributes: acl server SHOULD NOT store the verifier in the following attributes: acl
(it is desirable for access control to be established at creation), (it is desirable for access control to be established at creation),
dacl (ditto), mode (ditto), owner (ditto), owner_group (ditto), dacl (ditto), mode (ditto), owner (ditto), owner_group (ditto),
retentevt_set (it may be desired to establish retention at creation) retentevt_set (it may be desired to establish retention at creation)
retention_hold (ditto), retention_set (ditto), sacl (it is desirable retention_hold (ditto), retention_set (ditto), sacl (it is desirable
for auditing control to be established at creation), size (on some for auditing control to be established at creation), size (on some
servers, size may have a limited range of values), mode_set_masked servers, size may have a limited range of values), mode_set_masked
(as with mode), and time_creation (a meaningful file creation should (as with mode), and time_creation (a meaningful file creation should
be set when the file is created). Another alternative for the server be set when the file is created). Another alternative for the server
is to use named attribute to store the verifier. is to use a named attribute to store the verifier.
Because the EXCLUSIVE4 create method does not specify initial Because the EXCLUSIVE4 create method does not specify initial
attributes, when processing an EXCLUSIVE4 create, the server attributes, when processing an EXCLUSIVE4 create, the server
o SHOULD set the owner of the file to that corresponding to the o SHOULD set the owner of the file to that corresponding to the
credential of request's RPC header. credential of request's RPC header.
o SHOULD NOT leave the file's access control to anyone but the owner o SHOULD NOT leave the file's access control to anyone but the owner
of the file. of the file.
skipping to change at page 462, line 45 skipping to change at page 462, line 45
The definition of stable storage has been historically a point of The definition of stable storage has been historically a point of
contention. The following expected properties of stable storage may contention. The following expected properties of stable storage may
help in resolving design sends in the implementation. Stable storage help in resolving design sends in the implementation. Stable storage
is persistent storage that survives: is persistent storage that survives:
1. Repeated power failures. 1. Repeated power failures.
2. Hardware failures (of any board, power supply, etc.). 2. Hardware failures (of any board, power supply, etc.).
3. Repeated software crashes, including restart cycle. 3. Repeated software crashes and restarts.
This definition does not address failure of the stable storage module This definition does not address failure of the stable storage module
itself. itself.
The verifier is defined to allow a client to detect different The verifier is defined to allow a client to detect different
instances of an NFSv4.1 protocol server over which cached, instances of an NFSv4.1 protocol server over which cached,
uncommitted data may be lost. In the most likely case, the verifier uncommitted data may be lost. In the most likely case, the verifier
allows the client to detect server restarts. This information is allows the client to detect server restarts. This information is
required so that the client can safely determine whether the server required so that the client can safely determine whether the server
could have lost cached data. If the server fails unexpectedly and could have lost cached data. If the server fails unexpectedly and
the client has uncommitted data from previous WRITE requests (done the client has uncommitted data from previous WRITE requests (done
with the stable argument set to UNSTABLE4 and in which the result with the stable argument set to UNSTABLE4 and in which the result
committed was returned as UNSTABLE4 as well) it may not have flushed committed was returned as UNSTABLE4 as well) it may not have flushed
cached data to stable storage. The burden of recovery is on the cached data to stable storage. The burden of recovery is on the
client and the client will need to retransmit the data to the server. client and the client will need to retransmit the data to the server.
A suggested verifier would be to use the time that the server was A suggested verifier would be to use the time that the server was
booted or the time the server was last started (if restarting the last started (if restarting the server results in lost buffers).
server without a restart results in lost buffers).
The committed field in the results allows the client to do more The committed field in the results allows the client to do more
effective caching. If the server is committing all WRITE requests to effective caching. If the server is committing all WRITE requests to
stable storage, then it should return with committed set to stable storage, then it should return with committed set to
FILE_SYNC4, regardless of the value of the stable field in the FILE_SYNC4, regardless of the value of the stable field in the
arguments. A server that uses an NVRAM accelerator may choose to arguments. A server that uses an NVRAM accelerator may choose to
implement this policy. The client can use this to increase the implement this policy. The client can use this to increase the
effectiveness of the cache by discarding cached data that has already effectiveness of the cache by discarding cached data that has already
been committed on the server. been committed on the server.
skipping to change at page 522, line 50 skipping to change at page 522, line 50
SEQ4_STATUS_LEASE_MOVED SEQ4_STATUS_LEASE_MOVED
When set indicates that responsibility for lease renewal has been When set indicates that responsibility for lease renewal has been
transferred to one or more new servers. This condition will transferred to one or more new servers. This condition will
continue until the client receives an NFS4ERR_MOVED error and the continue until the client receives an NFS4ERR_MOVED error and the
server receives the subsequent GETATTR for the fs_locations or server receives the subsequent GETATTR for the fs_locations or
fs_locations_info attribute for an access to each file system for fs_locations_info attribute for an access to each file system for
which a lease has been moved to a new server. See which a lease has been moved to a new server. See
Section 11.7.7.1. Section 11.7.7.1.
SEQ4_STATUS_RESTART_RECLAIM_NEEDED SEQ4_STATUS_RESTART_RECLAIM_NEEDED
When set indicates that due to server restart or restart the When set indicates that due to server restart the client must
client must reclaim locking state. Until the client sends a reclaim locking state. Until the client sends a global
global RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will
will return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. return SEQ4_STATUS_RESTART_RECLAIM_NEEDED.
SEQ4_STATUS_BACKCHANNEL_FAULT SEQ4_STATUS_BACKCHANNEL_FAULT
The server has encountered an unrecoverable fault with the The server has encountered an unrecoverable fault with the
backchannel (e.g. it has lost track of the sequence id for a slot backchannel (e.g. it has lost track of the sequence id for a slot
in the backchannel). The client MUST stop sending more requests in the backchannel). The client MUST stop sending more requests
on the session's fore channel, wait for all outstanding requests on the session's fore channel, wait for all outstanding requests
to complete on the fore and back channel, and then destroy the to complete on the fore and back channel, and then destroy the
session. session.
SEQ4_STATUS_DEVID_CHANGED SEQ4_STATUS_DEVID_CHANGED
skipping to change at page 524, line 25 skipping to change at page 524, line 25
The server MUST maintain a mapping of sessionid to client ID in order The server MUST maintain a mapping of sessionid to client ID in order
to validate any operations that follow SEQUENCE that take a stateid to validate any operations that follow SEQUENCE that take a stateid
as an argument and/or result. as an argument and/or result.
If the client establishes a persistent session, then a SEQUENCE done If the client establishes a persistent session, then a SEQUENCE done
after a server restart may encounter requests performed and recorded after a server restart may encounter requests performed and recorded
in a persistent reply cache before the server restart. In this case, in a persistent reply cache before the server restart. In this case,
SEQUENCE will be processed successfully, while requests which were SEQUENCE will be processed successfully, while requests which were
not processed previously are rejected with NFS4ERR_DEADSESSION. not processed previously are rejected with NFS4ERR_DEADSESSION.
Depending on the operations within the COMPOUND successfully Depending on which of the operations within the COMPOUND were
performed before the server restart, these operations will also have successfully performed before the server restart, these operations
replies sent from the server reply cache. Note that when these will also have replies sent from the server reply cache. Note that
operations establish locking state it is locking state that applies when these operations establish locking state it is locking state
to the previous server instance and to the previous client ID, even that applies to the previous server instance and to the previous
though the server restart, which logically happened after these client ID, even though the server restart, which logically happened
operations eliminated that state. In the case of a partially after these operations, eliminated that state. In the case of a
executed COMPOUND, processing may reach an operation not processed partially executed COMPOUND, processing may reach an operation not
during the earlier server instance, making this operation a new one processed during the earlier server instance, making this operation a
and not performable on the existing session. In this case new one and not performable on the existing session. In this case,
NFS4ERR_DEADSESSION will be returned from that operation. NFS4ERR_DEADSESSION will be returned from that operation.
18.47. Operation 54: SET_SSV - Update SSV for a Client ID 18.47. Operation 54: SET_SSV - Update SSV for a Client ID
18.47.1. ARGUMENT 18.47.1. ARGUMENT
struct ssa_digest_input4 { struct ssa_digest_input4 {
SEQUENCE4args sdi_seqargs; SEQUENCE4args sdi_seqargs;
}; };
skipping to change at page 529, line 14 skipping to change at page 529, line 14
18.49.1. ARGUMENT 18.49.1. ARGUMENT
union deleg_claim4 switch (open_claim_type4 dc_claim) { union deleg_claim4 switch (open_claim_type4 dc_claim) {
/* /*
* No special rights to object. Ordinary delegation * No special rights to object. Ordinary delegation
* request of the specified object. Object identified * request of the specified object. Object identified
* by filehandle. * by filehandle.
*/ */
case CLAIM_FH: /* new to v4.1 */ case CLAIM_FH: /* new to v4.1 */
/* CURRENT_FH: object being delegated */
void; void;
/* /*
* Right to file based on a delegation granted * Right to file based on a delegation granted
* to a previous boot instance of the client. * to a previous boot instance of the client.
* File is specified by filehandle. * File is specified by filehandle.
*/ */
case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ case CLAIM_DELEG_PREV_FH: /* new to v4.1 */
/* CURRENT_FH: object being delegated */ /* CURRENT_FH: object being delegated */
void; void;
skipping to change at page 530, line 21 skipping to change at page 530, line 21
This operation allows a client to This operation allows a client to
o get a delegation on all types of files except directories. The o get a delegation on all types of files except directories. The
server MAY support this operation. If the server does not support server MAY support this operation. If the server does not support
this operation, it MUST return NFS4ERR_NOTSUPP. this operation, it MUST return NFS4ERR_NOTSUPP.
o register a "want" for a delegation for the specified file object, o register a "want" for a delegation for the specified file object,
and be notified via a callback when the delegation is available. and be notified via a callback when the delegation is available.
The server MAY support notifications of availability via The server MAY support notifications of availability via
callbacks. If the server does not support registration of wants callbacks. If the server does not support registration of wants
it MUST NOT return an error to indicate that. When the server it MUST NOT return an error to indicate that, and instead MUST
indicates that it will notify the server by means of a callback, return ond_why set to WND4_CONTENTION or WND4_RESOURCE and
it will either provide the delegation using a CB_PUSH_DELEG ond_server_will_push_deleg or ond_server_will_signal_avail set to
operation, or cancel its promise by sending a CB_WANTS_CANCELLED FALSE. When the server indicates that it will notify the client
operation. by means of a callback, it will either provide the delegation
using a CB_PUSH_DELEG operation, or cancel its promise by sending
a CB_WANTS_CANCELLED operation.
o cancel a want for a delegation. o cancel a want for a delegation.
The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set
OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST
ignore them. ignore them.
The meanings of the following flags in wda_want are the same as they The meanings of the following flags in wda_want are the same as they
are in OPEN: are in OPEN:
skipping to change at page 556, line 41 skipping to change at page 556, line 41
protocol error must result. See Section 18.46.3 for a description of protocol error must result. See Section 18.46.3 for a description of
how slots are processed. how slots are processed.
If csa_cachethis is TRUE, then the server is requesting that the If csa_cachethis is TRUE, then the server is requesting that the
client cache the reply in the callback reply cache. The client MUST client cache the reply in the callback reply cache. The client MUST
cache the reply (see Section 2.10.5.1.3). cache the reply (see Section 2.10.5.1.3).
The csa_referring_call_lists array is the list of COMPOUND requests, The csa_referring_call_lists array is the list of COMPOUND requests,
identified by sessionid, slot id and sequencid. These are requests identified by sessionid, slot id and sequencid. These are requests
that the client previously sent to the server. These previous that the client previously sent to the server. These previous
requests created state that some operation(s) in the in the same requests created state that some operation(s) in the same CB_COMPOUND
CB_COMPOUND as the csa_referring_call_lists is identifying. A as the csa_referring_call_lists is identifying. A sessionid is
sessionid is included because leased state is tied to a client ID, included because leased state is tied to a client ID, and a client ID
and a client ID can have multiple sessions. See Section 2.10.5.3. can have multiple sessions. See Section 2.10.5.3.
The value of csa_sequenceid argument relative to the cached sequence The value of csa_sequenceid argument relative to the cached sequence
id on the slot falls into one of three cases. id on the slot falls into one of three cases.
o If the difference between csa_sequenceid and the client's cached o If the difference between csa_sequenceid and the client's cached
sequence id at the slot id is two (2) or more, or if sequence id at the slot id is two (2) or more, or if
csa_sequenceid is less than the cached sequence id (accounting for csa_sequenceid is less than the cached sequence id (accounting for
wraparound of the unsigned sequence id value), then the client wraparound of the unsigned sequence id value), then the client
MUST return NFS4ERR_SEQ_MISORDERED. MUST return NFS4ERR_SEQ_MISORDERED.
 End of changes. 43 change blocks. 
116 lines changed or deleted 135 lines changed or added

This html diff was produced by rfcdiff 1.33. The latest version is available from http://tools.ietf.org/tools/rfcdiff/