Found wdiff, but it reported no recognisable version. Falling back to builtin diff colouring... --- 1/draft-pre-ch-19.txt 2008-05-03 17:15:23.007662900 -0700 +++ 2/draft-ietf-nfsv4-minorversion1-23.txt 2008-05-03 17:15:23.408246900 -0700 @@ -1,19 +1,19 @@ NFSv4 S. Shepler Internet-Draft M. Eisler Intended status: Standards Track D. Noveck -Expires: November 2, 2008 Editors - May 1, 2008 +Expires: November 4, 2008 Editors + May 3, 2008 NFS Version 4 Minor Version 1 - draft-ietf-nfsv4-minorversion1-22.txt + draft-ietf-nfsv4-minorversion1-23.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -24,21 +24,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. - This Internet-Draft will expire on November 2, 2008. + This Internet-Draft will expire on November 4, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This Internet-Draft describes NFS version 4 minor version one, including features retained from the base protocol and protocol extensions made subsequently. Major extensions introduced in NFS @@ -267,36 +267,36 @@ 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 269 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 269 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 270 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 271 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 272 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 272 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 272 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 273 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 274 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 275 - 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 279 + 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 278 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 287 12.5.7. Metadata Server Write Propagation . . . . . . . . . 287 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 287 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 289 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 289 - 12.7.2. Dealing with Lease Expiration on the Client . . . . 290 + 12.7.2. Dealing with Lease Expiration on the Client . . . . 289 12.7.3. Dealing with Loss of Layout State on the Metadata - Server . . . . . . . . . . . . . . . . . . . . . . . 291 + Server . . . . . . . . . . . . . . . . . . . . . . . 290 12.7.4. Recovery from Metadata Server Restart . . . . . . . 291 12.7.5. Operations During Metadata Server Grace Period . . . 293 - 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 294 + 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 293 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 294 12.9. Security Considerations for pNFS . . . . . . . . . . . . 294 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 295 - 13.1. Client ID and Session Considerations . . . . . . . . . . 296 + 13.1. Client ID and Session Considerations . . . . . . . . . . 295 13.1.1. Sessions Considerations for Data Servers . . . . . . 298 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 298 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 299 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 303 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 303 13.4.2. Interpreting the File Layout Using Sparse Packing . 303 13.4.3. Interpreting the File Layout Using Dense Packing . . 306 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 308 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 310 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 311 @@ -388,73 +388,73 @@ delegation . . . . . . . . . . . . . . . . . . . . . . . 510 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 514 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File System . . . . . . . . . . . . . . . . . . . 516 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using a layout . . . . . . . . . . . . . . . . . . . . . . . . 518 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 521 18.44. Operation 51: LAYOUTRETURN - Release Layout Information . . . . . . . . . . . . . . . . . . . . . . 526 18.45. Operation 52: SECINFO_NO_NAME - Get Security on - Unnamed Object . . . . . . . . . . . . . . . . . . . . . 530 + Unnamed Object . . . . . . . . . . . . . . . . . . . . . 531 18.46. Operation 53: SEQUENCE - Supply per-procedure - sequencing and control . . . . . . . . . . . . . . . . . 531 - 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 537 + sequencing and control . . . . . . . . . . . . . . . . . 532 + 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 538 18.48. Operation 55: TEST_STATEID - Test stateids for - validity . . . . . . . . . . . . . . . . . . . . . . . . 539 - 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 541 + validity . . . . . . . . . . . . . . . . . . . . . . . . 540 + 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 542 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing - client ID . . . . . . . . . . . . . . . . . . . . . . . 545 + client ID . . . . . . . . . . . . . . . . . . . . . . . 546 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims - Finished . . . . . . . . . . . . . . . . . . . . . . . . 545 - 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 548 - 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 548 - 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 549 - 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 549 - 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 553 - 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 553 - 20.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 554 + Finished . . . . . . . . . . . . . . . . . . . . . . . . 546 + 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 549 + 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 549 + 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 550 + 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 550 + 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 554 + 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 554 + 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 555 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from - Client . . . . . . . . . . . . . . . . . . . . . . . . . 555 - 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 559 + Client . . . . . . . . . . . . . . . . . . . . . . . . . 556 + 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 560 20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to - Client . . . . . . . . . . . . . . . . . . . . . . . . . 563 - 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 564 + Client . . . . . . . . . . . . . . . . . . . . . . . . . 564 + 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 565 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal - Resources for Recallable Objects . . . . . . . . . . . . 566 + Resources for Recallable Objects . . . . . . . . . . . . 568 20.8. Operation 10: CB_RECALL_SLOT - change flow control - limits . . . . . . . . . . . . . . . . . . . . . . . . . 567 + limits . . . . . . . . . . . . . . . . . . . . . . . . . 569 20.9. Operation 11: CB_SEQUENCE - Supply backchannel - sequencing and control . . . . . . . . . . . . . . . . . 568 + sequencing and control . . . . . . . . . . . . . . . . . 570 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending - Delegation Wants . . . . . . . . . . . . . . . . . . . . 570 + Delegation Wants . . . . . . . . . . . . . . . . . . . . 572 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible - lock availability . . . . . . . . . . . . . . . . . . . 571 + lock availability . . . . . . . . . . . . . . . . . . . 573 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID - changes . . . . . . . . . . . . . . . . . . . . . . . . 573 + changes . . . . . . . . . . . . . . . . . . . . . . . . 575 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback - Operation . . . . . . . . . . . . . . . . . . . . . . . 575 - 21. Security Considerations . . . . . . . . . . . . . . . . . . . 575 - 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 577 - 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 577 - 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 577 - 22.3. Defining New Notifications . . . . . . . . . . . . . . . 578 - 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 578 - 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 580 - 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 580 - 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 580 - 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 580 - 23.1. Normative References . . . . . . . . . . . . . . . . . . 580 - 23.2. Informative References . . . . . . . . . . . . . . . . . 582 - Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 584 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 586 - Intellectual Property and Copyright Statements . . . . . . . . . 587 + Operation . . . . . . . . . . . . . . . . . . . . . . . 577 + 21. Security Considerations . . . . . . . . . . . . . . . . . . . 577 + 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 579 + 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 579 + 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 579 + 22.3. Defining New Notifications . . . . . . . . . . . . . . . 580 + 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 580 + 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 582 + 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 582 + 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 582 + 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 582 + 23.1. Normative References . . . . . . . . . . . . . . . . . . 582 + 23.2. Informative References . . . . . . . . . . . . . . . . . 584 + Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 586 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 588 + Intellectual Property and Copyright Statements . . . . . . . . . 589 1. Introduction 1.1. The NFS Version 4 Minor Version 1 Protocol The NFS version 4 minor version 1 (NFSv4.1) protocol is the second minor version of the NFS version 4 (NFSv4) protocol. The first minor version, NFSv4.0 is described in [21]. It generally follows the guidelines for minor versioning model listed in Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a client and server @@ -2368,27 +2368,27 @@ CB_SEQUENCE (e.g. BIND_CONN_TO_SESSION), then the RPC XID is needed for correct operation to match the reply to the request. o The SEQUENCE or CB_SEQUENCE operation may generate an error. If so, the embedded slot id, sequence id, and sessionid (if present) in the request will not be in the reply, and the requester has only the XID to match the reply to the request. Given that well formulated XIDs continue to be required, this begs the question why SEQUENCE and CB_SEQUENCE replies have a sessionid, - slot id and sequence id? Having the sessionid in the reply means the - requester does not have to use the XID to lookup the sessionid, which - would be necessary if the connection were associated with multiple - sessions. Having the slot id and sequence id in the reply means - requester does not have to use the XID to lookup the slot id and - sequence id. Furhermore, since the XID is only 32 bits, it is too - small to guarantee the re-association of a reply with its request + slot id and sequence id? Having the session id in the reply means + the requester does not have to use the XID to lookup the session id, + which would be necessary if the connection were associated with + multiple sessions. Having the slot id and sequence id in the reply + means requester does not have to use the XID to lookup the slot id + and sequence id. Furhermore, since the XID is only 32 bits, it is + too small to guarantee the re-association of a reply with its request ([27]); having sessionid, slot id, and sequence id in the reply allows the client to validate that the reply in fact belongs to the matched request. The SEQUENCE (and CB_SEQUENCE) operation also carries a "highest_slotid" value which carries additional requester slot usage information. The requester must always indicate the slot id representing the outstanding request with the highest-numbered slot value. The requester should in all cases provide the most conservative value possible, although it can be increased somewhat @@ -2457,43 +2457,44 @@ entries at least as large as the old value of maximum requests outstanding, until it can infer that the requester has seen a reply containing the new granted highest_slotid. The replier can infer that requester as seen such a reply when it receives a new request with the same slotid as the request replied to and the next higher sequenceid. 2.10.5.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies When a SEQUENCE or CB_SEQUENCE operation is successfully executed, - its reply MUST always be cached. Specifically, sessionid, - sequenceid, and slotid MUST be cached in the reply cache. The reply - from SEQUENCE also includes the highest slotid, target highest - slotid, and status flags. Instead of caching these values, the - server MAY re-compute the values from the current state of the fore - channel, session and/or client ID as appropriate. Similarly, the - reply from CB_SEQUENCE includes a highest slotid and target highest - slotid. The client MAY re-compute the values from the current state - of the session as appropriate. + its reply MUST always be cached. Specifically, session id, sequence + id, and slot id MUST be cached in the reply cache. The reply from + SEQUENCE also includes the highest slot id, target highest slot id, + and status flags. Instead of caching these values, the server MAY + re-compute the values from the current state of the fore channel, + session and/or client ID as appropriate. Similarly, the reply from + CB_SEQUENCE includes a highest slot id and target highest slot id. + The client MAY re-compute the values from the current state of the + session as appropriate. Regardless of whether a replier is re-computing highest slotid, - target slotid, and status on replies to retries or not, the requester - MUST NOT assume the values are being re-computed whenever it receives - a reply after a retry is sent, since it has no way of knowing whether - the reply it has received was sent by the server in response to the - retry, or is a delayed response to the original request. Therefore, - it may be the case that highest slotid, target slotid, or status bits - may reflect the state of affairs when the request was first executed. - Although acting based on such delayed information is valid, it may - cause the receiver to do unneeded work. Requesters MAY choose to - send additional requests to get the current state of affairs or use - the state of affairs reported by subsequent requests, in preference - to acting immediately on data which may be out of date. + target slot id, and status on replies to retries or not, the + requester MUST NOT assume the values are being re-computed whenever + it receives a reply after a retry is sent, since it has no way of + knowing whether the reply it has received was sent by the server in + response to the retry, or is a delayed response to the original + request. Therefore, it may be the case that highest slot id, target + slot id, or status bits may reflect the state of affairs when the + request was first executed. Although acting based on such delayed + information is valid, it may cause the receiver to do unneeded work. + Requesters MAY choose to send additional requests to get the current + state of affairs or use the state of affairs reported by subsequent + requests, in preference to acting immediately on data which may be + out of date. 2.10.5.1.2. Errors from SEQUENCE and CB_SEQUENCE Any time SEQUENCE or CB_SEQUENCE return an error, the sequence id of the slot MUST NOT change. The replier MUST NOT modify the reply cache entry for the slot whenever an error is returned from SEQUENCE or CB_SEQUENCE. 2.10.5.1.3. Optional Reply Caching @@ -2585,44 +2586,44 @@ client may have been granted a delegation to a file it has opened, but the reply to the OPEN (informing the client of the granting of the delegation) may be delayed in the network. If a conflicting operation arrives at the server, it will recall the delegation using the backchannel, which may be on a different transport connection, perhaps even a different network, or even a different session associated with the same client ID The presence of a session between client and server alleviates this issue. When a session is in place, each client request is uniquely - identified by its { sessionid, slot id, sequence id } triple. By the - rules under which slot entries (reply cache entries) are retired, the - server has knowledge whether the client has "seen" each of the + identified by its { session id, slot id, sequence id } triple. By + the rules under which slot entries (reply cache entries) are retired, + the server has knowledge whether the client has "seen" each of the server's replies. The server can therefore provide sufficient information to the client to allow it to disambiguate between an erroneous or conflicting callback race condition. For each client operation which might result in some sort of server callback, the server SHOULD "remember" the { sessionid, slot id, sequence id } triple of the client request until the slot id retirement rules allow the server to determine that the client has, in fact, seen the server's reply. Until the time the { sessionid, slot id, sequence id } request triple can be retired, any recalls of the associated object MUST carry an array of these referring identifiers (in the CB_SEQUENCE operation's arguments), for the benefit of the client. After this time, it is not necessary for the server to provide this information in related callbacks, since it is certain that a race condition can no longer occur. The CB_SEQUENCE operation which begins each server callback carries a list of "referring" { sessionid, slot id, sequence id } triples. If - the client finds the request corresponding to the referring - sessionid, slot id and sequence id to be currently outstanding (i.e. - the server's reply has not been seen by the client), it can determine + the client finds the request corresponding to the referring session + id, slot id and sequence id to be currently outstanding (i.e. the + server's reply has not been seen by the client), it can determine that the callback has raced the reply, and act accordingly. If the client does not find the request corresponding the referring triple to be outstanding (including the case of a sessionid referring to a destroyed session), then there is no race with respect to this triple. The server SHOULD limit the referring triples to requests that refer to just those that apply to the objects referred to in the CB_COMPOUND procedure. The client must not simply wait forever for the expected server reply to arrive before responding to the CB_COMPOUND that won the race, @@ -2748,23 +2749,23 @@ sequence id) MUST be rejected with NFS4ERR_DEADSESSION (returned by SEQUENCE). Such a session is considered dead. A server MAY re- animate a session after a server restart so that the session will accept new requests as well as retries. To re-animate a session the server needs to persist additional information through server restart: o The client ID. This is a prerequisite to let the client to create more sessions associated with the same client ID as the - o The client ID's sequenceid that is used for creating sessions (see - Section 18.35 and Section 18.36. This is a prerequisite to let - the client create more sessions. + o The client ID's sequence id that is used for creating sessions + (see Section 18.35 and Section 18.36. This is a prerequisite to + let the client create more sessions. o The principal that created the client ID. This allows the server to authenticate the client when it sends EXCHANGE_ID. o The SSV, if SP4_SSV state protection was specified when the client ID was created (see Section 18.35). This lets the client create new sessions, and associate connections with the new and existing sessions. o The properties of the client ID as defined in Section 18.35. @@ -3527,22 +3528,22 @@ o A catastrophe that causes the reply cache to be corrupted or lost on the media it was stored on. This applies even if the replier indicated in the CREATE_SESSION results that it would persist the cache. o The server purges the session of a client that has been inactive for a very extended period of time. Loss of reply cache is equivalent to loss of session. The replier indicates loss of session to the requester by returning - NFS4ERR_BADSESSION on the next operation that uses the sessionid that - refers to the lost session. + NFS4ERR_BADSESSION on the next operation that uses the session id + that refers to the lost session. After an event like a server restart, the client may have lost its connections. The client assumes for the moment that the session has not been lost. It reconnects, and if it specified connection association enforcement when the session was created, it invokes BIND_CONN_TO_SESSION using the sessionid. Otherwise, it invokes SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns NFS4ERR_BADSESSION, the client knows the session was lost. If the connection survives session loss, then the next SEQUENCE operation the client sends over the connection will get back @@ -12863,26 +12864,24 @@ is incapable of providing this check in the presence of mandatory file locks, the metadata server then MUST NOT grant layouts and mandatory file locks simultaneously. 12.5.2. Getting a Layout A client obtains a layout with the LAYOUTGET operation. The metadata server will grant layouts of a particular type (e.g., block/volume, object, or file). The client selects an appropriate layout type that the server supports and the client is prepared to use. The layout - returned to the client may not exactly align with the requested byte - range. A field within the LAYOUTGET request, loga_minlength, - specifies the minimum length of the layout. The loga_minlength field - should be at least one. As needed a client may make multiple - LAYOUTGET requests; these will result in multiple overlapping, non- - conflicting layouts. + returned to the client might not exactly match the requested byte + range as described in Section 18.43.3. As needed a client may make + multiple LAYOUTGET requests; these might result in multiple + overlapping, non-conflicting layouts (see Section 12.2.8). In order to get a layout, the client must first have opened the file via the OPEN operation. When a client has no layout on a file, it MUST present a stateid as returned by OPEN, a delegation stateid, or a byte-range lock stateid in the loga_stateid argument. A successful LAYOUTGET result includes a layout stateid. The first successful LAYOUTGET processed by the server using a non-layout stateid as an argument MUST have the "seqid" field of the layout stateid in the response set to one. Thereafter, the client uses a layout stateid (see Section 12.5.3) on future invocations of LAYOUTGET on the file, @@ -12944,21 +12943,21 @@ correct "seqid" is defined as the highest "seqid" value from responses of fully processed LAYOUTGET or LAYOUTRETURN operations or arguments of a fully processed CB_LAYOUTRECALL operation. Since the server is incrementing the "seqid" value on each layout operation, the client may determine the order of operation processing by inspecting the "seqid" value. In the case of overlapping layout ranges, the ordering information will provide the client the knowledge of which layout ranges are held. Note that overlapping layout ranges may occur because of the client's specific requests or because the server is allowed to expand the range of a requested - layout and notify the client in the LAYOUTRETURN results Additional + layout and notify the client in the LAYOUTRETURN results. Additional layout stateid sequencing requirements are provided in Section 12.5.5.2. The client's receipt of a "seqid" is not sufficient for subsequent use. The client must fully process the operations before the "seqid" can be used. For LAYOUTGET results, if the client is not using the forgetful model (Section 12.5.5.1), it MUST first update its record of what ranges of the file's layout it has before using the seqid. For LAYOUTRETURN results, the client MUST delete the range from its record of what ranges of the file's layout it had before using the @@ -23376,31 +23375,50 @@ records introduced in the description of EXCHANGE_ID is used with the following addition: clientid_arg: The value of the csa_clientid field of the CREATE_SESSION4args structure of the current request. Since CREATE_SESSION is a non-idempotent operation, we must consider the possibility that retries may occur as a result of a client restart, network partition, malfunctioning router, etc. For each client ID created by EXCHANGE_ID, the server maintains a separate - reply cache similar to the session reply cache used for SEQUENCE - operations, with two distinctions. + reply cache (called the CREATE_SESSION reply cache) similar to the + session reply cache used for SEQUENCE operations, with two + distinctions. o First this is a reply cache just for detecting and processing CREATE_SESSION requests for a given client ID. o Second, the size of the client ID reply cache is of one slot (and as a result, the CREATE_SESSION request does not carry a slot number). This means that at most one CREATE_SESSION request for a given client ID can be outstanding. + As previously stated, CREATE_SESSION can be sent with or without a + preceding SEQUENCE operation. Even if SEQUENCE precedes + CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply + cache, which is separate from the reply cache for the session + associated with SEQUENCE. If CREATE_SESSION was originally sent by + itself, the client MAY send a retry of the CREATE_SESSION operation + within a COMPOUND preceded by SEQUENCE. If CREATE_SESSION was + originally sent in a COMPOUND that started with SEQUENCE, then the + client SHOULD send a retry in a COMPOUND that starts with SEQUENCE + that has the same session id as the SEQUENCE of the original request. + However, the client MAY send a retry in a COMPOUND that either has no + preceding SEQUENCE, or has a preceding SEQUENCE that refers to a + different session than the original CREATE_SESSION. This might be + necessary if the client sends a CREATE_SESSION in a COMPOUND preceded + by a SEQUENCE with session id X, and session X no longer exists. + Regardless any retry of CREATE_SESSION, with or without a preceding + SEQUENCE, MUST use the same value of csa_sequence as the original. + When a client sends a successful EXCHANGE_ID and it is returned an unconfirmed client ID, the client is also returned eir_sequenceid, and the client is expected to set the value of csa_sequenceid in the client ID-confirming-CREATE_SESSION it sends with that client ID to the value of eir_sequenceid. When EXCHANGE_ID returns a new, unconfirmed client ID, the server initializes the client ID slot to be equal to eir_sequenceid - 1 (accounting for underflow), and records a contrived CREATE_SESSION result with a "cached" result of NFS4ERR_SEQ_MISORDERED. With the slot thus initialized, the processing of the CREATE_SESSION operation is divided into four @@ -24195,161 +24214,194 @@ the sessionid in the preceding SEQUENCE operation), current filehandle, layout type (loga_layout_type), and the layout stateid (loga_stateid). The use of the loga_iomode field depends upon the layout type, but should reflect the client's data access intent. If the metadata server is in a grace period, and does not persist layouts and device ID to device address mappings, then it MUST return NFS4ERR_GRACE (see Section 8.4.2.1). The LAYOUTGET operation returns layout information for the specified - byte range: a layout. To get a layout from a specific offset through - the end-of-file, regardless of the file's length, a loga_length field - set to NFS4_UINT64_MAX is used. If loga_length is zero, or if a - loga_length which is not NFS4_UINT64_MAX is specified, and the sum of - loga_length and loga_offset exceeds NFS4_UINT64_MAX, the error - NFS4ERR_INVAL will result. + byte range: a layout. The client actually specifies two ranges, both + starting at the offset in the loga_offset field. The first range is + between loga_offset and loga_offset + loga_length - 1 inclusive. + This range indicates the desired range the client wants the layout to + cover. The second range is between loga_offset and loga_offset + + loga_minlength - 1 inclusive. This range indicates the required + range the client needs the layout to cover. Thus, loga_minlength + MUST be less than or equal to loga_length. - The loga_minlength field specifies the minimum length of layout the - server MUST return with two exceptions: + When a length field is set to NFS4_UINT64_MAX, this indicates a + desire (when loga_length is NFS4_UINT64_MAX) or requirement (when + loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset + through the end-of-file, regardless of the file's length. - 1. The argument loga_iomode was set to LAYOUTIOMODE_READ, and - loga_offset plus loga_minlength goes past the end of the file. + If loga_length or loga_minlength are zero the metadata server MUST + return NFS4ERR_INVAL. If the sum of loga_offset and loga_minlength + exceeds NFS4_UINT64_MAX, and loga_minlength is NFS4_UINT64_MAX, the + error NFS4ERR_INVAL will result. If the sum of loga_offset and + loga_length exceeds NFS4_UINT64_MAX, and loga_length is + NFS4_UINT64_MAX, the error NFS4ERR_INVAL will result. - 2. The range from loga_offset through loga_offset + loga_minlength - - 1 overlaps two or more striping patterns. In which case, - logr_layout will contain two or more elements, and the sum of the - lo_length fields of each element MUST be at least loga_minlength - unless the first exception also applies. + Any layout the metadata server returns: - If this requirement cannot be met, the server MUST NOT return a - layout and the error NFS4ERR_BADLAYOUT MUST be returned. + MUST + + start no higher than an offset of loga_offset. + + MAY + + start less than an offset of loga_offset. + + MUST + + have a length no less loga_minlength, unless the field loga_iomode + was set to LAYOUTIOMODE_READ, and the sum of loga_offset and + loga_minlength goes past the end of the file. + + MAY + + have a length longer than loga_minlength. + + SHOULD + + have a length no less loga_length. + + MAY + have a length longer than loga_length. + + If the metadata server cannot return a layout with an offset no + higher than loga_offset, and a length no smaller than loga_minlength, + the metadata server MUST NOT return a layout and the error + NFS4ERR_BADLAYOUT MUST be returned. The loga_stateid field specifies a valid stateid. If a layout is not currently held by the client, the loga_stateid field represents a stateid reflecting the correspondingly valid open, byte-range lock, or delegation stateid. Once a layout is held by the client for the - file, the loga_stateid field is a stateid as returned from a previous - LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL - operation (see Section 12.5.3). + file, the loga_stateid field MUST be a stateid as returned from a + previous LAYOUTGET or LAYOUTRETURN operation or provided by a + CB_LAYOUTRECALL operation (see Section 12.5.3). The loga_maxcount field specifies the maximum layout size (in bytes) that the client can handle. If the size of the layout structure exceeds the size specified by maxcount, the metadata server will return the NFS4ERR_TOOSMALL error. The returned layout is expressed as an array, logr_layout, with each element of type layout4. If a file has a single striping pattern, then logr_layout will contain just one entry. Otherwise, if the requested range overlaps more than one striping pattern, logr_layout will contain the required number of entries. The elements of logr_layout MUST be sorted in ascending order of the value of the lo_offset field of each element. There MUST be no gaps or overlaps in the range between two successive elements of logr_layout. The lo_iomode field in each element of logr_layout MUST be the same. - The metadata server may adjust the range of the returned layout based - on the usage implied by the loga_iomode. The client MUST be prepared - to get a layout that does not align exactly with its request. See - Section 12.5.2 for more details. + The length of the returned layout is considered to be the sum of the + lo_length fields of each element. Thus, the sum of the lo_length + fields MUST be no less than loga_minlength, and SHOULD be no less + than loga_length. - The metadata server may also return a layout with an lo_iomode other - than that requested by the client. If it does so, it MUST ensure - that the lo_iomode is more permissive than the loga_iomode requested. - For example, this behavior allows an implementation to upgrade read- - only requests to read/write requests at its discretion, within the - limits of the layout type specific protocol. A lo_iomode of either + The metadata server MAY return a layout with an lo_iomode other than + that requested by the client. If it does so, it MUST ensure that the + lo_iomode is more permissive than the loga_iomode requested. For + example, this behavior allows an implementation to upgrade read-only + requests to read/write requests at its discretion, within the limits + of the layout type specific protocol. A lo_iomode of either LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. The logr_return_on_close result field is a directive to return the - layout before closing the file. When the server sets this return - value to TRUE, it MUST be prepared to recall the layout in the case - the client fails to return the layout before close. For the server - that knows a layout must be returned before a close of the file, this - return value can be used to communicate the desired behavior to the - client and thus remove one extra step from the client's and server's - interaction. + layout before closing the file. When the metadata server sets this + return value to TRUE, it MUST be prepared to recall the layout in the + case the client fails to return the layout before close. For the + metadata server that knows a layout must be returned before a close + of the file, this return value can be used to communicate the desired + behavior to the client and thus remove one extra step from the + client's and metadata server's interaction. The logr_stateid stateid is returned to the client for use in subsequent layout related operations. See Section 8.2, Section 12.5.3, and Section 12.5.5.2 for a further discussion and requirements. The format of the returned layout (lo_content) is specific to the layout type. The value of the layout type (lo_content.loc_type) for - each of the elements of the array of layouts returned by the server - (logr_layout) MUST be equal to the loga_layout_type specified by the - client. If it is not equal, the client SHOULD ignore the response as - invalid and behave as if the server returned an error, even if the - client does have support for the layout type returned. + each of the elements of the array of layouts returned by the metadata + server (logr_layout) MUST be equal to the loga_layout_type specified + by the client. If it is not equal, the client SHOULD ignore the + response as invalid and behave as if the metadata server returned an + error, even if the client does have support for the layout type + returned. If layouts are not supported for the requested file or its containing - file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If - the layout type is not supported, the metadata server should return - NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout - matches the client provided layout identification, the server should - return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or - a loga_iomode of LAYOUTIOMODE4_ANY is specified, the server should - return NFS4ERR_BADIOMODE. + file system the metadata server MUST return + NFS4ERR_LAYOUTUNAVAILABLE. If the layout type is not supported, the + metadata server MUST return NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts + are supported but no layout matches the client provided layout + identification, the metadata server MUST return NFS4ERR_BADLAYOUT. + If an invalid loga_iomode is specified, or a loga_iomode of + LAYOUTIOMODE4_ANY is specified, the metadata server MUST return + NFS4ERR_BADIOMODE. If the layout for the file is unavailable due to transient - conditions, e.g. file sharing prohibits layouts, the server MUST - return NFS4ERR_LAYOUTTRYLATER. + conditions, e.g. file sharing prohibits layouts, the metadata server + MUST return NFS4ERR_LAYOUTTRYLATER. If the layout request is rejected due to an overlapping layout - recall, the server MUST return NFS4ERR_RECALLCONFLICT. See + recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See Section 12.5.5.2 for details. If the layout conflicts with a mandatory byte range lock held on the file, and if the storage devices have no method of enforcing mandatory locks, other than through the restriction of layouts, the - metadata server should return NFS4ERR_LOCKED. + metadata server SHOULD return NFS4ERR_LOCKED. If client sets loga_signal_layout_avail to TRUE, then it is registering with the client a "want" for a layout in the event the - layout cannot be obtained due to resource exhaustion. If the server - supports and will honor the "want", the results will have - logr_will_signal_layout_avail set to TRUE. If so the client should - expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a layout - is available. + layout cannot be obtained due to resource exhaustion. If the + metadata server supports and will honor the "want", the results will + have logr_will_signal_layout_avail set to TRUE. If so the client + should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a + layout is available. On success, the current filehandle retains its value and the current stateid is updated to match the value as returned in the results. 18.43.4. IMPLEMENTATION Typically, LAYOUTGET will be called as part of a COMPOUND request after an OPEN operation and results in the client having location information for the file; this requires that loga_stateid be set to - the special stateid that tells the server to use the current stateid, - which is set by OPEN (see Section 16.2.3.1.2) . A client may also - hold a layout across multiple OPENs. The client specifies a layout - type that limits what kind of layout the server will return. This - prevents servers from issuing layouts that are unusable by the - client. + the special stateid that tells the metadata server to use the current + stateid, which is set by OPEN (see Section 16.2.3.1.2) . A client + may also hold a layout across multiple OPENs. The client specifies a + layout type that limits what kind of layout the metadata server will + return. This prevents metadata servers from granting layouts that + are unusable by the client. Once the client has obtained a layout referring to a particular - device ID, the server MUST NOT delete the device ID until the layout - is returned or revoked. + device ID, the metadata server MUST NOT delete the device ID until + the layout is returned or revoked. CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is that LAYOUTGET returns a device ID the client does not have device - address mappings for, and the server sends a CB_NOTIFY_DEVICEID to - add the device ID to the client's awareness and meanwhile the client - sends GETDEVICEINFO on the device ID. This scenario is discussed in - Section 18.40.4. Another scenario is that the CB_NOTIFY_DEVICEID is - processed by the client before it processes the results from - LAYOUTGET. The client will send a GETDEVICEINFO on the device ID. - If the results from GETDEVICEINFO are received before the client gets - results from LAYTOUTGET, then there is no longer a race. If the - results from LAYOUTGET are received before the results from - GETDEVICEINFO, the client can either wait for results of + address mappings for, and the metadata server sends a + CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and + meanwhile the client sends GETDEVICEINFO on the device ID. This + scenario is discussed in Section 18.40.4. Another scenario is that + the CB_NOTIFY_DEVICEID is processed by the client before it processes + the results from LAYOUTGET. The client will send a GETDEVICEINFO on + the device ID. If the results from GETDEVICEINFO are received before + the client gets results from LAYTOUTGET, then there is no longer a + race. If the results from LAYOUTGET are received before the results + from GETDEVICEINFO, the client can either wait for results of GETDEVICEINFO, or send another one to get possibly more up to date device address mappings for the device ID. 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 18.44.1. ARGUMENT /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ const LAYOUT4_RET_REC_FILE = 1; const LAYOUT4_RET_REC_FSID = 2; @@ -25386,24 +25438,25 @@ 19.1.1. ARGUMENTS void; 19.1.2. RESULTS void; 19.1.3. DESCRIPTION - Standard NULL procedure. Void argument, void response. Even though - there is no direct functionality associated with this procedure, the - server will use CB_NULL to confirm the existence of a path for RPCs - from server to client. + CB_NULL is the standard ONC RPC NULL procedure, with the standard + void argument, and void response. Even though there is no direct + functionality associated with this procedure, the server will use + CB_NULL to confirm the existence of a path for RPCs from the server + to client. 19.1.4. ERRORS None. 19.2. Procedure 1: CB_COMPOUND - Compound Operations 19.2.1. ARGUMENTS enum nfs_cb_opnum4 { @@ -25508,37 +25561,37 @@ nfs_cb_resop4 resarray<>; }; 19.2.3. DESCRIPTION The CB_COMPOUND procedure is used to combine one or more of the callback procedures into a single RPC request. The main callback RPC program has two main procedures: CB_NULL and CB_COMPOUND. All other operations use the CB_COMPOUND procedure as a wrapper. - In the processing of the CB_COMPOUND procedure, the client may find - that it does not have the available resources to execute any or all - of the operations within the CB_COMPOUND sequence. This is discussed - in Section 2.10.5.4. + During the processing of the CB_COMPOUND procedure, the client may + find that it does not have the available resources to execute any or + all of the operations within the CB_COMPOUND sequence. Refer to + Section 2.10.5.4 for details. The minorversion field of the arguments MUST be the same as the minorversion of the COMPOUND procedure used to created the client ID and session. For NFSv4.1, minorversion MUST be set to 1. Contained within the CB_COMPOUND results is a 'status' field. This status must be equivalent to the status of the last operation that was executed within the CB_COMPOUND procedure. Therefore, if an operation incurred an error then the 'status' value will be the same error value as is being returned for the operation that failed. - For a description of the "tag" field, see Section 16.2.3 where the - corresponding forward channel procedure is described. + The "tag" field is handled the same way as that of COMPOUND procedure + (see Section 16.2.3). Illegal operation codes are handled in the same way as they are handled for the COMPOUND procedure. 19.2.4. IMPLEMENTATION The CB_COMPOUND procedure is used to combine individual operations into a single RPC request. The client interprets each of the operations in turn. If an operation is executed by the client and the status of that operation is NFS4_OK, then the next operation in @@ -25602,67 +25655,67 @@ 20.1.3. DESCRIPTION The CB_GETATTR operation is used by the server to obtain the current modified state of a file that has been write delegated. The attributes size and change are the only ones guaranteed to be serviced by the client. See Section 10.4.3 for a full description of how the client and server are to interact with the use of CB_GETATTR. If the filehandle specified is not one for which the client holds a - write open delegation, an NFS4ERR_BADHANDLE error is returned. + write delegation, an NFS4ERR_BADHANDLE error is returned. 20.1.4. IMPLEMENTATION The client returns attrmask bits and the associated attribute values only for the change attribute, and attributes that it may change (time_modify, and size). -20.2. Operation 4: CB_RECALL - Recall an Open Delegation +20.2. Operation 4: CB_RECALL - Recall a Delegation 20.2.1. ARGUMENT struct CB_RECALL4args { stateid4 stateid; bool truncate; nfs_fh4 fh; }; 20.2.2. RESULT struct CB_RECALL4res { nfsstat4 status; }; 20.2.3. DESCRIPTION - The CB_RECALL operation is used to begin the process of recalling an - open delegation and returning it to the server. + The CB_RECALL operation is used to begin the process of recalling a + delegation and returning it to the server. - The truncate flag is used to optimize recall for a file which is - about to be truncated to zero. When it is set, the client is freed - of obligation to propagate modified data for the file to the server, - since this data is irrelevant. + The truncate flag is used to optimize recall for a file object which + is a regular fule and is about to be truncated to zero. When it is + TRUE, the client is freed of the obligation to propagate modified + data for the file to the server, since this data is irrelevant. - If the handle specified is not one for which the client holds an open + If the handle specified is not one for which the client holds a delegation, an NFS4ERR_BADHANDLE error is returned. If the stateid specified is not one corresponding to an open delegation for the file specified by the filehandle, an NFS4ERR_BAD_STATEID is returned. 20.2.4. IMPLEMENTATION - The client should reply to the callback immediately. Replying does - not complete the recall except when an error was returned. The - recall is not complete until the delegation is returned using a - DELEGRETURN. + The client SHOULD reply to the callback immediately. Replying does + not complete the recall except when an error other than + NFS4ERR_DELYAY is returned. The recall is not complete until the + delegation is returned using a DELEGRETURN operation. 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 20.3.1. ARGUMENT /* * NFSv4.1 callback arguments and results */ enum layoutrecall_type4 { @@ -25697,45 +25750,45 @@ 20.3.2. RESULT struct CB_LAYOUTRECALL4res { nfsstat4 clorr_status; }; 20.3.3. DESCRIPTION The CB_LAYOUTRECALL operation is used by the server to recall layouts from the client; as a result, the client will begin the process of - returning layouts with LAYOUTRETURN. The CB_LAYOUTRECALL operation + returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation specifies one of three forms of recall processing with the value of layoutrecall_type4. The recall is either for a specific layout (by file), for an entire file system (FSID), or for all file systems (ALL). The behavior of the operation varies based on the value of the layoutrecall_type4. The value and behaviors are: LAYOUTRECALL4_FILE - For a layout to match the recall request, the following fields - must match in value with the layout: clora_type, clora_iomode, - lor_fh, and the byte range specified by lor_offset, and - lor_length. The clora_iomode field may have a special value of - LAYOUTIOMODE4_ANY. The LAYOUTIOMODE4_ANY will match any value - originally returned in a layout; therefore it acts as a wild card - for iomode. The other special value used is for lor_length. If - lor_length has a value of NFS4_MAXFILELEN, the lor_length field - means the maximum possible file size. If a matching layout is - found, it MUST be returned using the LAYOUTRETURN operation, see - Section 18.44. An example of the field's special value use is if - clora_iomode is LAYOUTIOMODE4_ANY, lor_offset is zero, and - lor_length is NFS4_MAXFILELEN, then the entire layout is to be - returned. + For a layout to match the recall request, the values of the + following fields must match those of the layout: clora_type, + clora_iomode, lor_fh, and the byte range specified by lor_offset + and lor_length. The clora_iomode field may have a special value + of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will + match any iomode originally returned in a layout; therefore it + acts as a wild card. The other special value used is for + lor_length. If lor_length has a value of NFS4_UINT64_MAX, the + lor_length field means the maximum possible file size. If a + matching layout is found, it MUST be returned using the + LAYOUTRETURN operation, see Section 18.44. An example of the + field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, + lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the + entire layout is to be returned. The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the client does not hold layouts for the file or if the client does not have any overlapping layouts for the specification in the layout recall. LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL If LAYOUTRECALL4_FSID is specified, the fsid specifies the file system for which any outstanding layouts MUST be returned. If @@ -25746,65 +25799,64 @@ respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL acknowledges to the server that the client invalidated the said device mappings. See Section 12.5.5.2.1.5 for considerations with "bulk" recall of layouts. The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the client does not hold layouts and does not have valid deviceid mappings. In processing the layout recall request, the client also varies its - behavior on the value of the clora_changed field. This field is used - by the server to provide additional context for the reason why the - layout is being recalled. A FALSE value for clora_changed indicates - that no change in the layout is expected and the client may write - modified data to the storage devices involved; this must be done - prior to returning the layout via LAYOUTRETURN. A TRUE value for - clora_changed indicates that the server is changing the layout. - Examples of layout changes and reasons for a TRUE indication are: + behavior based on the value of the clora_changed field. This field + is used by the server to provide additional context for the reason + why the layout is being recalled. A FALSE value for clora_changed + indicates that no change in the layout is expected and the client may + write modified data to the storage devices involved; this must be + done prior to returning the layout via LAYOUTRETURN. A TRUE value + for clora_changed indicates that the server is changing the layout. + Examples of layout changes and reasons for a TRUE indication are: the metadata server is restriping the file or a permanent error has occurred on a storage device and the metadata server would like to provide a new layout for the file. Therefore, a clora_changed value of TRUE indicates some level of change for the layout and the client SHOULD NOT write and commit modified data to the storage devices. In this case, the client writes and commits data through the metadata server. See Section 12.5.3 for a description of how the lor_stateid field in the arguments is to be constructed. Note that the "seqid" field of lor_stateid MUST NOT be zero. See Section 8.2, Section 12.5.3, and Section 12.5.5.2 for a further discussion and requirements. 20.3.4. IMPLEMENTATION The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL - (recall of file delegations) in that straightforward processing of - the layout recall done and the client responds to the request before - actually returning layouts with the LAYOUTRETURN operation. While - the client responds to the CB_LAYOUTRECALL immediately, the operation - is not considered complete (i.e. considered pending) until all - affected layouts are returned to the server with the LAYOUTRETURN - operation. + (recall of file delegations) in that the client responds to the + request before actually returning layouts via the LAYOUTRETURN + operation. While the client responds to the CB_LAYOUTRECALL + immediately, the operation is not considered complete (i.e. + considered pending) until all affected layouts are returned to the + server via the LAYOUTRETURN operation. - Before returning the layout to the server with LAYOUTRETURN, the + Before returning the layout to the server via LAYOUTRETURN, the client should wait for the response from in-process or in-flight READ, WRITE, or COMMIT operations that use the recalled layout. - If the client is holding modified data which is effected by a + If the client is holding modified data which is affected by a recalled layout, the client has various options for writing the data to the server. As always, the client may write the data through the metadata server. In fact, the client may not have a choice other than writing to the metadata server when the clora_changed argument is TRUE and a new layout is unavailable from the server. However, the client may be able to write the modified data to the storage device if the clora_changed argument is FALSE; this needs to be done - before returning the layout with LAYOUTRETURN. If the client were to + before returning the layout via LAYOUTRETURN. If the client were to obtain a new layout covering the modified data's range, then writing to the storage devices is an available alternative. Note that before obtaining a new layout, the client must first return the original layout. In the case of modified data being written while the layout is held, the client must use LAYOUTCOMMIT operations at the appropriate time; as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a large amount of modified data is outstanding, the client may send LAYOUTRETURNs for portions of the recalled layout; this allows the @@ -25912,57 +25964,58 @@ to clients about changes to delegated directories The registration of notifications for the directories occurs when the delegation is established using GET_DIR_DELEGATION. These notifications are sent over the backchannel. The notification is sent once the original request has been processed on the server. The server will send an array of notifications for changes that might have occurred in the directory. The notifications are sent as list of pairs of bitmaps and values. See Section 3.3.7 for a description of how NFSv4.1 bitmaps work. - If the server has more notifications then can fit in the CB_COMPOUND + If the server has more notifications than can fit in the CB_COMPOUND request, it SHOULD send a sequence of serial CB_COMPOUND requests so that the client's view of the directory does not become confused. E.g. If the server indicates a file named "foo" is added, and that - the file "foo" is removed, the order it which the client receives - these notifications are processed needs to be the same as the order - in which corresponding operations occurred on the server. + the file "foo" is removed, the order in which the client receives + these notifications needs to be the same as the order in which + corresponding operations occurred on the server. If the client holding the delegation makes any changes in the directory that cause files or sub directories to be added or removed, the server will notify that client of the resulting change(s). If the client holding the delegation is making attribute or cookie verifier changes only, the server does not need to send notifications to that client. The server will send the following information for each operation: NOTIFY4_ADD_ENTRY The server will send information about the new directory entry being created along with the cookie for that entry. The entry information (data type notify_add4) includes the component name of the entry and attributes. The server will send this type of entry when a file is actually being created, when an entry is being added to a directory as a result of a rename across directories (see below), and when a hard link is being created to an existing file. If this entry is added to the end of the directory, the - server will set the nad_last_entry flag to true. If the file is + server will set the nad_last_entry flag to TRUE. If the file is added such that there is at least one entry before it, the server will also return the previous entry information (nad_prev_entry, a variable length array of up to one element. If the array is of zero length, there is no previous entry), along with its cookie. - This is to help clients find the right location in their DNLC or - directory caches where this entry should be cached. If the new - entry's cookie is available, it will be in nad_new_entry_cookie - (another variable length array of up to one element). If the - addition of the entry causes another entry to be deleted (which - can only happen in the rename case) atomically with the addition, - then information on this entry is reported in nad_old_entry. + This is to help clients find the right location in their file name + caches and directory caches where this entry should be cached. If + the new entry's cookie is available, it will be in the + nad_new_entry_cookie (another variable length array of up to one + element) field. If the addition of the entry causes another entry + to be deleted (which can only happen in the rename case) + atomically with the addition, then information on this entry is + reported in nad_old_entry. NOTIFY4_REMOVE_ENTRY The server will send information about the directory entry being deleted. The server will also send the cookie value for the deleted entry so that clients can get to the cached information for this entry. NOTIFY4_RENAME_ENTRY The server will send information about both the old entry and the new entry. This includes name and attributes for each entry. In @@ -26013,57 +26066,56 @@ 20.5.2. RESULT struct CB_PUSH_DELEG4res { nfsstat4 cpdr_status; }; 20.5.3. DESCRIPTION CB_PUSH_DELEG is used by the server to both signal to the client that - the delegation it wants is available and to simultaneously offer the - delegation to the client. The client has the choice of accepting the - delegation by returning NFS4_OK to the server, delaying the decision - to accept the offered delegation by returning NFS4ERR_DELAY or - permanently rejecting the offer of the delegation by returning - NFS4ERR_REJECT_DELEG. When a delegation is rejected in this fashion, - the want previously established is permanently deleted. - - The server MUST send in cpda_delegation a delegation which satisfies - a request made in an OPEN or WANT_DELEGATION operation. + the delegation it wants (previously indicated via a want established + from an OPEN or WANT_DELEGATION operation) is available and to + simultaneously offer the delegation to the client. The client has + the choice of accepting the delegation by returning NFS4_OK to the + server, delaying the decision to accept the offered delegation by + returning NFS4ERR_DELAY or permanently rejecting the offer of the + delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is + rejected in this fashion, the want previously established is + permanently deleted and the delegation is subject to acquisition by + another client. 20.5.4. IMPLEMENTATION If the client does return NFS4ERR_DELAY and there is a conflicting delegation request, the server MAY process it at the expense of the client that returned NFS4ERR_DELAY. The client's want will typically not be cancelled, but MAY processed behind other delegation requests or registered wants. When a client returns a status other than NFS4_OK, NFSERR_DELAY, or NFS4ERR_REJECT_DELAY, the want remains pending, although servers may decide to cancel the want by sending a CB_WANTS_CANCELLED. 20.6. Operation 8: CB_RECALL_ANY - Keep any N delegations - Notify client to return delegation and keep N of them. + Notify client to return all but N delegations. 20.6.1. ARGUMENT const RCA4_TYPE_MASK_RDATA_DLG = 0; const RCA4_TYPE_MASK_WDATA_DLG = 1; const RCA4_TYPE_MASK_DIR_DLG = 2; const RCA4_TYPE_MASK_FILE_LAYOUT = 3; - const RCA4_TYPE_MASK_BLK_LAYOUT_MIN = 4; - const RCA4_TYPE_MASK_BLK_LAYOUT_MAX = 7; + const RCA4_TYPE_MASK_BLK_LAYOUT = 4; const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; - const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 11; + const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; struct CB_RECALL_ANY4args { uint32_t craa_objects_to_keep; bitmap4 craa_type_mask; }; 20.6.2. RESULT @@ -26097,37 +26149,67 @@ resource pools for layouts and for delegations, or further separate resources by types of delegations. When a given resource pool is over-utilized, the server can send a CB_RECALL_ANY to clients holding recallable objects of the types involved, allowing it to keep a certain number of such objects and return any excess. A mask specifies which types of objects are to be limited. The client chooses, based on its own knowledge of current usefulness, which of the objects in that class should be returned. - For NFSv4.1, a number of bits are defined. For some of these, ranges - are defined and it is up to the definition of the storage protocol to - specify how these are to be used. There are ranges for blocks-based - storage protocols, for object-based storage protocols and a reserved - range for other experimental storage protocols. The RFC defining - such a storage protocol needs to specify how particular bits within - its range are to be used. For example, it may specify a mapping - between attributes of the layout (read vs. write, size of area) and - the bit to be used or it may define a field in the layout where the - associated bit position is made available by the server to the - client. + A number of bits are defined. For some of these, ranges are defined + and it is up to the definition of the storage protocol to specify how + these are to be used. There are ranges reserved for object-based + storage protocols and for other experimental storage protocols. The + RFC defining such a storage protocol needs to specify how particular + bits within its range are to be used. For example, it may specify a + mapping between attributes of the layout (read vs. write, size of + area) and the bit to be used or it may define a field in the layout + where the associated bit position is made available by the server to + the client. - When an undefined bit is set in the type mask, NFS4ERR_INVAL should - be returned. If a client does not support an object of the specified - type, if the bit is defined, NFS4ERR_INVAL should not be returned. - Future minor versions of NFSv4 may expand the set of valid type mask - bits. + RCA4_TYPE_MASK_RDATA_DLG + + The client is to return read delegations on non-directory file + objects. + + RCA4_TYPE_MASK_WDATA_DLG + + The client is to return write delegations on regular file objects. + + RCA4_TYPE_MASK_DIR_DLG + + The client is to return directory delegations. + + RCA4_TYPE_MASK_FILE_LAYOUT + + The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. + + RCA4_TYPE_MASK_BLK_LAYOUT + + See [31] for a description. + + RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX + + See [30] for a description. + + RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX + + This range is reserved for telling the client to recall layouts of + experimental or site specific layout types (see Section 3.3.13). + + When a bit is set in the type mask that corresponds to an undefined + type of recallable object, NFS4ERR_INVAL MUST be returned. When a + bit is set that corresponds to a defined type of object, but the + client does not support an object of the type, NFS4ERR_INVAL MUST NOT + be returned. Future minor versions of NFSv4 may expand the set of + valid type mask bits. CB_RECALL_ANY specifies a count of objects that the client may keep as opposed to a count that the client must return. This is to avoid potential race between a CB_RECALL_ANY that had a count of objects to free with a set of client-originated operations to return layouts or delegations. As a result of the race, the client and server would have differing ideas as to how many objects to return. Hence the client could mistakenly free too many. If resource demands prompt it, the server may send another @@ -26185,26 +26267,38 @@ nfsstat4 croa_status; }; 20.7.3. DESCRIPTION CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client that the server has resources to grant recallable objects that might previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, or LAYOUTGET. - The argument, objects_to_keep means the total number of recallable - objects of the types indicated in the argument type_mask that the - server believes it can allow the client to have, including the number - of such objects the client already has. A client that tries to - acquire more recallable objects than the server informs it can have - runs the risk of having objects recalled. + The argument craa_objects_to_keep means the total number of + recallable objects of the types indicated in the argument type_mask + that the server believes it can allow the client to have, including + the number of such objects the client already has. A client that + tries to acquire more recallable objects than the server informs it + can have runs the risk of having objects recalled. + + The server is not obligated to reserve the difference between the + number of the objects the client currently has and the value of + craa_objects_to_keep, nor does delaying the reply to + CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources + of the recallable objects for another purpose. Indeed, if a client + responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might + interpret the client as having reduced capability to manage + recallable objects, and so cancel or reduce any reservation it is + maintaining. Thus if the client desires to acquire more recallable + objects, it needs to reply quickly to CB_RECALLABLE_OBJ_AVAIL, and + then send the appropriate operations to acquire recallable objects. 20.8. Operation 10: CB_RECALL_SLOT - change flow control limits Change flow control limits 20.8.1. ARGUMENT struct CB_RECALL_SLOT4args { slotid4 rsa_target_highest_slotid; }; @@ -26212,24 +26306,25 @@ 20.8.2. RESULT struct CB_RECALL_SLOT4res { nfsstat4 rsr_status; }; 20.8.3. DESCRIPTION The CB_RECALL_SLOT operation requests the client to return session slots, and if applicable, transport credits (e.g. RDMA credits for - connections associated with the operations channel) to the server. - CB_RECALL_SLOT specifies rsa_target_highest_slotid, the target - highest_slot the server wants for the session. The client, should - then work toward reducing the highest_slot to the target. + connections associated with the operations channel) of the session's + fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, + the value of the target highest slot id the server wants for the + session. The client should then work toward reducing the session's + highest slot id to the target value. If the session has only non-RDMA connections associated with its operations channel, then the client need only wait for all outstanding requests with a slotid > rsa_target_highest_slotid to complete, then send a single COMPOUND consisting of a single SEQUENCE operation, with the sa_highestslot field set to rsa_target_highest_slotid. If there are RDMA-based connections associated with operation channel, then the client needs to also send enough zero-length RDMA Sends to take the total RDMA credit count to rsa_target_highest_slotid + 1 or below. @@ -26285,42 +26380,42 @@ case NFS4_OK: CB_SEQUENCE4resok csr_resok4; default: void; }; 20.9.3. DESCRIPTION The CB_SEQUENCE operation is used to manage operational accounting for the backchannel of the session on which a request is sent. The - contents include the session to which this request belongs, slot id - and sequence id used by the server to implement session request - control and exactly once semantics, and exchanged slot maximums which - are used to adjust the size of the reply cache. This operation MUST - appear once as the first operation in each CB_COMPOUND request or a - protocol error must result. See Section 18.46.3 for a description of - how slots are processed. + contents include the session id to which this request belongs, the + slot id and sequence id used by the server to implement session + request control and exactly once semantics, and exchanged slot id + maxima which are used to adjust the size of the reply cache. This + operation will appear once as the first operation in each CB_COMPOUND + request or a protocol error MUST result. See Section 18.46.3 for a + description of how slots are processed. If csa_cachethis is TRUE, then the server is requesting that the client cache the reply in the callback reply cache. The client MUST cache the reply (see Section 2.10.5.1.3). The csa_referring_call_lists array is the list of COMPOUND requests, identified by sessionid, slot id and sequencid. These are requests that the client previously sent to the server. These previous requests created state that some operation(s) in the same CB_COMPOUND - as the csa_referring_call_lists is identifying. A sessionid is + as the csa_referring_call_lists are identifying. A session id is included because leased state is tied to a client ID, and a client ID can have multiple sessions. See Section 2.10.5.3. - The value of csa_sequenceid argument relative to the cached sequence - id on the slot falls into one of three cases. + The value of the csa_sequenceid argument relative to the cached + sequence id on the slot falls into one of three cases. o If the difference between csa_sequenceid and the client's cached sequence id at the slot id is two (2) or more, or if csa_sequenceid is less than the cached sequence id (accounting for wraparound of the unsigned sequence id value), then the client MUST return NFS4ERR_SEQ_MISORDERED. o If csa_sequenceid and the cached sequence id are the same, this is a retry, and the client returns the CB_COMPOUND request's cached reply. @@ -26343,22 +26438,20 @@ id, cached reply) MUST NOT change. The client returns two "highest_slotid" values: csr_highest_slotid, and csr_target_highest_slotid. The former is the highest slot id the client will accept in a future CB_SEQUENCE operation, and SHOULD NOT be less than the value of csa_highest_slotid (but see Section 2.10.5.1 for an exception). The latter is the highest slot id the client would prefer the server use on a future CB_SEQUENCE operation. -20.9.4. IMPLEMENTATION - 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation Wants Retracts promise to signal delegation availability. 20.10.1. ARGUMENT struct CB_WANTS_CANCELLED4args { bool cwca_contended_wants_cancelled; bool cwca_resourced_wants_cancelled; @@ -26422,37 +26515,37 @@ The server can use this operation to indicate that a lock for the given file and lock-owner, previously requested by the client via an unsuccessful LOCK request, might be available. This callback is meant to be used by servers to help reduce the latency of blocking locks in the case where they recognize that a client which has been polling for a blocking lock may now be able to acquire the lock. If the server supports this callback for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when responding to successful opens for that file. This does not commit - the server to use of CB_NOTIFY_LOCK, but the client may use this as a - hint to decide how frequently to poll for locks derived from that - open. + the server to the use of CB_NOTIFY_LOCK, but the client may use this + as a hint to decide how frequently to poll for locks derived from + that open. If an OPEN operation results in an upgrade, in which the stateid returned has an "other" value matching that of a stateid already allocated, with a new "seqid" indicating a change in the lock being represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag when responding to that new OPEN controls handling from that point going forward. When parallel OPENs are done on the same file and open-owner, the ordering of the "seqid" field of the returned stateid (subject to wraparound) are to be used to select the controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 20.11.4. IMPLEMENTATION - The server must not grant the lock to the client unless and until it + The server MUST NOT grant the lock to the client unless and until it receives an actual lock request from the client. Similarly, the client receiving this callback cannot assume that it now has the lock, or that a subsequent request for the lock will be successful. The server is not required to implement this callback, and even if it does, it is not required to use it in any particular case. Therefore the client must still rely on polling for blocking locks, as described in Section 9.6. Similarly, the client is not required to implement this callback, and @@ -26493,53 +26586,52 @@ 20.12.2. RESULT struct CB_NOTIFY_DEVICEID4res { nfsstat4 cndr_status; }; 20.12.3. DESCRIPTION The CB_NOTIFY_DEVICEID operation is used by the server to send notifications to clients about changes to pNFS device IDs. The - registration of device ID notifications occurs when the device - mapping stateid is established using GETDEVICEINFO or GETDEVICELIST. - These notifications are sent over the backchannel. The notification - is sent once the original request has been processed on the server. - The server will send an array of notifications, cnda_changes, as a - list of pairs of bitmaps and values. See Section 3.3.7 for a - description of how NFSv4.1 bitmaps work. + registration of device ID notifications is optional and is done via + GETDEVICEINFO. These notifications are sent over the backchannel + once the original request has been processed on the server. The + server will send an array of notifications, cnda_changes, as a list + of pairs of bitmaps and values. See Section 3.3.7 for a description + of how NFSv4.1 bitmaps work. As with CB_NOTIFY (Section 20.4.3), it is possible the server has more notifications than can fit in a CB_COMPOUND, thus requiring multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an issue because unlike directory entries, device IDs cannot be re-used after being deleted (Section 12.2.10). All device ID notifications contain a device ID and a layout type. The layout type is necessary because two different layout types can share the same device ID, and the common device ID can have completely different mappings for each layout type. The server will send the following notifications: NOTIFY_DEVICEID4_CHANGE A previously provided device ID to device address mapping has - changed and the client uses GETDEVICEINFO or GETDEVICELIST to - obtain the updated mapping. The notification is encoded in a - value of data type notify_deviceid_change4. This data type also - contains a boolean field, ndc_immediate, which if TRUE indicates - that the change will be enforced immediately, and so the client - might not be able to complete any pending I/O to the device ID. - If ndc_immediate is FALSE, then for an indefinite time, the client - can complete pending I/O. After pending I/O is complete, the - client SHOULD get the new device ID to device address mappings - before issuing new I/O to the device ID. + changed and the client uses GETDEVICEINFO to obtain the updated + mapping. The notification is encoded in a value of data type + notify_deviceid_change4. This data type also contains a boolean + field, ndc_immediate, which if TRUE indicates that the change will + be enforced immediately, and so the client might not be able to + complete any pending I/O to the device ID. If ndc_immediate is + FALSE, then for an indefinite time, the client can complete + pending I/O. After pending I/O is complete, the client SHOULD get + the new device ID to device address mappings before issuing new + I/O to the device ID. NOTIFY4_DEVICEID_DELETE Deletes a device ID from the mappings. This notification MUST NOT be sent if the client has a layout that refers to the device ID. In other words if the server is sending a delete device ID notification, one of the following is true for layouts associated with the layout type: * The client never had a layout referring to that device ID. @@ -26564,34 +26656,34 @@ /* * CB_ILLEGAL: Response for illegal operation numbers */ struct CB_ILLEGAL4res { nfsstat4 status; }; 20.13.3. DESCRIPTION This operation is a placeholder for encoding a result to handle the - case of the client sending an operation code within COMPOUND that is - not defined in the NFSv4.1 specification. See Section 16.2.3 for + case of the server sending an operation code within CB_COMPOUND that + is not defined in the NFSv4.1 specification. See Section 19.2.3 for more details. The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 20.13.4. IMPLEMENTATION A server will probably not send an operation with code OP_CB_ILLEGAL but if it does, the response will be CB_ILLEGAL4res just as it would be with any other invalid operation code. Note that if the client gets an illegal operation code that is not OP_ILLEGAL, and if the client checks for legal operation codes during the XDR decode phase, - then the CB_ILLEGAL4res would not be returned. + then an instance of data type CB_ILLEGAL4res will not be returned. 21. Security Considerations NFS has historically used a model where, from an authentication perspective, the client was the entire machine, or at least the source network address of the machine. The NFS server relied on the NFS client to make the proper authentication of the end-user. The NFS server in turn shared its files only to specific clients, as identified by the client's source network address. Given this model, the AUTH_SYS RPC security flavor simply identified the end-user using @@ -26924,26 +27016,26 @@ [27] Werme, R., "RPC XID Issues", USENIX Conference Proceedings , February 1996. [28] Nowicki, B., "NFS: Network File System Protocol specification", RFC 1094, March 1989. [29] Bhide, A., Elnozahy, E., and S. Morgan, "A Highly Available Network Server", USENIX Conference Proceedings , January 1991. [30] Halevy, B., Welch, B., and J. Zelenka, "Object-based pNFS - Operations", September 2007, . + Operations", April 2008, . [31] Black, D., Fridella, S., and J. Glasgow, "pNFS Block/Volume - Layout", November 2007, . + Layout", April 2008, . [32] Callaghan, B., "WebNFS Client Specification", RFC 2054, October 1996. [33] Callaghan, B., "WebNFS Server Specification", RFC 2055, October 1996. [34] Shepler, S., "NFS Version 4 Design Considerations", RFC 2624, June 1999.