view Side-By-Side changes
NFSVersionversion 4 Working Group S. Shepler INTERNET-DRAFT Sun Microsystems, Inc. Document:draft-ietf-nfsv4-rfc3010bis-00.txtdraft-ietf-nfsv4-rfc3010bis-02.txt C. Beame Hummingbird Ltd. B. Callaghan Sun Microsystems, Inc. M. EislerZambeel,Network Appliance, Inc. D. Noveck Network Appliance, Inc. D. Robinson Sun Microsystems, Inc. R. Thurlow Sun Microsystems, Inc.November 2001August 2002 NFS version 4 Protocol Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract NFS version 4 is a distributedfile systemfilesystem protocol which owes heritage to NFS protocol versions 2 [RFC1094] and 3 [RFC1813]. Expires:May 2002February 2003 [Page 1] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 Unlike earlier versions, the NFS version 4 protocol supports traditional file access while integrating support for file locking and the mount protocol. In addition, support for strong security (and its negotiation), compound operations, client caching, and internationalization have been added. Of course, attention has been applied to making NFS version 4 operate well in an Internet environment. Copyright Copyright (C) The Internet Society(2001).(2000-2002). All Rights Reserved. Key Words The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described inRFC 2119.[RFC2119]. Expires:May 2002February 2003 [Page 2] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1. Inconsistencies of this Document with Section 18 . . . . . 7 1.2. Overview of NFSVersionversion 4 Features . . . . . . . . . . . .7 1.1.1.8 1.2.1. RPC and Security . . . . . . . . . . . . . . . . . . . . 81.1.2.1.2.2. Procedure and Operation Structure . . . . . . . . . . . 81.1.3. File System1.2.3. Filesystem Model . . . . . . . . . . . . . . . . . . . . 91.1.3.1.1.2.3.1. Filehandle Types . . . . . . . . . . . . . . . . . . . 91.1.3.2.1.2.3.2. Attribute Types . . . . . . . . . . . . . . . . . .. 9 1.1.3.3. File System10 1.2.3.3. Filesystem Replication and Migration . . . . . . . . 101.1.4.1.2.4. OPEN and CLOSE . . . . . . . . . . . . . . . . . . . .10 1.1.5.11 1.2.5. File locking . . . . . . . . . . . . . . . . . . . . .10 1.1.6.11 1.2.6. Client Caching and Delegation . . . . . . . . . . . . 111.2.1.3. General Definitions . . . . . . . . . . . . . . . . . . 12 2. Protocol Data Types . . . . . . . . . . . . . . . . . . . 14 2.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 14 2.2. Structured Data Types . . . . . . . . . . . . . . . . . 15 3. RPC and Security Flavor . . . . . . . . . . . . . . . . .2021 3.1. Ports and Transports . . . . . . . . . . . . . . . . . .2021 3.1.1. Client Retransmission Behavior . . . . . . . . . . . . 21 3.2. Security Flavors . . . . . . . . . . . . . . . . . . . .2022 3.2.1. Security mechanisms for NFS version 4 . . . . . . . .2022 3.2.1.1. Kerberos V5 as a security triple . . . . . . . . . .. 2122 3.2.1.2. LIPKEY as a security triple . . . . . . . . . . . .2123 3.2.1.3. SPKM-3 as a security triple . . . . . . . . . . . .2224 3.3. Security Negotiation . . . . . . . . . . . . . . . . . .2324 3.3.1.Security ErrorSECINFO . . . . . . . . . . . . . . . . . . . .23 3.3.2. SECINFO. . . 25 3.3.2. Security Error . . . . . . . . . . . . . . . . . . . .2325 3.4. Callback RPC Authentication . . . . . . . . . . . . . .2325 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . .2628 4.1. Obtaining the First Filehandle . . . . . . . . . . . . .2628 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . .2628 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . .2728 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . .2729 4.2.1. General Properties of a Filehandle . . . . . . . . . .2729 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . .2830 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . .2830 4.2.4. One Method of Constructing a Volatile Filehandle . . .3031 4.3. Client Recovery from Filehandle Expiration . . . . . . .3032 5. File Attributes . . . . . . . . . . . . . . . . . . . . .3234 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . .3335 5.2. Recommended Attributes . . . . . . . . . . . . . . . . .3335 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . .3335 5.4. Classification of Attributes . . . . . . . . . . . . . . 36 5.5. Mandatory Attributes - Definitions . . . . . . . . . . .35 5.5.38 5.6. Recommended Attributes - Definitions . . . . . . . . . .37 5.6.40 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 45 5.8. Interpreting owner and owner_group . . . . . . . . . . .41 5.7.45 5.9. Character Case Attributes . . . . . . . . . . . . . . .42 5.8.47 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . .. 42 5.9.47 5.11. Access Control Lists . . . . . . . . . . . . . . . . .. 43 5.9.1.48 Expires: February 2003 [Page 3] Draft Specification NFS version 4 Protocol August 2002 5.11.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 49 5.11.2. ACE Access Mask . . . .44 5.9.2.. . . . . . . . . . . . . . . 50 5.11.3. ACE flag . . . . . . . . . . . . . . . . . . . . . .. 44 5.9.3.52 5.11.4. ACEAccess Maskwho . . . . . . . . . . . . . . . . . . .46 5.9.4. ACE who. . . . 53 5.11.5. Mode Attribute . . . . . . . . . . . . . . . . . . .47 Expires: May 2002 [Page 3] Draft Specification NFS version 4 Protocol November 200154 5.11.6. Mode and ACL Attribute . . . . . . . . . . . . . . . 55 5.11.7. mounted_on_fileid . . . . . . . . . . . . . . . . . . 55 6.File SystemFilesystem Migration and Replication . . . . . . . . . .48. 57 6.1. Replication . . . . . . . . . . . . . . . . . . . . . .4857 6.2. Migration . . . . . . . . . . . . . . . . . . . . . . .4857 6.3. Interpretation of the fs_locations Attribute . . . . . .4958 6.4. Filehandle Recovery for Migration or Replication . . . .5059 7. NFS Server Name Space . . . . . . . . . . . . . . . . . .5160 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . .5160 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . .5160 7.3. Server PseudoFile SystemFilesystem . . . . . . . . . . . . . . .51. 60 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . .5261 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . .5261 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . .5261 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . .5362 7.8. Security Policy and Name Space Presentation . . . . . .5362 8. File Locking and Share Reservations . . . . . . . . . . .5464 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . .5464 8.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . .5464 8.1.2. Server Release of Clientid . . . . . . . . . . . . . .5667 8.1.3.nfs_lockownerlock_owner and stateid Definition . . . . . . . . .57. 68 8.1.4. Use of the stateid and Locking . . . . . . . . . . . .. . . . . . 5869 8.1.5. Sequencing of Lock Requests . . . . . . . . . . . . .5871 8.1.6. Recovery from Replayed Requests . . . . . . . . . . .5972 8.1.7. Releasingnfs_lockownerlock_owner State . . . . . . . . . . . .59. . 72 8.1.8. Use of Open Confirmation . . . . . . . . . . . . . . . 73 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . .6074 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 74 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . .61 8.4.75 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . .61 8.5.75 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . .62 8.5.1.76 8.6.1. Client Failure and Recovery . . . . . . . . . . . . .62 8.5.2.76 8.6.2. Server Failure and Recovery . . . . . . . . . . . . .63 8.5.3.77 8.6.3. Network Partitions and Recovery . . . . . . . . . . .64 8.6.79 8.7. Recovery from a Lock Request Timeout or Abort . . . . .65 8.7.80 8.8. Server Revocation of Locks . . . . . . . . . . . . . . .66 8.8.80 8.9. Share Reservations . . . . . . . . . . . . . . . . . . .67 8.9.81 8.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . .68 8.10.82 8.10.1. Close and Retention of State Information . . . . . . 83 8.11. Open Upgrade and Downgrade . . . . . . . . . . . . . .68 8.11.83 8.12. Short and Long Leases . . . . . . . . . . . . . . . . .69 8.12. Clocks84 8.13. Clocks, Propagation Delay, and Calculating Lease Expiration . . . . . . . .69 8.13.. . . . . . . . . . . . . . 84 8.14. Migration, Replication and State . . . . . . . . . . .70 8.13.1.85 8.14.1. Migration and State . . . . . . . . . . . . . . . . .70 8.13.2.85 8.14.2. Replication and State . . . . . . . . . . . . . . . .70 8.13.3.86 8.14.3. Notification of Migrated Lease . . . . . . . . . . .7186 Expires: February 2003 [Page 4] Draft Specification NFS version 4 Protocol August 2002 8.14.4. Migration and the Lease_time Attribute . . . . . . . 87 9. Client-Side Caching . . . . . . . . . . . . . . . . . . .7288 9.1. Performance Challenges for Client-Side Caching . . . . .7288 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . .7389 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . .7490 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . .7692 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . .7692 9.3.2. Data Caching and File Locking . . . . . . . . . . . .7793 9.3.3. Data Caching and Mandatory File Locking . . . . . . .7895 9.3.4. Data Caching and File Identity . . . . . . . . . . . .7995 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . .8096 9.4.1. Open Delegation and Data Caching . . . . . . . . . . .82 Expires: May 2002 [Page 4] Draft Specification NFS version 4 Protocol November 200199 9.4.2. Open Delegation and File Locks . . . . . . . . . . . .83100 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . . 100 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . .83 9.4.4.102 9.4.5. Delegation Revocation . . . . . . . . . . . . . . . .85104 9.5. Data Caching and Revocation . . . . . . . . . . . . . .85104 9.5.1. Revocation Recovery for Write Open Delegation . . . .86104 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . .87105 9.7. Name Caching . . . . . . . . . . . . . . . . . . . . . .88107 9.8. Directory Caching . . . . . . . . . . . . . . . . . . .89108 10. Minor Versioning . . . . . . . . . . . . . . . . . . . .91110 11. Internationalization . . . . . . . . . . . . . . . . . .94113 11.1. Universal Versus Local Character Sets . . . . . . . . .94113 11.2. Overview of Universal Character Set Standards . . . . .95114 11.3. Difficulties with UCS-4, UCS-2, Unicode . . . . . . . .96115 11.4. UTF-8 and its solutions . . . . . . . . . . . . . . . .96115 11.5. Normalization . . . . . . . . . . . . . . . . . . . . .97116 11.6. UTF-8 Related Errors . . . . . . . . . . . . . . . . . 116 12. Error Definitions . . . . . . . . . . . . . . . . . . . .98118 13. NFSVersionversion 4 Requests . . . . . . . . . . . . . . . . .103124 13.1. Compound Procedure . . . . . . . . . . . . . . . . . .103124 13.2. Evaluation of a Compound Request . . . . . . . . . . .103125 13.3. Synchronous Modifying Operations . . . . . . . . . . .104125 13.4. Operation Values . . . . . . . . . . . . . . . . . . .105126 14. NFSVersionversion 4 Procedures . . . . . . . . . . . . . . . .106127 14.1. Procedure 0: NULL - No Operation . . . . . . . . . . .106127 14.2. Procedure 1: COMPOUND - Compound Operations . . . . . .107128 14.2.1. Operation 3: ACCESS - Check Access Rights . . . . . .110131 14.2.2. Operation 4: CLOSE - Close File . . . . . . . . . . .113134 14.2.3. Operation 5: COMMIT - Commit Cached Data . . . . . .115136 14.2.4. Operation 6: CREATE - Create a Non-Regular File Object118139 14.2.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery . . . . . . . . . . . . . . . . . . . . . .120142 14.2.6. Operation 8: DELEGRETURN - Return Delegation . . . .121143 14.2.7. Operation 9: GETATTR - Get Attributes . . . . . . . .122144 14.2.8. Operation 10: GETFH - Get Current Filehandle . . . .124146 14.2.9. Operation 11: LINK - Create Link to a File . . . . .126148 14.2.10. Operation 12: LOCK - Create Lock . . . . . . . . . .128150 14.2.11. Operation 13: LOCKT - Test For Lock . . . . . . . .130154 14.2.12. Operation 14: LOCKU - Unlock File . . . . . . . . .132156 14.2.13. Operation 15: LOOKUP - Lookup Filename . . . . . . .134158 Expires: February 2003 [Page 5] Draft Specification NFS version 4 Protocol August 2002 14.2.14. Operation 16: LOOKUPP - Lookup Parent Directory . .137161 14.2.15. Operation 17: NVERIFY - Verify Difference in Attributes . . . . . . . . . . . . . . . . . . . . .139162 14.2.16. Operation 18: OPEN - Open a Regular File . . . . . .141164 14.2.17. Operation 19: OPENATTR - Open Named Attribute Directory . . . . . . . . . . . . . . . . . . . . .150174 14.2.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . .152176 14.2.19. Operation 21: OPEN_DOWNGRADE - Reduce Open FileAccess155Access179 14.2.20. Operation 22: PUTFH - Set Current Filehandle . . . .157181 14.2.21. Operation 23: PUTPUBFH - Set Public Filehandle . . .158182 14.2.22. Operation 24: PUTROOTFH - Set Root Filehandle . . .159184 14.2.23. Operation 25: READ - Read from File . . . . . . . .160185 14.2.24. Operation 26: READDIR - Read Directory . . . . . . .163188 14.2.25. Operation 27: READLINK - Read Symbolic Link . . . .167 Expires: May 2002 [Page 5] Draft Specification NFS version 4 Protocol November 2001192 14.2.26. Operation 28: REMOVE - Remove Filesystem Object . .169194 14.2.27. Operation 29: RENAME - Rename Directory Entry . . .171197 14.2.28. Operation 30: RENEW - Renew a Lease . . . . . . . .174200 14.2.29. Operation 31: RESTOREFH - Restore Saved Filehandle .175201 14.2.30. Operation 32: SAVEFH - Save Current Filehandle . . .177203 14.2.31. Operation 33: SECINFO - Obtain Available Security .178204 14.2.32. Operation 34: SETATTR - Set Attributes . . . . . . .180208 14.2.33. Operation 35: SETCLIENTID - Negotiate Clientid . . .182211 14.2.34. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid184215 14.2.35. Operation 37: VERIFY - Verify Same Attributes . . .185219 14.2.36. Operation 38: WRITE - Write to File . . . . . . . .187221 14.2.37. Operation 39: RELEASE_LOCKOWNER - Release Lockowner State . . . . . . . . . . . . . . . . . . . . . . . 226 14.2.38. Operation 10044: ILLEGAL - Illegal operation . . . . 228 15. NFSVersionversion 4 Callback Procedures . . . . . . . . . . . .191229 15.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . .191229 15.2. Procedure 1: CB_COMPOUND - Compound Operations . . . .192230 15.2.1. Operation 3: CB_GETATTR - Get Attributes . . . . . .194232 15.2.2. Operation 4: CB_RECALL - Recall an Open Delegation .195234 15.2.3. Operation 10044: CB_ILLEGAL - Illegal Callback Operation . . . . . . . . . . . . . . . . . . . . . . 236 16. Security Considerations . . . . . . . . . . . . . . . . .197237 17. IANA Considerations . . . . . . . . . . . . . . . . . . .198238 17.1. Named Attribute Definition . . . . . . . . . . . . . .198238 17.2. ONC RPC Network Identifiers (netids) . . . . . . . . . 238 18. RPC definition file . . . . . . . . . . . . . . . . . . .199239 19. Bibliography . . . . . . . . . . . . . . . . . . . . . .229271 20. Authors . . . . . . . . . . . . . . . . . . . . . . . . .234277 20.1. Editor's Address . . . . . . . . . . . . . . . . . . .234277 20.2. Authors' Addresses . . . . . . . . . . . . . . . . . .234277 20.3. Acknowledgements . . . . . . . . . . . . . . . . . . .235278 21. Full Copyright Statement . . . . . . . . . . . . . . . .236279 Expires:May 2002February 2003 [Page 6] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 1. Introduction The NFS version 4 protocol is a further revision of the NFS protocol defined already by versions 2 [RFC1094] and 3 [RFC1813]. It retains the essential characteristics of previous versions: design for easy recovery, independent of transport protocols, operating systems and filesystems, simplicity, and good performance. The NFS version 4 revision has the following goals: o Improved access and good performance on the Internet. The protocol is designed to transit firewalls easily, perform well where latency is high and bandwidth is low, and scale to very large numbers of clients per server. o Strong security with negotiation built into the protocol. The protocol builds on the work of the ONCRPC working group in supporting the RPCSEC_GSS protocol. Additionally, the NFS version 4 protocol provides a mechanism to allow clients and servers the ability to negotiate security and require clients and servers to support a minimal set of security schemes. o Good cross-platform interoperability. The protocol features afile systemfilesystem model that provides a useful, common set of features that does not unduly favor onefile systemfilesystem or operating system over another. o Designed for protocol extensions. The protocol is designed to accept standard extensions that do not compromise backward compatibility. 1.1. Inconsistencies of this Document with Section 18 Section 18, RPC Definition File, contains the definitions in XDR description language of the constructs used by the protocol. Prior to Section 18, several of the constructs are reproduced for purposes of explanation. The reader is warned of the possibility of errors in the reproduced constructs outside of Section 18. For any part of the document that is inconsistent with Section 18, Section 18 is to be considered authoritative. Expires: February 2003 [Page 7] Draft Specification NFS version 4 Protocol August 2002 1.2. Overview of NFSVersionversion 4 Features To provide a reasonable context for the reader, the major features of NFS version 4 protocol will be reviewed in brief. This will be done to provide an appropriate context for both the reader who is familiar with the previous versions of the NFS protocol and the reader that is new to the NFS protocols. For the reader new to the NFS protocols, there is still a fundamental knowledge that is expected. The reader should be familiar with the XDR and RPC protocols as described in [RFC1831] and [RFC1832]. A basic knowledge offile systemsfilesystems and distributedfile systemsfilesystems is expected as well.Expires: May 2002 [Page 7] Draft Specification NFS version 4 Protocol November 2001 1.1.1.1.2.1. RPC and Security As with previous versions of NFS, the External Data Representation (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS version 4 protocol are those defined in [RFC1831] and [RFC1832]. To meet end to end security requirements, the RPCSEC_GSS framework [RFC2203] will be used to extend the basic RPC security. With the use of RPCSEC_GSS, various mechanisms can be provided to offer authentication, integrity, and privacy to the NFS version 4 protocol. Kerberos V5 will be used as described in [RFC1964] to provide one security framework. The LIPKEY GSS-API mechanism described in [RFC2847] will be used to provide for the use of user password and server public key by the NFS version 4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be specified and used for NFS version 4 security. To enable in-band security negotiation, the NFS version 4 protocol has added a new operation which provides the client a method of querying the server about its policies regarding which security mechanisms must be used for access to the server'sfile systemfilesystem resources. With this, the client can securely match the security mechanism that meets the policies specified at both the client and server.1.1.2.1.2.2. Procedure and Operation Structure A significant departure from the previous versions of the NFS protocol is the introduction of the COMPOUND procedure. For the NFS version 4 protocol, there are two RPC procedures, NULL and COMPOUND. The COMPOUND procedure is defined in terms of operations and these operations correspond more closely to the traditional NFS procedures. With the use of the COMPOUND procedure, the client is able to build simple or complex requests. These COMPOUND requests allow for a reduction in the number of RPCs needed for logicalfile systemfilesystem operations. For example, without previous contact with a server a client will be able to read data from a file in one request by combining LOOKUP, OPEN, and READ operations in a single COMPOUND RPC. With previous versions of the NFS protocol, this type of single Expires: February 2003 [Page 8] Draft Specification NFS version 4 Protocol August 2002 request was not possible. The model used for COMPOUND is very simple. There is no logical OR or ANDing of operations. The operations combined within a COMPOUND request are evaluated in order by the server. Once an operation returns a failing result, the evaluation ends and the results of all evaluated operations are returned to the client. The NFS version 4 protocol continues to have the client refer to a file or directory at the server by a "filehandle". The COMPOUND procedure has a method of passing a filehandle from one operation to another within the sequence of operations. There is a concept of a "current filehandle" and "saved filehandle". Most operations use theExpires: May 2002 [Page 8] Draft Specification NFS version 4 Protocol November 2001"current filehandle" as thefile systemfilesystem object to operate upon. The "saved filehandle" is used as temporary filehandle storage within a COMPOUND procedure as well as an additional operand for certain operations.1.1.3. File System1.2.3. Filesystem Model The generalfile systemfilesystem model used for the NFS version 4 protocol is the same as previous versions. The serverfile systemfilesystem is hierarchical with the regular files contained within being treated as opaque byte streams. In a slight departure, file and directory names are encoded with UTF-8 to deal with the basics of internationalization. The NFS version 4 protocol does not require a separate protocol to provide for the initial mapping between path name and filehandle. Instead of using the older MOUNT protocol for this mapping, the server provides a ROOT filehandle that represents the logical root or top of thefile systemfilesystem tree provided by the server. The server provides multiplefile systemsfilesystems by glueing them together with pseudofile systems.filesystems. These pseudofile systemsfilesystems provide for potential gaps in the path names between realfile systems. 1.1.3.1.filesystems. 1.2.3.1. Filehandle Types In previous versions of the NFS protocol, the filehandle provided by the server was guaranteed to be valid or persistent for the lifetime of thefile systemfilesystem object to which it referred. For some server implementations, this persistence requirement has been difficult to meet. For the NFS version 4 protocol, this requirement has been relaxed by introducing another type of filehandle, volatile. With persistent and volatile filehandle types, the server implementation can match the abilities of thefile systemfilesystem at the server along with the operating environment. The client will have knowledge of the type of filehandle being provided by the server and can be prepared to deal with the semantics of each.1.1.3.2.Expires: February 2003 [Page 9] Draft Specification NFS version 4 Protocol August 2002 1.2.3.2. Attribute Types The NFS version 4 protocol introduces three classes offile systemfilesystem or file attributes. Like the additional filehandle type, the classification of file attributes has been done to ease server implementations along with extending the overall functionality of the NFS protocol. This attribute model is structured to be extensible such that new attributes can be introduced in minor revisions of the protocol without requiring significant rework. The three classifications are: mandatory, recommended and named attributes. This is a significant departure from the previousExpires: May 2002 [Page 9] Draft Specification NFS version 4 Protocol November 2001attribute model used in the NFS protocol. Previously, the attributes for thefile systemfilesystem and file objects were a fixed set of mainlyUnixUNIX attributes. If the server or client did not support a particular attribute, it would have to simulate the attribute the best it could. Mandatory attributes are the minimal set of file orfile systemfilesystem attributes that must be provided by the server and must be properly represented by the server. Recommended attributes represent differentfile systemfilesystem types and operating environments. The recommended attributes will allow for better interoperability and the inclusion of more operating environments. The mandatory and recommended attribute sets are traditional file orfile systemfilesystem attributes. The third type of attribute is the named attribute. A named attribute is an opaque byte stream that is associated with a directory or file and referred to by a string name. Named attributes are meant to be used by client applications as a method to associate application specific data with a regular file or directory. One significant addition to the recommended set of file attributes is the Access Control List (ACL) attribute. This attribute provides for directory and file access control beyond the model used in previous versions of the NFS protocol. The ACL definition allows for specification of user and group level access control.1.1.3.3. File System1.2.3.3. Filesystem Replication and Migration With the use of a special file attribute, the ability to migrate or replicate serverfile systemsfilesystems is enabled within the protocol. Thefile systemfilesystem locations attribute provides a method for the client to probe the server about the location of afile system.filesystem. In the event of a migration of afile system,filesystem, the client will receive an error when operating on thefile systemfilesystem and it can then query as to the new file system location. Similar steps are used for replication, the client is able to query the server for the multiple available locations of a particularfile system.filesystem. From this information, the client can use its own policies to access the appropriatefile systemfilesystem location.1.1.4.Expires: February 2003 [Page 10] Draft Specification NFS version 4 Protocol August 2002 1.2.4. OPEN and CLOSE The NFS version 4 protocol introduces OPEN and CLOSE operations. The OPEN operation provides a single point where file lookup, creation, and share semantics can be combined. The CLOSE operation also provides for the release of state accumulated by OPEN.1.1.5.1.2.5. File locking With the NFS version 4 protocol, the support for byte range file locking is part of the NFS protocol. The file locking support isExpires: May 2002 [Page 10] Draft Specification NFS version 4 Protocol November 2001structured so that an RPC callback mechanism is not required. This is a departure from the previous versions of the NFS file locking protocol, Network Lock Manager (NLM). The state associated with file locks is maintained at the server under a lease-based model. The server defines a single lease period for all state held by a NFS client. If the client does not renew its lease within the defined period, all state associated with the client's lease may be released by the server. The client may renew its lease with use of the RENEW operation or implicitly by use of other operations (primarily READ).1.1.6.1.2.6. Client Caching and Delegation The file, attribute, and directory caching for the NFS version 4 protocol is similar to previous versions. Attributes and directory information are cached for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the relatedfile systemfilesystem object has been updated. For file data, the client checks its cache validity when the file is opened. A query is sent to the server to determine if the file has been changed. Based on this information, the client determines if the data cache for the file should kept or released. Also, when the file is closed, any modified data is written to the server. If an application wants to serialize access to file data, file locking of the file data ranges in question should be used. The major addition to NFS version 4 in the area of caching is the ability of the server to delegate certain responsibilities to the client. When the server grants a delegation for a file to a client, the client is guaranteed certain semantics with respect to the sharing of that file with other clients. At OPEN, the server may provide the client either a read or write delegation for the file. If the client is granted a read delegation, it is assured that no other client has the ability to write to the file for the duration of the delegation. If the client is granted a write delegation, the client is assured that no other client has read or write access to the file. Expires: February 2003 [Page 11] Draft Specification NFS version 4 Protocol August 2002 Delegations can be recalled by the server. If another client requests access to the file in such a way that the access conflicts with the granted delegation, the server is able to notify the initial client and recall the delegation. This requires that a callback path exist between the server and client. If this callback path does not exist, then delegations can not be granted. The essence of a delegation is that it allows the client to locally service operations such as OPEN, CLOSE, LOCK, LOCKU, READ, WRITE without immediate interaction with the server.Expires: May 2002 [Page 11] Draft Specification NFS version 4 Protocol November 2001 1.2.1.3. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Client The "client" is the entity that accesses the NFS server's resources. The client may be an application which contains the logic to access the NFS server directly. The client may also be the traditional operating system client remotefile systemfilesystem services for a set of applications. In the case of file locking the client is the entity that maintains a set of locks on behalf of one or more applications. This client is responsible for crash or failure recovery for those locks it manages. Note that multiple clients may share the same transport and multiple clients may exist on the same network node. Clientid A 64-bit quantity used as a unique, short-hand reference to a client supplied Verifier and ID. The server is responsible for supplying the Clientid. Lease An interval of time defined by the server for which the client is irrevocably granted a lock. At the end of a lease period the lock may be revoked if the lease has not been extended. The lock must be revoked if a conflicting lock has been granted after the lease interval. All leases granted by a server have the same fixed interval. Note that the fixed interval was chosen to alleviate the expense a server would have in maintaining state about variable length leases across server failures. Lock The term "lock" is used to refer to both record (byte- range) locks as well asfile (share) locksshare reservations unless specifically stated otherwise. Server The "Server" is the entity responsible for coordinating client access to a set offile systems. Stable Storage NFS version 4 servers mustfilesystems. Expires: February 2003 [Page 12] Draft Specification NFS version 4 Protocol August 2002 Stable Storage NFS version 4 servers must be able to recover without data loss from multiple power failures (including cascading power failures, that is, several power failures in quick succession), operating system failures, and hardware failure of components other than the storage medium itself (for example, disk, nonvolatile RAM). Some examples of stable storage that are allowable for an NFS server include:Expires: May 2002 [Page 12] Draft Specification NFS version 4 Protocol November 20011. Media commit of data, that is, the modified data has been successfully written to the disk media, for example, the disk platter. 2. An immediate reply disk drive with battery-backed on-drive intermediate storage or uninterruptible power system (UPS). 3. Server commit of data with battery-backed intermediate storage and recovery software. 4. Cache commit with uninterruptible power system (UPS) and recovery software. Stateid A64-bit128-bit quantity returned by a server that uniquely defines the open and locking stategrantedprovided by the server for a specific open or lock owner for a specific file. Stateids composed of all bits 0 or all bits 1 have special meaning and are reserved values. Verifier A 64-bit quantity generated by the client that the server can use to determine if the client has restarted and lost all previous lock state. Expires:May 2002February 2003 [Page 13] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 2. Protocol Data Types The syntax and semantics to describe the data types of the NFS version 4 protocol are defined in the XDR [RFC1832] and RPC [RFC1831] documents. The next sections build upon the XDR data types to define types and structures specific to this protocol. 2.1. Basic Data Types Data Type Definition _____________________________________________________________________ int32_t typedef int int32_t; uint32_t typedef unsigned int uint32_t; int64_t typedef hyper int64_t; uint64_t typedef unsigned hyper uint64_t; attrlist4 typedef opaque attrlist4<>; Used for file/directory attributes bitmap4 typedef uint32_t bitmap4<>; Used in attribute array encoding. changeid4 typedef uint64_t changeid4; Used in definition of change_info clientid4 typedef uint64_t clientid4; Shorthand reference to client identification component4 typedef utf8string component4; Represents path name components count4 typedef uint32_t count4; Various count parameters (READ, WRITE, COMMIT) length4 typedef uint64_t length4; Describes LOCK lengths linktext4 typedef utf8string linktext4; Symbolic link contents mode4 typedef uint32_t mode4; Mode attribute data type nfs_cookie4 typedef uint64_t nfs_cookie4; Opaque cookie value for READDIR nfs_fh4 typedef opaque nfs_fh4<NFS4_FHSIZE>; Filehandle definition; NFS4_FHSIZE is defined as 128 Expires:May 2002February 2003 [Page 14] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 nfs_ftype4 enum nfs_ftype4; Various defined file types nfsstat4 enum nfsstat4; Return value for operations offset4 typedef uint64_t offset4; Various offset designations (READ, WRITE, LOCK, COMMIT) pathname4 typedef component4 pathname4<>; Represents path name for LOOKUP, OPEN and others qop4 typedef uint32_t qop4; Quality of protection designation in SECINFO sec_oid4 typedef opaque sec_oid4<>; Security Object Identifier The sec_oid4 data type is not really opaque. Instead contains an ASN.1 OBJECT IDENTIFIER as used by GSS-API in the mech_type argument to GSS_Init_sec_context. See[RFC2078][RFC2743] for details. seqid4 typedef uint32_t seqid4; Sequence identifier used for file locking utf8string typedef opaque utf8string<>; UTF-8 encoding for strings verifier4 typedef opaque verifier4[NFS4_VERIFIER_SIZE]; Verifier used for various operations (COMMIT, CREATE, OPEN, READDIR, SETCLIENTID, SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is defined as 8 2.2. Structured Data Types nfstime4 struct nfstime4 { int64_t seconds; uint32_t nseconds; } The nfstime4 structure gives the number of seconds and nanoseconds since midnight or 0 hour January 1, 1970 Coordinated Universal Time (UTC). Values greater than zero for the seconds field denote dates after the 0 hour January 1, 1970. Values less than zero for the seconds field denote dates before the 0 hour January 1, 1970. In both cases, the nseconds field is to be added to the seconds field for the final time representation. For example, if the time to be represented is one-half second Expires: February 2003 [Page 15] Draft Specification NFS version 4 Protocol August 2002 before 0 hour January 1, 1970, the seconds field would have aExpires: May 2002 [Page 15] Draft Specification NFS version 4 Protocol November 2001value of negative one (-1) and the nseconds fields would have a value of one-half second (500000000). Values greater than 999,999,999 for nseconds are considered invalid. This data type is used to pass time and date information. A server converts to and from its local representation of time when processing time values, preserving as much accuracy as possible. If the precision of timestamps stored for afile systemfilesystem object is less than defined, loss of precision can occur. An adjunct time maintenance protocol is recommended to reduce client and server time skew. time_how4 enum time_how4 { SET_TO_SERVER_TIME4 = 0, SET_TO_CLIENT_TIME4 = 1 }; settime4 union settime4 switch (time_how4 set_it) { case SET_TO_CLIENT_TIME4: nfstime4 time; default: void; }; The above definitions are used as the attribute definitions to set time values. If set_it is SET_TO_SERVER_TIME4, then the server uses its local representation of time for the time value. specdata4 struct specdata4 { uint32_t specdata1; /* major device number */ uint32_t specdata2; /* minor device number */ }; This data type represents additional information for the device file types NF4CHR and NF4BLK. fsid4 struct fsid4 { uint64_t major; uint64_t minor;};Expires:May 2002February 2003 [Page 16] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 }; This type is thefile systemfilesystem identifier that is used as a mandatory attribute. fs_location4 struct fs_location4 { utf8string server<>; pathname4 rootpath; }; fs_locations4 struct fs_locations4 { pathname4 fs_root; fs_location4 locations<>; }; The fs_location4 and fs_locations4 data types are used for the fs_locations recommended attribute which is used for migration and replication support. fattr4 struct fattr4 { bitmap4 attrmask; attrlist4 attr_vals; }; The fattr4 structure is used to represent file and directory attributes. The bitmap is a counted array of 32 bit integers used to contain bit values. The position of the integer in the array that contains bit n can be computed from the expression (n / 32) and its bit within that integer is (n mod 32). 0 1 +-----------+-----------+-----------+-- | count | 31 .. 0 | 63 .. 32 | +-----------+-----------+-----------+-- change_info4 struct change_info4 { bool atomic; changeid4 before;changeid4 after; };Expires:May 2002February 2003 [Page 17] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 changeid4 after; }; This structure is used with the CREATE, LINK, REMOVE, RENAME operations to let the clienttheknow the value of the change attribute for the directory in which the targetfile systemfilesystem object resides. clientaddr4 struct clientaddr4 { /* see struct rpcb inRFC 1833RFC1833 */ string r_netid<>; /* network id */ string r_addr<>; /* universal address */ }; The clientaddr4 structure is used as part of theSETCLIENTSETCLIENTID operation to either specify the address of the client that is using a clientid or as part of thecall backcallback registration. The r_netid and r_addr fields are specified in [RFC1833], but they are underspecified in [RFC1833] as far as what they should look like for specific protocols. For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the US-ASCII string: h1.h2.h3.h4.p1.p2 The prefix, "h1.h2.h3.h4", is the standard textual form for representing an IPv4 address, which is always four octets long. Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, the first through fourth octets each converted to ASCII-decimal. Assuming big-endian ordering, p1 and p2 are, respectively, the first and second octets each converted to ASCII-decimal. For example, if a host, in big-endian order, has an address of 0x0A010307 and there is a service listening on, in big endian order, port 0x020F (decimal 527), then complete universal address is "10.1.3.7.2.15". For TCP over IPv4 the value of r_netid is the string "tcp". For UDP over IPv4 the value of r_netid is the string "udp". For TCP over IPv4 and for UDP over IPv6, the format of r_addr is the US-ASCII string: x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 The suffix "p1.p2" is the service port, and is computed the same way as with univeral addresses for TCP and UDP over IPv4. The prefix, "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for representing an IPv6 address as defined in Section 2.2 of Expires: February 2003 [Page 18] Draft Specification NFS version 4 Protocol August 2002 [RFC1884]. Additionally, the two alternative forms specified in Section 2.2 of [RFC1884] are also acceptable. For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP over IPv6 the value of r_netid is the string "udp6". cb_client4 struct cb_client4 { unsigned int cb_program; clientaddr4 cb_location; }; This structure is used by the client to inform the server of its call back address; includes the program number and client address. nfs_client_id4 struct nfs_client_id4 { verifier4 verifier; opaqueid<>;id<NFS4_OPAQUE_LIMIT>; }; This structure is part of the arguments to the SETCLIENTID operation.nfs_lockowner4NFS4_OPAQUE_LIMIT is defined as 1024. open_owner4 structnfs_lockowner4open_owner4 { clientid4 clientid; opaqueowner<>;owner<NFS4_OPAQUE_LIMIT>; }; This structure is used to identify the owner of open state. NFS4_OPAQUE_LIMIT is defined as 1024. lock_owner4 struct lock_owner4 { clientid4 clientid; opaque owner<NFS4_OPAQUE_LIMIT>; }; This structure is used to identify the owner ofa OPEN share orfilelock.locking state. NFS4_OPAQUE_LIMIT is defined as 1024. Expires:May 2002February 2003 [Page18]19] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 open_to_lock_owner4 struct open_to_lock_owner4 { seqid4 open_seqid; stateid4 open_stateid; seqid4 lock_seqid; lock_owner4 lock_owner; }; This structure is used for the first LOCK operation done for an open_owner4. It provides both the open_stateid and lock_owner such that the transition is made from a valid open_stateid sequence to that of the new lock_stateid sequence. Using this mechanism avoids the confirmation of the lock_owner/lock_seqid pair since it is tied to established state in the form of the open_stateid/open_seqid. stateid4 struct stateid4 { uint32_t seqid; opaque other[12]; }; Thisstrucutrestructure is used for the various state sharing mechanisms between the client and server. For the client, this data structure is read-only. Theseqidstarting value of the seqid field is undefined. The server is required to increment theonlyseqid fieldthatmonotonically at each transition of theclient should interpret. Seestateid. This is important since thesection forclient will inspect the seqid in OPENoperation for further descriptionstateids to determine the order ofhowOPEN processing done by theseqid field is to be interpreted.server. Expires:May 2002February 2003 [Page19]20] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 3. RPC and Security Flavor The NFS version 4 protocol is a Remote Procedure Call (RPC) application that uses RPC version 2 and the corresponding eXternal Data Representation (XDR) as defined in [RFC1831] and [RFC1832]. The RPCSEC_GSS security flavor as defined in [RFC2203] MUST be used as the mechanism to deliver stronger security for the NFS version 4 protocol. 3.1. Ports and Transports Historically, NFS version 2 and version 3 servers have resided on port 2049. The registered port 2049 [RFC1700] for the NFS protocol should be the default configuration. Using the registered port for NFS services means the NFS client will not need to use the RPC binding protocols as described in [RFC1833]; this will allow NFS to transit firewalls. The transport used by the RPC service for the NFS version 4 protocol MUST provide congestion control comparable to that defined for TCP in [RFC2581]. If the operating environment implements TCP, the NFS version 4 protocol SHOULD be supported over TCP. The NFS client and servermayMAY use other transports if they support congestion control as defined above and in those cases a mechanism may be provided to override TCP usage in favor of another transport. If TCP is used as the transport, the client and server SHOULD use persistent connections. This will prevent the weakening of TCP's congestion control via short lived connections and will improve performance for the WAN environment by eliminating the need for SYN handshakes. Note that for various timers, the client and server should avoid inadvertent synchronization of those timers. For further discussion of the general issue refer to [Floyd].3.2. Security Flavors Traditional RPC implementations have included AUTH_NONE, AUTH_SYS, AUTH_DH, and AUTH_KRB4 as security flavors. With [RFC2203] an additional security flavor3.1.1. Client Retransmission Behavior When processing a request received over a reliable transport such as TCP, the NFS version 4 server MUST NOT silently drop the request, except if the transport connection has been broken. Given such a contract between NFS version 4 clients and servers, clients MUST NOT retry a request unless one or both of the following are true: o The transport connection has been broken o The procedure being retried is the NULL procedure Since transports, including TCP, do not always synchronously inform a peer when the other peer has broken the connection (for example, when Expires: February 2003 [Page 21] Draft Specification NFS version 4 Protocol August 2002 an NFS server reboots), so the NFS version 4 client may want to actively "probe" the connection to see if has been broken. Use of the NULL procedure is one recommended way to do so. So, when a client experiences a remote procedure call timeout (of some arbitrary implementation specific amount), rather than retrying the remote procedure call, it could instead issue a NULL procedure call to the server. If the server has died, the transport connection break will eventually be indicated to the NFS version 4 client. The client can then reconnect, and then retry the original request. If the NULL procedure call gets a response, the connection has not broken. The client can decide to wait longer for the original request's response, or it can break the transport connection and reconnect before re- sending the original request. For callbacks from the server to the client, the same rules apply, but the server doing the callback becomes the client, and the client receiving the callback becomes the server. 3.2. Security Flavors Traditional RPC implementations have included AUTH_NONE, AUTH_SYS, AUTH_DH, and AUTH_KRB4 as security flavors. With [RFC2203] an additional security flavor of RPCSEC_GSS has been introduced which uses the functionality of GSS-API[RFC2078].[RFC2743]. This allows for the use ofvaryingvarious security mechanisms by the RPC layer without the additional implementation overhead of adding RPC security flavors. For NFS version 4, the RPCSEC_GSS security flavor MUST be used to enable the mandatory security mechanism. Other flavors, such as, AUTH_NONE, AUTH_SYS, and AUTH_DH MAY be implemented as well. 3.2.1. Security mechanisms for NFS version 4 The use of RPCSEC_GSS requires selection of: mechanism, quality ofExpires: May 2002 [Page 20] Draft Specification NFS version 4 Protocol November 2001protection, and service (authentication, integrity, privacy). The remainder of this document will refer to these three parameters of the RPCSEC_GSS security as the security triple. 3.2.1.1. Kerberos V5 as a security triple The Kerberos V5 GSS-API mechanism as described in [RFC1964] MUST be implemented and provide the following security triples. column descriptions: 1 == number of pseudo flavor 2 == name of pseudo flavor 3 == mechanism's OID 4 == mechanism's algorithm(s) 5 == RPCSEC_GSS service 1 2 3 4 5 Expires: February 2003 [Page 22] Draft Specification NFS version 4 Protocol August 2002 ----------------------------------------------------------------------- 390003 krb5 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_none 390004 krb5i 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_integrity 390005 krb5p 1.2.840.113554.1.2.2 DES MAC MD5 rpc_gss_svc_privacy for integrity, and 56 bit DES for privacy. Note that the pseudo flavor is presented here as a mapping aid to the implementor. Because this NFS protocol includes a method to negotiate security and it understands the GSS-API mechanism, the pseudo flavor is not needed. The pseudo flavor is needed for NFS version 3 since the security negotiation is done via the MOUNT protocol. For a discussion of NFS' use of RPCSEC_GSS and Kerberos V5, please see [RFC2623]. 3.2.1.2. LIPKEY as a security triple The LIPKEY GSS-API mechanism as described in [RFC2847] MUST be implemented and provide the following security triples. The definition of the columns matches the previous subsection "Kerberos V5 as security triple" 1 2 3 4 5 ----------------------------------------------------------------------- 390006 lipkey 1.3.6.1.5.5.9 negotiated rpc_gss_svc_none 390007 lipkey-i 1.3.6.1.5.5.9 negotiated rpc_gss_svc_integrity 390008 lipkey-p 1.3.6.1.5.5.9 negotiated rpc_gss_svc_privacy The mechanism algorithm is listed as "negotiated". This is because LIPKEY is layered on SPKM-3 and in SPKM-3 [RFC2847] theExpires: May 2002 [Page 21] Draft Specification NFS version 4 Protocol November 2001confidentiality and integrity algorithms are negotiated. Since SPKM-3 specifies HMAC-MD5 for integrity as MANDATORY, 128 bit cast5CBC for confidentiality for privacy as MANDATORY, and further specifies that HMAC-MD5 and cast5CBC MUST be listed first before weaker algorithms, specifying "negotiated" in column 4 does not impair interoperability. In the event an SPKM-3 peer does not support the mandatory algorithms, the other peer is free to accept or reject the GSS-API context creation. Because SPKM-3 negotiates the algorithms, subsequent calls to LIPKEY's GSS_Wrap() and GSS_GetMIC() by RPCSEC_GSS will use a quality of protection value of 0 (zero). See section 5.2 of [RFC2025] for an explanation. LIPKEY uses SPKM-3 to create a secure channel in which to pass a user name and password from the client to theuser.server. Once the user name and password have been accepted by the server, calls to the LIPKEY context are redirected to the SPKM-3 context. See [RFC2847] for more Expires: February 2003 [Page 23] Draft Specification NFS version 4 Protocol August 2002 details. 3.2.1.3. SPKM-3 as a security triple The SPKM-3 GSS-API mechanism as described in [RFC2847] MUST be implemented and provide the following security triples. The definition of the columns matches the previous subsection "Kerberos V5 as security triple". 1 2 3 4 5 ----------------------------------------------------------------------- 390009 spkm3 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_none 390010 spkm3i 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_integrity 390011 spkm3p 1.3.6.1.5.5.1.3 negotiated rpc_gss_svc_privacy For a discussion as to why the mechanism algorithm is listed as "negotiated", see the previous section "LIPKEY as a security triple." Because SPKM-3 negotiates the algorithms, subsequent calls to SPKM- 3's GSS_Wrap() and GSS_GetMIC() by RPCSEC_GSS will use a quality of protection value of 0 (zero). See section 5.2 of [RFC2025] for an explanation. Even though LIPKEY is layered over SPKM-3, SPKM-3 is specified as a mandatory set of triples to handle the situations where the initiator (the client) is anonymous or where the initiator has its own certificate. If the initiator is anonymous, there will not be a user name and password to send to the target (the server). If the initiator has its own certificate, then using passwords is superfluous.Expires: May 2002 [Page 22] Draft Specification NFS version 4 Protocol November 20013.3. Security Negotiation With the NFS version 4 server potentially offering multiple security mechanisms, the client needs a method to determine or negotiate which mechanism is to be used for its communication with the server. The NFS server may have multiple points within itsfile systemfilesystem name space that are available for use by NFS clients. In turn the NFS server may be configured such that each of these entry points may have different or multiple security mechanisms in use. The security negotiation between client and server must be done with a secure channel to eliminate the possibility of a third party intercepting the negotiation sequence and forcing the client and server to choose a lower level of security than required or desired. See the section "Security Considerations" for further discussion. Expires: February 2003 [Page 24] Draft Specification NFS version 4 Protocol August 2002 3.3.1. SECINFO The new SECINFO operation will allow the client to determine, on a per filehandle basis, what security triple is to be used for server access. In general, the client will not have to use the SECINFO operation except during initial communication with the server or when the client crosses policy boundaries at the server. It is possible that the server's policies change during the client's interaction therefore forcing the client to negotiate a new security triple. 3.3.2. Security Error Based on the assumption that each NFS version 4 client and server must support a minimum set of security (i.e. LIPKEY, SPKM-3, and Kerberos-V5 all under RPCSEC_GSS), the NFS client will start its communication with the server with one of the minimal security triples. During communication with the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. This error allows the server to notify the client that the security triple currently being used is not appropriate for access to the server'sfile systemfilesystem resources. The client is then responsible for determining what security triples are available at the server and choose one which is appropriate for the client.3.3.2. SECINFO The new SECINFO operation will allowSee theclient to determine, on a per filehandle basis, what security triple is to be usedsection forserver access. In general,the "SECINFO" operation for further discussion of how the client willnot haverespond to the NFS4ERR_WRONGSEC error and use SECINFO. 3.4. Callback RPC Authentication Except as noted elsewhere in this section, theSECINFO procedure except during initial communication withcallback RPC (described later) MUST mutually authenticate the NFS serveror when the client crosses policy boundaries at the server. It is possible that the server's policies change during the client's interaction therefore forcing the client to negotiate a new security triple. 3.4. Callback RPC Authentication The callback RPC (described later) must mutually authenticate the NFS server toto the principal that acquired the clientid (also described later), using thesamesecurity flavor the original SETCLIENTID operation used.Because LIPKEY is layered over SPKM-3, it is permissible for the server to use SPKM-3 and not LIPKEY for the callback even if the client used LIPKEY for SETCLIENTID.For AUTH_NONE, there are no principals, so this is a non-issue.Expires: May 2002 [Page 23] Draft Specification NFS version 4 Protocol November 2001 For AUTH_SYS,AUTH_SYS has no notions of mutual authentation or a server principal, so the callback from the server simply uses the AUTH_SYS credential that the user used whenithe set up the delegation. For AUTH_DH, one commonly used convention is that the server uses the credential corresponding to this AUTH_DH principal: unix.host@domain where host and domain are variables corresponding to the name of server host and directory services domain in which it lives such as a Network Information System domain or a DNS domain. Because LIPKEY is layered over SPKM-3, it is permissible for the server to use SPKM-3 and not LIPKEY for the callback even if the Expires: February 2003 [Page 25] Draft Specification NFS version 4 Protocol August 2002 client used LIPKEY for SETCLIENTID. Regardless of what security mechanism under RPCSEC_GSS is being used, the NFS server, MUST identify itself in GSS-API via a GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE names are of the form: service@hostname For NFS, the "service" element is nfs Implementations of security mechanisms will convert nfs@hostname to various different forms. For Kerberos V5 and LIPKEY, the following form is RECOMMENDED: nfs/hostname For Kerberos V5, nfs/hostname would be a server principal in the Kerberos Key Distribution Center database. For LIPKEY, this would be the username passed to the target (the NFS version 4 client that receives the callback). It should be noted that LIPKEY may not work for callbacks, since the LIPKEY client uses a user id/password. If the NFS client receiving the callback can authenticate the NFS server's user name/password pair, and if the user that the NFS server is authenticating to has a public key certificate, then it works. In situations where the NFS client uses LIPKEY and uses a per-host principal for the SETCLIENTID operation, instead of using LIPKEY for SETCLIENTID, it is RECOMMENDED that SPKM-3 with mutual authentication be used. This effectively means that the client will use a certificate to authenticate and identify the initiator to the target on the NFS server. Using SPKM-3 and not LIPKEY has the following advantages: o When the server does a callback, it must authenticate to the principal used in the SETCLIENTID. Even if LIPKEY is used, because LIPKEY is layered over SPKM-3, the NFS client will need to have a certificate that corresponds to the principal used inExpires: May 2002 [Page 24] Draft Specification NFS version 4 Protocol November 2001the SETCLIENTID operation. From an administrative perspective, having a user name, password, and certificate for both the client and server is redundant. o LIPKEY was intended to minimize additional infrastructure requirements beyond a certificate for the target, and the expectation is that existing password infrastructure can be leveraged for the initiator. In some environments, a per-host password does not exist yet. If certificates are used for any per-host principals, then additional password infrastructure is Expires: February 2003 [Page 26] Draft Specification NFS version 4 Protocol August 2002 not needed. o In cases when a host is both an NFS client and server, it can share the same per-host certificate. Expires:May 2002February 2003 [Page25]27] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 4. Filehandles The filehandle in the NFS protocol is a per server unique identifier for afile systemfilesystem object. The contents of the filehandle are opaque to the client. Therefore, the server is responsible for translating the filehandle to an internal representation of thefile system object. Since the filehandle is the client's reference to an object and the client may cache this reference, the server SHOULD not reuse a filehandle for another file systemfilesystem object.If the server needs to reuse a filehandle value, the time elapsed before reuse SHOULD be large enough such that it is unlikely the client has a cached copy of the reused filehandle value. Note that a client may cache a filehandle for a very long time. For example, a client may cache NFS data to local storage as a method to expand its effective cache size and as a means to survive client restarts. Therefore, the lifetime of a cached filehandle may be extended.4.1. Obtaining the First Filehandle The operations of the NFS protocol are defined in terms of one or more filehandles. Therefore, the client needs a filehandle to initiate communication with the server. With the NFS version 2 protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there exists an ancillary protocol to obtain this first filehandle. The MOUNT protocol, RPC program number 100005, provides the mechanism of translating a string basedfile systemfilesystem path name to a filehandle which can then be used by the NFS protocols. The MOUNT protocol has deficiencies in the area of security and use via firewalls. This is one reason that the use of the public filehandle was introduced in [RFC2054] and [RFC2055]. With the use of the public filehandle in combination with the LOOKUPprocedureoperation in the NFS version 2 and 3 protocols, it has been demonstrated that the MOUNT protocol is unnecessary for viable interaction between NFS client and server. Therefore, the NFS version 4 protocol will not use an ancillary protocol for translation from string based path names to a filehandle. Two special filehandles will be used as starting points for the NFS client. 4.1.1. Root Filehandle The first of the special filehandles is the ROOT filehandle. The ROOT filehandle is the "conceptual" root of thefile systemfilesystem name space at the NFS server. The client uses or starts with the ROOT filehandle by employing the PUTROOTFH operation. The PUTROOTFH operation instructs the server to set the "current" filehandle to the ROOT of the server's file tree. Once this PUTROOTFH operation is used, the client can then traverse the entirety of the server's fileExpires: May 2002 [Page 26] Draft Specification NFS version 4 Protocol November 2001tree with the LOOKUPprocedure.operation. A complete discussion of the server name space is in the section "NFS Server Name Space". 4.1.2. Public Filehandle The second special filehandle is the PUBLIC filehandle. Unlike the ROOT filehandle, the PUBLIC filehandle may be bound or represent an arbitraryfile systemfilesystem object at the server. The server is responsible Expires: February 2003 [Page 28] Draft Specification NFS version 4 Protocol August 2002 for this binding. It may be that the PUBLIC filehandle and the ROOT filehandle refer to the samefile systemfilesystem object. However, it is up to the administrative software at the server and the policies of the server administrator to define the binding of the PUBLIC filehandle and serverfile systemfilesystem object. The client may not make any assumptions about this binding. The client uses the PUBLIC filehandle via the PUTPUBFH operation. 4.2. Filehandle Types In the NFS version 2 and 3 protocols, there was one type of filehandle with a single set of semantics.The NFS version 4 protocol introduces a new type of filehandle in an attempt to accommodate certain server environments. The firstThis type of filehandle is'persistent'.termed "persistent" in NFS Version 4. The semantics of a persistent filehandleareremain the same asthe filehandles of the NFS version 2 and 3 protocols. The second orbefore. A new type of filehandle introduced in NFS Version 4 is the "volatile"filehandle.filehandle, which attempts to accommodate certain server environments. The volatile filehandle typeis beingwas introduced to address server functionality or implementation issues which make correct implementation of a persistent filehandle infeasible. Some server environments do not provide afile systemfilesystem level invariant that can be used to construct a persistent filehandle. The underlying serverfile systemfilesystem may not provide the invariant or the server'sfile systemfilesystem programming interfaces may not provide access to the needed invariant. Volatile filehandles may ease the implementation of server functionality such as hierarchical storage management orfile systemfilesystem reorganization or migration. However, the volatile filehandle increases the implementation burden for the client.However this increased burden is deemed acceptable based on the overall gains achieved by the protocol.Since the client will need to handle persistent and volatilefilehandlefilehandles differently, a file attribute is defined which may be used by the client to determine the filehandle types being returned by the server. 4.2.1. General Properties of a Filehandle The filehandle contains all the information the server needs to distinguish an individual file. To the client, the filehandle is opaque. The client stores filehandles for use in a later request andExpires: May 2002 [Page 27] Draft Specification NFS version 4 Protocol November 2001can compare two filehandles from the same server for equality by doing a byte-by-byte comparison. However, the client MUST NOT otherwise interpret the contents of filehandles. If two filehandles from the same server are equal, they MUST refer to the same file.If they are not equal, the client may use information provided by the server, in the form of file attributes, to determine whether they denote the same files or different files. The client would do this as necessary for client side caching.Servers SHOULD try to maintain a one-to-one correspondence between filehandles and files but this is not required. Clients MUST use filehandle comparisons only to improve performance, not for correct behavior. All clients need to be prepared for situations in which it cannot be determined whether two filehandles denote the same object and in such cases, avoid making invalid assumpions which might cause incorrect behavior. Further discussion of filehandle and attribute Expires: February 2003 [Page 29] Draft Specification NFS version 4 Protocol August 2002 comparison in the context of data caching is presented in the section "Data Caching and File Identity". As an example, in the case that two different path names when traversed at the server terminate at the samefile systemfilesystem object, the server SHOULD return the same filehandle for each path. This can occur if a hard link is used to create two file names which refer to the same underlying file object and associated data. For example, if paths /a/b/c and /a/d/c refer to the same file, the server SHOULD return the same filehandle for both path names traversals. 4.2.2. Persistent Filehandle A persistent filehandle is defined as having a fixed value for the lifetime of thefile systemfilesystem object to which it refers. Once the server creates the filehandle for afile systemfilesystem object, the server MUST accept the same filehandle for the object for the lifetime of the object. If the server restarts or reboots the NFS server must honor the same filehandle value as it did in the server's previous instantiation. Similarly, if thefile systemfilesystem is migrated, the new NFS server must honor the samefile handlefilehandle as the old NFS server. The persistent filehandle will be become stale or invalid when thefile systemfilesystem object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE. A filehandle may become stale when thefile systemfilesystem containing the object is no longer available. The file system may become unavailable if it exists on removable media and the media is no longer available at the server or thefile systemfilesystem in whole has been destroyed or thefile systemfilesystem has simply been removed from the server's name space (i.e. unmounted in aUnixUNIX environment). 4.2.3. Volatile Filehandle A volatile filehandle does not share the same longevityExpires: May 2002 [Page 28] Draft Specification NFS version 4 Protocol November 2001characteristics of a persistent filehandle. The server may determine that a volatile filehandle is no longer valid at many different points in time. If the server can definitively determine that a volatile filehandle refers to an object that has been removed, the server should return NFS4ERR_STALE to the client (as is the case for persistent filehandles). In all other cases where the server determines that a volatile filehandle can no longer be used, it should return an error of NFS4ERR_FHEXPIRED. The mandatory attribute "fh_expire_type" is used by the client to determine what type of filehandle the server is providing for a particularfile system.filesystem. This attribute is a bitmask with the following values: Expires: February 2003 [Page 30] Draft Specification NFS version 4 Protocol August 2002 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a persistent filehandle, which is valid until the object is removed from thefile system.filesystem. The server will not return NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined as a value in which none of the bits specified below are set.FH4_NOEXPIRE_WITH_OPENFH4_VOLATILE_ANY The filehandlewill notmay expirewhile client has the file open.at any time, except as specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. If this bit is set, then thevaluesmeaning of FH4_VOLATILE_ANYor FH4_VOL_RENAME do not impact expiration while the file is open. Once the file is closed or if the FH4_NOEXPIRE_WITH_OPEN bitisfalse, the restqualified to exclude any expiration of thevolatile related bits apply. FH4_VOLATILE_ANY Thefilehandlemay expire at any time and will expire during system migration and rename.when it is open. FH4_VOL_MIGRATION The filehandle will expireduring file systemas a result of migration.May only be set if FH4_VOLATILE_ANYIf FH4_VOL_ANY isnot set.set, FH4_VOL_MIGRATION is redundant. FH4_VOL_RENAME The filehandlemaywill expiredue to aduring rename. This includes a rename by the requesting client or a rename byanotherany other client.May only be set if FH4_VOLATILE_ANYIf FH4_VOL_ANY isnot set.set, FH4_VOL_RENAME is redundant. Servers which provide volatile filehandles that may expire while open (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should deny a RENAME or REMOVE that would affect an OPEN fileorof any of the components leading to the OPEN file. In addition, the server should deny all RENAME or REMOVE requests during the graceor leaseperiod upon server restart.The reader may be wondering why there are three FH4_VOL*Note that the bitsand why FH4_VOLATILE_ANY is exclusive ofFH4_VOL_MIGRATION andFH4_VOL_RENAME. If the a filehandle is normally persistent but cannot persist across a file set migration, then the presence of the Expires: May 2002 [Page 29] Draft Specification NFS version 4 Protocol November 2001 FH4_VOL_MIGRATION orFH4_VOL_RENAMEtellsallow the client to determine thatit can treat the file handle as persistent for purposes of maintainingexpiration has occurred whenever afile name to file handle cache, except for thespecific eventdescribed by the bit. However, FH4_VOLATILE_ANY tellsoccurs, without an explicit filehandle expiration error from theclient that it shouldserver. FH4_VOL_ANY does notmaintain such a cache for unopened files. Aprovide this form of information. In situations where the serverMUSTwill expire many, but notpresentall filehandles upon migration (e.g. all but those that are open), FH4_VOLATILE_ANYwith FH4_VOL_MIGRATION or FH4_VOL_RENAME as(in thiswill lead to confusion. FH4_VOLATILE_ANY implies thatcase with FH4_NOEXPIRE_WITH_OPEN) is a better choice since thefile handleclient may not assume that all filehandles will expireuponwhen migrationor rename,occurs, and it is likely that additional expirations will occur (as a result of file CLOSE) that are separated inaddition to other events.time from the migration event itself. 4.2.4. One Method of Constructing a Volatile Filehandle As mentioned, in some instances a filehandle is stale (no longer valid; perhaps because the file was removed from the server) or it is expired (the underlying file is valid but since the filehandle is Expires: February 2003 [Page 31] Draft Specification NFS version 4 Protocol August 2002 volatile, it may have expired). Thus the server needs to be able to return NFS4ERR_STALE in the former case and NFS4ERR_FHEXPIRED in the latter case. This can be done by careful construction of the volatile filehandle. One possible implementation follows. A volatile filehandle, while opaque to the client could contain: [volatile bit = 1 | server boot time | slot | generation number] o slot is an index in the server volatile filehandle table o generation number is the generation number for the table entry/slot If the server boot time is less than the current server boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return NFS4ERR_BADHANDLE. If the generation number does not match, return NFS4ERR_FHEXPIRED. When the server reboots, the table is gone (it is volatile). If volatile bit is 0, then it is a persistent filehandle with a different structure following it. 4.3. Client Recovery from Filehandle Expiration If possible, the client SHOULD recover from the receipt of an NFS4ERR_FHEXPIRED error. The client must take on additional responsibility so that it may prepare itself to recover from the expiration of a volatile filehandle. If the server returns persistent filehandles, the client does not need these additional steps.Expires: May 2002 [Page 30] Draft Specification NFS version 4 Protocol November 2001For volatile filehandles, most commonly the client will need to store the component names leading up to and including thefile systemfilesystem object in question. With these names, the client should be able to recover by finding a filehandle in the name space that is still available or by starting at the root of the server'sfile systemfilesystem name space. If the expired filehandle refers to an object that has been removed from thefile system,filesystem, obviously the client will not be able to recover from the expired filehandle. It is also possible that the expired filehandle refers to a file that has been renamed. If the file was renamed by another client, again it is possible that the original client will not be able to recover. However, in the case that the client itself is renaming the file and the file is open, it is possible that the client may be able to recover. The client can determine the new path name based on the processing of the rename request. The client can then regenerate the Expires: February 2003 [Page 32] Draft Specification NFS version 4 Protocol August 2002 new filehandle based on the new path name. The client could also use the compound operation mechanism to construct a set of operations like: RENAME A B LOOKUP B GETFH Note that the COMPOUND procedure does not provide atomicity. This example only reduces the overhead of recovering from an expired filehandle. Expires:May 2002February 2003 [Page31]33] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 5. File Attributes To meet the requirements of extensibility and increased interoperability withnon-Unixnon-UNIX platforms, attributes must be handled in a flexible manner. The NFSVersionversion 3 fattr3 structure contains a fixed list of attributes that not all clients and servers are able to support or care about. The fattr3 structure can not be extended as new needs arise and it provides no way to indicate non-support. With the NFSVersionversion 4 protocol, the clientwill beis ableto askquery what attributes the server supports andwill be able to requestconstruct requests with only those supported attributesin which it is interested.(or a subset thereof). To this end, attributeswill beare divided into three groups: mandatory, recommended, and named. Both mandatory and recommended attributes are supported in the NFS version 4 protocol by a specific and well- defined encoding and are identified by number. They are requested by setting a bit in the bit vector sent in the GETATTR request; the server response includes a bit vector to list what attributes were returned in the response. New mandatory or recommended attributes may be added to the NFS protocol between major revisions by publishing a standards-track RFC which allocates a new attribute number value and defines the encoding for the attribute. See the section "Minor Versioning" for further discussion. Named attributes are accessed by the new OPENATTR operation, which accesses a hidden directory of attributes associated with a file system object. OPENATTR takes a filehandle for the object and returns the filehandle for the attribute hierarchy. The filehandle for the named attributes is a directory object accessible by LOOKUP or READDIR and contains files whose names represent the named attributes and whose data bytes are the value of the attribute. For example: LOOKUP "foo" ; look up file GETATTR attrbits OPENATTR ; access foo's named attributes LOOKUP "x11icon" ; look up specific attribute READ 0,4096 ; read stream of bytes Named attributes are intended for data needed by applications rather than by an NFS client implementation. NFS implementors are strongly encouraged to define their new attributes as recommended attributes by bringing them to the IETF standards-track process. The set of attributes which are classified as mandatory is deliberately small since servers must do whatever it takes to support them.The recommended attributes may be unsupported; though aA server should support as many of the recommended attributes asit can.possible but by their definition, the server is not required to support all of them. Attributes are deemed mandatory if the data is both needed by a large number of clients and is not otherwise Expires: February 2003 [Page 34] Draft Specification NFS version 4 Protocol August 2002 reasonably computable by the client when support is notExpires: May 2002 [Page 32] Draft Specification NFS version 4 Protocol November 2001provided on the server. Note that the hidden directory returned by OPENATTR is a convenience for protocol processing. The client should not make any assumptions about the server's implementation of named attributes and whether the underlying filesystem at the server has a named attribute directory or not. Therefore, operations such as SETATTR and GETATTR on the named attribute directory are undefined. 5.1. Mandatory Attributes These MUST be supported by every NFSVersionversion 4 client and server in order to ensure a minimum level of interoperability. The server must store and return these attributes and the client must be able to function with an attribute set limited to these attributes. With just the mandatory attributes some client functionality may be impaired or limited in some ways. A client may ask for any of these attributes to be returned by setting a bit in the GETATTR request and the server must return their value. 5.2. Recommended Attributes These attributes are understood well enough to warrant support in the NFSVersionversion 4 protocol. However, they may not be supported on all clients and servers. A client may ask for any of these attributes to be returned by setting a bit in the GETATTR request but must handle the case where the server does not return them. A client may ask for the set of attributes the server supports and should not request attributes the server does not support. A server should be tolerant of requests for unsupported attributes and simply not return them rather than considering the request an error. It is expected that servers will support all attributes they comfortably can and only fail to support attributes which are difficult to support in their operating environments. A server should provide attributes whenever they don't have to "tell lies" to the client. For example, a file modification time should be either an accurate time or should not be supported by the server. This will not always be comfortable to clients butit seems thatthe clienthas ais betterabilitypositioned decide whether and how to fabricate or construct an attribute or whether to do without the attribute. 5.3. Named Attributes These attributes are not supported by direct encoding in the NFS Version 4 protocol but are accessed by string names rather than numbers and correspond to an uninterpreted stream of bytes which are stored with thefile systemfilesystem object. The name space for these Expires: February 2003 [Page 35] Draft Specification NFS version 4 Protocol August 2002 attributes may be accessed by using the OPENATTR operation. The OPENATTR operation returns a filehandle for a virtual "attribute directory" and further perusal of the name space may be done using READDIR and LOOKUP operations on this filehandle. Named attributes may then be examined or changed by normal READ and WRITE and CREATE operations on the filehandles returned from READDIR and LOOKUP. Named attributes may have attributes. It is recommended that servers support arbitrary named attributes. A client should not depend on the ability to store any named attributesExpires: May 2002 [Page 33] Draft Specification NFS version 4 Protocol November 2001in the server'sfile system.filesystem. If a server does support named attributes, a client which is also able to handle them should be able to copy a file's data and meta-data with complete transparency from one location to another; this would imply that names allowed for regular directory entries are valid for named attribute names as well. Names of attributes will not be controlled by this document or other IETF standards track documents. See the section "IANA Considerations" for further discussion. 5.4. Classification of Attributes Each of the Mandatory and Recommended attributes can be classified in one of three categories: per server, per filesystem, or per filesystem object. Note that it is possible that some per filesystem attributes may vary within the filesystem. See the "homogeneous" attribute for its definition. Note that the attributes time_access_set and time_modify_set are not listed below because they are write-only attributes used in a special instance of SETATTR. o The per server attribute is: lease_time o The per filesystem attributes are: supp_attr, fh_expire_type, link_support, symlink_support, unique_handles, aclsupport, cansettime, case_insensitive, case_preserving, chown_restricted, files_avail, files_free, files_total, fs_locations, homogeneous, maxfilesize, maxname, maxread, maxwrite, no_trunc, space_avail, space_free, space_total, time_delta o The per filesystem object attributes are: type, change, size, named_attr, fsid, rdattr_error, filehandle, ACL, archive, fileid, hidden, maxlink, mimetype, mode, numlinks, owner, owner_group, rawdev, space_used, system, time_access, time_backup, time_create, time_metadata, time_modify, mounted_on_fileid Expires:MayFebruary 2003 [Page 36] Draft Specification NFS version 4 Protocol August 2002 For quota_avail_hard, quota_avail_soft, and quota_used see their definitions below for the appropriate classification. Expires: February 2003 [Page34]37] Draft Specification NFS version 4 ProtocolNovember 2001 5.4.August 2002 5.5. Mandatory Attributes - Definitions Name # DataType Access Description ___________________________________________________________________ supp_attr 0 bitmap READ The bit vector which would retrieve all mandatory and recommended attributes that are supported for this object. The scope of this attribute applies to all objects with a matching fsid. type 1 nfs4_ftype READ The type of the object (file, directory,symlink)symlink, etc.) fh_expire_type 2 uint32 READ Server uses this to specify filehandle expiration behavior to the client. See the section "Filehandles" for additional description. change 3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object have been modified. The server may return the object'stime_modifytime_metadata attribute for this attribute's value but only if thefile systemfilesystem object can not be updated more frequently than the resolution oftime_modify.time_metadata. size 4 uint64 R/W The size of the object in bytes.link_support 5 bool READ Does the object's file system supports hard links?Expires:May 2002February 2003 [Page35]38] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 link_support 5 bool READ True, if the object's filesystem supports hard links. symlink_support 6 bool READDoesTrue, if the object'sfile systemfilesystem supports symboliclinks?links. named_attr 7 bool READDoesTrue, if this objecthavehas namedattributes?attributes. In other words, object has a non-empty named attribute directory. fsid 8 fsid4 READ Uniquefile systemfilesystem identifier for thefile systemfilesystem holding this object. fsid contains major and minor components each of which are uint64. unique_handles 9 bool READAreTrue, if two distinct filehandles guaranteed to refer to two differentfile system objects?filesystem objects. lease_time 10 nfs_lease4 READ Duration of leases at server in seconds. rdattr_error 11 enum READ Error returned from getattr during readdir. filehandle 19 nfs_fh4 READ The filehandle of this object (primarily for readdir requests). Expires:May 2002February 2003 [Page36]39] Draft Specification NFS version 4 ProtocolNovember 2001 5.5.August 2002 5.6. Recommended Attributes - Definitions Name # Data Type Access Description___________________________________________________________________________________________________________________________________________ ACL 12 nfsace4<> R/W The access control list for the object. aclsupport 13 uint32 READ Indicates what types of ACLs are supported on the currentfile system.filesystem. archive 14 bool R/WWhether or notTrue, if this file has been archived since the time of last modification (deprecated in favor of time_backup). cansettime 15 bool READIsTrue, if the server able to change the times for afile systemfilesystem object as specified in a SETATTRoperation?operation. case_insensitive 16 bool READAreTrue, if filename comparisons on thisfile systemfilesystem caseinsensitive?insensitive. case_preserving 17 bool READIsTrue, if filename case on thisfile system preserved?filesystem preserved. chown_restricted 18 bool READ If TRUE, the server will reject any request to change either the owner or the group associated with a file if the caller is not a privileged user (for example, "root" inUnixUNIX operating environments or inNTWindows 2000 the "Take Ownership"privilege)privilege). Expires:May 2002February 2003 [Page37]40] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 fileid 20 uint64 READ A number uniquely identifying the file within thefile system.filesystem. files_avail 21 uint64 READ File slots available to this user on thefile systemfilesystem containing this object - this should be the smallest relevant limit. files_free 22 uint64 READ Free file slots on thefile systemfilesystem containing this object - this should be the smallest relevant limit. files_total 23 uint64 READ Total file slots on thefile systemfilesystem containing this object. fs_locations 24 fs_locations READ Locations where thisfile systemfilesystem may be found. If the server returns NFS4ERR_MOVED as an error, this attributemustMUST be supported. hidden 25 bool R/WIsTrue, if the file is considered hidden with respect to theWIN32Windows API? homogeneous 26 bool READWhether or notTrue, if this object'sfile systemfilesystem is homogeneous, i.e. are perfile systemfilesystem attributes the same for allfile system'sfilesystem's objects. maxfilesize 27 uint64 READ Maximum supported file size for thefile systemfilesystem of this object. Expires:May 2002February 2003 [Page38]41] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 maxlink 28 uint32 READ Maximum number of links for this object. maxname 29 uint32 READ Maximum filename size supported for this object. maxread 30 uint64 READ Maximum read size supported for this object. maxwrite 31 uint64 READ Maximum write size supported for this object. This attribute SHOULD be supported if the file is writable. Lack of this attribute can lead to the client either wasting bandwidth or not receiving the best performance. mimetype 32 utf8<> R/W MIME body type/subtype of this object. mode 33 mode4 R/WUnix-styleUNIX-style mode and permission bits for thisobject (deprecated in favor of ACLs)object. no_trunc 34 bool READIfTrue, if a name longer than name_max is used,willan error be returnedor will theand namebe truncated?is not truncated. numlinks 35 uint32 READ Number of hard links to this object. owner 36 utf8<> R/W The string name of the owner of this object. owner_group 37 utf8<> R/W The string name of the group ownership of this object. Expires:May 2002February 2003 [Page39]42] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 quota_avail_hard 38 uint64 READ For definition see "Quota Attributes" section below. quota_avail_soft 39 uint64 READ For definition see "Quota Attributes" section below. quota_used 40 uint64 READ For definition see "Quota Attributes" section below. rawdev 41 specdata4 READ Raw device identifier.UnixUNIX device major/minor node information. If the value of type is not NF4BLK or NF4CHR, the value return SHOULD NOT be considered useful. space_avail 42 uint64 READ Disk space in bytes available to this user on thefile systemfilesystem containing this object - this should be the smallest relevant limit. space_free 43 uint64 READ Free disk space in bytes on thefile systemfilesystem containing this object - this should be the smallest relevant limit. space_total 44 uint64 READ Total disk space in bytes on thefile systemfilesystem containing this object. space_used 45 uint64 READ Number offile systemfilesystem bytes allocated to this object. Expires: February 2003 [Page 43] Draft Specification NFS version 4 Protocol August 2002 system 46 bool R/WIsTrue, if this file is asystem"system" file with respect to theWIN32Windows API? time_access 47 nfstime4 READ The time of last access to theobject. Expires: May 2002 [Page 40] Draft Specification NFS version 4 Protocol November 2001object by a read that was satisfied by the server. time_access_set 48 settime4 WRITE Set the time of last access to the object. SETATTR use only. time_backup 49 nfstime4 R/W The time of last backup of the object. time_create 50 nfstime4 R/W The time of creation of the object. This attribute does not have any relation to the traditionalUnixUNIX file attribute "ctime" or "change time". time_delta 51 nfstime4 READ Smallest useful server time granularity. time_metadata 52 nfstime4 R/W The time of last meta-data modification of the object. time_modify 53 nfstime4 READ The time of last modification to the object. time_modify_set 54 settime4 WRITE Set the time of last modification to the object. SETATTR use only.5.6. Interpreting owner and owner_group The recommended attributes "owner" and "owner_group" are represented in terms of a UTF-8 string. To avoidmounted_on_fileid 55 uint64 READ Like fileid, but if the target filehandle is the root of a filesystem return the fileid of the underlying directory. Expires: February 2003 [Page 44] Draft Specification NFS version 4 Protocol August 2002 5.7. Time Access As defined above, the time_access attribute represents the time of last access to the object by a read that was satisfied by the server. The notion of what is an "access" depends on server's operating environment and/or the server's filesystem semantics. For example, for servers obeying POSIX semantics, time_access would be updated only by the READLINK, READ, and READDIR operations and not any of the operations that modify the content of the object. Of course, setting the corresponding time_access_set attribute is another way to modify the time_access attribute. Whenever the file object resides on a writeable filesystem, the server should make best efforts to record time_access into stable storage. However, to mitigate the performance effects of doing so, and most especially whenever the server is satisifying the read of the object's content from its cache, the server MAY cache access time updates and lazily write them to stable storage. It is also acceptable to give administrators of the server the option to disable time_access updates. 5.8. Interpreting owner and owner_group The recommended attributes "owner" and "owner_group" (and also users and groups within the "acl" attribute) are represented in terms of a UTF-8 string. To avoid a representation that is tied to a particular underlying implementation at the client or server, the use of the UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] provides additional rationale. It is expected that the client and server will have their own local representation of owner and owner_group that is used for local storage or presentation to the end user. Therefore, it is expected that when these attributes are transferred between the client and server that the local representation is translated to a syntax of the form "user@dns_domain". This will allow for a client and server that do not use the same local representation the ability to translate to a common syntax that can be interpreted by both.Expires: May 2002 [Page 41] Draft Specification NFS version 4 Protocol November 2001Similarly, security principals may be represented in different ways by different security mechanisms. Servers normally translate these representations into a common format, generally that used by local storage, to serve as a means of identifying the users corresponding to these security principals. When these local identifiers are translated to the form of the owner attribute, associated with files created by such principals they identify, in a common format, the users associated with each corresponding set of security principals. The translation used to interpret owner and group strings is not specified as part of the protocol. This allows various solutions to be employed. For example, a local translation table may be consulted that maps between a numeric id to the user@dns_domain syntax. A name Expires: February 2003 [Page 45] Draft Specification NFS version 4 Protocol August 2002 service may also be used to accomplish the translation.The "dns_domain" portionA server may provide a more general service, not limited by any particular translation (which would only translate a limited set of possible strings) by storing the ownerstring is meant to beand owner_group attributes in local storage without any translation or it may augment aDNS domain name. For example, user@ietf.org. Intranslation method by storing thecase where there isentire string for attributes for which no translation is availableto the client or server, the attribute value must be constructed without the "@". Therefore,while using theabsencelocal representation for those cases in which a translation is available. Servers that do not provide support for all possible values of the@ from theownerorand owner_groupattribute signifiesattributes, should return an error (NFS4ERR_BADOWNER) when a string is presented that has notranslation was available and the receiver of the attribute should not place any special meaning with the attribute value. Even thoughtranslation, as theattributevaluecan not be translated, it may still be useful. In the case of a client, the attribute string mayto beusedset forlocal displaya SETATTR ofownership. 5.7. Character Case Attributes With respect tothecase_insensitive and case_preserving attributes, each UCS-4 character (which UTF-8 encodes) hasowner, owner_group, or acl attributes. When a"longserver does accept an owner or owner_group value as valid on a SETATTR (and similarly for the owner and group strings in an acl), it is promising to return that same string when a corresponding GETATTR is done. Configuration changes and ill-constructed name translations (those that contain aliasing) may make that promise impossible to honor. Servers should make appropriate efforts to avoid a situation in which these attributes have their values changed when no real change to ownership has occurred. The "dns_domain" portion of the owner string is meant to be a DNS domain name. For example, user@ietf.org. Servers should accept as valid a set of users for at least one domain. A server may treat other domains as having no valid translations. A more general service is provided when a server is capable of accepting users for multiple domains, or for all domains, subject to security constraints. In the case where there is no translation available to the client or server, the attribute value must be constructed without the "@". Therefore, the absence of the @ from the owner or owner_group attribute signifies that no translation was available at the sender and that the receiver of the attribute should not use that string as a basis for translation into its own internal format. Even though the attribute value can not be translated, it may still be useful. In the case of a client, the attribute string may be used for local display of ownership. To provide a greater degree of compatibility with previous versions of NFS (i.e. v2 and v3), which identified users and groups by 32-bit unsigned uid's and gid's, owner and group strings that consist of decimal numeric values with no leading zeros can be given a special interpretation by clients and servers which choose to provide such support. The receiver may treat such a user or group string as representing the same user as would be represented by a v2/v3 uid or gid having the corresponding numeric value. A server is not obligated to accept such a string, but may return an NFS4ERR_BADOWNER instead. To avoid this mechanism being used to subvert user and group translation, so that a client might pass all of the owners and Expires: February 2003 [Page 46] Draft Specification NFS version 4 Protocol August 2002 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER error when there is a valid translation for the user or owner designated in this way. In that case, the client must use the appropriate name@domain string and not the special form for compatibility. The owner string "nobody" may be used to designate an anonymous user, which will be associated with a file created by a security principal that cannot be mapped through normal means to the owner attribute. 5.9. Character Case Attributes With respect to the case_insensitive and case_preserving attributes, each UCS-4 character (which UTF-8 encodes) has a "long descriptive name" [RFC1345] which may or may not included the word "CAPITAL" or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to implement unambiguous and efficient table driven mappings for case insensitive comparisons, and non-case-preserving storage. For general character handling and internationalization issues, see the section "Internationalization".5.8.5.10. Quota Attributes For the attributes related tofile systemfilesystem quotas, the following definitions apply: quota_avail_soft The value in bytes which represents the amount of additional disk space that can be allocated to this file or directory before the user may reasonably be warned. It is understood that this space may be consumed by allocations to other files or directories though there is a rule as to which other files or directories. quota_avail_hard The value in bytes which represent the amount of additional disk space beyond the current allocation that can be allocated to this file or directory before further allocations will be refused. It is understood that this space may be consumed by allocations to other files or directories. quota_usedExpires: May 2002 [Page 42] Draft Specification NFS version 4 Protocol November 2001The value in bytes which represent the amount of disc space used by this file or directory and possibly a number of other similar files or directories, where the set of "similar" meets at least the criterion that allocating space to any file or directory in the set will reduce the "quota_avail_hard" of every other file or directory in the set. Expires: February 2003 [Page 47] Draft Specification NFS version 4 Protocol August 2002 Note that there may be a number of distinct but overlapping sets of files or directories for which a quota_used value is maintained. E.g. "all files with a given owner", "all files with a given group owner". etc. The server is at liberty to choose any of those sets but should do so in a repeatable way. The rule may be configured per- filesystem or may be "choose the set with the smallest quota".5.9.5.11. Access Control Lists The NFS version 4 ACL attribute is an array of access control entries (ACE). There are various access control entrytypes.types, as defined in the Section "ACE type". The server is able to communicate which ACE types are supported by returning the appropriate value within the aclsupport attribute.The types of ACEs are definedEach ACE covers one or more operations on a file or directory asfollows: Type Description _____________________________________________________ ALLOW Explicitly grantsdescribed in theaccess defined in acemask4 to the fileSection "ACE Access Mask". It may also contain one ordirectory. DENY Explicitly denies the access defined in acemask4 tomore flags that modify thefile or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses anysemantics of theaccess methods specifiedACE as defined inacemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory fortheaccess methods specified in acemask4.Section "ACE flag". The NFS ACE attribute is defined as follows: typedef uint32_t acetype4; typedef uint32_t aceflag4; typedef uint32_t acemask4; struct nfsace4 { acetype4 type; aceflag4 flag;Expires: May 2002 [Page 43] Draft Specification NFS version 4 Protocol November 2001acemask4 access_mask; utf8string who; }; To determine ifan ACCESS or OPENa requestsucceedssucceeds, each nfsace4 entry is processed in order by the server. Only ACEs which have a "who" that matches the requester are considered. Each ACE is processed until all of the bits of the requester's access have been ALLOWED. Once a bit (see below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer considered in the processing of later ACEs. If an ACCESS_DENIED_ACE is encountered where the requester'smodeaccess still has unALLOWED bits in common with the "access_mask" of the ACE, the request is denied. However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT ACE types do not affect a requestor's access, and instead are for triggering events as a result of a requestor's access attempt. Therefore, all AUDIT and ALARM ACEs are processed until end of the ACL. The NFS version 4 ACL model is quite rich. Some server platforms may provide access control functionality that goes beyond the UNIX-style Expires: February 2003 [Page 48] Draft Specification NFS version 4 Protocol August 2002 mode attribute, but which is not as rich as the NFS ACL model. So that users can take advantage of this more limited functionality, the server may indicate that it supports ACLs as long as it follows the guidelines for mapping between its ACL model and the NFS version 4 ACL model. The situation is complicated by the fact that a server may have multiple modules that enforce ACLs. For example, the enforcement for NFS version 4 access may be different from the enforcement for local access, and both may be different from the enforcement for access through other protocols such as SMB. So it may be useful for a server to accept an ACL even if not all of its modules are able to support it. The guiding principle in all cases is that the server must not accept ACLs that appear to make the file more secure than it really is. 5.11.1. ACE type Type Description _____________________________________________________ ALLOW Explicitly grants the access defined in acemask4 to the file or directory. DENY Explicitly denies the access defined in acemask4 to the file or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses any of the access methods specified in acemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory for the access methods specified in acemask4. A server need not support all of the above ACE types. The bitmask constants used to represent the above definitions within the aclsupport attribute are as follows: const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_ALARM_ACL = 0x00000008;5.9.1. ACE typeThe semantics of the "type" field follow the descriptions provided Expires: February 2003 [Page 49] Draft Specification NFS version 4 Protocol August 2002 above. Thebitmaskconstants used for the type field (acetype4) are as follows: const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003;5.9.2. ACE flag The "flag" field contains values based on the following descriptions. ACE4_FILE_INHERIT_ACE Can be placed on a directory and indicates that this ACEClients shouldbe addednot attempt toeach new non-directory file created. ACE4_DIRECTORY_INHERIT_ACE Can be placed on a directory and indicatesset an ACE unless the server claims support for thatthisACEshould be added to each new directory created. Expires: May 2002 [Page 44] Draft Specification NFS version 4 Protocol November 2001 ACE4_INHERIT_ONLY_ACE Can be placed ontype. If the server receives adirectory but does not applyrequest to set an ACE that it cannot store, it must reject thedirectory, only to newly created files/directories as specified byrequest with NFS4ERR_ATTRNOTSUPP. If theabove two flags. ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory. Normally whenserver receives anew directory is created andrequest to set an ACEexists onthat it can store but cannot enforce, theparent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed onserver SHOULD reject thenew directory. Onerequest. Example: suppose a server can enforce NFS ACLs forthe directory itself and one which is an inheritable ACENFS access but cannot enforce ACLs fornewly created directories. This flag tells the server to not place an ACElocal access. If arbitrary processes can run on thenewly created directory which is inheritable by subdirectories ofserver, then thecreated directory. ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACL4_FAILED_ACCESS_ACE_FLAG Bothserver SHOULD NOT indicatefor AUDIT and ALARM which state to log the event. On every ACCESS or OPEN call which occurs on a file or directory which has anACLthat is of type ACE4_SYSTEM_AUDIT_ACE_TYPE or ACE4_SYSTEM_ALARM_ACE_TYPE, the attempted access is compared to the ace4mask of these ACLs. If the access is a subset of ace4mask and the identifier match, an AUDIT trail or an ALARM is generated. By default this happens regardless of the success or failure of the ACCESS or OPEN call. The flag ACE4_SUCCESSFUL_ACCESS_ACE_FLAG only produces the AUDIT or ALARM if the ACCESS or OPEN call is successful. The ACE4_FAILED_ACCESS_ACE_FLAG causessupport. On theALARM or AUDITother hand, if only trusted administrative programs run locally, then theACCESS or OPEN call fails. ACE4_IDENTIFIER_GROUP Indicates that the "who" refers to a GROUP as defined under Unix. The bitmask constants used for the flag field are as follows: const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_IDENTIFIER_GROUP = 0x00000040; Expires: May 2002 [Page 45] Draft Specification NFS version 4 Protocol November 2001 5.9.3.server may indicate ACL support. 5.11.2. ACE Access Mask The access_mask field contains values based on the following: Access Description _______________________________________________________________ READ_DATA Permission to read the data of the file LIST_DIRECTORY Permission to list the contents of a directory WRITE_DATA Permission to modify the file's data ADD_FILE Permission to add a new file to a directory APPEND_DATA Permission to append data to a file ADD_SUBDIRECTORY Permission to create a subdirectory to a directory READ_NAMED_ATTRS Permission to read the named attributes of a file WRITE_NAMED_ATTRS Permission to write the named attributes of a file EXECUTE Permission to execute a file DELETE_CHILD Permission to delete a file or directory within a directory READ_ATTRIBUTES The ability to read basic attributes (non-acls) of a file Expires: February 2003 [Page 50] Draft Specification NFS version 4 Protocol August 2002 WRITE_ATTRIBUTES Permission to change basic attributes (non-acls) of a file DELETE Permission to Delete the file READ_ACL Permission to Read the ACL WRITE_ACL Permission to Write the ACL WRITE_OWNER Permission to change the owner SYNCHRONIZE Permission to access file locally at the server with synchronous reads and writes The bitmask constants used for the access mask field are as follows: const ACE4_READ_DATA = 0x00000001; const ACE4_LIST_DIRECTORY = 0x00000001; const ACE4_WRITE_DATA = 0x00000002; const ACE4_ADD_FILE = 0x00000002; const ACE4_APPEND_DATA = 0x00000004; const ACE4_ADD_SUBDIRECTORY = 0x00000004; const ACE4_READ_NAMED_ATTRS = 0x00000008; const ACE4_WRITE_NAMED_ATTRS = 0x00000010; const ACE4_EXECUTE = 0x00000020; const ACE4_DELETE_CHILD = 0x00000040; const ACE4_READ_ATTRIBUTES = 0x00000080; const ACE4_WRITE_ATTRIBUTES = 0x00000100; const ACE4_DELETE = 0x00010000; const ACE4_READ_ACL = 0x00020000;Expires: May 2002 [Page 46] Draft Specification NFS version 4 Protocol November 2001const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_OWNER = 0x00080000; const ACE4_SYNCHRONIZE = 0x00100000;5.9.4. ACE who There are several special identifiers ("who") whichServer implementations needto be understood universally. Some of these identifiers cannot be understood when an NFS client accessesnot provide theserver, but have meaning whengranularity of control that is implied by this list of masks. For example, POSIX-based systems might not distinguish APPEND_DATA (the ability to append to alocal process accesses the file. Thefile) from WRITE_DATA (the ability todisplay andmodifythese permissions is permitted over NFS. Who Description _______________________________________________________________ "OWNER" The owner of the file. "GROUP" The group associated with the file. "EVERYONE" The world. "INTERACTIVE" Accessed from an interactive terminal. "NETWORK" Accessed via the network. "DIALUP" Accessed asexisting contents); both masks would be tied to adialup usersingle ``write'' permission. When such a server returns attributes to theserver. "BATCH" Accessed fromclient, it would show both APPEND_DATA and WRITE_DATA if and only if the write permission is enabled. If abatch job. "ANONYMOUS" Accessed without any authentication. "AUTHENTICATED" Any authenticated user (opposite of ANONYMOUS) "SERVICE" Access fromserver receives asystem service. To avoid conflict, these special identifiers are distinguish by an appended "@" andSETATTR request that it cannot accurately implement, it shouldappearerror in theform "xxxx@" (note: no domain name after the "@").direction of more restricted access. Forexample: ANONYMOUS@. Expires: May 2002example, suppose a server cannot distinguish overwriting data from appending new data, as described in the previous paragraph. If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is not (or vice versa), the server should reject the request with NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the server may silently turn on the other bit, so that both APPEND_DATA and WRITE_DATA are denied. Expires: February 2003 [Page47]51] Draft Specification NFS version 4 ProtocolNovember 2001 6. File System Migration and Replication With the use of the recommended attribute "fs_locations",August 2002 5.11.3. ACE flag The "flag" field contains values based on theNFS version 4 server hasfollowing descriptions. ACE4_FILE_INHERIT_ACE Can be placed on amethod of providing file system migration or replication services. For the purposes of migrationdirectory andreplication, aindicates that this ACE should be added to each new non-directory filesystem willcreated. ACE4_DIRECTORY_INHERIT_ACE Can bedefined as all files that shareplaced on agiven fsid (both majordirectory andminor values are the same). The fs_locations attribute providesindicates that this ACE should be added to each new directory created. ACE4_INHERIT_ONLY_ACE Can be placed on alist of file system locations. These locations aredirectory but does not apply to the directory, only to newly created files/directories as specified byprovidingtheserver name (either DNS domain or IP address)above two flags. ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory. Normally when a new directory is created andthe path name representing the root of the file system. Dependingan ACE exists on thetype of service being provided,parent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed on thelist will provide anewlocation or a set of alternate locationsdirectory. One for thefile system. The client will use this information to redirect its requestsdirectory itself and one which is an inheritable ACE for newly created directories. This flag tells the server to not place an ACE on thenew server. 6.1. Replication Itnewly created directory which isexpected that file system replication will be used in the caseinheritable by subdirectories ofread-only data. Typically,thefile system will be replicated on two or more servers.created directory. ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACL4_FAILED_ACCESS_ACE_FLAG Thefs_locations attribute will provide the list of these locationsACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE (ALARM) ACE types. If during theclient. On first accessprocessing of thefile system,file's ACL, theclient should obtainserver encounters an AUDIT or ALARM ACE that matches thevalueprincipal attempting the OPEN, the server notes that fact, and the prescence, if any, of thefs_locations attribute. If,SUCCESS and FAILED flags encountered in thefuture, the client findsAUDIT or ALARM ACE. Once the serverunresponsive,completes theclient may attempt to use another server specified by fs_locations. If applicable,ACL processing, and theclient must takeshare reservation processing, and theappropriate steps to recover valid filehandles fromOPEN call, it then notes if thenew server. This is described in more detail inOPEN succeeded or failed. If thefollowing sections. 6.2. Migration File system migration is used to move a file system from one server to another. Migration is typically used for a file system that is writableOPEN succeeded, andhas a single copy. The expected use of migration isif the SUCCESS flag was set forload balancinga matching AUDIT orgeneral resource reallocation. The protocol does not specify howALARM, then thefile system will be moved between servers. This server-to-server transfer mechanism is left to the server implementor. However, the method used to communicate the migrationappropriate AUDIT or ALARM eventbetween client and server is specified here. Once the servers participating in the migration have completed the move of the file system,occurs. If theerror NFS4ERR_MOVED will be returned for subsequent requests received byOPEN failed, and if theoriginal server. The NFS4ERR_MOVED error is returnedFAILED flag was set forall operations except GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will obtain the value ofthefs_locations attribute. The client willmatching AUDIT or ALARM, thenuse the contents of the attribute to redirect its requests to the specified server. To facilitatetheuse of GETATTR, operations such as PUTFHappropriate Expires:May 2002February 2003 [Page48]52] Draft Specification NFS version 4 ProtocolNovember 2001 must also be accepted byAugust 2002 AUDIT or ALARM event occurs. Clearly either or both of theserver forSUCCESS or FAILED can be set, but if neither is set, themigrated file system's filehandles. NoteAUDIT or ALARM ACE is not useful. The previously described processing applies to thatifof theserverACCESS operation as well. The difference being that "success" or "failure" does not mean whether ACCESS returnsNFS4ERR_MOVED,NFS4_OK or not. Success means whether ACCESS returns all requested and supported bits. Failure means whether ACCESS failed to return a bit that was requested and supported. ACE4_IDENTIFIER_GROUP Indicates that the "who" refers to a GROUP as defined under UNIX. The bitmask constants used for the flag field are as follows: const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_IDENTIFIER_GROUP = 0x00000040; A serverMUSTneed not supportthe fs_locations attribute.any of these flags. If theclient requests more attributes than just fs_locations, theserver supports flags that are similar to, but not exactly the same as, these flags, the implementation mayreturn fs_locations only. This is to be expected sincedefine a mapping between theserver has migratedprotocol-defined flags and the implementation-defined flags. Again, the guiding principle is that the filesystem and maynothave a method of obtaining additional attribute data. The server implementor needsappear to becareful in developingmore secure than it really is. For example, suppose amigration solution. Theclient tries to set an ACE with ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the servermust consider alldoes not support any form of ACL inheritance, thestate information clients may have outstanding at the server. This includes but is not limited to locking/share state, delegation state, and asynchronous file writes which are represented by WRITE and COMMIT verifiers. Theserver shouldstrive to minimizereject theimpact on its clients duringrequest with NFS4ERR_ATTRNOTSUPP. If the server supports a single "inherit ACE" flag that applies to both files andafterdirectories, themigration process. 6.3. Interpretation ofserver may reject thefs_locations Attribute The fs_location attribute is structured inrequest (i.e., requiring thefollowing way: struct fs_location { utf8string server<>; pathname4 rootpath; }; struct fs_locations { pathname4 fs_root; fs_location locations<>; }; The fs_location struct is usedclient torepresentset both thelocation of afilesystem by providing aand directory inheritance flags). The servernamemay also accept the request and silently turn on thepathACE4_DIRECTORY_INHERIT_ACE flag. 5.11.4. ACE who There are several special identifiers ("who") which need to be understood universally, rather than in therootcontext ofthe file system. For a multi-homed server oraset of servers that use the same rootpath, an arrayparticular DNS domain. Some ofserver names maythese identifiers cannot beprovided. An entry in the server array isunderstood when anUTF8 string and represents one of a traditional DNS host name, IPv4 address, or IPv6 address. It is notNFS client accesses the server, but have meaning when arequirement that all servers that sharelocal process Expires: February 2003 [Page 53] Draft Specification NFS version 4 Protocol August 2002 accesses thesame rootpath be listed in one fs_location struct.file. Thearray of server namesability to display and modify these permissions isprovided for convenience. Servers that sharepermitted over NFS, even if none of thesame rootpath may also be listed in separate fs_location entries inaccess methods on thefs_locations attribute.server understands the identifiers. Who Description _______________________________________________________________ "OWNER" Thefs_locations struct and attribute then contains an arrayowner oflocations. Sincethename space of each server may be constructed differently,file. "GROUP" The group associated with the"fs_root" field is provided.file. "EVERYONE" Thepath represented by fs_root representsworld. "INTERACTIVE" Accessed from an interactive terminal. "NETWORK" Accessed via thelocation ofnetwork. "DIALUP" Accessed as a dialup user to thefileserver. "BATCH" Accessed from a batch job. "ANONYMOUS" Accessed without any authentication. "AUTHENTICATED" Any authenticated user (opposite of ANONYMOUS) "SERVICE" Access from a system service. To avoid conflict, these special identifiers are distinguish by an appended "@" and should appear in theserver'sform "xxxx@" (note: no domain namespace. Therefore,after thefs_root path"@"). For example: ANONYMOUS@. 5.11.5. Mode Attribute The NFS version 4 mode attribute isonly associated with the server from whichbased on thefs_locations attribute was obtained.UNIX mode bits. Thefs_root path is meantfollowing bits are defined: const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_RGRP = 0x020; /* read permission: group */ const MODE4_WGRP = 0x010; /* write permission: group */ const MODE4_XGRP = 0x008; /* execute permission: group */ const MODE4_ROTH = 0x004; /* read permission: other */ const MODE4_WOTH = 0x002; /* write permission: other */ const MODE4_XOTH = 0x001; /* execute permission: other */ Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply toaidtheclientprincipal identified inlocatingthefile system atowner attribute. Bits MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP apply to thevarious servers listed.principals identified in the owner_group attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does not match that in the owner group, and does not have a group matching that of the owner_group attribute. The remaining bits are not defined by this protocol and MUST NOT be Expires:May 2002February 2003 [Page49]54] Draft Specification NFS version 4 ProtocolNovember 2001 As an example, there isAugust 2002 used. The minor version mechanism must be used to define further bit usage. Note that in UNIX, if areplicatedfilesystem located at two servers (servA and servB). At servAhas the MODE4_SGID bit set and no MODE4_XGRP bit set, then READ and WRITE must use mandatory filesystem is located at path "/a/b/c". At servBlocking. 5.11.6. Mode and ACL Attribute The server that supports both mode and ACL must take care to synchronize thefile system is located at path "/x/y/z". In this example the client accesses the file system first at servAMODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits witha multi-component lookup paththe ACEs which have respective who fields of"/a/b/c/d". Since"OWNER@", "GROUP@", and "EVERYONE@" so that the clientused a multi-component lookup to obtaincan see semantically equivalent access permissions exist whether thefilehandle at "/a/b/c/d", it is unaware thatclient asks for owner, owner_group and mode attributes, or for just thefile system's root is located in servA's name space at "/a/b/c". WhenACL. Because theclient switches to servB, it will need to determinemode attribute includes bits (e.g. MODE4_SVTX) thatthe directoryhave nothing to do with ACL semantics, itfirst referenced at servAisnow represented by the path "/x/y/z/d" on servB. To facilitate this,permitted for clients to specify both thefs_locationsACL attributeprovided by servA would have a fs_root value of "/a/b/c"andtwo entries in fs_location. One entrymode infs_location will be for itself (servA) andtheother will besame SETATTR operation. However, because there is no prescribed order forservB withprocessing the attributes in apath of "/x/y/z". With this information,SETATTR, the clientis able to substitute "/x/y/z" for the "/a/b/c" atmust ensure that ACL attribute, if specified without mode, would produce thebeginning of its access pathdesired mode bits, andconstruct "/x/y/z/d" to use for the new server. 6.4. Filehandle Recovery for Migration or Replication Filehandles for file systems that are replicated or migrated generally haveconversely, thesame semantics as for file systems that are not replicated or migrated. For example,mode attribute ifa file system has persistent filehandles and it is migrated to another server,specified without ACL, would produce thefilehandle values fordesired "OWNER@", "GROUP@", and "EVERYONE@" ACEs. 5.11.7. mounted_on_fileid UNIX-based operating environments connect a filesystem into thefile system will be valid atnamespace by connecting (mounting) thenew server. For volatile filehandles,filesystem onto theservers involved likely do not have a mechanism to transfer filehandle format and content between themselves. Therefore, a server may have difficulty in determining ifexisting file object (the mount point, usually avolatile filehandle from an old server should return an errordirectory) ofNFS4ERR_FHEXPIRED. Therefore,an existing filesystem. When theclientmount point's parent directory isinformed, withread via an API like readdir(), theusereturn results are directory entries, each with a component name and a fileid. The fileid of thefh_expire_type attribute, whether volatile filehandlesmount point's directory entry willexpire atbe different from themigration or replication event. Iffileid that thebit FH4_VOL_MIGRATIONstat() system call returns. The stat() system call isset inreturning thefh_expire_type attribute,fileid of theclient must treatroot of thevolatile filehandle as ifmounted filesystem, whereas readdir() is returning theserver hadfileid stat() would have returned before any filesystems were mounted on theNFS4ERR_FHEXPIRED error. Atmount point. Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request to cross other filesystems. The client detects themigration or replication event infilesystem crossing whenever thepresencefilehandle argument of LOOKUP has an fsid attribute different from that of theFH4_VOL_MIGRATION bit, thefilehandle returned by LOOKUP. A UNIX-based client willnot present the original or old volatile file handleconsider this a "mount point crossing". UNIX has a legacy scheme for allowing a process to determine its current working directory. This relies on readdir() of a mount point's parent and stat() of thenew server.mount point returning fileids as previously described. Theclient will start its communication withmounted_on_fileid attribute corresponds to thenew server by recovering its filehandles using the saved file names.fileid that readdir() would have returned as described previously. Expires:May 2002February 2003 [Page50]55] Draft Specification NFS version 4 ProtocolNovember 2001 7.August 2002 While the NFSServer Name Space 7.1. Server Exports Onversion 4 client could simply fabricate aUNIX serverfileid corresponding to what mounted_on_fileid provides (and if thename space describes allserver does not support mounted_on_fileid, thefiles reachable by pathnames underclient has no choice), there is a risk that theroot directory or "/". Onclient will generate aWindows NT serverfileid that conflicts with one that is already assigned to another object in thename space constitutes allfilesystem. Instead, if thefiles on disks named by mapped disk letters. NFSserveradministrators rarely make the entire server's file system name space available to NFS clients. More often portions of the name space are made available via an "export" feature. In previous versions ofcan provide theNFS protocol,mounted_on_fileid, theroot filehandlepotential foreach exportclient operational problems in this area isobtained through the MOUNT protocol;eliminated. If theclient sends a stringserver detects thatidentifiesthere is no mounted point at theexport of name space andtarget file object, then theservervalue for mounted_on_fileid that it returns is theroot filehandle for it. The MOUNT protocol supports an EXPORTS proceduresame as thatwill enumerateof theserver's exports. 7.2. Browsing Exportsfileid attribute. TheNFS version 4 protocol provides a root filehandle that clients can use to obtain filehandlesmounted_on_fileid attribute is RECOMMENDED, so the server SHOULD provide it if possible, and forthese exports viaamulti-component LOOKUP. A common user experienceUNIX-based server, this is straightforward. Usually, mounted_on_fileid will be requested during a READDIR operation, in which case it is trivial (at least for UNIX- based servers) tousereturn mounted_on_fileid since it is equal to the fileid of agraphical user interface (perhapsdirectory entry returned by readdir(). If mounted_on_fileid is requested in afile "Open" dialog window) to findGETATTR operation, the server should obey an invariant that has it returning a value that is equal to the filevia progressive browsing throughobject's entry in the object's parent directory, i.e. what readdir() would have returned. Some operating environments allow adirectory tree. The client mustseries of two or more filesystems to beablemounted onto a single mount point. In this case, for the server tomove from one exportobey the aforementioned invariant, it will need toanother export via single-component, progressive LOOKUP operations. This style of browsing isfind the base mount point, and notwell supported bythe intermediate mount points. Expires: February 2003 [Page 56] Draft Specification NFS version24 Protocol August 2002 6. Filesystem Migration and3 protocols. The client expects all LOOKUP operations to remain within a singleReplication With the use of the recommended attribute "fs_locations", the NFS version 4 serverfile system.has a method of providing filesystem migration or replication services. Forexample,thedevice attribute will not change. This preventspurposes of migration and replication, aclient from taking name space pathsfilesystem will be defined as all files thatspan exports. An automounter onshare a given fsid (both major and minor values are theclient can obtainsame). The fs_locations attribute provides asnapshotlist of filesystem locations. These locations are specified by providing theserver'sserver namespace using(either DNS domain or IP address) and theEXPORTS procedurepath name representing the root of theMOUNT protocol. If it understandsfilesystem. Depending on theserver's pathname syntax, it can create an imagetype of service being provided, theserver's name spacelist will provide a new location or a set of alternate locations for the filesystem. The client will use this information to redirect its requests to the new server. 6.1. Replication It is expected that filesystem replication will be used in the case of read-only data. Typically, the filesystem will be replicated on two or more servers. The fs_locations attribute will provide the list of these locations to the client.The partsOn first access of thename space that are not exported byfilesystem, the client should obtain the value of the fs_locations attribute. If, in the future, the client finds the serverare filledunresponsive, the client may attempt to use another server specified by fs_locations. If applicable, the client must take the appropriate steps to recover valid filehandles from the new server. This is described in more detail inwith a "pseudo file system" that allowstheuserfollowing sections. 6.2. Migration Filesystem migration is used tobrowsemove a filesystem from onemounted file systemserver to another.ThereMigration is typically used for adrawback to this representationfilesystem that is writable and has a single copy. The expected use of migration is for load balancing or general resource reallocation. The protocol does not specify how theserver's name space on the client: itfilesystem will be moved between servers. This server-to-server transfer mechanism isstatic. Ifleft to the serveradministrator adds a new exportimplementor. However, the method used to communicate the migration event between client and server is specified here. Once the servers participating in the migration have completed the move of the filesystem, the error NFS4ERR_MOVED will beunaware of it. 7.3. Server Pseudo File System NFS version 4 servers avoid this name space inconsistencyreturned for subsequent requests received bypresentingthe original server. The NFS4ERR_MOVED error is returned for all operations except PUTFH and GETATTR. Upon receiving theexports withinNFS4ERR_MOVED error, theframeworkclient will obtain the value ofa single server name space. An NFS version 4the fs_locations attribute. The clientuses LOOKUP and READDIR operationswill then use the contents of the attribute tobrowse seamlessly from one exportredirect its requests toanother. Portionsthe specified server. To facilitate the use of GETATTR, operations such Expires:May 2002February 2003 [Page51]57] Draft Specification NFS version 4 ProtocolNovember 2001 ofAugust 2002 as PUTFH must also be accepted by the servername space that are not exported are bridged via a "pseudofor the migrated filesystem"system's filehandles. Note thatprovides a view of exported directoriesif the server returns NFS4ERR_MOVED, the server MUST support the fs_locations attribute. If the client requests more attributes than just fs_locations, the server may return fs_locations only.A pseudo file systemThis is to be expected since the server hasa unique fsidmigrated the filesystem andbehaves likemay not have anormal, read only file system. Based on the constructionmethod of obtaining additional attribute data. The server implementor needs to be careful in developing a migration solution. The server must consider all of theserver's name space, it is possible that multiple pseudo file systemsstate information clients mayexist. For example, /a pseudo file system /a/b real file system /a/b/c pseudo file system /a/b/c/d real file system Each ofhave outstanding at thepseudoserver. This includes but is not limited to locking/share state, delegation state, and asynchronous filesystemswrites which areconsider separate entitiesrepresented by WRITE andtherefore will have a unique fsid. 7.4. Multiple RootsCOMMIT verifiers. TheDOSserver should strive to minimize the impact on its clients during andWindows operating environments are sometimes described as having "multiple roots". File systems are commonly represented as disk letters. MacOS represents file systems as top level names. NFS version 4 servers for these platforms can construct a pseudo file system above these root names so that disk letters or volume names are simply directory names inafter thepseudo root. 7.5. Filehandle Volatility The naturemigration process. 6.3. Interpretation of theserver's pseudo file systemfs_locations Attribute The fs_location attribute isthat itstructured in the following way: struct fs_location { utf8string server<>; pathname4 rootpath; }; struct fs_locations { pathname4 fs_root; fs_location locations<>; }; The fs_location struct is used to represent the location of alogical representationfilesystem by providing a server name and the path to the root offile system(s) available fromtheserver. Therefore,filesystem. For a multi-homed server or a set of servers that use thepseudo file system is most likely constructed dynamically whensame rootpath, an array of server names may be provided. An entry in the server array isfirst instantiated.an UTF8 string and represents one of a traditional DNS host name, IPv4 address, or IPv6 address. It isexpectednot a requirement that all servers that share thepseudo file system may not have an on disk counterpart from which persistent filehandles couldsame rootpath beconstructed. Even though itlisted in one fs_location struct. The array of server names ispreferableprovided for convenience. Servers that share theserver provide persistent filehandles forsame rootpath may also be listed in separate fs_location entries in thepseudo file system,fs_locations attribute. The fs_locations struct and attribute then contains an array of locations. Since theNFS client should expect that pseudo file system filehandles are volatile. This canname space of each server may beconfirmedconstructed differently, the "fs_root" field is provided. The path represented bycheckingfs_root represents theassociated "fh_expire_type" attribute for those filehandleslocation of the filesystem inquestion. Ifthefilehandles are volatile,server's name space. Therefore, theNFS client must be prepared to recover a filehandle value (e.g.fs_root path is only associated witha multi-component LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED. 7.6. Exported Root Iftheserver's root file system is exported, one might conclude that a pseudo-file systemserver from which the fs_locations attribute was obtained. The fs_root path isnot needed. This would be wrong. Assumemeant to aid thefollowing file systems on a server: / disk1 (exported) /a disk2 (not exported)client in locating the filesystem at the various servers listed. Expires:May 2002February 2003 [Page52]58] Draft Specification NFS version 4 ProtocolNovember 2001 /a/b disk3 (exported) Because disk2August 2002 As an example, there isnot exported, disk3 cannot be reached with simple LOOKUPs. The server must bridge the gap with a pseudo-file system. 7.7. Mount Point Crossing The server file system environment may be constructed in suchaway that one file system contains a directory which is 'covered' or mounted upon by a second file system. For example: /a/b (file system 1) /a/b/c/d (file system 2) The pseudo file system for this server may be constructed to look like: / (place holder/not exported) /a/b (file system 1) /a/b/c/d (file system 2) It is the server's responsibility to presentreplicated filesystem located at two servers (servA and servB). At servA thepseudo file system thatfilesystem iscomplete tolocated at path "/a/b/c". At servB theclient. Iffilesystem is located at path "/x/y/z". In this example the clientsendsaccesses the filesystem first at servA with a multi-component lookuprequest for thepath"/a/b/c/d", the server's response is the filehandleofthe file system"/a/b/c/d".In previous versions ofSince theNFS protocol,client used a multi-component lookup to obtain theserver would respond withfilehandle at "/a/b/c/d", it is unaware that thedirectory "/a/b/c/d" withinfilesystem's root is located in servA's name space at "/a/b/c". When thefile system "/a/b". The NFSclient switches to servB, it willbe ableneed to determineifthat the directory itcrosses a server mount pointfirst referenced at servA is now represented bya change inthe path "/x/y/z/d" on servB. To facilitate this, the fs_locations attribute provided by servA would have a fs_root value ofthe "fsid" attribute. 7.8. Security Policy"/a/b/c" and two entries in fs_location. One entry in fs_location will be for itself (servA) andName Space Presentation The application oftheserver's security policy needs toother will becarefully considered byfor servB with a path of "/x/y/z". With this information, theimplementor. One may chooseclient is able tolimitsubstitute "/x/y/z" for theviewability of portions"/a/b/c" at the beginning of its access path and construct "/x/y/z/d" to use for thepseudo file system based onnew server. See theserver's perception ofsection "Security Considerations" for a discussion on theclient's ability to authenticate itself properly. However, withrecommendations for thesupport of multiplesecuritymechanisms and the abilityflavor tonegotiatebe used by any GETATTR operation that requests theappropriate use of these mechanisms,"fs_locations" attribute. 6.4. Filehandle Recovery for Migration or Replication Filehandles for filesystems that are replicated or migrated generally have theserver is unable to properly determinesame semantics as for filesystems that are not replicated or migrated. For example, if aclient will be ablefilesystem has persistent filehandles and it is migrated toauthenticate itself. If, based on its policies,another server, theserver chooses to limitfilehandle values for thecontents offilesystem will be valid at thepseudo file system,new server. For volatile filehandles, theserver may effectively hide file systems fromservers involved likely do not have aclient thatmechanism to transfer filehandle format and content between themselves. Therefore, a server mayotherwisehavelegitimate access. Expires: May 2002 [Page 53] Draft Specification NFS version 4 Protocol November 2001 8. File Locking and Share Reservations Integrating locking intodifficulty in determining if a volatile filehandle from an old server should return an error of NFS4ERR_FHEXPIRED. Therefore, theNFS protocol necessarily causes it to be state-full. Withclient is informed, with theinclusionuse of"share" file lockstheprotocol becomes substantially more dependent on state thanfh_expire_type attribute, whether volatile filehandles will expire at thetraditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division between client and server o Ability to reliably detect inconsistencymigration or replication event. If the bit FH4_VOL_MIGRATION is set instate between client and server o Simple and robust recovery mechanisms In this model,theserver ownsfh_expire_type attribute, thestate information. Theclientcommunicates its view of this state tomust treat theservervolatile filehandle asneeded. The client is also able to detect inconsistent state before modifying a file. To support Win32 "share" locks it is necessary to atomically OPENif the server had returned the NFS4ERR_FHEXPIRED error. At the migration orCREATE files. Having a separate share/unshare operation would not allow correct implementationreplication event in the presence of theWin32 OpenFile API. In order to correctly implement share semantics,FH4_VOL_MIGRATION bit, theprevious NFS protocol mechanisms used when a file is openedclient will not present the original orcreated (LOOKUP, CREATE, ACCESS) needold volatile filehandle tobe replaced.the new server. The client will start its communication with the new server by recovering its filehandles using the saved file names. Expires: February 2003 [Page 59] Draft Specification NFS version 4protocol has an OPEN operation that subsumes the functionality of LOOKUP, CREATE, and ACCESS. However, because many operations requireProtocol August 2002 7. NFS Server Name Space 7.1. Server Exports On afilehandle,UNIX server thetraditional LOOKUP is preserved to map a filenameto filehandle without establishing state onspace describes all theserver. The policy of granting access or modifyingfilesis managedreachable by pathnames under the root directory or "/". On a Windows NT serverbased ontheclient's state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking. 8.1. Locking It is assumed that manipulating a lock is rare when compared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important thatname space constitutes all theREAD and WRITE operations have a lightweight mechanism to indicate if they possess a held lock. A lock request containsfiles on disks named by mapped disk letters. NFS server administrators rarely make theheavyweight information requiredentire server's filesystem name space available toestablish a lock and uniquely define the lock owner. The following sections describeNFS clients. More often portions of thetransition fromname space are made available via an "export" feature. In previous versions of theheavy weight information toNFS protocol, theeventual stateid usedroot filehandle formosteach export is obtained through the MOUNT protocol; the client sends a string that identifies the export of name space and the serverlocking and lease interactions. 8.1.1. Client ID For each LOCK request,returns theclient must identify itself toroot filehandle for it. The MOUNT protocol supports an EXPORTS procedure that will enumerate theserver. Expires: May 2002 [Page 54] Draft Specificationserver's exports. 7.2. Browsing Exports The NFS version 4Protocol November 2001 This is done in suchprotocol provides away asroot filehandle that clients can use toallowobtain filehandles forcorrect lock identification and crash recovery. Client identification is accomplished with two values. othese exports via a multi-component LOOKUP. Averifier thatcommon user experience isusedtodetect client reboots. o A variable length opaque array to uniquely defineuse aclient. For an operating system this may begraphical user interface (perhaps afully qualified host name or IP address. Forfile "Open" dialog window) to find auser level NFS client it may additionally containfile via progressive browsing through aprocess id or other unique sequence.directory tree. Thedata structure for the Client ID would then appear as: struct nfs_client_id { opaque verifier[4]; opaque id<>; } Itclient must be able to move from one export to another export via single-component, progressive LOOKUP operations. This style of browsing ispossible throughnot well supported by themis-configuration of aNFS version 2 and 3 protocols. The clientorexpects all LOOKUP operations to remain within a single server filesystem. For example, theexistence ofdevice attribute will not change. This prevents arogueclient from taking name space paths thattwo clients end up using the same nfs_client_id. This situation is avoided by "negotiating"span exports. An automounter on thenfs_client_id betweenclientand server with the usecan obtain a snapshot of theSETCLIENTID and SETCLIENTID_CONFIRM operations. The following describesserver's name space using thetwo scenariosEXPORTS procedure ofnegotiation. 1 Client has never connected totheserver In this caseMOUNT protocol. If it understands theclient generatesserver's pathname syntax, it can create annfs_client_id and unless another client has the same nfs_client_id.id field,image of theserver acceptsserver's name space on therequest.client. Theserver also recordsparts of theprincipal (or principal to uid mapping) fromname space that are not exported by thecredentialserver are filled inthe RPC requestwith a "pseudo filesystem" thatcontains the nfs_client_id negotiation request (SETCLIENTID operation). Two clients might still useallows thesame nfs_client_id.id dueuser toperhaps configuration error. For example, a High Availability configuration where the nfs_client_id.id is derivedbrowse fromthe ethernet controller address and both systems have the same address. In this case, the resultone mounted filesystem to another. There is aswitched union that returns, in additiondrawback toNFS4ERR_CLID_INUSE, the network address (the rpcbind netid and universal address)this representation of theclient that is usingserver's name space on theid. 2 Clientclient: it isre-connecting tostatic. If the serverafteradministrator adds aclient reboot In this case,new export the clientstill generates an nfs_client_id but the nfs_client_id.id fieldwill be unaware of it. 7.3. Server Pseudo Filesystem NFS version 4 servers avoid this name space inconsistency by presenting all thesame as the nfs_client_id.id generated prior to reboot. Ifexports within the framework of a single serverfinds that the principal/uid is equalname space. An NFS version 4 client uses LOOKUP and READDIR operations tothe previously "registered" nfs_client_id.id, then locks associated withbrowse seamlessly from one export to another. Portions Expires:May 2002February 2003 [Page55]60] Draft Specification NFS version 4 ProtocolNovember 2001August 2002 of theold nfs_client_idserver name space that areimmediately released. If the principal/uid isnotequal, then this isexported are bridged via arogue client"pseudo filesystem" that provides a view of exported directories only. A pseudo filesystem has a unique fsid and behaves like a normal, read only filesystem. Based on therequest is returned in error. For more discussionconstruction ofcrash recovery semantics, seethesection on "Crash Recovery". Itserver's name space, it is possiblefor a retransmissionthat multiple pseudo filesystems may exist. For example, /a pseudo filesystem /a/b real filesystem /a/b/c pseudo filesystem /a/b/c/d real filesystem Each ofrequest to be received by the server aftertheserver has acted upon and responded to the original client request. Therefore to mitigate effects of the retransmission of the SETCLIENTID operation, the clientpseudo filesystems are considered separate entities andserver usetherefore will have aconfirmation step.unique fsid. 7.4. Multiple Roots Theserver returnsDOS and Windows operating environments are sometimes described as having "multiple roots". filesystems are commonly represented as disk letters. MacOS represents filesystems as top level names. NFS version 4 servers for these platforms can construct aconfirmation verifierpseudo file system above these root names so thatthe client then sends to the serverdisk letters or volume names are simply directory names in theSETCLIENTID_CONFIRM operation. Once the server receivespseudo root. 7.5. Filehandle Volatility The nature of theconfirmationserver's pseudo filesystem is that it is a logical representation of filesystem(s) available from theclient,server. Therefore, thelocking state forpseudo filesystem is most likely constructed dynamically when theclientserver isreleased. In both cases, upon success, NFS4_OKfirst instantiated. It isreturned. To help reduceexpected that theamount of data transferredpseudo filesystem may not have an onOPEN and LOCK, the server will also return a unique 64-bit clientid value thatdisk counterpart from which persistent filehandles could be constructed. Even though it isa shorthand reference topreferable that thenfs_client_id values presented byserver provide persistent filehandles for theclient. From this point forward,pseudo filesystem, the NFS clientwill use the clientid to refer to itself. The clientid assignedshould expect that pseudo file system filehandles are volatile. This can be confirmed by checking theserver shouldassociated "fh_expire_type" attribute for those filehandles in question. If the filehandles are volatile, the NFS client must bechosen so that it will not conflictprepared to recover a filehandle value (e.g. with aclientid previously assigned bymulti-component LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED. 7.6. Exported Root If theserver.server's root filesystem is exported, one might conclude that a pseudo-filesystem is not needed. Thisapplies across server restarts or reboots. Whenwould be wrong. Assume the following filesystems on aclientidserver: / disk1 (exported) /a disk2 (not exported) Expires: February 2003 [Page 61] Draft Specification NFS version 4 Protocol August 2002 /a/b disk3 (exported) Because disk2 ispresented tonot exported, disk3 cannot be reached with simple LOOKUPs. The server must bridge the gap with a pseudo-filesystem. 7.7. Mount Point Crossing The serverandfilesystem environment may be constructed in such a way thatclientidone filesystem contains a directory which isnot recognized, as would happen after'covered' or mounted upon by a second filesystem. For example: /a/b (filesystem 1) /a/b/c/d (filesystem 2) The pseudo filesystem for this serverreboot,may be constructed to look like: / (place holder/not exported) /a/b (filesystem 1) /a/b/c/d (filesystem 2) It is theserver will rejectserver's responsibility to present therequest withpseudo filesystem that is complete to theerror NFS4ERR_STALE_CLIENTID. When this happens,client. If the clientmust obtainsends anew clientid by uselookup request for the path "/a/b/c/d", the server's response is the filehandle of theSETCLIENTID operation and then proceed to any other necessary recovery forfilesystem "/a/b/c/d". In previous versions of the NFS protocol, the serverreboot case (Seewould respond with thesection "Server Failure and Recovery").filehandle of directory "/a/b/c/d" within the filesystem "/a/b". The NFS clientmust also employ the SETCLIENTID operation whenwill be able to determine if itreceives a NFS4ERR_STALE_STATEID error using a stateid derived from its current clientid, since this also indicatescrosses a serverreboot which has invalidatedmount point by a change in theexisting clientid (seevalue of thenext section "nfs_lockowner"fsid" attribute. 7.8. Security Policy andstateid Definition" for details). 8.1.2. Server ReleaseName Space Presentation The application ofClientid If the server determines thattheclient holds no associated state for its clientid,server's security policy needs to be carefully considered by theserverimplementor. One may choose toreleaselimit theclientid. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. Ifviewability of portions of theclient contactspseudo filesystem based on theserver after this release,server's perception of theserver must ensureclient's ability to authenticate itself properly. However, with theclient receivessupport of multiple security mechanisms and the ability to negotiate the appropriateerror so that it willuse of these mechanisms, theSETCLIENTID/SETCLIENTID_CONFIRM sequenceserver is unable toestablishproperly determine if anew identity. It should be clear that the server mustclient will bevery hesitantable torelease a clientid since the resulting workauthenticate itself. If, based on its policies, theclientserver chooses torecover from such an event will belimit thesame burden as ifcontents of the pseudo filesystem, the serverhad failed and restarted. Typicallymay effectively hide filesystems from a client that may otherwise have legitimate access. As suggested practice, the serverwould not releaseshould apply the security policy of aclientid unlessshared resource in the server's namespace to the ancestors components of the namespace. For example: / Expires:May 2002February 2003 [Page56]62] Draft Specification NFS version 4 ProtocolNovember 2001 there had been no activity from that client for many minutes. 8.1.3. nfs_lockowner and stateid Definition When requestingAugust 2002 /a/b /a/b/c The /a/b/c directory is alock, the client must present toreal filesystem and is the shared resource. The security policy for /a/b/c is Kerberos with integrity. The server should should apply theclientidsame security policy to /, /a, andan identifier/a/b. This allows for theownerextension of therequested lock. These two fields are referred to asprotection of thenfs_lockowner andserver's namespace to thedefinitionancestors ofthose fields are: o A clientid returned bytheserver as partreal shared resource. For the case of theclient'suse of multiple, disjoint security mechanisms in theSETCLIENTID operation. o A variable length opaque array used to uniquely defineserver's resources, theowner ofsecurity for alock managed byparticular object in theclient. This mayserver's namespace should bea thread id, process id, or other unique value. Whentheserver grantsunion of all security mechanisms of all direct descendants. Expires: February 2003 [Page 63] Draft Specification NFS version 4 Protocol August 2002 8. File Locking and Share Reservations Integrating locking into thelock,NFS protocol necessarily causes itresponds with a unique 64-bit stateid. The stateid is used as a shorthand referenceto be stateful. With thenfs_lockowner, sinceinclusion of share reservations theserver will be maintainingprotocol becomes substantially more dependent on state than thecorrespondencetraditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division betweenthem. Theclient and serveris freeo Ability toform the stateidreliably detect inconsistency inany manner that it chooses as long as it is able to recognize invalidstate between client andout-of-date stateids. This requirement includes those stateids generated by earlier instances of the server. From this,server o Simple and robust recovery mechanisms In this model, theclient can be properly notified of aserverrestart. This notification will occur whenowns the state information. The clientpresents a stateidcommunicates its view of this state to the serverfrom a previous instantiation.as needed. Theserver must beclient is also able todistinguishdetect inconsistent state before modifying a file. To support Win32 share reservations it is necessary to atomically OPEN or CREATE files. Having a separate share/unshare operation would not allow correct implementation of thefollowing situations and returnWin32 OpenFile API. In order to correctly implement share semantics, theerror as specified: o The stateid was generated by an earlier server instance (i.e. beforeprevious NFS protocol mechanisms used when aserver reboot). The error NFS4ERR_STALE_STATEID shouldfile is opened or created (LOOKUP, CREATE, ACCESS) need to bereturned. oreplaced. Thestateid was generated by the current server instance butNFS version 4 protocol has an OPEN operation that subsumes thestateid no longer designatesNFS version 3 methodology of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, thecurrent lockingtraditional LOOKUP is preserved to map a file name to filehandle without establishing stateforon thelockowner-file pair in question (i.e. one or more locking operations has occurred).server. Theerror NFS4ERR_OLD_STATEID should be returned. This error condition willpolicy of granting access or modifying files is managed by the server based on the client's state. These mechanisms can implement policy ranging from advisory onlyoccurlocking to full mandatory locking. 8.1. Locking It is assumed that manipulating a lock is rare when compared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important that theclient issuesREAD and WRITE operations have alocking request which changeslightweight mechanism to indicate if they possess astateid while an I/Oheld lock. A lock requestthat uses that stateid is outstanding. ocontains the heavyweight information required to establish a lock and uniquely define the lock owner. Thestateid was generated byfollowing sections describe thecurrent server instance buttransition from the heavy weight information to the eventual stateiddoes not designate a locking stateused forany active lockowner-file pair. The error NFS4ERR_BAD_STATEID should bemost client and server locking and lease interactions. 8.1.1. Client ID For each LOCK request, the client must identify itself to the server. Expires:May 2002February 2003 [Page57]64] Draft Specification NFS version 4 ProtocolNovember 2001 returned.August 2002 Thiserror condition will occur when there has beenis done in such alogic error onway as to allow for correct lock identification and crash recovery. A sequence of a SETCLIENTID operation followed by a SETCLIENTID_CONFIRM operation is required to establish thepartidentification onto the server. Establishment of identification by a new incarnation of the clientor server. This should not happen. One mechanismalso has the effect of immediately breaking any leased state thatmay be useda previous incarnation of the client might have had on the server, as opposed tosatisfy these requirements isforcing the new client incarnation to wait for theserverleases todivide stateids into three fields: o Aexpire. Breaking the lease state amounts to the serververifier which uniquely designates a particularremoving all lock, share reservation, and, where the serverinstantiation. o An index into a table of locking-state structures. o A sequence value which is incremented for each stateid thatis not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated withthesameindex into the locking-state table. By matching the incoming stateid and its field valuesclient with the same identity. For discussion of delegation stateheld atrecovery, see theserver,section "Delegation Recovery". Client identification is encapsulated in theserverfollowing structure: struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT>; }; The first field, verifier isable to easily determine ifastateidclient incarnation verifier that isvalid for its current instantiation and state. Ifused to detect client reboots. Only if thestateidverifier isnot valid,different from that theappropriate error can be supplied toserver has previously recorded theclient. 8.1.4. Use ofclient (as identified by thestateid All READ and WRITE operations contain a stateid. Ifsecond field f thenfs_lockowner performs a READ or WRITE on a range of bytes within a locked range,structure, id) does thestateid (previously returned byserver start theserver) must be used to indicate thatprocess of cancelling theappropriate lock (record or share) is held. If no stateclient's leased state. The second field, id isestablished by the client, either record lock or share lock,astateid of all bits 0 is used. If no conflicting locksvariable length string that uniquely defines the client. There areheld onseveral considerations for how thefile,client generates theserver may serviceid string: o The string should be unique so that multiple clients do not present theREAD or WRITE operation. If a conflict with an explicit lock occurs,same string. The consequences of two clients presenting the same string range from one client getting an erroris returned for the operation (NFS4ERR_LOCKED). This allows "mandatory locking"to one client having its leased state abruptly and unexpectedly cancelled. o The string should beimplemented. A stateidselected so the subsequent incarnations (e.g. reboots) ofall bits 1 (one) allows READ operations to bypass record locking checks attheserver. However, WRITE operations with stateid with bits all 1 (one) do not bypass record locking checks. File locking checks are handled bysame client cause theOPEN operation (seeclient to present thesection "OPEN/CLOSE Operations"). An explicit lock may notsame string. The implementor is cautioned from an approach that requires the string to begranted whilerecorded in aREAD or WRITE operation with conflicting implicit locking is being performed. 8.1.5. Sequencinglocal file because this precludes the use ofLock Requests Lockingthe implementation in an environment where there is no local disk and all file access is from an NFS version 4 server. o The string should be different for each server network address that the client accesses, rather thanmost NFS operations as it requires "at- most-one" semanticscommon to all server network addresses. The reason is thatareit may notprovided by ONCRPC. ONCRPC over abe possible for the client to tell if same server is listening on multiple network addresses. If the client issues SETCLIENTID with the Expires:May 2002February 2003 [Page58]65] Draft Specification NFS version 4 ProtocolNovember 2001 reliable transport is not sufficient because a sequenceAugust 2002 same id string to each network address oflocking requests may span multiple TCP connections. Insuch a server, theface of retransmission or reordering, lock or unlock requests must have a well definedserver will think it is the same client, andconsistent behavior. To accomplish this,eachlock request contains a sequence number that is a consecutively increasing integer. Different nfs_lockowners have different sequences. Thesuccessive SETCLIENTID will cause the servermaintainsto begin thelast sequence number (L) received andprocess of removing theresponseclient's previous leased state. o The algorithm for generating the string should not assume thatwas returned. Notethe client's network address won't change. This includes changes between client incarnations and even changes while the client is stilling running in its current incarnation. This means thatfor requestsif the client includes just the client's and server's network address in the id string, there is a real risk, after the client gives up the network address, thatcontainanother client, using asequence number,similar algorithm foreach nfs_lockowner, there should be no more than one outstanding request. If a request withgenerate the id string, will generating aprevious sequence number (r < L) is received, it is rejected withconflicting id string. Given thereturnabove considerations, an example oferror NFS4ERR_BAD_SEQID. Givenaproperly-functioningwell generated id string is one that includes: o The server's network address. o The client's network address. o For a user level NFS version 4 client,the responseit should contain additional information to(r) must have been received beforedistinguish thelast request (L) was sent. If a duplicate of last request (r == L) is received,client from other user level clients running on thestored response is returned. Ifsame host, such as arequest beyond the next sequence (r == L + 2) is received,process id or other unique sequence. o Additional information that tends to be unique, such as one or more of: - The client machines serial number (for privacy reasons, it isrejected withbest to perform some one way function on thereturnserial number). - A MAC address. - The timestamp oferror NFS4ERR_BAD_SEQID. Sequence history is reinitialized wheneverwhen theclient verifier changes. SinceNFS version 4 software was first installed on thesequence numberclient (though this isrepresented with an unsigned 32-bit integer, the arithmetic involved withsubject to thesequence number is mod 2^32. Itpreviously mentioned caution about using information that iscritical the server maintainstored in a file, because thelast response sentfile might only be accessible over NFS version 4). - A true random number. However since this number ought to be the same between clientto provide a more reliable cache of duplicate non-idempotent requests thanincarnations, this shares the same problem as that of thetraditional cache described in [Juszczak]. The traditional duplicate request cache uses a least recently used algorithm for removing unneeded requests. However,using thelast lock request and response ontimestamp of the software installation. As agiven nfs_lockowner must be cached as long assecurity measure, thelockserver MUST NOT cancel a client's leased stateexists onif theserver. 8.1.6. Recovery from Replayed Requests As described above,principal established thesequence numberstate for a given id string isper nfs_lockowner. As longnot the same as theserver maintainsprincipal issuing thelast sequence number receivedSETCLIENTID. Note that SETCLIENTID andfollows the methods described above, there are no risks ofSETCLIENTID_CONFIRM has aByzantine router re-sending old requests. Thesecondary purpose Expires: February 2003 [Page 66] Draft Specification NFS version 4 Protocol August 2002 of establishing the information the serverneed only maintain the nfs_lockowner, sequence number state as long as there are open files or closed files with locks outstanding. LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence number and thereforeneeds to make callbacks to theriskclient for purpose of supporting delegations. It is permitted to change this information via SETCLIENTID and SETCLIENTID_CONFIRM within thereplaysame incarnation ofthese operations resulting in undesired effects is non-existent whiletheserver maintainsclient without removing thenfs_lockownerclient's leased state.8.1.7. Releasing nfs_lockowner State WhenOnce aparticular nfs_lockowner no longer holds open or file locking Expires: May 2002 [Page 59] Draft Specification NFS version 4 Protocol November 2001 state at the server, the server may choose to release theSETCLIENTID and SETCLIENTID_CONFIRM sequencenumber state associated withhas successfully completed, thenfs_lockowner. The server may make this choice based on lease expiration, forclient uses thereclamationshort hand client identifier, of type clientid4, instead ofserver memory, or other implementation specific details. In any event, the server is able to do this safely only whenthenfs_lockowner nolonger and less compact nfs_client_id4 structure. This short hand client identfier (a clientid) isbeing utilizedassigned by theclient. Theservermay choose to hold the nfs_lockowner state in the eventand should be chosen so thatretransmitted requests are received. However,it will not conflict with a clientid previously assigned by theperiod to hold this stateserver. This applies across server restarts or reboots. When a clientid isimplementation specific. In the case thatpresented to aLOCK, LOCKU, OPEN_DOWNGRADE, or CLOSEserver and that clientid isretransmittednot recognized, as would happen afterthea serverhas previously released the nfs_lockowner state,reboot, the server willfind that the nfs_lockowner has no files open and an error will be returned to the client. Ifreject thenfs_lockowner does have a file open,request with thestateid will not match and again anerroris returned to the client. In the case that an OPEN is retransmitted and the nfs_lockowner is being used for the first time orNFS4ERR_STALE_CLIENTID. When this happens, thenfs_lockowner state has been previously releasedclient must obtain a new clientid bythe server, theuse of theOPEN_CONFIRMSETCLIENTID operationwill prevent incorrect behavior. Whenand then proceed to any other necessary recovery for the serverobserves the use ofreboot case (See thenfs_lockowner forsection "Server Failure and Recovery"). The client must also employ thefirst time,SETCLIENTID operation when itwill directreceives a NFS4ERR_STALE_STATEID error using a stateid derived from its current clientid, since this also indicates a server reboot which has invalidated theclient to performexisting clientid (see theOPEN_CONFIRMnext section "lock_owner and stateid Definition" for details). See thecorresponding OPEN. This sequence establishes the usedetailed descriptions ofan nfs_lockownerSETCLIENTID andassociated sequence number. See the section "OPEN_CONFIRM - Confirm Open"SETCLIENTID_CONFIRM forfurther details. 8.2. Lock Ranges The protocol allows a lock owner to request a lock with one byte range and then either upgrade or unlockasub-rangecomplete specification of theinitial lock. It is expected that this will be an uncommon typeoperations. 8.1.2. Server Release ofrequest. In any case, servers or server file systems may not be able to support sub-range lock semantics. InClientid If theevent that aserverreceives a locking requestdetermines thatrepresents a sub-range of current lockingthe client holds no associated state forthe lock owner,its clientid, the serveris allowedmay choose toreturnrelease theerror NFS4ERR_LOCK_RANGE to signifyclientid. The server may make this choice for an inactive client so thatit doesresources are notsupport sub- range lock operations. Therefore,consumed by those intermittently active clients. If the clientshould be prepared to receivecontacts the server after thiserror and, if appropriate, reportrelease, theerror toserver must ensure therequesting application. Theclientis discouraged from combining multiple independent locking rangesreceives the appropriate error so thathappenit will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence tobe adjacent intoestablish asingle request sincenew identity. It should be clear that the servermay not support sub-range requests and for reasons relatedmust be very hesitant to release a clientid since therecovery of file locking state inresulting work on the client to recover from such an eventof server failure. As discussed inwill be thesection "Server Failure and Recovery" below,same burden as if the servermay employ certain optimizations during recoveryhad failed and restarted. Typically a server would not release a clientid unless there had been no activity from thatwork effectively only whenclient for many minutes. Note that if theclient's behavior during lock recoveryid string in a SETCLIENTID request issimilar toproperly constructed, and if theclient's locking behavior priorclient takes care to use the same principal for each successive use of SETCLIENTID, then, barring an active denial of service attack, NFS4ERR_CLID_INUSE should never be returned. However, client bugs, serverfailure.bugs, or perhaps a deliberate change of Expires:May 2002February 2003 [Page60]67] Draft Specification NFS version 4 ProtocolNovember 2001 8.3. Blocking Locks Some clients requireAugust 2002 thesupportprincipal owner of the id string (such as the case ofblocking locks. The NFS version 4 protocol must not rely onacallback mechanismclient that changes security flavors, andthereforeunder the new flavor, there isunableno mapping tonotify a clientthe previous owner) will in rare cases result in NFS4ERR_CLID_INUSE. In that event, when the server gets apreviously denied lockSETCLIENTID for a client id that currently hasbeen granted. Clients havenochoicestate, or it has state, butto continually poll for the lock. This presents a fairness problem. Two new lock types are added, READW and WRITEW, and are used to indicate totheserver thatlease has expired, rather than returning NFS4ERR_CLID_INUSE, theclient is requesting a blocking lock. Theservershould maintain an ordered list of pending blocking locks. WhenMUST allow theconflicting lock is released,SETCLIENTID, and confirm theserver may waitnew clientid if followed by thelease period forappropriate SETCLIENTID_CONFIRM. 8.1.3. lock_owner and stateid Definition When requesting a lock, thefirst waitingclient must present tore-requestthelock. Afterserver thelease period expiresclientid and an identifier for thenext waiting client request is allowedowner of the requested lock.ClientsThese two fields arerequired to poll at an interval sufficiently small that it is likelyreferred toacquireas thelock in a timely manner. The server is not required to maintain a listlock_owner and the definition ofpending blocked locksthose fields are: o A clientid returned by the server asit is used to increase fairness and not correct operation. Becausepart of theunordered natureclient's use ofcrash recovery, storingthe SETCLIENTID operation. o A variable length opaque array used to uniquely define the owner of a lockstate to stable storage would be required to guarantee ordered granting of blocking locks. Servers may also note the lock types and delay returning denial ofmanaged by therequest to allow extra time for a conflicting lock toclient. This may bereleased, allowingasuccessful return. In this way, clients can avoidthread id, process id, or other unique value. When theburden of needlessly frequent polling for blocking locks. Theservershould take care ingrants thelength of delay inlock, it responds with a unique stateid. The stateid is used as a shorthand reference to theeventlock_owner, since theclient retransmitsserver will be maintaining therequest. 8.4. Lease Renewalcorrespondence between them. Thepurpose of a lease is to allow aserver is free toremove stale locks that are held by a clientform the stateid in any manner thathas crashed or is otherwise unreachable. Itit chooses as long as it isnot a mechanism for cache consistencyable to recognize invalid andlease renewals may not be denied if the lease interval has not expired. The following events cause implicit renewal of allout-of-date stateids. This requirement includes those stateids generated by earlier instances of theleases for a givenserver. From this, the client(i.e. all those sharing a given clientid). Eachcan be properly notified ofthese isapositive indication thatserver restart. This notification will occur when the clientis still active and thatpresents a stateid to theassociated state held atserver from a previous instantiation. The server must be able to distinguish theserver, forfollowing situations and return theclient, is still valid.error as specified: oAn OPEN withThe stateid was generated by an earlier server instance (i.e. before avalid clientid.server reboot). The error NFS4ERR_STALE_STATEID should be returned. oAny operation made with a validThe stateid(CLOSE, DELEGRETURN, LOCK, LOCKU, OPEN, OPEN_CONFIRM, READ, RENEW, SETATTR, WRITE). This does not includewas generated by thespecial stateids of all bits 0 or all bits 1. Note that ifcurrent server instance but theclient had restarted or rebooted,stateid no longer designates theclient would not be making these requests without issuingcurrent locking state for theSETCLIENTID operation.lockowner-file pair in question (i.e. one or more locking operations has occurred). Theuse of the SETCLIENTIDerror NFS4ERR_OLD_STATEID should be returned. Expires:May 2002February 2003 [Page61]68] Draft Specification NFS version 4 ProtocolNovember 2001 operation (possibly with the addition ofAugust 2002 This error condition will only occur when theoptional SETCLIENTID_CONFIRM operation) notifiesclient issues a locking request which changes a stateid while an I/O request that uses that stateid is outstanding. o The stateid was generated by the current serverto dropinstance but the stateid does not designate a locking stateassociated with the client. If the serverfor any active lockowner-file pair. The error NFS4ERR_BAD_STATEID should be returned. This error condition will occur when there hasrebooted,been a logic error on thestateids (NFS4ERR_STALE_STATEID error) orpart of theclientid (NFS4ERR_STALE_CLIENTID error) willclient or server. This should not happen. One mechanism that may bevalid hence preventing spurious renewals. This approach allowsused to satisfy these requirements is forlow overhead lease renewal which scales well. Inthetypical case no extra RPC calls are required for lease renewal and inserver to, o divide theworst case one RPC is required every lease period (i.e."other" field of each stateid into two fields: - A server verifier which uniquely designates aRENEW operation). The numberparticular server instantiation. - An index into a table oflocks held bylocking-state structures. o utilize theclient"seqid" field of each stateid, such that seqid isnot a factor since all statemonotonically incremented forthe clienteach stateid that isinvolvedassociated with thelease renewal action. Since all operations that create a new lease also renew existing leases,same index into theserver must maintain a common lease expiration time for all valid leases for a given client. This lease time can then be easily updated upon implicit lease renewal actions. 8.5. Crash Recovery The important requirement in crash recovery is that bothlocking-state table. By matching theclientincoming stateid and its field values with theserver know whenstate held at theother has failed. Additionally, itserver, the server isrequired that a client seesable to easily determine if aconsistent view of data across server restarts or reboots. All READstateid is valid for its current instantiation andWRITE operations that may have been queued withinstate. If theclient or network buffers must wait untilstateid is not valid, theclient has successfully recoveredappropriate error can be supplied to thelocks protectingclient. 8.1.4. Use of theREADstateid and Locking All READ, WRITEoperations. 8.5.1. Client FailureandRecovery In the event thatSETATTR operations contain aclient fails, the server may recover the client's locks whenstateid. For theassociated leases have expired. Conflicting locks from another client may only be granted afterpurposes of thislease expiration. Ifsection, SETATTR operations which change theclient is able to restart or reinitialize withinsize attribute of a file are treated as if they are writing thelease periodarea between theclient may be forced to waitold and new size (i.e. theremainder ofrange truncated or added to thelease period before obtaining new locks. To minimize client delay upon restart, lock requests are associated with an instancefile by means of theclient by a client supplied verifier. This verifierSETATTR), even where SETATTR ispart ofnot explicitly mentioned in theinitial SETCLIENTID call made bytext. If theclient. The server returnslock_owner performs aclientid asREAD or WRITE in aresult ofsituation in which it has established a lock or share reservation on theSETCLIENTID operation. The client then confirmsserver (any OPEN constitutes a share reservation) theuse ofstateid (previously returned by theverifier with SETCLIENTID_CONFIRM. The clientid in combination with an opaque owner field is thenserver) must be used to indicate what locks, including both record locks and share reservations, are held by theclient to identifylockowner. If no state is established by the client, either record lockowner for OPEN. This chainor share reservation, a stateid ofassociationsall bits 0 isthen used to identifyused. Regardless whether a stateid of alllocks forbits 0, or aparticular client.stateid returned by the server is used, Expires:May 2002February 2003 [Page62]69] Draft Specification NFS version 4 ProtocolNovember 2001 Since the verifier will be changed byAugust 2002 if there is a conflicting share reservation or mandatory record lock held on theclient upon each initialization,file, the servercan compare a new verifierMUST refuse to service theverifier associated with currently held locksREAD or WRITE operation. Share reservations are established by OPEN operations anddetermineby their nature are mandatory in thatthey do not match. This signifieswhen theclient's new instantiation and subsequent lossOPEN denies READ or WRITE operations, that denial results in such operations being rejected with error NFS4ERR_LOCKED. Record locks may be implemented by the server as either mandatory or advisory, or the choice oflocking state. As a result,mandatory or advisory behavior may be determined by the serveris free to release all locks held which are associated withon theold clientid which was derived frombasis of theold verifier. For secure environments,file being accessed (for example, some UNIX-based servers support achange in"mandatory lock bit" on theverifier mustmode attribute such that if set, record locks are required on the file before I/O is possible). When record locks are advisory, they onlycauseprevent thereleasegranting oflocks associatedconflicting lock requests and have no effect on READ's or WRITE's. Mandatory record locks, however, prevent conflicting I/O operations. When they are attempted, they are rejected with NFS4ERR_LOCKED. Assuming an operating environment like UNIX that requires it, when theauthenticated requester. This is required to preventclient gets NFS4ERR_LOCKED on arogue entity from freeing otherwise valid locks. Note thatfile it knows it has theverifier must haveproper share reservation for, it will need to issue a LOCK request on thesame uniqueness propertiesregion of theverifier forfile that includes theCOMMIT operation. 8.5.2. Server Failure and Recovery Ifregion theserver loses locking state (usually asI/O was to be performed on, with an appropriate locktype (i.e. READ*_LT for aresultREAD operation, WRITE*_LT for a WRITE operation). With NFS version 3, there was no notion of arestart or reboot), it must allow clients timestateid so there was no way todiscover this fact and re- establishtell if the application process of thelost locking state. Theclientmust be able to re- establishsending thelocking state without havingREAD or WRITE operation had also acquired theserver deny valid requests becauseappropriate record lock on theserver has granted conflicting access to another client. Likewise, iffile. Thus thereiswas no way to implement mandatory locking. With thepossibilitystateid construct, this barrier has been removed. Note thatclients have not yet re-established their locking statefora file,UNIX environments that support mandatory file locking, theserver must disallow READdistinction between advisory andWRITE operations for that file. The duration of this recovery periodmandatory locking isequal tosubtle. In fact, advisory and mandatory record locks are exactly theduration ofsame in so far as the APIs and requirements on implementation. If the mandatory lock attribute is set on the file, thelease period. A client can determine thatserverfailure (and thus loss of locking state)checks to see if the lockowner hasoccurred, whenan appropriate shared (read) or exclusive (write) record lock on the region itreceives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a rebootwishes to read orrestart. The NFS4ERR_STALE_CLIENTID error indicateswrite to. If there is no appropriate lock, the server checks if there is aclientid invalidatedconflicting lock (which can be done byreboot or restart. When eitherattempting to acquire the conflicting lock on the behalf ofthese are received,theclient must establish a new clientid (See the section "Client ID")lockowner, andre-establishif successful, release thelocking state as discussed below. The period of special handling of locking and READs and WRITEs, equal in duration tolock after thelease period,READ or WRITE isreferred to as the "grace period". During the grace period, clients recover locks and the associated state by reclaim-type locking requests (i.e. LOCK requests with reclaim set to truedone), andOPEN operations with a claim type of CLAIM_PREVIOUS). During the grace period,if there is, the servermust reject READ and WRITE operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations) with an error of NFS4ERR_GRACE. Ifreturns NFS4ERR_LOCKED. For Windows environments, there are no advisory record locks, so the servercan reliably determine that granting a non-reclaim request will not conflict with reclamation ofalways checks for record locksby other clients,during I/O requests. Thus, theNFS4ERR_GRACE errorNFS version 4 LOCK operation does nothaveneed tobe returneddistinguish between advisory and mandatory record locks. It is thenon- reclaim client request can be serviced. ForNFS version 4 server's processing of theserver to be able to serviceREAD and WRITE operationsduring the grace period, it must again be able to guaranteethatno possible conflict could ariseintroduces the distinction. Every stateid other than the special stateid values noted in this Expires:May 2002February 2003 [Page63]70] Draft Specification NFS version 4 ProtocolNovember 2001 betweenAugust 2002 section, whether returned by animpending reclaim locking request andOPEN-type operation (i.e. OPEN, OPEN_DOWNGRADE), or by a LOCK-type operation (i.e. LOCK or LOCKU), defines an access mode for theREADfile (i.e. READ, WRITE, orWRITE operation. IfREAD-WRITE) as established by theserver is unable to offeroriginal OPEN which began the stateid sequence, and as modified by subsequent OPEN's and OPEN_DOWNGRADE's within thatguarantee,stateid sequence. When a READ, WRITE, or SETATTR which specifies theNFS4ERR_GRACE error must be returnedsize attribute, is done, the operation is subject to checking against theclient. For a serveraccess mode toprovide simple, valid handling duringverify that thegrace period,operation is appropriate given theeasiest methodOPEN with which the operation isto simply reject all non-reclaim locking requestsassociated. In the case of WRITE-type operations (i.e. WRITE's andREADSETATTR's which set size), the server must verify that the access mode allows writing andWRITE operations by returningreturn an NFS4ERR_OPENMODE error if it does not. In the case, of READ, theNFS4ERR_GRACE error. However, aserver maykeep information about granted locks in stable storage. With this information,perform theserver could determine if a regular lockcorresponding check on the access mode, or it may choose to allow READoron opens for WRITEoperation can be safely processed. For example,only, to accommodate clients whose write implementation may unavoidably do reads (e.g. due to buffer cache constraints). However, even ifa count of locks on a given file is availableREAD's are allowed instable storage,these circumstances, the servercan track reclaimed locksMUST still check forthe file and when all reclaims have been processed, non-reclaim locking requests may be processed. This way the server can ensurelocks thatnon-reclaim locking requests will notconflict withpotential reclaim requests. With respect to I/O requests, iftheserver is able to determineREAD (e.g. another open specify denial of READ's). Note thatthere are no outstanding reclaim requests forafile by information from stable storage or another similar mechanism,server which does enforce theprocessing of I/O requests could proceed normallyaccess mode check on READ's need not explicitly check for conflicting share reservations since thefile. To reiterate,existence of OPEN fora serverread access guarantees thatallows non-reclaim lock and I/O requestsno conflicting share reservation can exist. A stateid of all bits 1 (one) MAY allow READ operations tobe processed duringbypass locking checks at thegrace period, itserver. However, WRITE operations with a stateid with bits all 1 (one) MUSTdetermine that no lock subsequently reclaimed will be rejectedNOT bypass locking checks andthat no lock subsequently reclaimed would have prevented any I/O operation processed during the grace period. Clients should be prepared forare treated exactly thereturnsame as if a stateid ofNFS4ERR_GRACE errors for non-reclaimall bits 0 were used. A lockand I/O requests. In this case the client should employmay not be granted while aretry mechanism forREAD or WRITE operation using one of therequest. A delay (onspecial stateids is being performed and theorderrange ofseveral seconds) between retries should be used to avoid overwhelmingtheserver. Further discussionlock request conflicts with the range of thegeneral is included in [Floyd]. The client must account forREAD or WRITE operation. For theserverpurposes of this paragraph, a conflict occurs when a shared lock is requested and a WRITE operation is being performed, or an exclusive lock is requested and either a READ or a WRITE operation is being performed. A SETATTR that sets size isabletreated similarly toperform I/O and non-reclaim locking requests within the grace perioda WRITE aswelldiscussed above. 8.1.5. Sequencing of Lock Requests Locking is different than most NFS operations asthoseit requires "at- most-one" semantics thatcanare notdo so. A reclaim-typeprovided by ONCRPC. ONCRPC over a reliable transport is not sufficient because a sequence of lockingrequest outside the server's grace period can only succeed ifrequests may span multiple TCP connections. In theserver can guarantee that no conflicting lockface of retransmission orI/O request has been granted since rebootreordering, lock orrestart. 8.5.3. Network Partitionsunlock requests must have a well defined andRecovery If the duration ofconsistent behavior. To accomplish this, each lock request contains anetwork partitionsequence number that isgreater than the lease period provided by the server, the server will have not receivedalease renewal from the client. If this occurs, theconsecutively increasing integer. Different lock_owners have different sequences. The servermay free all locks held for the client. As a result, all stateids held bymaintains theclient will become invalid or stale. Oncelast sequence number (L) received and theclientresponse that was returned. The first request issued for any given lock_owner isable to reach the server after such a network partition, all I/O submitted by the client with the now invalid stateids will failissued withthe server returning the error NFS4ERR_EXPIRED. Once this error is received,a sequence number of zero. Expires:May 2002February 2003 [Page64]71] Draft Specification NFS version 4 ProtocolNovember 2001 the client will suitably notify the applicationAugust 2002 Note thatheld the lock. Asfor requests that contain acourtesy to the client or as an optimization, the server may continue to hold locks on behalf of a clientsequence number, forwhich recent communication has extended beyond the lease period.each lock_owner, there should be no more than one outstanding request. Ifthe server receivesalock or I/Orequestthat conflicts(r) withonea previous sequence number (r < L) is received, it is rejected with the return ofthese courtesy locks,error NFS4ERR_BAD_SEQID. Given a properly-functioning client, theserverresponse to (r) mustfree the courtesy lock and granthave been received before thenew request.last request (L) was sent. If a duplicate of last request (r == L) is received, theserver continues to hold locksstored response is returned. If a request beyond theexpiration of a client's lease,next sequence (r == L + 2) is received, it is rejected with theserver MUST employ a methodreturn ofrecording this fact in its stable storage. Conflicting locks requests from anothererror NFS4ERR_BAD_SEQID. Sequence history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM sequence changes the clientmay be serviced afterverifier. Since thelease expiration. There are various scenarios involving server failure after suchsequence number is represented with anevent that requireunsigned 32-bit integer, thestorage of these lease expirations or network partitions. One scenarioarithmetic involved with the sequence number isas follows: A client holds a lock atmod 2^32. It is critical the serverand encounters a network partition and is unablemaintain the last response sent torenewtheassociated lease. A secondclientobtainsto provide aconflicting lock and then freesmore reliable cache of duplicate non-idempotent requests than that of thelock. Aftertraditional cache described in [Juszczak]. The traditional duplicate request cache uses a least recently used algorithm for removing unneeded requests. However, theunlocklast lock requestbyand response on a given lock_owner must be cached as long as thesecond client,lock state exists on theserver reboots or reinitializes. Onceserver. The client MUST monotonically increment theserver recovers,sequence number for thenetwork partition healsCLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE operations. This is true even in theoriginal client attempts to reclaimevent that theoriginal lock. Inprevious operation that used the sequence number received an error. The only exception to thisscenario and without any state information,rule is if theserver will allowprevious operation received one of thereclaim andfollowing errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID. 8.1.6. Recovery from Replayed Requests As described above, theclient will be in an inconsistent state becausesequence number is per lock_owner. As long as the serverormaintains theclient haslast sequence number received and follows the methods described above, there are noknowledgerisks ofthe conflicting lock.a Byzantine router re-sending old requests. The servermay choose to store this lease expiration or network partitioning state in a way that willneed onlyidentifymaintain theclient(lock_owner, sequence number) state asa whole. Note that this may potentially lead to lock reclaims being denied unnecessarily because of a mix of conflictinglong as there are open files or closed files with locks outstanding. LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, andnon- conflicting locks. The server may also choose to store information aboutCLOSE eachlock that has an expired lease with an associated conflicting lock. The choicecontain a sequence number and therefore the risk of theamount and typereplay ofstate information that is storedthese operations resulting in undesired effects isleft to the implementor. In any case,non-existent while the servermust have enough state information to enable correct recovery from multiple partitions and multiple server failures. 8.6. Recovery frommaintains the lock_owner state. 8.1.7. Releasing lock_owner State When aLock Request Timeoutparticular lock_owner no longer holds open orAbort Infile locking Expires: February 2003 [Page 72] Draft Specification NFS version 4 Protocol August 2002 state at theevent a lock request times out, a clientserver, the server maydecidechoose tonot retryrelease therequest.sequence number state associated with the lock_owner. Theclientserver mayalso abortmake this choice based on lease expiration, for therequest whenreclamation of server memory, or other implementation specific details. In any event, theprocess for which it was issuedserver isterminated (e.g. in UNIX dueable toa signal. Itdo this safely only when the lock_owner no longer ispossible though thatbeing utilized by the client. The serverreceived the request and acted upon it. This would changemay choose to hold the lock_owner stateon the server without the client being aware ofin thechange. It is paramountevent that retransmitted requests are received. However, theclient re-synchronizeperiod to hold this statewith server before it attempts any other Expires: May 2002 [Page 65] Draft Specification NFS version 4 Protocol November 2001 operationis implementation specific. In the case thattakes a seqid and/orastateid with the same nfs_lockowner. ThisLOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE isstraightforward to do without a special re- synchronize operation. Sinceretransmitted after the servermaintains the last lock request and response received on the nfs_lockowner, for each nfs_lockowner,has previously released theclient should cachelock_owner state, thelast lock request it sent suchserver will find that thelock request did not receive a response. From this,lock_owner has no files open and an error will be returned to thenext timeclient. If theclientlock_owner does have alock operation for the nfs_lockowner, it can sendfile open, thecached request, if there is one,stateid will not match andifagain an error is returned to therequest was oneclient. 8.1.8. Use of Open Confirmation In the case thatestablished state (e.g. a LOCK oran OPENoperation)is retransmitted and theclient can follow up with a request to removelock_owner is being used for thestate (e.g. a LOCKUfirst time orCLOSE operation). With this approach,thesequencing and stateid information onlock_owner state has been previously released by theclient and server forserver, thegiven nfs_lockowner will re-synchronize and in turnuse of thelock stateOPEN_CONFIRM operation willre-synchronize. 8.7. Server Revocation of Locks At any point,prevent incorrect behavior. When the servercan revoke locks held by a client andobserves theclient must be prepareduse of the lock_owner forthis event. Whentheclient detects that its locks have been or may have been revoked,first time, it will direct the clientis responsibleto perform the OPEN_CONFIRM forvalidatingthestate information between itselfcorresponding OPEN. This sequence establishes the use of an lock_owner and associated sequence number. Since theserver. Validating locking state forOPEN_CONFIRM sequence connects a new open_owner on theclient means that it must verify or reclaim state for each lock currently held. The first instance of lock revocation is uponserverreboot or re- initialization. In this instance the client will receive an error (NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will proceedwithnormal crash recovery as described inan existing open_owner on a client, theprevious section.sequence number may have any value. Thesecond lock revocation event isOPEN_CONFIRM step assures theinability to renewserver that thelease period. While thisvalue received isconsidered a rare or unusual event, the client must be prepared to recover. Both the server and client will be able to detectthefailure to renewcorrect one. See thelease andsection "OPEN_CONFIRM - Confirm Open" for further details. There arecapablea number ofrecovering without data corruption. For the server, it trackssituations in which thelast renewal event servicedrequirement to confirm an OPEN would pose difficulties for the client andknows when the lease will expire. Similarly, the client must track operations which will renew the lease period. Using the time that each such request was sent and the timeserver, in thatthe corresponding reply wasthey would be prevented from acting in a timely fashion on information received,the client should bound the timebecause that information would be provisional, subject to deletion upon non-confirmation. Fortunately, these are situations in which thecorresponding renewal could have occurred onserver can avoid the need for confirmation when responding to open requests. The two constraints are: o The serverand thus determine if it is possible thatmust not bestow alease period expiration could have occurred.delegation for any open which would require confirmation. o Thethird lock revocation event can occur as a result of administrative intervention within the lease period. While this is consideredserver MUST NOT require confirmation on arare event, it is possiblereclaim-type open (i.e. one specifying claim type CLAIM_PREVIOUS or CLAIM_DELEGATE_PREV). These constraints are related in that reclaim-type opens are theserver's administrator has decided to release or revoke a particular lock held byonly ones in which theclient. Asserver may be required to send aresult of revocation, the client will receive an error of NFS4ERR_EXPIRED anddelegation. For CLAIM_NULL, sending theerrordelegation isreceived within the lease periodoptional while forthe lock. In this instance the client may assume thatCLAIM_DELEGATE_CUR, no delegation is sent. Expires:May 2002February 2003 [Page66]73] Draft Specification NFS version 4 ProtocolNovember 2001 onlyAugust 2002 Delegations being sent with an open requiring confirmation are troublesome because recovering from non-confirmation adds undue complexity to thenfs_lockowner's locks have been lost. The client notifiesprotocol while requiring confirmation on reclaim-type opens poses difficulties in that thelock holder appropriately. The client may not assumeinability to resolve thelease period has been renewed as a resultstatus offailed operation. When the client determinesthe reclaim until leaseperiodexpiration may make it difficult to haveexpired,timely determination of theclient must mark allset of locksheld for the associated lease as "unvalidated". This means the client has been unable to re-establish or confirmbeing reclaimed (since theappropriate lock state withgrace period may expire). Requiring open confirmation on reclaim-type opens is avoidable because of theserver. As described innature of theprevious section on crash recovery, there are scenariosenvironments in whichthe server may grant conflicting locks after the lease period has expired for a client. When itsuch opens are done. For CLAIM_PREVIOUS opens, this ispossibleimmediately after server reboot, so there should be no time for lockowners to be created, found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we are dealing with a client reboot situation. A server which supports delegation can be sure that no lockowners for thatthe lease period has expired, theclientmust validate each lock currently held tohave been recycled since client initialization and thus can ensure thata conflicting lock hasconfirmation will notbeen granted.be required. 8.2. Lock Ranges Theclient may accomplish this task by issuing an I/O request, eitherprotocol allows apending I/Olock owner to request a lock with a byte range and then either upgrade or unlock azero-length read, specifyingsub-range of thestateid associated withinitial lock. It is expected that this will be an uncommon type of request. In any case, servers or server filesystems may not be able to support sub- range lock semantics. In the event that a server receives a locking request that represents a sub-range of current locking state for the lockin question. Ifowner, theresponseserver is allowed to return therequest is success,error NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock operations. Therefore, the clienthas validated all ofshould be prepared to receive this error and, if appropriate, report thelocks governed byerror to the requesting application. The client is discouraged from combining multiple independent locking ranges thatstateid and re-establishedhappen to be adjacent into a single request since theappropriate state between itselfserver may not support sub-range requests and for reasons related to theserver. If the I/O request is not successful, then one or morerecovery of file locking state in thelocks associated with the stateid was revoked by theevent of server failure. As discussed in the section "Server Failure and Recovery" below, theclient must notifyserver may employ certain optimizations during recovery that work effectively only when theowner. 8.8. Share Reservations A share reservationclient's behavior during lock recovery isa mechanismsimilar tocontrol accessthe client's locking behavior prior toa file. It is a separateserver failure. 8.3. Upgrading andindependent mechanism from record locking. WhenDowngrading Locks If a clientopenshas afile,write lock on a record, itissuescan request anOPEN operation toatomic downgrade of theserver specifyinglock to a read lock via thetype of access required (READ, WRITE, or BOTH) andLOCK request, by setting the typeof accesstodeny others (deny NONE, READ, WRITE, or BOTH).READ_LT. If theOPEN failsserver supports atomic downgrade, theclientrequest will succeed. If not, it willfail the application's open request. Pseudo-code definition of the semantics: if ((request.access & file_state.deny)) || (request.deny & file_state.access))return(NFS4ERR_DENIED)NFS4ERR_LOCK_NOTSUPP. Theconstants used for the OPENclient should be prepared to receive this error, andOPEN_DOWNGRADE operations forif appropriate, report theaccess and deny fields are as follows: const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_BOTH = 0x00000003;error to the requesting application. Expires:May 2002February 2003 [Page67]74] Draft Specification NFS version 4 ProtocolNovember 2001 8.9. OPEN/CLOSE Operations To provide correct share semantics,August 2002 If a clientMUST usehas a read lock on a record, it can request an atomic upgrade of theOPEN operationlock toobtaina write lock via theinitial filehandle and indicate the desired access and what if any access to deny. Even ifLOCK request by setting theclient intendstype touse a stateid of all 0'sWRITE_LT orall 1's,WRITEW_LT. If the server does not support atomic upgrade, itmust still obtainwill return NFS4ERR_LOCK_NOTSUPP. If thefilehandle forupgrade can be achieved without an existing conflict, theregular filerequest will succeed. Otherwise, the server will return either NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the client issued the LOCK request with theOPEN operation sotype set to WRITEW_LT and theappropriate share semantics can be applied. For clients that do not haveserver has detected adeny mode built into their open programming interfaces, deny equal to NONEdeadlock. The client should beused. The OPEN operation withprepared to receive such errors and if appropriate, report theCREATE flag, also subsumeserror to theCREATE operation for regular files as used in previous versions ofrequesting application. 8.4. Blocking Locks Some clients require the support of blocking locks. The NFSprotocol. This allows a create withversion 4 protocol must not rely on asharecallback mechanism and therefore is unable tobe done atomically. The CLOSE operation removes all share locks held by the nfs_lockowner on that file. If record locks are held, the client SHOULD release all locks before issuingnotify aCLOSE. The server MAY free all outstanding locks on CLOSE but some servers may not support the CLOSE ofclient when afile that stillpreviously denied lock hasrecord locks held. The server MUST return failure if any locks would exist afterbeen granted. Clients have no choice but to continually poll for theCLOSE. The LOOKUP operation will returnlock. This presents afilehandle without establishing anyfairness problem. Two new lockstate on the server. Without a valid stateid,types are added, READW and WRITEW, and are used to indicate to the serverwill assumethat the clienthas the least access. For example, a file opened with deny READ/WRITE cannot be accessed using a filehandle obtained through LOOKUP because it would not have a valid stateid (i.e. usingis requesting astateidblocking lock. The server should maintain an ordered list ofall bits 0 or all bits 1). 8.10. Open Upgrade and Downgradepending blocking locks. Whenan OPEN is done for a file and the lockowner for whichtheopenconflicting lock isbeing done already has the file open,released, theresult is to upgradeserver may wait theopen file status maintained onlease period for theserverfirst waiting client toincludere-request theaccess and deny bits specified bylock. After thenew OPEN as well as those forlease period expires theexisting OPEN. The resultnext waiting client request is allowed the lock. Clients are required to poll at an interval sufficiently small thatthereit isone open file, as far aslikely to acquire theprotocollock in a timely manner. The server isconcerned, and it includes the unionnot required to maintain a list ofthe accesspending blocked locks as it is used to increase fairness anddeny bits for allnot correct operation. Because of theOPEN requests completed. Only a single CLOSE willunordered nature of crash recovery, storing of lock state to stable storage would bedonerequired toreset the effectsguarantee ordered granting ofboth OPEN's. Note that the client, when issuing the OPEN,blocking locks. Servers maynot know that the same file is in fact being opened. The above only applies if both OPEN's result in the OPEN'ed object being designated byalso note thesame filehandle. Whenlock types and delay returning denial of theserver choosesrequest toexport multiple filehandles correspondingallow extra time for a conflicting lock to be released, allowing a successful return. In this way, clients can avoid thesame file object and returns different filehandles on two different OPEN'sburden ofthe same file object, theneedlessly frequent polling for blocking locks. The serverMUST NOT "OR" togethershould take care in theaccess and deny bits and coalescelength of delay in thetwo open files. Insteadevent the client retransmits the request. 8.5. Lease Renewal The purpose of a lease is to allow a servermust maintain separate OPEN's with separate stateid's and will require separate CLOSE'stofree them. When multiple open files onremove stale locks that are held by a client that has crashed or is otherwise unreachable. It is not a mechanism for cache consistency and lease renewals may not be denied if the lease interval has not expired. The following events cause implicit renewal of all of the leases for a given clientare merged into(i.e. all those sharing asingle opengiven clientid). Each of these is a positive indication that the client is still active and Expires:May 2002February 2003 [Page68]75] Draft Specification NFS version 4 ProtocolNovember 2001 file object on the server, the close of one of the open files (on the client) may necessitate change ofAugust 2002 that theaccess and deny status ofassociated state held at theopen file onserver, for theserver. Thisclient, isbecausestill valid. o An OPEN with a valid clientid. o Any operation made with a valid stateid (CLOSE, DELEGPURGE, DELEGRETURN, LOCK, LOCKU, OPEN, OPEN_CONFIRM, OPEN_DOWNGRADE, READ, RENEW, SETATTR, WRITE). This does not include theunionspecial stateids ofthe access and denyall bitsfor0 or all bits 1. Note that if theremaining open's mayclient had restarted or rebooted, the client would not besmaller (i.e. a proper subset) than previously.making these requests without issuing the SETCLIENTID/SETCLIENTID_CONFIRM sequence. TheOPEN_DOWNGRADE operation is used to makeuse of thenecessary change andSETCLIENTID/SETCLIENTID_CONFIRM sequence (one that changes the clientshould use it to updateverifier) notifies the serverso that share reservation requests by other clients are handled properly. 8.11. Short and Long Leases When determiningto drop thetime period forlocking state associated with the client. SETCLIENTID/SETCLIENTID_CONFIRM never renews a lease. If the serverlease,has rebooted, theusualstateids (NFS4ERR_STALE_STATEID error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be valid hence preventing spurious renewals. This approach allows for low overhead leasetradeoffs apply. Short leasesrenewal which scales well. In the typical case no extra RPC calls aregoodrequired forfast server recovery atlease renewal and in the worst case one RPC is required every lease period (i.e. acost of increasedRENEWor READ (with zero length) requests. Longer leases are certainly kinder and gentler to large internet servers trying to handle very large numbers of clients.operation). The number ofRENEW requests drop in proportion tolocks held by thelease time. The disadvantages of long leases are slower recovery after server failure (server must wait for leases to expire and grace period before granting new lock requests) and increased file contention (ifclientfails to transmit an unlock request then server must wait for lease expiration before granting new locks). Long leases are usable if the serverisable to store leasenot a factor since all statein non-volatile memory. Upon recovery, the server can reconstructfor thelease state from its non-volatile memory and continue operationclient is involved withits clients and therefore long leases are not an issue. 8.12. Clocks and Calculating Lease Expiration To avoidtheneed for synchronized clocks,leasetimes are granted byrenewal action. Since all operations that create a new lease also renew existing leases, the serverasmust maintain a common lease expiration timedelta. However, there isfor all valid leases for a given client. This lease time can then be easily updated upon implicit lease renewal actions. 8.6. Crash Recovery The important requirement in crash recovery is that both the client andserver clocks do not drift excessively overtheduration ofserver know when thelock. Thereother has failed. Additionally, it isalso the issuerequired that a client sees a consistent view ofpropagation delaydata acrossthe network which could easily be several hundred milliseconds as well as the possibility that requests will be lostserver restarts or reboots. All READ andneed to be retransmitted. To take propagation delay into account,WRITE operations that may have been queued within the clientshould subtract it from lease times (e.g. ifor network buffers must wait until the clientestimateshas successfully recovered theone-way propagation delay as 200 msec, then it can assume thatlocks protecting thelease is already 200 msec old when it gets it).READ and WRITE operations. 8.6.1. Client Failure and Recovery Inaddition, it will take another 200 msec to get a response back totheserver. So the client must sendevent that alock renewal or write data back toclient fails, the server400 msec beforemay recover the client's locks when the associated leases have expired. Conflicting locks from another client may only be granted after this leasewould expire.expiration. Expires:May 2002February 2003 [Page69]76] Draft Specification NFS version 4 ProtocolNovember 2001 8.13. Migration, Replication and State When responsibility for handling a given file systemAugust 2002 If the client istransferredable toa new server (migration)restart or reinitialize within the lease period the clientchooses to use an alternate server (e.g. in responsemay be forced toserver unresponsiveness) inwait thecontextremainder offile system replication,theappropriate handlinglease period before obtaining new locks. To minimize client delay upon restart, lock requests are associated with an instance ofstate shared betweenthe clientand server (i.e. locks, leases, stateid's, and clientid's)by a client supplied verifier. This verifier isas described below. The handling differs between migration and replication. For related discussion of file server state and recoverpart ofsuch seethesections under "File Locking and Share Reservations" 8.13.1. Migration and State Ininitial SETCLIENTID call made by thecaseclient. The server returns a clientid as a result ofmigration,theservers involved inSETCLIENTID operation. The client then confirms themigrationuse ofa file system SHOULD transfer all server state fromtheoriginal to the new server. This must be doneclientid with SETCLIENTID_CONFIRM. The clientid ina way thatcombination with an opaque owner field istransparentthen used by the client to identify theclient.lock owner for OPEN. Thisstate transfer will ease the client's transition when a file system migration occurs. If the servers are successful in transferringchain of associations is then used to identify allstate,locks for a particular client. Since theclientverifier willcontinue to use stateid's assignedbe changed by theoriginal server. Thereforeclient upon each initialization, thenewservermust recognize these stateid's as valid. This holds true forcan compare a new verifier to theclientid as well. Since responsibility for an entire file system is transferredverifier associated witha migration event, there is no possibilitycurrently held locks and determine thatconflicts will arise onthey do not match. This signifies the client's newserver as a result of the transferinstantiation and subsequent loss oflocks.locking state. Aspart ofa result, thetransfer of information between servers, leases would be transferred as well. The leases being transferredserver is free to release all locks held which are associated with thenew server will typically have a different expiration timeold clientid which was derived fromthose forthesame client, previously on the new server. To maintain the propertyold verifier. Note thatall leases on a given server for a given client expire at the same time, the server should advancetheexpiration time toverifier must have thelatersame uniqueness properties of theleases being transferred orverifier for theleases already present. This allowsCOMMIT operation. 8.6.2. Server Failure and Recovery If theclient to maintain lease renewalserver loses locking state (usually as a result ofboth classes without special effort. The servers may choose nota restart or reboot), it must allow clients time totransfer the state information upon migration. However, this choice is discouraged. Indiscover thiscase, when the client presents state information from the original server,fact and re- establish the lost locking state. The client must bepreparedable toreceive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID fromre- establish thenew server. The client should then recover itslocking stateinformation as it normally would in response to awithout having the serverfailure. The newdeny valid requests because the servermust take carehas granted conflicting access toallow foranother client. Likewise, if there is therecovery ofpossibility that clients have not yet re-established their locking stateinformation as it would infor a file, theevent ofserverrestart. 8.13.2. Replicationmust disallow READ andState Since client switch-over in the caseWRITE operations for that file. The duration ofreplicationthis recovery period isnot under Expires: May 2002 [Page 70] Draft Specification NFS version 4 Protocol November 2001 server control,equal to thehandlingduration ofstate is different. In this case, leases, stateid's and clientid's do not have validity across a transition from one server to another. The client must re-establish its locks onthenew server. Thislease period. A client canbe compared to the re- establishment of locks by means of reclaim-type requests after a server reboot. The difference isdetermine thattheserverhas no provision to distinguish requests reclaiming locks from those obtaining new locks or to defer the latter. Thus, a client re-establishing a lock on the new server (by meansfailure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates aLOCKstateid invalidated by a reboot orOPEN request), may have the requests denied due torestart. The NFS4ERR_STALE_CLIENTID error indicates aconflicting lock. Since replication is intended for read-only use of filesystems, such denial of locks should not pose large difficulties in practice.clientid invalidated by reboot or restart. Whenan attempt to re-establish a lock oneither of these are received, the client must establish a newserver is denied,clientid (See theclient should treatsection "Client ID") and re-establish thesituationlocking state asif his original lock had been revoked. 8.13.3. Notificationdiscussed below. The period ofMigrated Lease In the casespecial handling oflease renewal, the client may not be submitting requests for a file system that has been migratedlocking and READs and WRITEs, equal in duration toanother server. This can occur because oftheimplicitleaserenewal mechanism. The client renews leases for all file systems when submitting a requestperiod, is referred toany one file system atas theserver. In order for"grace period". During theclientgrace period, clients recover locks and the associated state by reclaim-type locking requests (i.e. LOCK requests with reclaim set toschedule renewaltrue and OPEN operations with a claim type ofleases that may have been relocated toCLAIM_PREVIOUS). During thenew server,grace period, theclientserver mustfind out about lease relocation before those leases expire. To accomplish this, allreject Expires: February 2003 [Page 77] Draft Specification NFS version 4 Protocol August 2002 READ and WRITE operationswhich implicitly renew leases for a clientand non-reclaim locking requests (i.e.OPEN, CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return theother LOCK and OPEN operations) with an errorNFS4ERR_LEASE_MOVED if responsibility for anyof NFS4ERR_GRACE. If theleases to be renewed has been transferred toserver can reliably determine that granting anew server. This conditionnon-reclaim request willcontinue untilnot conflict with reclamation of locks by other clients, theclient receives an NFS4ERR_MOVEDNFS4ERR_GRACE error does not have to be returned and theserver receives the subsequent GETATTR(fs_locations) for an access to each file system for which a lease has been moved to a new server. When anon- reclaim clientreceives an NFS4ERR_LEASE_MOVED error, it should perform some operation, such as a RENEW, on each file system associated withrequest can be serviced. For the serverin question. Whento be able to service READ and WRITE operations during theclient receivesgrace period, it must again be able to guarantee that no possible conflict could arise between anNFS4ERR_MOVED error, the client can followimpending reclaim locking request and thenormal process to obtainREAD or WRITE operation. If thenewserverinformation (through the fs_locations attribute) and perform renewal of those leases onis unable to offer that guarantee, thenew server. IfNFS4ERR_GRACE error must be returned to the client. For a serverhas not had state transferredtoit transparently, it will receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID fromprovide simple, valid handling during thenew server, as described above, and can then recover state information as it does ingrace period, theevent of server failure. Expires: May 2002 [Page 71] Draft Specification NFS version 4 Protocol November 2001 9. Client-Side Caching Client-side caching of data, of file attributes, and of file nameseasiest method isessentialtoproviding good performance with the NFS protocol. Providing distributed cache coherence is a difficult problemsimply reject all non-reclaim locking requests andprevious versions of the NFS protocol have not attempted it. Instead, several NFS client implementation techniques have been used to reduce the problems that a lack of coherence poses for users. These techniques have not been clearly defined by earlier protocol specificationsREAD andit is often unclear what is valid or invalid client behavior. The NFS version 4 protocol uses many techniques similar to those that have been used in previous protocol versions. The NFS version 4 protocol does not provide distributed cache coherence.WRITE operations by returning the NFS4ERR_GRACE error. However,it definesamore limited set of caching guarantees to allowserver may keep information about granted locksand share reservations to be used without destructive interference from client side caching. In addition, the NFS version 4 protocol introduces a delegation mechanism which allows many decisions normally made byin stable storage. With this information, the servertocould determine if a regular lock or READ or WRITE operation can bemade locally by clients. This mechanism provides efficient supportsafely processed. For example, if a count ofthe common cases where sharing is infrequent or where sharinglocks on a given file isread-only. 9.1. Performance Challenges for Client-Side Caching Caching techniques usedavailable inprevious versions ofstable storage, theNFS protocol have been successful in providing good performance. However, several scalability challengesserver canarise when those techniques are used with very large numbers of clients. This is particularly true when clients are geographically distributed which classically increases the latencytrack reclaimed locks forcache revalidation requests. The previous versions oftheNFS protocol repeat theirfiledata cache validationand when all reclaims have been processed, non-reclaim locking requestsat the time the file is opened.may be processed. Thisbehavior can have serious performance drawbacks. A common case is one in which a file is only accessed by a single client. Therefore, sharing is infrequent. In this case, repeated reference toway the serverto findcan ensure thatno conflicts exist is expensive. A better optionnon-reclaim locking requests will not conflict withregardspotential reclaim requests. With respect toperformanceI/O requests, if the server is able toallow a clientdetermine thatrepeatedly opensthere are no outstanding reclaim requests for a fileto do so without reference to the server. This is done until potentially conflicting operationsby information from stable storage or anotherclient actually occur. Asimilarsituation arises in connection with file locking. Sending file lock and unlockmechanism, the processing of I/O requeststocould proceed normally for the file. To reiterate, for a serveras well as the readthat allows non-reclaim lock andwriteI/O requestsnecessarytomake data caching consistent with the locking semantics (seebe processed during thesection "Data Caching and File Locking") can severely limit performance. When locking is used to provide Expires: May 2002 [Page 72] Draft Specification NFS version 4 Protocol November 2001 protection against infrequent conflicts, a large penalty is incurred. This penalty may discouragegrace period, it MUST determine that no lock subsequently reclaimed will be rejected and that no lock subsequently reclaimed would have prevented any I/O operation processed during theusegrace period. Clients should be prepared for the return offile locking by applications. The NFS version 4 protocol provides more aggressive caching strategies withNFS4ERR_GRACE errors for non-reclaim lock and I/O requests. In this case thefollowing design goals: o Compatibility withclient should employ alarge range of server semantics. o Provideretry mechanism for thesame caching benefits as previous versions ofrequest. A delay (on theNFS protocol when unableorder of several seconds) between retries should be used toprovideavoid overwhelming themore aggressive model. o Requirementsserver. Further discussion of the general issue is included in [Floyd]. The client must account foraggressive caching are organized sothe server thata large portion ofis able to perform I/O and non-reclaim locking requests within thebenefitgrace period as well as those that canbe obtained even whennotall ofdo so. A reclaim-type locking request outside therequirementsserver's grace period canbe met. The appropriate requirements foronly succeed if the serverare discussed in later sections in which specific forms of caching are covered. (see the section "Open Delegation"). 9.2. Delegation and Callbacks Recallable delegation ofcan guarantee that no conflicting lock or I/O request has been granted since reboot or restart. A serverresponsibilities formay, upon restart, establish afile tonew value for the lease period. Therefore, clients should, once aclient improves performance by avoiding repeated requests tonew clientid is Expires: February 2003 [Page 78] Draft Specification NFS version 4 Protocol August 2002 established, refetch theserver inlease_time attribute and use it as theabsence of inter-client conflict. Withbasis for lease renewal for the lease associated with that server. However, theuse of a "callback" RPC fromserverto client,must establish, for this restart event, a grace period at least as long as the lease period for the previous serverrecalls delegated responsibilities when anotherinstantiation. This allows the clientengages in sharing of a delegated file. A delegation is passed fromstate obtained during the previous server instance tothe client, specifying the object of the delegationbe reliably re-established. 8.6.3. Network Partitions and Recovery If thetype of delegation. There are different typesduration ofdelegations but each type containsastateid to be used to representnetwork partition is greater than thedelegation when performing operations that depend onlease period provided by thedelegation. This stateid is similar to those associated with locks and share reservations but differs in thatserver, thestateid for a delegation is associated withserver will have not received aclientid andlease renewal from the client. If this occurs, the server maybe used on behalf offree allthe nfs_lockownerslocks held for thegivenclient.A delegationAs a result, all stateids held by the client will become invalid or stale. Once the client ismadeable to reach theclient asserver after such awhole and not to any specific process or thread of control within it. Because callback RPCs may not work innetwork partition, allenvironments (due to firewalls, for example), correct protocol operation does not depend on them. Preliminary testing of callback functionalityI/O submitted bymeans of a CB_NULL procedure determines whether callbacks can be supported. The CB_NULL procedure checksthecontinuity ofclient with the now invalid stateids will fail with thecallback path. Aservermakes a preliminary assessment of callback availability to a givenreturning the error NFS4ERR_EXPIRED. Once this error is received, the clientand avoids delegating responsibilities until it has determinedwill suitably notify the application thatcallbacks are supported. Becauseheld thegranting oflock. As adelegation is always conditional uponcourtesy to theabsenceclient or as an optimization, the server may continue to hold locks on behalf ofconflicting access, clients must not assume thatadelegation will be granted and they must always be preparedclient forOPENs to be processed without any Expires: May 2002 [Page 73] Draft Specification NFS version 4 Protocol November 2001 delegations being granted. Once granted, a delegation behaves in most ways like a lock. There is an associatedwhich recent communication has extended beyond the lease period. If the server receives a lock or I/O request thatis subject to renewal togetherconflicts withallone ofthe other leases held by that client. Unlikethese courtesy locks,an operation by a second client to a delegated file will causethe serverto recall a delegation through a callback. On recall,must free theclient holdingcourtesy lock and grant thedelegation must flush modified state (such as modified data) tonew request. If the serverand returncontinues to hold locks beyond thedelegation. The conflicting request will not receiveexpiration of aresponse until the recall is complete. The recall is considered complete whenclient's lease, the server MUST employ a method of recording this fact in its stable storage. Conflicting lock requests from another clientreturnsmay be serviced after thedelegationlease expiration. There are various scenarios involving server failure after such an event that require the storage of these lease expirations or network partitions. One scenario is as follows: A client holds a lock at the servertimes out on the recallandrevokesencounters a network partition and is unable to renew thedelegation asassociated lease. A second client obtains aresult ofconflicting lock and then frees thetimeout. Followinglock. After theresolution ofunlock request by therecall,second client, the serverhas the information necessary to grantreboots ordenyreinitializes. Once thesecond client's request. Atserver recovers, thetimenetwork partition heals and the original clientreceives a delegation recall, it may have substantial state that needs to be flushedattempts to reclaim theserver. Therefore,original lock. In this scenario and without any state information, the servershouldwill allowsufficient time for the delegation to be returned since it may involve numerous RPCs to the server. Iftheserver is able to determine thatreclaim and the clientis diligently flushingwill be in an inconsistent statetobecause the serveras a result ofor therecall,client has no knowledge of the conflicting lock. The server mayextend the usual time allowed forchoose to store this lease expiration or network partitioning state in arecall. However,way that will only identify thetime allowed for recall completion should not be unbounded. An example of this is when responsibility to mediate opens onclient as agiven file is delegatedwhole. Note that this may potentially lead to lock reclaims being Expires: February 2003 [Page 79] Draft Specification NFS version 4 Protocol August 2002 denied unnecessarily because of aclient (see the section "Open Delegation").mix of conflicting and non- conflicting locks. The serverwill not know what opens are in effect on the client. Without this knowledge the server will be unablemay also choose todetermine ifstore information about each lock that has an expired lease with an associated conflicting lock. The choice of theaccessamount anddenytype of statefor the file allows any particular open until the delegation for the file has been returned. A client failure or a network partition can result in failure to respondinformation that is stored is left toa recall callback.the implementor. Inthisany case, the serverwill revoke the delegation which in turn will render useless any modifiedmust have enough statestill oninformation to enable correct recovery from multiple partitions and multiple server failures. For further discussion of revocation of locks see theclient. 9.2.1. Delegationsection "Server Revocation of Locks". 8.7. RecoveryThere are three situations that delegation recovery must deal with: o Client reboot or restart o Server reboot or restart o Network partition (fullfrom a Lock Request Timeout orcallback-only)Abort In the eventthea lock request times out, a clientreboots or restarts, the failuremay decide torenew Expires: May 2002 [Page 74] Draft Specification NFS version 4 Protocol November 2001 leases will result innot retry therevocation of record locks and share reservations. Delegations, however,request. The client maybe treated a bit differently. There will be situations inalso abort the request when the process for whichdelegations will needit was issued is terminated (e.g. in UNIX due tobe reestablished afteraclient reboots or restarts. The reason for thissignal). It is possible though that theclient may have file data stored locally and this data was associated withserver received thepreviously held delegations. The client will need to reestablishrequest and acted upon it. This would change theappropriate filestate on theserver. To allow for this type of client recovery, theservermay extendwithout theperiod for delegation recovery beyondclient being aware of thetypical lease expiration period. This implieschange. It is paramount thatrequests fromthe client re-synchronize state with server before it attempts any otherclientsoperation thatconflicttakes a seqid and/or a stateid withthese delegations will needthe same lock_owner. This is straightforward towait. Becausedo without a special re- synchronize operation. Since thenormal recall process may require significant timeserver maintains the last lock request and response received on the lock_owner, for each lock_owner, the clientto flush changed state toshould cache theserver, other clients need be prepared for delayslast lock request it sent such thatoccur because ofthe lock request did not receive aconflicting delegation. This longer interval would increaseresponse. From this, thewindownext time the client does a lock operation forclients to reboot and consult stable storage so thatthedelegationslock_owner, it canbe reclaimed. For open delegations, such delegations are reclaimed using OPEN with a claim type of CLAIM_DELEGATE_PREV. (Seesend thesections on "Data Caching and Revocation" and "Operation 18: OPEN" for discussion of open delegationcached request, if there is one, and if thedetails ofrequest was one that established state (e.g. a LOCK or OPENrespectively). Whenoperation), the serverrebootswill return the cached result orrestarts, delegations are reclaimed (usingif never saw theOPEN operationrequest, perform it. The client can follow up withCLAIM_DELEGATE_PREV) inasimilar fashionrequest torecord locks and share reservations. However, there isremove the state (e.g. aslight semantic difference. InLOCKU or CLOSE operation). With this approach, thenormal case ifsequencing and stateid information on the client and serverdecides that a delegation should not be granted, it performsfor therequested action (e.g. OPEN) without grantinggiven lock_owner will re-synchronize and in turn the lock state will re-synchronize. 8.8. Server Revocation of Locks At anydelegation. For reclaim,point, the servergrants the delegation butcan revoke locks held by aspecial designation is applied so thatclient and the clienttreatsmust be prepared for this event. When thedelegation as havingclient detects that its locks have beengranted but recalled by the server. Because of this,or may have been revoked, the clienthasis responsible for validating theduty to write all modifiedstateto the serverinformation between itself andthen returnthedelegation. This process of handling delegation reclaim reconciles three principles ofserver. Validating locking state for theNFS Version 4 protocol: o Upon reclaim, aclientreporting resources assigned tomeans that itby an earlier server instancemustbe granted those resources. o The server has unquestionable authority to determine whether delegations are to be granted and, once granted, whether they are to be continued. overify or reclaim state for each lock currently held. Theusefirst instance ofcallbackslock revocation isnot to be dependeduponuntilserver reboot or re- initialization. In this instance the clienthas proven its ability towill receivethem. When a network partition occurs, delegations are subject to freeing by the server when the lease renewal period expires. This is similar to the behavior for locksan error (NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) andshare reservations. For delegations, however, the server may extendtheperiodclient will proceed with normal crash recovery as described inwhich conflictingthe previous Expires:May 2002February 2003 [Page75]80] Draft Specification NFS version 4 ProtocolNovember 2001 requests are held off. EventuallyAugust 2002 section. The second lock revocation event is theoccurrence ofinability to renew the lease before expiration. While this is considered aconflicting request from anotherrare or unusual event, the client must be prepared to recover. Both the server and client willcause revocation ofbe able to detect thedelegation. A lossfailure to renew the lease and are capable of recovering without data corruption. For thecallback path (e.g. by later network configuration change) will haveserver, it tracks thesame effect. A recall request will faillast renewal event serviced for the client andrevocation ofknows when thedelegationlease willresult. Aexpire. Similarly, the clientnormally finds out about revocation of a delegation when it uses a stateid associated with a delegationmust track operations which will renew the lease period. Using the time that each such request was sent andreceivestheerror NFS4ERR_EXPIRED. It also may find out about delegationtime that the corresponding reply was received, the client should bound the time that the corresponding renewal could have occurred on the server and thus determine if it is possible that a lease period expiration could have occurred. The third lock revocationafterevent can occur as aclient reboot whenresult of administrative intervention within the lease period. While this is considered a rare event, itattemptsis possible that the server's administrator has decided toreclaimrelease or revoke adelegation and receives that same error. Note that inparticular lock held by thecase ofclient. As arevoked write open delegation, there are issues because dataresult of revocation, the client will receive an error of NFS4ERR_EXPIRED and the error is received within the lease period for the lock. In this instance the client may assume that only the lock_owner's locks have beenmodified bylost. The client notifies the lock holder appropriately. The clientwhose delegation is revoked and separately by other clients. Seemay not assume thesection "Revocation Recovery for Write Open Delegation" forlease period has been renewed as adiscussionresult ofsuch issues. Note also that when delegations are revoked, information aboutfailed operation. When therevoked delegation will be written byclient determines theserver to stable storage (as described inlease period may have expired, thesection "Crash Recovery").client must mark all locks held for the associated lease as "unvalidated". Thisis donemeans the client has been unable todealre-establish or confirm the appropriate lock state with thecaseserver. As described in the previous section on crash recovery, there are scenarios in whichathe serverrebootsmay grant conflicting locks afterrevoking a delegation but before the client holdingtherevoked delegationlease period has expired for a client. When it isnotified aboutpossible that therevocation. 9.3. Data Caching When applications share accesslease period has expired, the client must validate each lock currently held to ensure that aset of files, they need to be implemented so as to take account of the possibility ofconflictingaccesslock has not been granted. The client may accomplish this task byanother application. This is true whether the applications in question execute on different clientsissuing an I/O request, either a pending I/O orreside ona zero-length read, specifying thesame client. Share reservations and record locks arestateid associated with thefacilitieslock in question. If theNFS version 4 protocol provides to allow applicationsresponse tocoordinate accessthe request is success, the client has validated all of the locks governed byproviding mutual exclusion facilities. The NFS version 4 protocol's data caching must be implemented suchthatit does not invalidatestateid and re-established theassumptions that those using these facilities depend upon. 9.3.1. Data Cachingappropriate state between itself andOPENs In order to avoid invalidatingthesharing assumptions that applications rely on, NFS version 4 clients shouldserver. If the I/O request is notprovide cached data to applicationssuccessful, then one ormodify it on behalfmore ofan application when it would not be valid to obtain or modify that same data via a READ or WRITE operation. Furthermore, intheabsence of open delegation (seelocks associated with thesection "Open Delegation") two additional rules apply. Note that these rules are obeyed in practicestateid was revoked bymany NFS version 2the server andversion 3 clients. o First, cached data present on athe client mustbe revalidated afternotify the owner. 8.9. Share Reservations A share reservation is a mechanism to control access to a file. It is a separate and independent mechanism from record locking. When a client opens a file, it issues an OPEN operation to the server specifying the type of access required (READ, WRITE, or BOTH) and the type of access to deny others (deny NONE, READ, WRITE, or BOTH). If Expires:May 2002February 2003 [Page76]81] Draft Specification NFS version 4 ProtocolNovember 2001 doing an OPEN. This is to ensure thatAugust 2002 thedata forOPEN fails theOPENed file is still correctly reflected inclient will fail theclient's cache.application's open request. Pseudo-code definition of the semantics: if ((request.access & file_state.deny)) || (request.deny & file_state.access)) return (NFS4ERR_DENIED) Thisvalidation must bechecking of share reservations on OPEN is doneat least whenwith no exception for an existing OPEN for the same open_owner. The constants used for theclient'sOPENoperation includes DENY=WRITE or BOTH thus terminatingand OPEN_DOWNGRADE operations for the access and deny fields are as follows: const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_BOTH = 0x00000003; 8.10. OPEN/CLOSE Operations To provide correct share semantics, aperiod in which other clients may have hadclient MUST use theopportunityOPEN operation toopenobtain thefile with WRITE access. Clients may choose to doinitial filehandle and indicate therevalidation more often (i.e. at OPENs specifying DENY=NONE)desired access and what if any access toparallel the NFS version 3 protocol's practice fordeny. Even if thebenefit of users assuming this degreeclient intends to use a stateid ofcache revalidation. o Second, modified dataall 0's or all 1's, it mustbe flushed tostill obtain theserver before closing a file OPENedfilehandle forwrite. This is complementary tothefirst rule. If the data is not flushed at CLOSE, the revalidation done after client OPENs asregular fileis unable to achieve its purpose. The other aspect to flushingwith thedata before close is thatOPEN operation so thedata mustappropriate share semantics can becommittedapplied. For clients that do not have a deny mode built into their open programming interfaces, deny equal tostable storage, atNONE should be used. The OPEN operation with theserver, beforeCREATE flag, also subsumes the CREATE operation for regular files as used in previous versions of the NFS protocol. This allows a create with a share to be done atomically. The CLOSE operationis requestedremoves all share reservations held by theclient. Inlock_owner on that file. If record locks are held, thecase ofclient SHOULD release all locks before issuing a CLOSE. The serverreboot or restart and a CLOSEd file, itMAY free all outstanding locks on CLOSE but some servers may notbe possible to retransmit the data to be written tosupport thefile. Hence, this requirement. 9.3.2. Data Caching and File Locking For those applications that choose to use file locking instead of share reservations to exclude inconsistent file access, there is an analogous setCLOSE ofconstraints that apply to client side data caching. These rules are effective only if the file locking is used inaway that matches in an equivalent way the actual READ and WRITE operations executed. This is as opposed tofilelockingthatis based on pure convention. For example, it is possible to manipulate a two-megabyte file by dividing the file into two one-megabyte regions and protecting access to the two regions by filestill has record locks held. The server MUST return failure, NFS4ERR_LOCKS_HELD, if any lockson bytes zero and one. A lock for write on byte zero of the filewouldrepresent the right to do READ and WRITE operations onexist after thefirst region. ACLOSE. The LOOKUP operation will return a filehandle without establishing any lockfor writestate onbyte one ofthefile would representserver. Without a valid stateid, theright to do READ and WRITE operations onserver will assume thesecond region. As long as all applications manipulatingclient has thefile obey this convention, they will work onleast access. For example, alocalfilesystem. However, they may not workopened withthedeny READ/WRITE cannot be accessed using a filehandle Expires: February 2003 [Page 82] Draft Specification NFS version 4protocol unless clients refrain from data caching. The rules for data caching in the file locking environment are: o First, whenProtocol August 2002 obtained through LOOKUP because it would not have aclient obtainsvalid stateid (i.e. using afile lock forstateid of all bits 0 or all bits 1). 8.10.1. Close and Retention of State Information Since aparticular region,CLOSE operation requests deallocation of a stateid, dealing with retransmission of thedata cache corresponding to that region (if any cache data exists) mustCLOSE, may pose special difficulties, since the state information, which normally would berevalidated. Ifused to determine thechange attribute indicates thatstate of the open file being designated, might be deallocated, resulting in an NFS4ERR_BAD_STATEID error. Servers mayhave been updated sincedeal with this problem in a number of ways. To provide thecached data was obtained,greatest degree assurance that theclient must flush or invalidateprotocol is being used properly, a server should, rather than deallocate thecached data forstateid, mark it as close-pending, and retain thenewly locked region. A client might choosestateid with this status, until later deallocation. In this way, a retransmitted CLOSE can be recognized since the stateid points toinvalidate all of non-modified cached datastate information with this distinctive status, so that ithas forcan be handled without error. When adopting this strategy, a server should retain thefile butstate information until theonly requirementearliest of: o Another validly sequenced request forcorrect operationthe same lockowner, that is not a retransmission. o The time that a lockowner is freed by the server due toinvalidate allperiod with no activity. o All locks for the client are freed as a result of a SETCLIENTID. Servers may avoid this complexity, at thedatacost of less complete protocol error checking, by simply responding NFS4_OK in thenewly locked region. Expires: May 2002 [Page 77] Draft Specification NFS version 4 Protocol November 2001 o Second, before releasingevent of awrite lockCLOSE for aregion, all modified data for that region must be flushed todeallocated stateid, on theserver. The modified dataassumption that this case mustalsobewritten to stable storage. Note that flushing datacaused by a retranmitted close. When adopting this approach, it is desirable to at least log an error when returning a no-error indication in this situation. If the server maintains a reply-cache mechanism, it can verify the CLOSE is indeed a retransmission and avoid error logging in most cases. 8.11. Open Upgrade and Downgrade When an OPEN is done for a file and theinvalidation of cached data must reflectlockowner for which theactual byte ranges locked or unlocked. Rounding these up or down to reflect client cache block boundaries will cause problems if not carefully done. For example, writing a modified block when only half of that blockopen iswithin an areabeingunlocked may cause invalid modificationdone already has the file open, the result is to upgrade theregion outsideopen file status maintained on theunlocked area. This, in turn, may be part of a region locked by another client. Clients can avoid this situationserver to include the access and deny bits specified bysynchronously performing portions of write operations that overlap that portion (initial or final)the new OPEN as well as those for the existing OPEN. The result is that there isnot a full block. Similarly, invalidating a locked area whichone open file, as far as the protocol isnot an integral numberconcerned, and it includes the union offull buffer blocks would requiretheclient to read one or two partial blocks fromaccess and deny bits for all of theserver ifOPEN requests completed. Only a single CLOSE will be done to reset therevalidation procedure showseffects of both OPEN's. Note that Expires: February 2003 [Page 83] Draft Specification NFS version 4 Protocol August 2002 thedata whichclient, when issuing theclient possessesOPEN, may notbe valid. The dataknow that the same file iswritten toin fact being opened. The above only applies if both OPEN's result in the OPEN'ed object being designated by the same filehandle. When the serveras a pre-requisitechooses to export multiple filehandles corresponding to theunlockingsame file object and returns different filehandles on two different OPEN's ofa region must be written, attheserver,same file object, the server MUST NOT "OR" together the access and deny bits and coalesce the two open files. Instead the server must maintain separate OPEN's with separate stateid's and will require separate CLOSE's tostable storage. Thefree them. When multiple open files on the clientmay accomplish this either with synchronous writes or by following asynchronous writes withare merged into aCOMMIT operation.single open file object on the server, the close of one of the open files (on the client) may necessitate change of the access and deny status of the open file on the server. This isrequiredbecauseretransmissionthe union of themodified data after a server reboot might conflict withaccess and deny bits for the remaining open's may be smaller (i.e. alock held by another client. Aproper subset) than previously. The OPEN_DOWNGRADE operation is used to make the necessary change and the clientimplementation may choose to accommodate applications which use record locking in non-standard ways (e.g. using a record lock as a global semaphore) by flushingshould use it to update the servermore data upon an LOCKU than is coveredso that share reservation requests bythe locked range. This may include modified data within filesotherthanclients are handled properly. 8.12. Short and Long Leases When determining theonetime period forwhichtheunlocks are being done. In such cases,server lease, theclient must not interfere with applications whose READs and WRITEsusual lease tradeoffs apply. Short leases arebeing done only within the bounds of record locks which the application holds. For example, an application locksgood for fast server recovery at asingle bytecost ofa fileincreased RENEW or READ (with zero length) requests. Longer leases are certainly kinder andproceedsgentler towrite that single byte. A client that choseservers trying to handlea LOCKU by flushing all modified datavery large numbers of clients. The number of RENEW requests drop in proportion to the lease time. The disadvantages of long leases are slower recovery after servercould validly write that single byte in responsefailure (server must wait for leases to expire and grace period before granting new lock requests) and increased file contention (if client fails to transmit anunrelated unlock. However, itunlock request then server must wait for lease expiration before granting new locks). Long leases are usable if the server is able to store lease state in non-volatile memory. Upon recovery, the server can reconstruct the lease state from its non-volatile memory and continue operation with its clients and therefore long leases would not bevalid to write the entire block in which that single written byte was located since it includesanarea that is not lockedissue. 8.13. Clocks, Propagation Delay, andmight be locked by another client. Client implementations canCalculating Lease Expiration To avoidthis problem by dividing files with modified data into thosethe need forwhich all modificationssynchronized clocks, lease times aredone to areas coveredgranted byan appropriate record lock and those for whichthe server as a time delta. However, thereare modifications not covered byis arecord lock. Any writes done forrequirement that theformer class of files must not include areas not lockedclient andthusserver clocks do notmodified ondrift excessively over theclient. 9.3.3. Data Caching and Mandatory File Locking Client side data caching needs to respect mandatory file locking when itduration of the lock. There isin effect. The presencealso the issue ofmandatory file locking for a givenpropagation delay across the network which could easily be several hundred milliseconds as well as the possibility that requests will be lost and need to be retransmitted. Expires:May 2002February 2003 [Page78]84] Draft Specification NFS version 4 ProtocolNovember 2001 file is indicated inAugust 2002 To take propagation delay into account, theresult flags for an OPEN. When mandatory lockingclient should subtract it from lease times (e.g. if the client estimates the one-way propagation delay as 200 msec, then it can assume that the lease isin effect foralready 200 msec old when it gets it). In addition, it will take another 200 msec to get afile,response back to the server. So the client mustcheck for an appropriate file lock for data being read or written. Ifsend a lockexists for the range being readrenewal orwritten, the client may satisfywrite data back to therequest usingserver 400 msec before theclient's validated cache. If an appropriate file lock is not held forlease would expire. The server's lease period configuration should take into account therangenetwork distance of theread or write, the read or write request must notclients that will besatisfied byaccessing theclient's cache andserver's resources. It is expected that therequest must be sent tolease period will take into account theservernetwork propogation delays and other network delay factors forprocessing. When a read or write request partially overlaps a locked region,therequest should be subdivided into multiple pieces with each region (locked or not) treated appropriately. 9.3.4. Data Caching and File Identity When clients cache data,client population. Since thefile data needsprotocol does not allow for an automatic method toorganized accordingdetermine an appropriate lease period, the server's administrator may have to tune the lease period. 8.14. Migration, Replication and State When responsibility for handling a given file systemobjectis transferred towhich the data belongs. For NFS version 3 clients,a new server (migration) or thetypical practice has beenclient chooses toassume foruse an alternate server (e.g. in response to server unresponsiveness) in thepurposecontext ofcaching that distinct filehandles represent distinctfile systemobjects.replication, the appropriate handling of state shared between the client and server (i.e. locks, leases, stateid's, and clientid's) is as described below. The handling differs between migration and replication. For related discussion of file server state and recover of such see the sections under "File Locking and Share Reservations" If server replica or a server immigrating a filesystem agrees to, or is expected to, accept opaque values from the client that originated from another server, thenhasit is a wise implementation practice for thechoiceservers toorganizeencode the "opaque" values in network byte order. This way, servers acting as replicas or immigrating filesystems will be able to parse values like stateids, directory cookies, filehandles, etc. even if their native byte order is different from other servers cooperating in the replication andmaintainmigration of thedata cache on this basis.filesystem. 8.14.1. Migration and State In theNFS version 4 protocol, there is nowcase of migration, thepossibility to have significant deviations from a "one filehandle per object" model because a filehandle may be constructed onservers involved in thebasismigration ofthe object's pathname. Therefore, clients needareliable method to determine if two filehandles designate the same file system object. If clients were simply to assume thatfilesystem SHOULD transfer alldistinct filehandles denote distinct objects and proceed to do data caching on this basis, caching inconsistencies would arise betweenserver state from thedistinct client side objects which mappedoriginal to thesame server side object. By providingnew server. This must be done in amethodway that is transparent todifferentiate filehandles,theNFS version 4 protocol alleviatesclient. This state transfer will ease the client's transition when apotential functional regression in comparison withfilesystem migration occurs. If theNFS version 3 protocol. Without this method, caching inconsistencies withinservers are successful in transferring all state, thesameclientcould occur and this has not been present in previous versions of the NFS protocol. Note that it is possiblewill continue tohave such inconsistencies with applications executing on multiple clients but that is not the issue being addressed here. For the purposes of data caching,use stateid's assigned by thefollowing steps allow an NFS version 4 client to determine whether two distinct filehandles denoteoriginal server. Therefore thesamenew serverside object: o If GETATTR directed to two filehandles have different values of the fsid attribute, thenmust recognize these stateid's as valid. This holds true for thefilehandles represent distinct objects. o If GETATTRclientid as well. Since responsibility forany file withanfsidentire filesystem is transferred with a migration event, there is no possibility thatmatches the fsid ofconflicts will arise on thetwo filehandles in question returns a unique_handles attribute withnew server as avalueresult ofTRUE, thenthetwo objects aretransfer of Expires:May 2002February 2003 [Page79]85] Draft Specification NFS version 4 ProtocolNovember 2001 distinct. o If GETATTR directed to the two filehandles does not return the fileid attribute for one or bothAugust 2002 locks. As part of thehandles, then the it cannot be determined whether the two objects are the same. Therefore, operations which depend on that knowledge (e.g. client side data caching) cannottransfer of information between servers, leases would bedone reliably. o If GETATTR directedtransferred as well. The leases being transferred to thetwo filehandles returnsnew server will typically have a differentvaluesexpiration time from those for thefileid attribute, then they are distinct objects. o Otherwise they are thesameobject. 9.4. Open Delegation When a file is being OPENed, the server may delegate further handling of opens and closes for that file toclient, previously on theopening client. Any such delegation is recallable, sinceold server. To maintain thecircumstancesproperty thatallowedall leases on a given server for a given client expire at thedelegation are subject to change. In particular,same time, the servermay receive a conflicting OPEN from another client,should advance theserver must recallexpiration time to thedelegation before deciding whetherlater of theOPEN fromleases being transferred or the leases already present. This allows theotherclient to maintain lease renewal of both classes without special effort. The servers maybe granted. Making a delegation is upchoose not to transfer theserver and clients should not assume that any particular OPEN either will or will not result in an open delegation. The followingstate information upon migration. However, this choice isa typical set of conditions that servers might use in deciding whether OPEN should be delegated: o Thediscouraged. In this case, when the client presents state information from the original server, the client must beable to respondprepared to receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from theserver's callback requests.new server. Theserver will use the CB_NULL procedure forclient should then recover its state information as it normally would in response to atest of callback ability. oserver failure. Theclientnew server musthave responded properlytake care toprevious recalls. o There must be no current open conflicting with the requested delegation. o There should be no current delegation that conflicts withallow for thedelegation being requested. o The probabilityrecovery offuture conflicting open requests should be low based onstate information as it would in therecent historyevent of server restart. 8.14.2. Replication and State Since client switch-over in thefile. o The existence of any server-specific semanticscase ofOPEN/CLOSE that would make the required handling incompatible withreplication is not under server control, theprescribedhandlingthat the delegated client would apply (see below). There are two typesofopen delegations, readstate is different. In this case, leases, stateid's andwrite. A read open delegation allowsclientid's do not have validity across aclienttransition from one server tohandle, onanother. The client must re-establish itsown, requestslocks on the new server. This can be compared toopenthe re- establishment of locks by means of reclaim-type requests after afile for readingserver reboot. The difference is thatdo not deny read access to others. Multiple read open delegations may be outstanding simultaneously and do not Expires: May 2002 [Page 80] Draft Specification NFS version 4 Protocol November 2001 conflict. A write open delegation allowstheclientserver has no provision tohandle, on its own, all opens. Only one write open delegation may exist for a given file at a given time and it is inconsistent with any read open delegations. Whendistinguish requests reclaiming locks from those obtaining new locks or to defer the latter. Thus, a clienthasre-establishing aread open delegation, it may not make any changes tolock on thecontents or attributesnew server (by means of a LOCK or OPEN request), may have thefile but itrequests denied due to a conflicting lock. Since replication isassured that no other client may do so.intended for read-only use of filesystems, such denial of locks should not pose large difficulties in practice. When an attempt to re-establish aclient haslock on awrite open delegation, it may modifynew server is denied, thefile data since no otherclientwill be accessingshould treat thefile's data. The client holding a write delegation may only affect file attributes which are intimately connected withsituation as if his original lock had been revoked. 8.14.3. Notification of Migrated Lease In the case of lease renewal, thefile data: object_size, time_modify, change. When aclienthas an open delegation, it doesmay notsend OPENs or CLOSEs to the server but updates the appropriate status internally. For a read open delegation, opens that cannotbehandled locally (openssubmitting requests forwrite ora filesystem thatdeny read access) must be senthas been migrated totheanother server.When an open delegation is made,This can occur because of theresponseimplicit lease renewal mechanism. The client renews leases for all filesystems when submitting a request to any one filesystem at theOPEN contains an open delegation structure which specifies the following: oserver. In order for thetype of delegation (read or write) o space limitation informationclient tocontrol flushingschedule renewal ofdata on close (write open delegation only, see the section "Open Delegation and Data Caching") o an nfsace4 specifying read and write permissions o a stateidleases that may have been relocated torepresent the delegation for READ and WRITE The stateid is separate and distinct fromthestateid fornew server, theOPEN proper. The standard stateid, unlike the delegation stateid, is associated withclient must find out about Expires: February 2003 [Page 86] Draft Specification NFS version 4 Protocol August 2002 lease relocation before those leases expire. To accomplish this, all operations which implicitly renew leases for aparticular nfs_lockowner andclient (i.e. OPEN, CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), willcontinue to be valid afterreturn thedelegation is recalled anderror NFS4ERR_LEASE_MOVED if responsibility for any of thefile remains open. When a request internalleases tothe client is madebe renewed has been transferred toopenafile and open delegation is in effect, itnew server. This condition willbe accepted or rejected solely oncontinue until thebasis ofclient receives an NFS4ERR_MOVED error and thefollowing conditions. Any requirementserver receives the subsequent GETATTR(fs_locations) forother checks