NFSv4 S. Shepler Internet-Draft M. Eisler Intended status: Standards Track D. Noveck Expires: November 13, 2008 Editors May 12, 2008 NFS Version 4 Minor Version 1 draft-ietf-nfsv4-minorversion1-23.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on November 13, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This Internet-Draft describes NFS version 4 minor version one, including features retained from the base protocol and protocol extensions made subsequently. Major extensions introduced in NFS version 4 minor version one include: Sessions, Directory Delegations, and parallel NFS (pNFS). Shepler, et al. Expires November 13, 2008 [Page 1] Internet-Draft NFSv4.1 May 2008 Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 11 1.2. Scope of this Document . . . . . . . . . . . . . . . . . 11 1.3. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . 11 1.4. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . 12 1.5. General Definitions . . . . . . . . . . . . . . . . . . 12 1.6. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 15 1.6.1. RPC and Security . . . . . . . . . . . . . . . . . . 15 1.6.2. Protocol Structure . . . . . . . . . . . . . . . . . 15 1.6.3. File System Model . . . . . . . . . . . . . . . . . 16 1.6.4. Locking Facilities . . . . . . . . . . . . . . . . . 18 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 19 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 20 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 20 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 23 2.4. Client Identifiers and Client Owners . . . . . . . . . . 24 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 . . . . . . . . . . 27 2.4.2. Server Release of Client ID . . . . . . . . . . . . 28 2.4.3. Resolving Client Owner Conflicts . . . . . . . . . . 28 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 29 2.6. Security Service Negotiation . . . . . . . . . . . . . . 30 2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 30 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 30 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 31 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 35 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 38 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 38 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 38 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 38 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 38 2.9.1. REQUIRED and RECOMMENDED Properties of Transports . 38 2.9.2. Client and Server Transport Behavior . . . . . . . . 39 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 41 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 41 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 41 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 42 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 44 Shepler, et al. Expires November 13, 2008 [Page 2] Internet-Draft NFSv4.1 May 2008 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 45 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 48 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 61 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 63 2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 69 2.10.9. Session Mechanics - Steady State . . . . . . . . . . 73 2.10.10. Session Inactivity Timer . . . . . . . . . . . . . . 75 2.10.11. Session Mechanics - Recovery . . . . . . . . . . . . 75 2.10.12. Parallel NFS and Sessions . . . . . . . . . . . . . 78 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 78 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . 79 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 79 3.3. Structured Data Types . . . . . . . . . . . . . . . . . 81 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 90 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 91 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 91 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 91 4.2.1. General Properties of a Filehandle . . . . . . . . . 92 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 93 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 93 4.3. One Method of Constructing a Volatile Filehandle . . . . 94 4.4. Client Recovery from Filehandle Expiration . . . . . . . 95 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 96 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . 97 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 97 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 98 5.4. Classification of Attributes . . . . . . . . . . . . . . 99 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 100 5.6. REQUIRED Attributes - List and Definition References . . 100 5.7. RECOMMENDED Attributes - List and Definition References . . . . . . . . . . . . . . . . . . . . . . . 101 5.8. Attribute Definitions . . . . . . . . . . . . . . . . . 103 5.8.1. Definitions of REQUIRED Attributes . . . . . . . . . 103 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes . . . . . . . . . . . . . . . . . . . . . 105 5.9. Interpreting owner and owner_group . . . . . . . . . . . 112 5.10. Character Case Attributes . . . . . . . . . . . . . . . 114 5.11. Directory Notification Attributes . . . . . . . . . . . 114 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 114 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 116 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 119 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 120 6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 120 6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 135 6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 135 6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 135 Shepler, et al. Expires November 13, 2008 [Page 3] Internet-Draft NFSv4.1 May 2008 6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 136 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 137 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 137 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 138 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 139 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 139 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 141 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 141 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 145 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 145 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 146 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 146 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 147 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 147 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 147 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 148 7.8. Security Policy and Namespace Presentation . . . . . . . 148 8. State Management . . . . . . . . . . . . . . . . . . . . . . 149 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 150 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 150 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 151 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 152 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 154 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 155 8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 158 8.2.6. Stateid Use for SETATTR Operations . . . . . . . . . 159 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 159 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 161 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 162 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 163 8.4.3. Network Partitions and Recovery . . . . . . . . . . 166 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 171 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 172 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration . . . . . . . . . . . . . . . . . . . . . . . 172 8.8. Obsolete Locking Infrastructure From NFSv4.0 . . . . . . 173 9. File Locking and Share Reservations . . . . . . . . . . . . . 174 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 174 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 174 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 175 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 178 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 178 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . 179 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 179 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 180 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 181 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 181 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 182 Shepler, et al. Expires November 13, 2008 [Page 4] Internet-Draft NFSv4.1 May 2008 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 183 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 184 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 184 10.1. Performance Challenges for Client-Side Caching . . . . . 185 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 186 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 188 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 190 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 190 10.3.2. Data Caching and File Locking . . . . . . . . . . . 191 10.3.3. Data Caching and Mandatory File Locking . . . . . . 193 10.3.4. Data Caching and File Identity . . . . . . . . . . . 193 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 195 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 197 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 198 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 199 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 202 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 204 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 204 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 205 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 206 10.5.1. Revocation Recovery for Write Open Delegation . . . 206 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 207 10.7. Data and Metadata Caching and Memory Mapped Files . . . 209 10.8. Name and Directory Caching without Directory Delegations . . . . . . . . . . . . . . . . . . . . . . 211 10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 211 10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 213 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 214 10.9.1. Introduction to Directory Delegations . . . . . . . 214 10.9.2. Directory Delegation Design . . . . . . . . . . . . 215 10.9.3. Attributes in Support of Directory Notifications . . 216 10.9.4. Directory Delegation Recall . . . . . . . . . . . . 216 10.9.5. Directory Delegation Recovery . . . . . . . . . . . 217 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 217 11.1. Location Attributes . . . . . . . . . . . . . . . . . . 217 11.2. File System Presence or Absence . . . . . . . . . . . . 218 11.3. Getting Attributes for an Absent File System . . . . . . 219 11.3.1. GETATTR Within an Absent File System . . . . . . . . 219 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 220 11.4. Uses of Location Information . . . . . . . . . . . . . . 221 11.4.1. File System Replication . . . . . . . . . . . . . . 222 11.4.2. File System Migration . . . . . . . . . . . . . . . 222 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 224 11.5. Location Entries and Server Identity . . . . . . . . . . 225 11.6. Additional Client-side Considerations . . . . . . . . . 226 11.7. Effecting File System Transitions . . . . . . . . . . . 226 11.7.1. File System Transitions and Simultaneous Access . . 228 11.7.2. Simultaneous Use and Transparent Transitions . . . . 228 Shepler, et al. Expires November 13, 2008 [Page 5] Internet-Draft NFSv4.1 May 2008 11.7.3. Filehandles and File System Transitions . . . . . . 231 11.7.4. Fileids and File System Transitions . . . . . . . . 231 11.7.5. Fsids and File System Transitions . . . . . . . . . 233 11.7.6. The Change Attribute and File System Transitions . . 233 11.7.7. Lock State and File System Transitions . . . . . . . 234 11.7.8. Write Verifiers and File System Transitions . . . . 238 11.7.9. Readdir Cookies and Verifiers and File System Transitions . . . . . . . . . . . . . . . . . . . . 238 11.7.10. File System Data and File System Transitions . . . . 238 11.8. Effecting File System Referrals . . . . . . . . . . . . 240 11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 240 11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 244 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 246 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 249 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 253 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 258 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 259 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 261 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 265 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 265 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 266 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 267 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 267 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 267 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 267 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 268 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 268 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 268 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 269 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 269 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 270 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 271 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 272 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 272 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 272 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 273 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 274 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 276 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 279 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 287 12.5.7. Metadata Server Write Propagation . . . . . . . . . 287 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 287 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 289 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 289 12.7.2. Dealing with Lease Expiration on the Client . . . . 290 12.7.3. Dealing with Loss of Layout State on the Metadata Server . . . . . . . . . . . . . . . . . . . . . . . 291 12.7.4. Recovery from Metadata Server Restart . . . . . . . 291 Shepler, et al. Expires November 13, 2008 [Page 6] Internet-Draft NFSv4.1 May 2008 12.7.5. Operations During Metadata Server Grace Period . . . 293 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 294 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 294 12.9. Security Considerations for pNFS . . . . . . . . . . . . 294 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 295 13.1. Client ID and Session Considerations . . . . . . . . . . 296 13.1.1. Sessions Considerations for Data Servers . . . . . . 298 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 298 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 299 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 303 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 303 13.4.2. Interpreting the File Layout Using Sparse Packing . 303 13.4.3. Interpreting the File Layout Using Dense Packing . . 306 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 308 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 310 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 311 13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 313 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 315 13.9. Metadata and Data Server State Coordination . . . . . . 315 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 315 13.9.2. Data Server State Propagation . . . . . . . . . . . 316 13.10. Data Server Component File Size . . . . . . . . . . . . 318 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 319 13.12. Security Considerations for the File Layout Type . . . . 319 14. Internationalization . . . . . . . . . . . . . . . . . . . . 320 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 321 14.2. Stringprep profile for the utf8str_cis type . . . . . . 323 14.3. Stringprep profile for the utf8str_mixed type . . . . . 324 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 326 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 326 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 327 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 327 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 329 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 331 15.1.3. Compound Structure Errors . . . . . . . . . . . . . 332 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 334 15.1.5. State Management Errors . . . . . . . . . . . . . . 336 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 337 15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 337 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 338 15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 339 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 340 15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 341 15.1.12. Session Management Errors . . . . . . . . . . . . . 343 15.1.13. Client Management Errors . . . . . . . . . . . . . . 343 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 344 15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 344 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 345 Shepler, et al. Expires November 13, 2008 [Page 7] Internet-Draft NFSv4.1 May 2008 15.2. Operations and their valid errors . . . . . . . . . . . 346 15.3. Callback operations and their valid errors . . . . . . . 362 15.4. Errors and the operations that use them . . . . . . . . 364 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 378 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 378 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 379 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 390 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 393 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 393 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 399 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 400 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 403 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery . . . . . . . . . . . . . . . . . . . . . . . . 406 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 407 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 407 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 409 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 410 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 413 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 417 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 418 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 420 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 421 18.15. Operation 17: NVERIFY - Verify Difference in Attributes . . . . . . . . . . . . . . . . . . . . . . . 423 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 424 18.17. Operation 19: OPENATTR - Open Named Attribute Directory . . . . . . . . . . . . . . . . . . . . . . . 443 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 444 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 446 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 446 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 448 18.22. Operation 25: READ - Read from File . . . . . . . . . . 449 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 451 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 455 18.25. Operation 28: REMOVE - Remove File System Object . . . . 456 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 458 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 462 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 463 18.29. Operation 33: SECINFO - Obtain Available Security . . . 464 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 468 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 471 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 472 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 476 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 478 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 481 18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm Client ID . . . . . . . . . . . . . . . . . . . 498 Shepler, et al. Expires November 13, 2008 [Page 8] Internet-Draft NFSv4.1 May 2008 18.37. Operation 44: DESTROY_SESSION - Destroy existing session . . . . . . . . . . . . . . . . . . . . . . . . 508 18.38. Operation 45: FREE_STATEID - Free stateid with no locks . . . . . . . . . . . . . . . . . . . . . . . . . 509 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory delegation . . . . . . . . . . . . . . . . . . . . . . . 510 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 514 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File System . . . . . . . . . . . . . . . . . . . 516 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using a layout . . . . . . . . . . . . . . . . . . . . . . . . 518 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 521 18.44. Operation 51: LAYOUTRETURN - Release Layout Information . . . . . . . . . . . . . . . . . . . . . . 531 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object . . . . . . . . . . . . . . . . . . . . . 535 18.46. Operation 53: SEQUENCE - Supply per-procedure sequencing and control . . . . . . . . . . . . . . . . . 537 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 542 18.48. Operation 55: TEST_STATEID - Test stateids for validity . . . . . . . . . . . . . . . . . . . . . . . . 544 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 546 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing client ID . . . . . . . . . . . . . . . . . . . . . . . 550 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished . . . . . . . . . . . . . . . . . . . . . . . . 550 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 553 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 553 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 554 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 554 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 558 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 558 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 559 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client . . . . . . . . . . . . . . . . . . . . . . . . . 560 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 564 20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to Client . . . . . . . . . . . . . . . . . . . . . . . . . 568 20.6. Operation 8: CB_RECALL_ANY - Keep any N recallable objects . . . . . . . . . . . . . . . . . . . . . . . . 569 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for Recallable Objects . . . . . . . . . . . . 572 20.8. Operation 10: CB_RECALL_SLOT - change flow control limits . . . . . . . . . . . . . . . . . . . . . . . . . 573 20.9. Operation 11: CB_SEQUENCE - Supply backchannel sequencing and control . . . . . . . . . . . . . . . . . 574 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation Wants . . . . . . . . . . . . . . . . . . . . 576 Shepler, et al. Expires November 13, 2008 [Page 9] Internet-Draft NFSv4.1 May 2008 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible lock availability . . . . . . . . . . . . . . . . . . . 577 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID changes . . . . . . . . . . . . . . . . . . . . . . . . 579 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation . . . . . . . . . . . . . . . . . . . . . . . 581 21. Security Considerations . . . . . . . . . . . . . . . . . . . 581 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 583 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 583 22.2. ONC RPC Network Identifiers (netids) . . . . . . . . . . 583 22.3. Defining New Notifications . . . . . . . . . . . . . . . 584 22.4. Defining New Layout Types . . . . . . . . . . . . . . . 584 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 586 22.5.1. Path Variable Values . . . . . . . . . . . . . . . . 586 22.5.2. Path Variable Names . . . . . . . . . . . . . . . . 586 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 586 23.1. Normative References . . . . . . . . . . . . . . . . . . 586 23.2. Informative References . . . . . . . . . . . . . . . . . 588 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 590 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 592 Intellectual Property and Copyright Statements . . . . . . . . . 593 Shepler, et al. Expires November 13, 2008 [Page 10] Internet-Draft NFSv4.1 May 2008 1. Introduction 1.1. The NFS Version 4 Minor Version 1 Protocol The NFS version 4 minor version 1 (NFSv4.1) protocol is the second minor version of the NFS version 4 (NFSv4) protocol. The first minor version, NFSv4.0 is described in [21]. It generally follows the guidelines for minor versioning model listed in Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a client and server that supports minor version X must support minor versions 0 through X-1"), and 12 ("no features may be introduced as mandatory in a minor version"). These divergences are due to the introduction of the sessions model for managing non-idempotent operations and the RECLAIM_COMPLETE operation. These two new features are infrastructural in nature and simplify implementation of existing and other new features. Making them anything but REQUIRED would add undue complexity to protocol definition and implementation. NFSv4.1 accordingly updates the Minor Versioning guidelines (Section 2.7). As a minor version, NFSv4.1 is consistent with the overall goals for NFSv4, but extends the protocol so as to better meet those goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted some additional goals, which motivate some of the major extensions in NFSv4.1. 1.2. Scope of this Document This document describes the NFSv4.1 protocol. With respect to NFSv4.0, this document does not: o describe the NFSv4.0 protocol, except where needed to contrast with NFSv4.1. o modify the specification of the NFSv4.0 protocol. o clarify the NFSv4.0 protocol. 1.3. NFSv4 Goals The NFSv4 protocol is a further revision of the NFS protocol defined already by NFSv3 [22]. It retains the essential characteristics of previous versions: easy recovery; independence of transport protocols, operating systems and file systems; simplicity; and good performance. NFSv4 has the following goals: o Improved access and good performance on the Internet. The protocol is designed to transit firewalls easily, perform well Shepler, et al. Expires November 13, 2008 [Page 11] Internet-Draft NFSv4.1 May 2008 where latency is high and bandwidth is low, and scale to very large numbers of clients per server. o Strong security with negotiation built into the protocol. The protocol builds on the work of the ONCRPC working group in supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 protocol provides a mechanism to allow clients and servers the ability to negotiate security and require clients and servers to support a minimal set of security schemes. o Good cross-platform interoperability. The protocol features a file system model that provides a useful, common set of features that does not unduly favor one file system or operating system over another. o Designed for protocol extensions. The protocol is designed to accept standard extensions within a framework that enable and encourages backward compatibility. 1.4. NFSv4.1 Goals NFSv4.1 has the following goals, within the framework established by the overall NFSv4 goals. o To correct significant structural weaknesses and oversights discovered in the base protocol. o To add clarity and specificity to areas left unaddressed or not addressed in sufficient detail in the base protocol. However, as stated in Section 1.2, it is not a goal to clarify the NFSv4.0 protocol in the NFSv4.1 specification. o To add specific features based on experience with the existing protocol and recent industry developments. o To provide protocol support to take advantage of clustered server deployments including the ability to provide scalable parallel access to files distributed among multiple servers. 1.5. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Shepler, et al. Expires November 13, 2008 [Page 12] Internet-Draft NFSv4.1 May 2008 Byte This document defines a byte as an octet, i.e. a datum exactly 8 bits in length. Client The "client" is the entity that accesses the NFS server's resources. The client may be an application which contains the logic to access the NFS server directly. The client may also be the traditional operating system client that provides remote file system services for a set of applications. A client is uniquely identified by a Client Owner. With reference to file locking, the client is also the entity that maintains a set of locks on behalf of one or more applications. This client is responsible for crash or failure recovery for those locks it manages. Note that multiple clients may share the same transport and connection and multiple clients may exist on the same network node. Client ID A 64-bit quantity used as a unique, short-hand reference to a client supplied Verifier and client owner. The server is responsible for supplying the client ID. Client Owner The client owner is a unique string, opaque to the server, which identifies a client. Multiple network connections and source network addresses originating from those connections may share a client owner. The server is expected to treat requests from connnections with the same client owner as coming from the same client. File System The collection of objects on a server (as identified by the major identifier of a Server Owner, which is defined later in this section), that share the same fsid attribute (see Section 5.8.1.9). Lease An interval of time defined by the server for which the client is irrevocably granted a lock. At the end of a lease period the lock may be revoked if the lease has not been extended. The lock must be revoked if a conflicting lock has been granted after the lease interval. All leases granted by a server have the same fixed interval. Note that the fixed interval was chosen to alleviate the expense a server would have in maintaining state about variable length leases across server failures. Shepler, et al. Expires November 13, 2008 [Page 13] Internet-Draft NFSv4.1 May 2008 Lock The term "lock" is used to refer to byte-range (in UNIX environments, also known as record) locks, share reservations, delegations, or layouts unless specifically stated otherwise. Server The "Server" is the entity responsible for coordinating client access to a set of file systems and is identified by a Server owner. A server can span multiple network addresses. Server Owner The "Server Owner" identifies the server to the client. The server owner consists of a major and minor identifier. When the client has two connections each to a peer with the same major identifier, the client assumes both peers are the same server (the server namespace is the same via each connection), and assumes and lock state is sharable across both connections. When each peer has both the same major and minor identifier, the client assumes each connection might be associatable with the same session. Stable Storage NFSv4.1 servers must be able to recover without data loss from multiple power failures (including cascading power failures, that is, several power failures in quick succession), operating system failures, and hardware failure of components other than the storage medium itself (for example, disk, nonvolatile RAM). Some examples of stable storage that are allowable for an NFS server include: 1. Media commit of data, that is, the modified data has been successfully written to the disk media, for example, the disk platter. 2. An immediate reply disk drive with battery-backed on- drive intermediate storage or uninterruptible power system (UPS). 3. Server commit of data with battery-backed intermediate storage and recovery software. 4. Cache commit with uninterruptible power system (UPS) and recovery software. Stateid A 128-bit quantity returned by a server that uniquely defines the open and locking state provided by the server for a specific open-owner or lock-owner/open-owner pair for a specific file and type of lock. Shepler, et al. Expires November 13, 2008 [Page 14] Internet-Draft NFSv4.1 May 2008 Verifier A 64-bit quantity generated by the client that the server can use to determine if the client has restarted and lost all previous lock state. 1.6. Overview of NFSv4.1 Features To provide a reasonable context for the reader, the major features of the NFSv4.1 protocol will be reviewed in brief. This will be done to provide an appropriate context for both the reader who is familiar with the previous versions of the NFS protocol and the reader that is new to the NFS protocols. For the reader new to the NFS protocols, there is still a set of fundamental knowledge that is expected. The reader should be familiar with the XDR and RPC protocols as described in [2] and [3]. A basic knowledge of file systems and distributed file systems is expected as well. In general this specification of NFSv4.1 will not distinguish those added in minor version one from those present in the base protocol but will treat NFSv4.1 as a unified whole. See Section 1.7 for a summary of the differences between NFSv4.0 and NFSv4.1. 1.6.1. RPC and Security As with previous versions of NFS, the External Data Representation (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 protocol are those defined in [2] and [3]. To meet end-to-end security requirements, the RPCSEC_GSS framework [4] will be used to extend the basic RPC security. With the use of RPCSEC_GSS, various mechanisms can be provided to offer authentication, integrity, and privacy to the NFSv4 protocol. Kerberos V5 will be used as described in [5] to provide one security framework. The LIPKEY and SPKM-3 GSS- API mechanisms described in [6] will be used to provide for the use of user password and client/server public key certificates by the NFSv4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be specified and used for NFSv4.1 security. To enable in-band security negotiation, the NFSv4.1 protocol has operations which provide the client a method of querying the server about its policies regarding which security mechanisms must be used for access to the server's file system resources. With this, the client can securely match the security mechanism that meets the policies specified at both the client and server. 1.6.2. Protocol Structure Shepler, et al. Expires November 13, 2008 [Page 15] Internet-Draft NFSv4.1 May 2008 1.6.2.1. Core Protocol Unlike NFSv3, which used a series of ancillary protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFSv4 a single RPC protocol is used to make requests to the server. Facilities that had been separate protocols, such as locking, are now integrated within a single unified protocol. 1.6.2.2. Parallel Access Minor version one supports high-performance data access to a clustered server implementation by enabling a separation of metadata access and data access, with the latter done to multiple servers in parallel. Such parallel data access is controlled by recallable objects known as "layouts", which are integrated into the protocol locking model. Clients direct requests for data access to a set of data servers specified by the layout via a data storage protocol which may be NFSv4.1 or may be another protocol. 1.6.3. File System Model The general file system model used for the NFSv4.1 protocol is the same as previous versions. The server file system is hierarchical with the regular files contained within being treated as opaque byte streams. In a slight departure, file and directory names are encoded with UTF-8 to deal with the basics of internationalization. The NFSv4.1 protocol does not require a separate protocol to provide for the initial mapping between path name and filehandle. All file systems exported by a server are presented as a tree so that all file systems are reachable from a special per-server global root filehandle. This allows LOOKUP operations to be used to perform functions previously provided by the MOUNT protocol. The server provides any necessary pseudo file systems to bridge any gaps that arise due to unexported gaps between exported file systems. 1.6.3.1. Filehandles As in previous versions of the NFS protocol, opaque filehandles are used to identify individual files and directories. Lookup-type and create operations translate file and directory names to filehandles which are then used to identify objects in subsequent operations. The NFSv4.1 protocol provides support for persistent filehandles, guaranteed to be valid for the lifetime of the file system object designated. In addition it provides support to servers to provide Shepler, et al. Expires November 13, 2008 [Page 16] Internet-Draft NFSv4.1 May 2008 filehandles with more limited validity guarantees, called volatile filehandles. 1.6.3.2. File Attributes The NFSv4.1 protocol has a rich and extensible attribute structure, which is divided into REQUIRED, RECOMMENDED, and named attributes. The acl, sacl, and dacl attributes compose a set of RECOMMENDED file attributes that make up the Access Control List (ACL) of a file (Section 6). These attributes provide for directory and file access control beyond the model used in NFSv3. The ACL definition allows for specification of specific sets of permissions for individual users and groups. In addition, ACL inheritance allows propagation of access permissions and restriction down a directory tree as file system objects are created. A named attribute is an opaque byte stream that is associated with a directory or file and referred to by a string name. Named attributes are meant to be used by client applications as a method to associate application-specific data with a regular file or directory. NFSv4.1 modifies named attributes relative to NFSv4.0 by tightening the allowed operations in order to prevent the development of non- interoperable implementation. See Section 5.3 for details. 1.6.3.3. Multi-server Namespace NFSv4.1 contains a number of features to allow implementation of namespaces that cross server boundaries and that allow and facilitate a non-disruptive transfer of support for individual file systems between servers. They are all based upon attributes that allow one file system to specify alternate or new locations for that file system. These attributes may be used together with the concept of absent file systems, which provide specifications for additional locations but no actual file system content. This allows a number of important facilities: o Location attributes may be used with absent file systems to implement referrals whereby one server may direct the client to a file system provided by another server. This allows extensive multi-server namespaces to be constructed. o Location attributes may be provided for present file systems to provide the locations of alternate file system instances or replicas to be used in the event that the current file system instance becomes unavailable. Shepler, et al. Expires November 13, 2008 [Page 17] Internet-Draft NFSv4.1 May 2008 o Location attributes may be provided when a previously present file system becomes absent. This allows non-disruptive migration of file systems to alternate servers. 1.6.4. Locking Facilities As mentioned previously, NFS v4.1 is a single protocol which includes locking facilities. These locking facilities include support for many types of locks including a number of sorts of recallable locks. Recallable locks such as delegations allow the client to be assured that certain events will not occur so long as that lock is held. When circumstances change, the lock is recalled via a callback request. The assurances provided by delegations allow more extensive caching to be done safely when circumstances allow it. The types of locks are: o Share reservations as established by OPEN operations. o Byte-range locks. o File delegations, which are recallable locks that assure the holder that inconsistent opens and file changes cannot occur so long as the delegation is held. o Directory delegations, which are recallable locks that assure the holder that inconsistent directory modifications cannot occur so long as the delegation is held. o Layouts, which are recallable objects that assure the holder that direct access to the file data may be performed directly by the client and that no change to the data's location inconsistent with that access may be made so long as the layout is held. All locks for a given client are tied together under a single client- wide lease. All requests made on sessions associated with the client renew that lease. When leases are not promptly renewed locks are subject to revocation. In the event of server restart, clients have the opportunity to safely reclaim their locks within a special grace period. 1.7. Differences from NFSv4.0 The following summarizes the major differences between minor version one and the base protocol: o Implementation of the sessions model (Section 2.10). Shepler, et al. Expires November 13, 2008 [Page 18] Internet-Draft NFSv4.1 May 2008 o Parallel access to data (Section 12). o Addition of the RECLAIM_COMPLETE operation to better structure the lock reclamation process (Section 18.51). o Enhanced delegation support as follows. * Delegations on directories and other file types in addition to regular files (Section 18.39, Section 18.49). * Operations to optimize acquisition of recalled or denied delegations (Section 18.49, Section 20.5, Section 20.7). * Notifications of changes to files and directories (Section 18.39, Section 20.4). * A method to allow a server to indicate it is recalling one or more delegations for resource management reasons, and thus a method to allow the client to pick which delegations to return (Section 20.6). o Attributes can be set atomically during exclusive file create via the OPEN operation (see the new EXCLUSIVE4_1 creation method in Section 18.16). o Open files can be preserved if removed and the hard link count goes to zero thus obviating the need for clients to rename deleted files to partially hidden names -- colloquially called "silly rename" (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in Section 18.16). o Improved compatibility with Microsoft Windows for Access Control Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). o Data retention (Section 5.13). o Identification of the implementation of the NFS client and server (Section 18.35). o Support for notification of the availability of byte-range locks (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in Section 18.16 and see Section 20.11). 2. Core Infrastructure Shepler, et al. Expires November 13, 2008 [Page 19] Internet-Draft NFSv4.1 May 2008 2.1. Introduction NFSv4.1 relies on core infrastructure common to nearly every operation. This core infrastructure is described in the remainder of this section. 2.2. RPC and XDR The NFSv4.1 protocol is a Remote Procedure Call (RPC) application that uses RPC version 2 and the corresponding eXternal Data Representation (XDR) as defined in [3] and [2]. 2.2.1. RPC-based Security Previous NFS versions have been thought of as having a host-based authentication model, where the NFS server authenticates the NFS client, and trusts the client to authenticate all users. Actually, NFS has always depended on RPC for authentication. One of the first forms of RPC authentication, AUTH_SYS, had no strong authentication, and required a host-based authentication approach. NFSv4.1 also depends on RPC for basic security services, and mandates RPC support for a user-based authentication model. The user-based authentication model has user principals authenticated by a server, and in turn the server authenticated by user principals. RPC provides some basic security services which are used by NFSv4.1. 2.2.1.1. RPC Security Flavors As described in section 7.2 "Authentication" of [3], RPC security is encapsulated in the RPC header, via a security or authentication flavor, and information specific to the specified security flavor. Every RPC header conveys information used to identify and authenticate a client and server. As discussed in Section 2.2.1.1.1, some security flavors provide additional security services. NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This requirement to implement is not a requirement to use.) Other flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. 2.2.1.1.1. RPCSEC_GSS and Security Services RPCSEC_GSS ([4]) uses the functionality of GSS-API [7]. This allows for the use of various security mechanisms by the RPC layer without the additional implementation overhead of adding RPC security flavors. Shepler, et al. Expires November 13, 2008 [Page 20] Internet-Draft NFSv4.1 May 2008 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate users on clients to servers, and servers to users. It can also perform integrity checking on the entire RPC message, including the RPC header, and the arguments or results. Finally, privacy, usually via encryption, is a service available with RPCSEC_GSS. Privacy is performed on the arguments and results. Note that if privacy is selected, integrity, authentication, and identification are enabled. If privacy is not selected, but integrity is selected, authentication and identification are enabled. If integrity and privacy are not selected, but authentication is enabled, identification is enabled. RPCSEC_GSS does not provide identification as a separate service. Although GSS-API has an authentication service distinct from its privacy and integrity services, GSS-API's authentication service is not used for RPCSEC_GSS's authentication service. Instead, each RPC request and response header is integrity protected with the GSS-API integrity service, and this allows RPCSEC_GSS to offer per-RPC authentication and identity. See [4] for more information. NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's privacy service. 2.2.1.1.1.2. Security mechanisms for NFSv4.1 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide security services. Therefore NFSv4.1 clients and servers MUST support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. The use of RPCSEC_GSS requires selection of: mechanism, quality of protection (QOP), and service (authentication, integrity, privacy). For the mandated security mechanisms, NFSv4.1 specifies that a QOP of zero (0) is used, leaving it up to the mechanism or the mechanism's configuration to use an appropriate level of protection that QOP zero maps to. Each mandated mechanism specifies minimum set of cryptographic algorithms for implementing integrity and privacy. NFSv4.1 clients and servers MUST be implemented on operating environments that comply with the REQUIRED cryptographic algorithms of each REQUIRED mechanism. 2.2.1.1.1.2.1. Kerberos V5 The Kerberos V5 GSS-API mechanism as described in [5] MUST be implemented with the RPCSEC_GSS services as specified in the following table: Shepler, et al. Expires November 13, 2008 [Page 21] Internet-Draft NFSv4.1 May 2008 column descriptions: 1 == number of pseudo flavor 2 == name of pseudo flavor 3 == mechanism's OID 4 == RPCSEC_GSS service 5 == NFSv4.1 clients MUST support 6 == NFSv4.1 servers MUST support 1 2 3 4 5 6 ------------------------------------------------------------------ 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes Note that the number and name of the pseudo flavor is presented here as a mapping aid to the implementor. Because the NFSv4.1 protocol includes a method to negotiate security and it understands the GSS- API mechanism, the pseudo flavor is not needed. The pseudo flavor is needed for the NFSv3 since the security negotiation is done via the MOUNT protocol as described in [23]. 2.2.1.1.1.2.2. LIPKEY The LIPKEY V5 GSS-API mechanism as described in [6] MUST be implemented with the RPCSEC_GSS services as specified in the following table: 1 2 3 4 5 6 ------------------------------------------------------------------ 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes 2.2.1.1.1.2.3. SPKM-3 as a security triple The SPKM-3 GSS-API mechanism as described in [6] MUST be implemented with the RPCSEC_GSS services as specified in the following table: 1 2 3 4 5 6 ------------------------------------------------------------------ 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes Shepler, et al. Expires November 13, 2008 [Page 22] Internet-Draft NFSv4.1 May 2008 2.2.1.1.1.3. GSS Server Principal Regardless of what security mechanism under RPCSEC_GSS is being used, the NFS server, MUST identify itself in GSS-API via a GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE names are of the form: service@hostname For NFS, the "service" element is nfs Implementations of security mechanisms will convert nfs@hostname to various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the following form is RECOMMENDED: nfs/hostname 2.3. COMPOUND and CB_COMPOUND A significant departure from the versions of the NFS protocol before NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 protocol, in all minor versions, there are exactly two RPC procedures, NULL and COMPOUND. The COMPOUND procedure is defined as a series of individual operations and these operations perform the sorts of functions performed by traditional NFS procedures. The operations combined within a COMPOUND request are evaluated in order by the server, without any atomicity guarantees. A limited set of facilities exist to pass results from one operation to another. Once an operation returns a failing result, the evaluation ends and the results of all evaluated operations are returned to the client. With the use of the COMPOUND procedure, the client is able to build simple or complex requests. These COMPOUND requests allow for a reduction in the number of RPCs needed for logical file system operations. For example, multi-component lookup requests can be constructed by combining multiple LOOKUP operations. Those can be further combined with operations such as GETATTR, READDIR, or OPEN plus READ to do more complicated sets of operation without incurring additional latency. NFSv4.1 also contains a considerable set of callback operations in which the server makes an RPC directed at the client. Callback RPC's have a similar structure to that of the normal server requests. In all minor versions of the NFSv4 protocol there are two callback RPC procedures, CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is Shepler, et al. Expires November 13, 2008 [Page 23] Internet-Draft NFSv4.1 May 2008 defined in an analogous fashion to that of COMPOUND with its own set of callback operations. The addition of new server and callback operations within the COMPOUND and CB_COMPOUND request framework provides a means of extending the protocol in subsequent minor versions. Except for a small number of operations needed for session creation, server requests and callback requests are performed within the context of a session. Sessions provide a client context for every request and support robust reply protection for non-idempotent requests. 2.4. Client Identifiers and Client Owners For each operation that obtains or depends on locking state, the specific client must be identifiable by the server. Each distinct client instance is represented by a client ID. A client ID is a 64-bit identifier representing a specific client at a given time. The client ID is changed whenever the client re- initializes, and may change when the server re-initializes. Client IDs are used to support lock identification and crash recovery. During steady state operation, the client ID associated with each operation is derived from the session (see Section 2.10) on which the operation is sent. A session is associated with a client ID when the session is created. Unlike NFSv4.0, the only NFSv4.1 operations possible before a client ID is established are those needed to establish the client ID. A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION operation using that client ID (eir_clientid as returned from EXCHANGE_ID) is required to establish and confirm the client ID on the server. Establishment of identification by a new incarnation of the client also has the effect of immediately releasing any locking state that a previous incarnation of that same client might have had on the server. Such released state would include all lock, share reservation, layout state, and where the server is not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated with the same client with the same identity. For discussion of delegation state recovery, see Section 10.2.1. For discussion of layout state recovery see Section 12.7.1. Releasing such state requires that the server be able to determine that one client instance is the successor of another. Where this cannot be done, for any of a number of reasons, the locking state Shepler, et al. Expires November 13, 2008 [Page 24] Internet-Draft NFSv4.1 May 2008 will remain for a time subject to lease expiration (see Section 8.3) and the new client will need to wait for such state to be removed, if it makes conflicting lock requests. Client identification is encapsulated in the following Client Owner data type: struct client_owner4 { verifier4 co_verifier; opaque co_ownerid; }; The first field, co_verifier, is a client incarnation verifier. The server will start the process of canceling the client's leased state if co_verifier is different than what the server has previously recorded for the identified client (as specified in the co_ownerid field). The second field, co_ownerid is a variable length string that uniquely defines the client so that subsequent instances of the same client bear the same co_ownerid with a different verifier. There are several considerations for how the client generates the co_ownerid string: o The string should be unique so that multiple clients do not present the same string. The consequences of two clients presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly canceled. o The string should be selected so that subsequent incarnations (e.g. restarts) of the same client cause the client to present the same string. The implementor is cautioned from an approach that requires the string to be recorded in a local file because this precludes the use of the implementation in an environment where there is no local disk and all file access is from an NFSv4.1 server. o The string should be the same for each server network address that the client accesses. This way, if a server has multiple interfaces, the client can trunk traffic over multiple network paths as described in Section 2.10.4. (Note: the precise opposite was advised in the NFSv4.0 specification [21].) o The algorithm for generating the string should not assume that the client's network address will not change, unless the client Shepler, et al. Expires November 13, 2008 [Page 25] Internet-Draft NFSv4.1 May 2008 implementation knows it is using statically assigned network addresses. This includes changes between client incarnations and even changes while the client is still running in its current incarnation. Thus with dynamic address assignment, if the client includes just the client's network address in the co_ownerid string, there is a real risk that after the client gives up the network address, another client, using a similar algorithm for generating the co_ownerid string, would generate a conflicting co_ownerid string. Given the above considerations, an example of a well generated co_ownerid string is one that includes: o If applicable, the client's statically assigned network address. o Additional information that tends to be unique, such as one or more of: * The client machine's serial number (for privacy reasons, it is best to perform some one way function on the serial number). * A MAC address (again, a one way function should be performed). * The timestamp of when the NFSv4.1 software was first installed on the client (though this is subject to the previously mentioned caution about using information that is stored in a file, because the file might only be accessible over NFSv4.1). * A true random number. However since this number ought to be the same between client incarnations, this shares the same problem as that of using the timestamp of the software installation. o For a user level NFSv4.1 client, it should contain additional information to distinguish the client from other user level clients running on the same host, such as a process identifier or other unique sequence. The client ID is assigned by the server (the eir_clientid result from EXCHANGE_ID) and should be chosen so that it will not conflict with a client ID previously assigned by the server. This applies across server restarts. In the event of a server restart, a client may find out that its current client ID is no longer valid when it receives an NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on the characteristics of the sessions involved, specifically whether the session is persistent (see Section 2.10.5.5), but in each case Shepler, et al. Expires November 13, 2008 [Page 26] Internet-Draft NFSv4.1 May 2008 the client will receive this error when it attempts to establish a new session with the existing client ID and receives the error NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be obtained via EXCHANGE_ID and the new session established with that client ID. When a session is not persistent, the client will find out that it needs to create a new session as a result of getting an NFS4ERR_BADSESSION, since the session in question was lost as part of a server restart. When the existing client ID is presented to a server as part of creating a session and that client ID is not recognized, as would happen after a server restart, the server will reject the request with the error NFS4ERR_STALE_CLIENTID. In the case of the session being persistent, the client will re- establish communication using the existing session after the restart. This session will be associated with the existing client ID but may only be used to retransmit operations that the client previously transmitted and did not see replies to. Replies to operations that the server previously performed will come from the reply cache, otherwise NFS4ERR_DEADSESSION will be returned. Hence, such a session is referred to as "dead". In this situation, in order to perform new operations, the client must establish a new session. If an attempt is made to establish this new session with the existing client ID, the server will reject the request with NFS4ERR_STALE_CLIENTID. When NFS4ERR_STALE_CLIENTID is received in either of these situations, the client must obtain a new client ID by use of the EXCHANGE_ID operation, then use that client ID as the basis of a new session, and then proceed to any other necessary recovery for the server restart case (See Section 8.4.2). See the descriptions of EXCHANGE_ID (Section 18.35) and CREATE_SESSION (Section 18.36) for a complete specification of these operations. 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a client_owner4 in an EXCHANGE_ID with an nfs_client_id4 established using the SETCLIENTID operation of NFSv4.0. A server that does so will allow an upgraded client to avoid waiting until the lease (i.e. the lease established by the NFSv4.0 instance client) expires. This requires the client_owner4 be constructed the same way as the nfs_client_id4. If the latter's contents included the server's network address (per the recommendations of the NFSv4.0 specification [21]), and the NFSv4.1 client does not wish to use a client ID that Shepler, et al. Expires November 13, 2008 [Page 27] Internet-Draft NFSv4.1 May 2008 prevents trunking, it should send two EXCHANGE_ID operations. The first EXCHANGE_ID will have a client_owner4 equal to the nfs_client_id4. This will clear the state created by the NFSv4.0 client. The second EXCHANGE_ID will not have the server's network address. The state created for the second EXCHANGE_ID will not have to wait for lease expiration, because there will be no state to expire. 2.4.2. Server Release of Client ID NFSv4.1 introduces a new operation called DESTROY_CLIENTID (Section 18.50) which the client SHOULD use to destroy a client ID it no longer needs. This permits graceful, bilateral release of a client ID. The operation cannot be used if there are sessions associated with the client ID, or state with an unexpired lease. If the server determines that the client holds no associated state for its client ID (including sessions, opens, locks, delegations, layouts, and wants), the server may choose to unilaterally release the client ID in order to conserve resources. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the EXCHANGE_ID/ CREATE_SESSION sequence to establish a new client ID. The server ought to be very hesitant to release a client ID since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a client ID unless there had been no activity from that client for many minutes. As long as there are sessions, opens, locks, delegations, layouts, or wants, the server MUST NOT release the client ID. See Section 2.10.11.1.4 for discussion on releasing inactive sessions. 2.4.3. Resolving Client Owner Conflicts When the server gets an EXCHANGE_ID for a client owner that currently has no state, or that has state, but the lease has expired, the server MUST allow the EXCHANGE_ID, and confirm the new client ID if followed by the appropriate CREATE_SESSION. When the server gets an EXCHANGE_ID for a new incarnation of a client owner that currently has an old incarnation with state and an unexpired lease, the server is allowed to dispose of the state of the previous incarnation of the client owner if one of the following are true: o The principal that created the client ID for the client owner is the same as the principal that is issuing the EXCHANGE_ID. Note that if the client ID was created with SP4_MACH_CRED state Shepler, et al. Expires November 13, 2008 [Page 28] Internet-Draft NFSv4.1 May 2008 protection (Section 18.35), the principal MUST be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used MUST be integrity or privacy, and the same GSS mechanism and principal must be used as that used when the client ID was created. o The client ID was established with SP4_SSV protection (Section 18.35, Section 2.10.7.3) and the client sends the EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the GSS SSV mechanism (Section 2.10.8). o The client ID was established with SP4_SSV protection, and under the conditions described herein, the EXCHANGE_ID was sent with SP4_MACH_CRED state protection. Because the SSV might not persist across client and server restart, and because the first time a client sends EXCHANGE_ID to a server it does not have an SSV, the client MAY send the subsequent EXCHANGE_ID without an SSV RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the principal MUST be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used MUST be integrity or privacy, and the same GSS mechanism and principal MUST be used as that used when the client ID was created. If none of the above situations apply, the server MUST return NFS4ERR_CLID_INUSE. If the server accepts the principal and co_ownerid as matching that which created the client ID, and the co_verifier in the EXCHANGE_ID differs from the co_verifier used when the client ID was created, then after the server receives a CREATE_SESSION that confirms the client ID, the server deletes state. If the co_verifier values are the same, (e.g. the client is either updating properties of the client ID (Section 18.35), or the client is attempting trunking (Section 2.10.4) the server MUST NOT delete state. 2.5. Server Owners The Server Owner is similar to a Client Owner (Section 2.4), but unlike the Client Owner, there is no shorthand server ID. The Server Owner is defined in the following data type: struct server_owner4 { uint64_t so_minor_id; opaque so_major_id; }; The Server Owner is returned from EXCHANGE_ID. When the so_major_id fields are the same in two EXCHANGE_ID results, the connections each Shepler, et al. Expires November 13, 2008 [Page 29] Internet-Draft NFSv4.1 May 2008 EXCHANGE_ID were sent over can be assumed to address the same Server (as defined in Section 1.5). If the so_minor_id fields are also the same, then not only do both connections connect to the same server, but the session can be shared across both connections. The reader is cautioned that multiple servers may deliberately or accidentally claim to have the same so_major_id or so_major_id/so_minor_id; the reader should examine Section 2.10.4 and Section 18.35 in order to avoid acting on falsely matching Server Owner values. The considerations for generating a so_major_id are similar to that for generating a co_ownerid string (see Section 2.4). The consequences of two servers generating conflicting so_major_id values are less dire than they are for co_ownerid conflicts because the client can use RPCSEC_GSS to compare the authenticity of each server (see Section 2.10.4). 2.6. Security Service Negotiation With the NFSv4.1 server potentially offering multiple security mechanisms, the client needs a method to determine or negotiate which mechanism is to be used for its communication with the server. The NFS server may have multiple points within its file system namespace that are available for use by NFS clients. These points can be considered security policy boundaries, and in some NFS implementations are tied to NFS export points. In turn the NFS server may be configured such that each of these security policy boundaries may have different or multiple security mechanisms in use. The security negotiation between client and server must be done with a secure channel to eliminate the possibility of a third party intercepting the negotiation sequence and forcing the client and server to choose a lower level of security than required or desired. See Section 21 for further discussion. 2.6.1. NFSv4.1 Security Tuples An NFS server can assign one or more "security tuples" to each security policy boundary in its namespace. Each security tuple consists of a security flavor (see Section 2.2.1.1), and if the flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of protection, and an RPCSEC_GSS service. 2.6.2. SECINFO and SECINFO_NO_NAME The SECINFO and SECINFO_NO_NAME operations allow the client to determine, on a per filehandle basis, what security tuple is to be used for server access. In general, the client will not have to use either operation except during initial communication with the server Shepler, et al. Expires November 13, 2008 [Page 30] Internet-Draft NFSv4.1 May 2008 or when the client crosses security policy boundaries at the server. However, the server's policies may also change at any time and force the client to negotiate a new security tuple. Where the use of different security tuples would affect the type of access that would be allowed if a request was sent over the same connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. read-only vs. read-write) access, security tuples that allow greater access should be presented first. Where the general level of access is the same and different security flavors limit the range of principals whose privileges are recognized (e.g. allowing or disallowing root access), flavors supporting the greatest range of principals should be listed first. 2.6.3. Security Error Based on the assumption that each NFSv4.1 client and server must support a minimum set of security (i.e., LIPKEY, SPKM-3, and Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file access to the server with one of the minimal security tuples. During communication with the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. This error allows the server to notify the client that the security tuple currently being used contravenes the server's security policy. The client is then responsible for determining (see Section 2.6.3.1) what security tuples are available at the server and choosing one which is appropriate for the client. 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME This section explains of the mechanics of NFSv4.1 security negotiation. 2.6.3.1.1. Put Filehandle Operations The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH. Each of the subsections herein describes how the server handles a subseries of operations that starts with a put filehandle operation. 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH The client is saving a filehandle for a future RESTOREFH, LINK, or RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine whether the put filehandle operation returns NFS4ERR_WRONGSEC or not, the server implementation pretends SAVEFH is not in the series of operations and examines which of the situations described in the other subsections of Section 2.6.3.1.1 apply. Shepler, et al. Expires November 13, 2008 [Page 31] Internet-Draft NFSv4.1 May 2008 2.6.3.1.1.2. Two or More Put Filehandle Operations For a series of N put filehandle operations, the server MUST NOT return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. The N'th put filehandle operation is handled as if it is the first in a subseries of operations. For example if the server received PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is processed as according to Section 2.6.3.1.1.3. 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing Name) This situation also applies to a put filehandle operation followed by a LOOKUP or an OPEN operation that specifies an existing component name. In this situation, the client is potentially crossing a security policy boundary, and the set of security tuples the parent directory supports may differ from those of the child. The server implementation may decide whether to impose any restrictions on security policy administration. There are at least three approaches (sec_policy_child is the tuple set of the child export, sec_policy_parent is that of the parent). a) sec_policy_child <= sec_policy_parent (<= for subset). This means that the set of security tuples specified on the security policy of a child directory is always a subset of that of its parent directory. b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, {} for the empty set). This means that the security tuples specified on the security policy of a child directory always has a non empty intersection with that of the parent. c) sec_policy_child ^ sec_policy_parent == {}. This means that the set of tuples specified on the security policy of a child directory may not intersect with that of the parent. In other words, there are no restrictions on how the system administrator may set up these tuples. In order for a server to support approaches (b) (for the case when a client chooses a flavor that is not a member of sec_policy_parent) and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC when there is a security tuple mismatch. Instead, it should be returned from the LOOKUP (or OPEN by existing component name) that follows. Shepler, et al. Expires November 13, 2008 [Page 32] Internet-Draft NFSv4.1 May 2008 Since the above guideline does not contradict approach (a), it should be followed in general. Even if approach (a) is implemented, it is possible for the security tuple used to be acceptable for the target of LOOKUP but not for the filehandles used in the put filehandle operation. The put filehandle operation could be a PUTROOTFH or PUTPUBFH, where the client cannot know the security tuples for the root or public filehandle. Or the security policy for the filehandle used by the put filehandle operation could have changed since the time the filehandle was obtained. Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in response to the put filehandle operation if the operation is immediately followed by a LOOKUP or an OPEN by component name. 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP Since SECINFO only works its way down, there is no way LOOKUPP can return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue via style SECINFO_STYLE4_PARENT, which works in the opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put filehandle operation that is followed by a LOOKUPP MUST NOT return NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, the client's only recourse is to send the put filehandle operation, LOOKUPP, GETFH sequence of operations with every security tuple it supports. Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle operation if the operation is immediately followed by a LOOKUPP. 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME A security sensitive client is allowed to choose a strong security tuple when querying a server to determine a file object's permitted security tuples. The security tuple chosen by the client does not have to be included in the tuple list of the security policy of the either parent directory indicated in the put filehandle operation, or the child file object indicated in SECINFO (or any parent directory indicated in SECINFO_NO_NAME). Of course the server has to be configured for whatever security tuple the client selects, otherwise the request will fail at RPC layer with an appropriate authentication error. In theory, there is no connection between the security flavor used by SECINFO or SECINFO_NO_NAME and those supported by the security policy. But in practice, the client may start looking for strong flavors from those supported by the security policy, followed by those in the REQUIRED set. Shepler, et al. Expires November 13, 2008 [Page 33] Internet-Draft NFSv4.1 May 2008 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put filehandle operation that is immediately followed by SECINFO or SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC from SECINFO or SECINFO_NO_NAME. 2.6.3.1.1.6. Put Filehandle Operation + Nothing The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 2.6.3.1.1.7. Put Filehandle Operation + Anything Else "Anything Else" includes OPEN by filehandle. The security policy enforcement applies to the filehandle specified in the put filehandle operation. Therefore the put filehandle operation must return NFS4ERR_WRONGSEC when there is a security tuple mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an allowable error to every other operation. A COMPOUND containing the series put filehandle operation + SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way for the client to recover from NFS4ERR_WRONGSEC. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by component name). 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME Suppose a client sends a COMPOUND procedure containing the series SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple used does not match that required for the target file. By rule (see Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot return NFS4ERR_WRONGSEC. The issue is resolved by the fact that SECINFO and SECINFO_NO_NAME consume the current filehandle (note that this is a change from NFSv4.0). This leaves no current filehandle for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 2.6.3.1.2. LINK and RENAME The LINK and RENAME operations use both the current and saved filehandles. When the current filehandle is injected into a series of operations via a put filehandle operation, the server MUST return NFS4ERR_WRONGSEC, per Section 2.6.3.1.1. LINK and RENAME MAY return NFS4ERR_WRONGSEC if the security policy of the saved filehandle rejects the security flavor used in the COMPOUND request's credentials. If the server does so, then if there is no intersection Shepler, et al. Expires November 13, 2008 [Page 34] Internet-Draft NFSv4.1 May 2008 between the security policies of saved and current filehandles, this means it will be impossible for client to perform the intended LINK or RENAME operation. For example, suppose the client sends this COMPOUND request: SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where filehandles bFH and aFH refer to different directories. Suppose no common security tuple exists between the security policies of aFH and bFH. If the client sends the request using credentials acceptable to bFH's security policy but not aFH's policy, then the PUTFH aFH operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", using credentials acceptable to aFH's security policy, but not bFH's policy. The server returns NFS4ERR_WRONGSEC on the RENAME operation. To prevent a client from an endless sequence of a request containing LINK or RENAME, followed by a request containing SECINFO_NO_NAME, the server MUST detect when the security policies of the current and saved filehandles have no mutually acceptable security tuple, and MUST NOT NFS4ERR_WRONGSEC in that situation. Instead the server MUST return NFS4ERR_XDEV. Thus while a server MAY return NFS4ERR_WRONGSEC from LINK and RENAME, the server implementor may reasonably decide the consequences are not worth the security benefits, and so allow the security policy of the current filehandle to override that of the saved filehandle. 2.7. Minor Versioning To address the requirement of an NFS protocol that can evolve as the need arises, the NFSv4.1 protocol contains the rules and framework to allow for future minor changes or versioning. The base assumption with respect to minor versioning is that any future accepted minor version must follow the IETF process and be documented in a standards track RFC. Therefore, each minor version number will correspond to an RFC. Minor version zero of the NFSv4 protocol is represented by [21], and minor version one is represented by this document [[Comment.1: RFC Editor: change "document" to "RFC" when we publish]]. The COMPOUND and CB_COMPOUND procedures support the encoding of the minor version being requested by the client. The following items represent the basic rules for the development of minor versions. Note that a future minor version may decide to modify or add to the following rules as part of the minor version definition. Shepler, et al. Expires November 13, 2008 [Page 35] Internet-Draft NFSv4.1 May 2008 1. Procedures are not added or deleted To maintain the general RPC model, NFSv4 minor versions will not add to or delete procedures from the NFS program. 2. Minor versions may add operations to the COMPOUND and CB_COMPOUND procedures. The addition of operations to the COMPOUND and CB_COMPOUND procedures does not affect the RPC model. * Minor versions may append attributes to the bitmap4 that represents sets of attributes and the fattr4 that represents sets of attribute values. This allows for the expansion of the attribute model to allow for future growth or adaptation. * Minor version X must append any new attributes after the last documented attribute. Since attribute results are specified as an opaque array of per-attribute XDR encoded results, the complexity of adding new attributes in the midst of the current definitions would be too burdensome. 3. Minor versions must not modify the structure of an existing operation's arguments or results. Again the complexity of handling multiple structure definitions for a single operation is too burdensome. New operations should be added instead of modifying existing structures for a minor version. This rule does not preclude the following adaptations in a minor version. * adding bits to flag fields such as new attributes to GETATTR's bitmap4 data type and providing corresponding variants of opaque arrays, such as a notify4 used together with such bitmaps. * adding bits to existing attributes like ACLs that have flag words * extending enumerated types (including NFS4ERR_*) with new values Shepler, et al. Expires November 13, 2008 [Page 36] Internet-Draft NFSv4.1 May 2008 * adding cases to a switched union 4. Minor versions may not modify the structure of existing attributes. 5. Minor versions may not delete operations. This prevents the potential reuse of a particular operation "slot" in a future minor version. 6. Minor versions may not delete attributes. 7. Minor versions may not delete flag bits or enumeration values. 8. Minor versions may declare an operation MUST NOT be implemented. Specifying an operation MUST NOT be implemented is equivalent to obsoleting an operation. For the client, it means that the operation should not be sent to the server. For the server, an NFS error can be returned as opposed to "dropping" the request as an XDR decode error. This approach allows for the obsolescence of an operation while maintaining its structure so that a future minor version can reintroduce the operation. 1. Minor versions may declare an attribute MUST NOT be implemented. 2. Minor versions may declare a flag bit or enumeration value MUST NOT be implemented. 9. Minor versions may downgrade features from REQUIRED to RECOMMENDED, or RECOMMENDED to OPTIONAL. 10. Minor versions may upgrade features from OPTIONAL to RECOMMENDED or RECOMMENDED to REQUIRED. 11. A client and server that supports minor version X should support minor versions 0 (zero) through X-1 as well. 12. Except for infrastructural changes, no new features may be introduced as REQUIRED in a minor version. This rule allows for the introduction of new functionality and forces the use of implementation experience before designating a feature as REQUIRED. On the other hand, some classes of features are infrastructural and have broad effects. Allowing such features to not be REQUIRED complicates implementation of the minor version. Shepler, et al. Expires November 13, 2008 [Page 37] Internet-Draft NFSv4.1 May 2008 13. A client MUST NOT attempt to use a stateid, filehandle, or similar returned object from the COMPOUND procedure with minor version X for another COMPOUND procedure with minor version Y, where X != Y. 2.8. Non-RPC-based Security Services As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for identification, authentication, integrity, and privacy. NFSv4.1 itself provides or enables additional security services as described in the next several subsections. 2.8.1. Authorization Authorization to access a file object via an NFSv4.1 operation is ultimately determined by the NFSv4.1 server. A client can predetermine its access to a file object via the OPEN (Section 18.16) and the ACCESS (Section 18.1) operations. Principals with appropriate access rights can modify the authorization on a file object via the SETATTR (Section 18.30) operation. Attributes that affect access rights include: mode, owner, owner_group, acl, dacl, and sacl. See Section 5. 2.8.2. Auditing NFSv4.1 provides auditing on a per file object basis, via the acl and sacl attributes as described in Section 6. It is outside the scope of this specification to specify audit log formats or management policies. 2.8.3. Intrusion Detection NFSv4.1 provides alarm control on a per file object basis, via the acl and sacl attributes as described in Section 6. Alarms may serve as the basis for intrusion detection. It is outside the scope of this specification to specify heuristics for detecting intrusion via alarms. 2.9. Transport Layers 2.9.1. REQUIRED and RECOMMENDED Properties of Transports NFSv4.1 works over RDMA and non-RDMA_based transports with the following attributes: o The transport supports reliable delivery of data, which NFSv4.1 requires but neither NFSv4.1 nor RPC has facilities for ensuring. Shepler, et al. Expires November 13, 2008 [Page 38] Internet-Draft NFSv4.1 May 2008 [24] o The transport delivers data in the order it was sent. Ordered delivery simplifies detection of transmit errors, and simplifies the sending of arbitrary sized requests and responses, via the record marking protocol [3]. Where an NFSv4.1 implementation supports operation over the IP network protocol, any transport used between NFS and IP MUST be among the IETF-approved congestion control transport protocols. At the time this document was written, the only two transports that had the above attributes were TCP and SCTP. To enhance the possibilities for interoperability, an NFSv4.1 implementation MUST support operation over the TCP transport protocol. Even if NFSv4.1 is used over a non-IP network protocol, it is RECOMMENDED that the transport support congestion control. It is permissible for a connectionless transport to be used under NFSv4.1, however reliable and in-order delivery of data by the connectionless transport is REQUIRED. NFSv4.1 assumes that a client transport address and server transport address used to send data over a transport together constitute a connection, even if the underlying transport eschews the concept of a connection. 2.9.2. Client and Server Transport Behavior If a connection-oriented transport (e.g. TCP) is used, the client and server SHOULD use long lived connections for at least three reasons: 1. This will prevent the weakening of the transport's congestion control mechanisms via short lived connections. 2. This will improve performance for the WAN environment by eliminating the need for connection setup handshakes. 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the client and server to maintain a client-created backchannel (see Section 2.10.3.1) for the server to use. In order to reduce congestion, if a connection-oriented transport is used, and the request is not the NULL procedure, o A requester MUST NOT retry a request unless the connection the request was sent over was lost before the reply was received. Shepler, et al. Expires November 13, 2008 [Page 39] Internet-Draft NFSv4.1 May 2008 o A replier MUST NOT silently drop a request, even if the request is a retry. (The silent drop behavior of RPCSEC_GSS [4] does not apply because this behavior happens at the RPCSEC_GSS layer, a lower layer in the request processing). Instead, the replier SHOULD return an appropriate error (see Section 2.10.5.1) or it MAY disconnect the connection. When sending a reply, the replier MUST send the reply to the same full network address (e.g. if using an IP-based transport, the source port of the requester is part of the full network address) that the requester sent the request from. If using a connection-oriented transport, replies MUST be sent on the same connection the request was received from. If a connection is dropped after the replier receives the request but before the replier sends the reply, the replier might have an pending reply. If a connection is established with the same source and destination full network address as the dropped connection, then the replier MUST NOT send the reply until the client retries the request. The reason for this prohibition is that the client MAY retry a request over a different connection than is associated with the session. When using RDMA transports there are other reasons for not tolerating retries over the same connection: o RDMA transports use "credits" to enforce flow control, where a credit is a right to a peer to transmit a message. If one peer were to retransmit a request (or reply), it would consume an additional credit. If the replier retransmitted a reply, it would certainly result in an RDMA connection loss, since the requester would typically only post a single receive buffer for each request. If the requester retransmitted a request, the additional credit consumed on the server might lead to RDMA connection failure unless the client accounted for it and decreased its available credit, leading to wasted resources. o RDMA credits present a new issue to the reply cache in NFSv4.1. The reply cache may be used when a connection within a session is lost, such as after the client reconnects. Credit information is a dynamic property of the RDMA connection, and stale values must not be replayed from the cache. This implies that the reply cache contents must not be blindly used when replies are sent from it, and credit information appropriate to the channel must be refreshed by the RPC layer. In addition, as described in Section 2.10.5.2, while a session is active, the NFSv4.1 requester MUST NOT stop waiting for a reply. Shepler, et al. Expires November 13, 2008 [Page 40] Internet-Draft NFSv4.1 May 2008 2.9.3. Ports Historically, NFSv3 servers have listened over TCP port 2049. The registered port 2049 [25] for the NFS protocol should be the default configuration. NFSv4.1 clients SHOULD NOT use the RPC binding protocols as described in [26]. 2.10. Session 2.10.1. Motivation and Overview Previous versions and minor versions of NFS have suffered from the following: o Lack of support for Exactly Once Semantics (EOS). This includes lack of support for EOS through server failure and recovery. o Limited callback support, including no support for sending callbacks through firewalls, and races between replies to normal requests and callbacks. o Limited trunking over multiple network paths. o Requiring machine credentials for fully secure operation. Through the introduction of a session, NFSv4.1 addresses the above shortfalls with practical solutions: o EOS is enabled by a reply cache with a bounded size, making it feasible to keep the cache in persistent storage and enable EOS through server failure and recovery. One reason that previous revisions of NFS did not support EOS was because some EOS approaches often limited parallelism. As will be explained in Section 2.10.5, NFSv4.1 supports both EOS and unlimited parallelism. o The NFSv4.1 client (defined in Section 1.5, Paragraph 2) creates transport connections and provides them to the server to use for sending callback requests, thus solving the firewall issue (Section 18.34). Races between responses from client requests, and callbacks caused by the requests are detected via the session's sequencing properties which are a consequence of EOS (Section 2.10.5.3). o The NFSv4.1 client can add an arbitrary number of connections to the session, and thus provide trunking (Section 2.10.4). Shepler, et al. Expires November 13, 2008 [Page 41] Internet-Draft NFSv4.1 May 2008 o The NFSv4.1 client and server produces a session key independent of client and server machine credentials which can be used to compute a digest for protecting critical session management operations (Section 2.10.7.3). o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for use by the session's backchannel that do not require the server to authenticate to a client machine principal (Section 2.10.7.2). A session is a dynamically created, long-lived server object created by a client, used over time from one or more transport connections. Its function is to maintain the server's state relative to the connection(s) belonging to a client instance. This state is entirely independent of the connection itself, and indeed the state exists whether the connection exists or not. A client may have one or more sessions associated with it so that client-associated state may be accessed using any of the sessions associated with that client's client ID, when connections are associated with those sessions. When no connections are associated with any of a client ID's sessions for an extended time, such objects as locks, opens, delegations, layouts, etc. are subject to expiration. The session serves as an object representing a means of access by a client to the associated client state on the server, independent of the physical means of access to that state. A single client may create multiple sessions. A single session MUST NOT serve multiple clients. 2.10.2. NFSv4 Integration Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major infrastructure change such as sessions would require a new major version number to an ONC RPC program like NFS. However, because NFSv4 encapsulates its functionality in a single procedure, COMPOUND, and because COMPOUND can support an arbitrary number of operations, sessions have been added to NFSv4.1 with little difficulty. COMPOUND includes a minor version number field, and for NFSv4.1 this minor version is set to 1. When the NFSv4 server processes a COMPOUND with the minor version set to 1, it expects a different set of operations than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation, which is required for every COMPOUND that operates over an established session, with the exception of some session administration operations, such as DESTROY_SESSION (Section 18.37). 2.10.2.1. SEQUENCE and CB_SEQUENCE In NFSv4.1, when the SEQUENCE operation is present, it MUST be the first operation in the COMPOUND procedure. The primary purpose of Shepler, et al. Expires November 13, 2008 [Page 42] Internet-Draft NFSv4.1 May 2008 SEQUENCE is to carry the session identifier. The session identifier associates all other operations in the COMPOUND procedure with a particular session. SEQUENCE also contains required information for maintaining EOS (see Section 2.10.5). Session-enabled NFSv4.1 COMPOUND requests thus have the form: +-----+--------------+-----------+------------+-----------+---- | tag | minorversion | numops |SEQUENCE op | op + args | ... | | (== 1) | (limited) | + args | | +-----+--------------+-----------+------------+-----------+---- and the reply's structure is: +------------+-----+--------+-------------------------------+--// |last status | tag | numres |status + SEQUENCE op + results | // +------------+-----+--------+-------------------------------+--// //-----------------------+---- // status + op + results | ... //-----------------------+---- A CB_COMPOUND procedure request and reply has a similar form to COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE operation. CB_COMPOUND also has an additional field called "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored by the client. CB_SEQUENCE has the same information as SEQUENCE, and also includes other information needed to resolve callback races (Section 2.10.5.3). 2.10.2.2. Client ID and Session Association Each client ID (Section 2.4) can have zero or more active sessions. A client ID and associated session are required to perform file access in NFSv4.1. Each time a session is used (whether by a client sending a request to the server, or the client replying to a callback request from the server), the state leased to its associated client ID is automatically renewed. State such as share reservations, locks, delegations, and layouts (Section 1.6.4) is tied to the client ID. Client state is not tied to any individual session. Successive state changing operations from a given state owner MAY go over different sessions, provided the session is associated with the same client ID. A callback MAY arrive over a different session than from the session that originally acquired the state pertaining to the callback. For example, if session A is used to acquire a delegation, a request to recall the delegation MAY arrive over session B if both sessions are associated with the same client ID. Section 2.10.7.1 and Section 2.10.7.2 discuss the security considerations around callbacks. Shepler, et al. Expires November 13, 2008 [Page 43] Internet-Draft NFSv4.1 May 2008 2.10.3. Channels A channel is not a connection. A channel represents the direction ONC RPC requests are sent. Each session has one or two channels: the fore channel and the backchannel. Because there are at most two channels per session, and because each channel has a distinct purpose, channels are not assigned identifiers. The fore channel is used for ordinary requests from the client to the server, and carries COMPOUND requests and responses. A session always has a fore channel. The backchannel used for callback requests from server to client, and carries CB_COMPOUND requests and responses. Whether there is a backchannel or not is a decision by the client, however many features of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support backchannels. Each session has resources for each channel, including separate reply caches (see Section 2.10.5.1). Note that even the backchannel requires a reply cache because some callback operations are nonidempotent. 2.10.3.1. Association of Connections, Channels, and Sessions Each channel is associated with zero or more transport connections. A connection can be associated with one channel or both channels of a session; the client and server negotiate whether a connection will carry traffic for one channel or both channels via the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION (Section 18.34) operations. When a session is created via CREATE_SESSION, the connection that transported the CREATE_SESSION request is automatically associated with the fore channel, and optionally the backchannel. If the client specifies no state protection (Section 18.35) when the session is created, then when SEQUENCE is transmitted on a different connection, the connection is automatically associated with the fore channel of the session specified in the SEQUENCE operation. A connection's association with a session is not exclusive. A connection associated with the channel(s) of one session may be simultaneously associated with the channel(s) of other sessions including sessions associated with other client IDs. It is permissible for connections of multiple transport types to be associated with the same channel. For example both a TCP and RDMA Shepler, et al. Expires November 13, 2008 [Page 44] Internet-Draft NFSv4.1 May 2008 connection can be associated with the fore channel. In the event an RDMA and non-RDMA connection are associated with the same channel, the maximum number of slots SHOULD be at least one more than the total number of RDMA credits (Section 2.10.5.1. This way if all RDMA credits are used, the non-RDMA connection can have at least one outstanding request. If a server supports multiple transport types, it MUST allow a client to associate connections from each transport to a channel. It is permissible for a connection of one type of transport to be associated with the fore channel, and a connection of a different type to be associated with the backchannel. 2.10.4. Trunking Trunking is the use of multiple connections between a client and server in order to increase the speed of data transfer. NFSv4.1 supports two types of trunking: session trunking and client ID trunking. NFSv4.1 servers MUST support trunking. Session trunking is essentially the association of multiple connections, each with potentially different target and/or source network addresses, to the same session. Client ID trunking is the association of multiple sessions to the same client ID, major server owner ID (Section 2.5), and server scope (Section 11.7.7). When two servers return the same major server owner and server scope it means the two servers are cooperating on locking state management which is a prerequisite for client ID trunking. Understanding and distinguishing session and client ID trunking requires understanding how the results of the EXCHANGE_ID (Section 18.35) operation identify a server. Suppose a client sends EXCHANGE_ID over two different connections each with a possibly different target network address but each EXCHANGE_ID with the same value in the eia_clientowner field. If the same NFSv4.1 server is listening over each connection, then each EXCHANGE_ID result MUST return the same values of eir_clientid, eir_server_owner.so_major_id and eir_server_scope. The client can then treat each connection as referring to the same server (subject to verification, see Paragraph 5 later in this section), and it can use each connection to trunk requests and replies. The question is whether session trunking and/or client ID trunking applies. Shepler, et al. Expires November 13, 2008 [Page 45] Internet-Draft NFSv4.1 May 2008 Session Trunking If the eia_clientowner argument is the same in two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and eir_server_scope results match in both EXCHANGE_ID results, then the client is permitted to perform session trunking. If the client has no session mapping to the tuple of eir_clientid, eir_server_owner.so_major_id, eir_server_scope, eir_server_owner.so_minor_id, then it creates the session via a CREATE_SESSION operation over one of the connections, which associates the connection to the session. If there is a session for the tuple, the client can send BIND_CONN_TO_SESSION to associate the connection to the session. (Of course, if the client does not want to use session trunking, it can invoke CREATE_SESSION on the connection. This will result in client ID trunking as described below.) Client ID Trunking If the eia_clientowner argument is the same in two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, and eir_server_scope results match in both EXCHANGE_ID results, but the eir_server_owner.so_minor_id results do not match then the client is permitted to perform client ID trunking. The client can associate each connection with different sessions, where each session is associated with the same server. Of course, even if the eir_server_owner.so_minor_id fields do match, the client is free to employ client ID trunking instead of session trunking. The client completes the act of client ID trunking by invoking CREATE_SESSION on each connection, using the same client ID that was returned in eir_clientid. These invocations create two sessions and also associate each connection with each session. When doing client ID trunking, locking state is shared across sessions associated with the same client ID. This requires the server to coordinate state across sessions. When two servers over two connections claim matching or partially matching eir_server_owner, eir_server_scope, and eir_clientid values, the client does not have to trust the servers' claims. The client may verify these claims before trunking traffic in the following ways: Shepler, et al. Expires November 13, 2008 [Page 46] Internet-Draft NFSv4.1 May 2008 o For session trunking, clients SHOULD reliably verify if connections between different network paths are in fact associated with the same NFSv4.1 server and usable on the same session, and servers MUST allow clients to perform reliable verification. When a client ID is created, the client SHOULD specify that BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or SP4_MACH_CRED (Section 18.35) state protection options. For SP4_SSV, reliable verification depends on a shared secret (the SSV) that is established via the SET_SSV (Section 18.47) operation. When a new connection is associated with the session (via the BIND_CONN_TO_SESSION operation, see Section 18.34), if the client specified SP4_SSV state protection for the BIND_CONN_TO_SESSION operation, the client MUST send the BIND_CONN_TO_SESSION with RPCSEC_GSS protection, using integrity or privacy, and an RPCSEC_GSS handle created with the GSS SSV mechanism (Section 2.10.8). If the client mistakenly tries to associate a connection to a session of a wrong server, the server will either reject the attempt because it is not aware of the session identifier of the BIND_CONN_TO_SESSION arguments, or it will reject the attempt because the RPCSEC_GSS authentication fails. Even if the server mistakenly or maliciously accepts the connection association attempt, the RPCSEC_GSS verifier it computes in the response will not be verified by the client, so the client will know it cannot use the connection for trunking the specified session. If the client specified SP4_MACH_CRED state protection, the BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or privacy, using the same credential that was used when the client ID was created. Mutual authentication via RPCSEC_GSS assures the client that the connection is associated with the correct session of the correct server. o For client ID trunking, the client has at least two options for verifying that the same client ID obtained from two different EXCHANGE_ID operations came from the same server. The first option is to use RPCSEC_GSS authentication when issuing each EXCHANGE_ID. Each time an EXCHANGE_ID is sent with RPCSEC_GSS authentication, the client notes the principal name of the GSS target. If the EXCHANGE_ID results indicate client ID trunking is possible, and the GSS targets' principal names are the same, the servers are the same and client ID trunking is allowed. The second option for verification is to use SP4_SSV protection. Shepler, et al. Expires November 13, 2008 [Page 47] Internet-Draft NFSv4.1 May 2008 When the client sends EXCHANGE_ID it specifies SP4_SSV protection. The first EXCHANGE_ID the client sends always has to be confirmed by a CREATE_SESSION call. The client then sends SET_SSV. Later the client sends EXCHANGE_ID to a second destination network address than the first EXCHANGE_ID was sent with. The client checks that each EXCHANGE_ID reply has the same eir_clientid, eir_server_owner.so_major_id, and eir_server_scope. If so, the client verifies the claim by issuing a CREATE_SESSION to the second destination address, protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle returned by the second EXCHANGE_ID. If the server accepts the CREATE_SESSION request, and if the client verifies the RPCSEC_GSS verifier and integrity codes, then the client has proof the second server knows the SSV, and thus the two servers are the same for the purposes of client ID trunking. 2.10.5. Exactly Once Semantics Via the session, NFSv4.1 offers Exactly Once Semantics (EOS) for requests sent over a channel. EOS is supported on both the fore and back channels. Each COMPOUND or CB_COMPOUND request that is sent with a leading SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver exactly once. This requirement holds regardless of whether the request is sent with reply caching specified (see Section 2.10.5.1.3). The requirement holds even if the requester is issuing the request over a session created between a pNFS data client and pNFS data server. To understand the rationale for this requirement, divide the requests into three classifications: o Nonidempotent requests. o Idempotent modifying requests. o Idempotent non-modifying requests. An example of a non-idempotent request is RENAME. If is obvious that if a replier executes the same RENAME request twice, and the first execution succeeds, the re-execution will fail. If the replier returns the result from the re-execution, this result is incorrect. Therefore, EOS is required for nonidempotent requests. An example of an idempotent modifying request is a COMPOUND request containing a WRITE operation. Repeated execution of the same WRITE has the same effect as execution of that write a single time. Nevertheless, enforcing EOS for WRITEs and other idempotent modifying requests is necessary to avoid data corruption. Shepler, et al. Expires November 13, 2008 [Page 48] Internet-Draft NFSv4.1 May 2008 Suppose a client sends WRITE A to a noncompliant server that does not enforce EOS, and receives no response, perhaps due to a network partition. The client reconnects to the server and re-sends WRITE A. Now, the server has outstanding two instances of A. The server can be in a situation in which it executes and replies to the retry of A, while the first A is still waiting in the server's internal I/O system for some resource. Upon receiving the reply to the second attempt of WRITE A, the client believes its write is done so it is free to send WRITE B which overlaps the range of A. When the original A is dispatched from the server's I/O system, and executed (thus the second time A will have been written), then what has been written by B can be overwritten and thus corrupted. An example of an idempotent non-modifying request is a COMPOUND containing SEQUENCE, PUTFH, READLINK and nothing else. The re- execution of a such a request will not cause data corruption, or produce an incorrect result. Nonetheless, to keep the implementation simple, the replier MUST enforce EOS for all requests whether idempotent and non-modifying or not. Note that true and complete EOS is not possible unless the server persists the reply cache in stable storage, unless the server is somehow implemented to never require a restart (indeed if such a server exists, the distinction between a reply cache kept in stable storage versus one that is not is one without meaning). See Section 2.10.5.5 for a discussion of persistence in the reply cache. Regardless, even if the server does not persist the reply cache, EOS improves robustness and correctness over previous versions of NFS because the legacy duplicate request/reply caches were based on the ONC RPC transaction identifier (XID). Section 2.10.5.1 explains the shortcomings of the XID as a basis for a reply cache and describes how NFSv4.1 sessions improve upon the XID. 2.10.5.1. Slot Identifiers and Reply Cache The RPC layer provides a transaction ID (XID), which, while required to be unique, is not convenient for tracking requests for two reasons. First, the XID is only meaningful to the requester; it cannot be interpreted by the replier except to test for equality with previously sent requests. When consulting an RPC-based duplicate request cache, the opaqueness of the XID requires a computationally expensive lookup (often via a hash that includes XID and source address). NFSv4.1 requests use a non-opaque slot ID which is an index into a slot table, which is far more efficient. Second, because RPC requests can be executed by the replier in any order, there is no bound on the number of requests that may be outstanding at any time. To achieve perfect EOS using ONC RPC would require storing all replies in the reply cache. XIDs are 32 bits; storing Shepler, et al. Expires November 13, 2008 [Page 49] Internet-Draft NFSv4.1 May 2008 over four billion (2^32) replies in the reply cache is not practical. In practice, previous versions of NFS have chosen to store a fixed number of replies in the cache, and use a least recently used (LRU) approach to replacing cache entries with new entries when the cache is full. In NFSv4.1, the number of outstanding requests is bounded by the size of the slot table, and a sequence ID per slot is used to tell the replier when it is safe to delete a cached reply. In the NFSv4.1 reply cache, when the requester sends a new request, it selects a slot ID in the range 0..N, where N is the replier's current maximum slot ID granted to the requester on the session over which the request is to be sent. The value of N starts out as equal to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the response to SEQUENCE or CB_SEQUENCE as described later in this section. The slot ID must be unused by any of the requests which the requester has already active on the session. "Unused" here means the requester has no outstanding request for that slot ID. A slot contains a sequence ID and the cached reply corresponding to the request sent with that sequence ID. The sequence ID is a 32 bit unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 1). The first time a slot is used, the requester MUST specify a sequence ID of one (1) (Section 18.36). Each time a slot is reused, the request MUST specify a sequence ID that is one greater than that of the previous request on the slot. If th