view Side-By-Side changes
Working Group A. Barth Internet-Draft U.C. Berkeley Expires:July 13,December 2, 2009 I. Hickson Google, Inc.January 9,May 31, 2009 Content-Type Processing Modeldraft-abarth-mime-sniff-00draft-abarth-mime-sniff-01 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire onJuly 13,December 2, 2009. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents(http://trustee.ietf.org/license-info)in effect on the date of publication of thisdocument.document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page 1] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 Abstract ManyWebweb servers supply incorrect Content-Type headers with their HTTP responses. In order to be compatible with theseWebservers,Web browsersuser agents must consider the content of HTTP responses as well as the Content-Type header when determining the effectivemimemedia type of the response. This document describes an algorithm for determining the effectivemimemedia type of HTTP responses that balances security and compatibility considerations. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . .45 3. Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . .67 4. Text or Binary . . . . . . . . . . . . . . . . . . . . . . . .89 5. Unknown Type . . . . . . . . . . . . . . . . . . . . . . . . .1011 6. Image . . . . . . . . . . . . . . . . . . . . . . . . . . . .1416 7. Feed or HTML . . . . . . . . . . . . . . . . . . . . . . . . .1517 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . .1820 Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page 2] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 1. Introduction The HTTP Content-Type header indicates themimemedia type of an HTTPresponses.response. However, many HTTP servers supply a Content-Type that does not match the actual contents of the response. Historically,Webweb browsers have been tolerated these servers by examining the content of HTTP responses in addition to the Content-Type header to determine the effectivemimemedia type of the response. Without a clear specification of how to "sniff" themimemedia type, eachbrowser vendoruser agent implementor was forced to reverse engineer the behavior of the otherborwsersuser agents and to developed their own algorithm. These divergent algorithms have lead to a lack of interoperability betweenbrowsersuser agents and to security issues when thesiteserver intends an HTTP response to be interpreted as onemimemedia type butthe browser interpretessome user agents interpret the responses as anothermimemedia type. These security issues aremustmost severe whena Web sitean "honest" server lets potentially malicious users upload files and then serves the contents of those files with alow- privilege mimelow-privilege media type (such as text/plain or image/jpeg). (Malicious servers, of course, can specify an arbitrary media type in the Content-Type header.) In the absense of mime sniffing, this user-generated contentwillwould not beable to run JavaScript, butinterpreted as a high-privilege media type, such as text/html. However, ifthe browser treats the responsea user agent does interpret a low-privilege media type, such as image/gif, as a high-privilege media type, such as text/html,thenthe usercanagent as created a privilege escalation vulnerability in the server. For example, a malicious user might be able to leverage content sniffing to mount a cross-sitescriptingscript attack by including JavaScript code in the uploadedfile.file that a user agent treats as text/html. This document describes amimecontent sniffing algorithm that carefully balances the compatibility needs ofbrowser vendorsuser agent implementors with the security constraints. The algorithm has been constructed with reference tomimecontent sniffing algorithms present in popularWeb browsers,user agents, an extensive database ofWebexisting web content, and metrics collected from implementations deployed to a sizable number ofWebusers.Warning! ItWARNING! Whenever possible, user agents should avoid employing a content sniffing algorithm. However, if the user agent does emply a content sniffing algorithm, it is imperative that the algorithm in this document be followed exactly. When a user agent uses different heuristics forcontentmedia type detection than the server expects, security problems can occur. For example, if a server believes that the client will treat a contributed file as an image (and thus treat it as benign), but aWeb browseruser agent believes the content to be HTML (and thus privileged to execute any scripts contained therein),the end user canan attacker might beexposedable tomalicious content, makingsteal theuser vulnerable to cookie theft attacksuser's authentication credentials Barth & Hickson Expires December 2, 2009 [Page 3] Internet-Draft Content-Type Processing Model May 2009 and mount other cross-site scripting attacks. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page3]4] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 2. Metadata What explicit Content-Type metadata is associated with the resource (the resource's type information) depends on the protocol that was used to fetch the resource. For HTTP resources, only thefirstlast Content-Type HTTP header, if any, contributes any type information; theexplicitofficial type of the resource is then the value of that header, interpreted as described by the HTTP specifications. If the Content-Type HTTP header is present but the value of thefirstlast such header cannot be interpreted as described by the HTTP specifications (e.g. because its value doesn't contain a U+002F SOLIDUS ('/') character), then the resource has no type information (even if there are multiple Content-Type HTTP headers and one of the other ones is syntactically correct).[HTTP]For resources fetched from the file system, user agents should use platform-specific conventions, e.g. operating systemextension/typefile extension/ type mappings.Extensions must notFile extensions MUST NOT be used for determining resource types for resources fetched over HTTP. For resources fetched over most other protocols, e.g. FTP, there is no type information. The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns an encoding or nothing. 1. Find the first seven characters in s that are an ASCII case- insensitive match for the word "charset". If no such match is found, return nothing. 2. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the word 'charset' (there might not be any). 3. If the next character is not a U+003D EQUALS SIGN ('='), return nothing. 4. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the equals sign (there might not be any). 5. Process the next character as follows: * If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s, or Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page4]5] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 * If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s Return the string between this character and the next earliest occurrence of this character. * If it is an unmatched U+0022 QUOTATION MARK ('"'), * If it is an unmatched U+0027 APOSTROPHE ("'"), or * If there is no next character Return nothing. * Otherwise Return the string from this character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the end of s, whichever comes first. Note: The above algorithm is a willful violation of the HTTP specification. [RFC2616] Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page5]6] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 3. Web Pages Thesniffed type/sniffed type/ of a resourcemustMUST be found as follows: 1. Let /official type/ be the type given by the Content-Type metadata for the resource, ignoring parameters. Comparisons with this type, as defined by MIME specifications, are done in an ASCII case-insensitive manner. [RFC2046] 2. If the user agent is configured to strictly obey Content-Type headers for this resource, then jump to the last step in this set of steps.2.3. If the resource was fetched over an HTTP protocol and there is an HTTP Content-Type header and the value of thefirstlast such header has bytes that exactly match one of the following lines: +-------------------------------+--------------------------------+ | Bytes in Hexadecimal | TextualrepresentationRepresentation | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=ISO-8859-1 | | 3b 20 63 68 61 72 73 65 74 3d | | | 49 53 4f 2d 38 38 35 39 2d 31 | | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=iso-8859-1 | | 3b 20 63 68 61 72 73 65 74 3d | | | 69 73 6f 2d 38 38 35 39 2d 31 | | +-------------------------------+--------------------------------+ | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=UTF-8 | | 3b 20 63 68 61 72 73 65 74 3d | | | 55 54 46 2d 38 | | +-------------------------------+--------------------------------+ ...then jump to the "text or binary" section below.3. Let official type be the type given by the Content-Type metadata for the resource, ignoring parameters.4. If there is nosuch type,/official type/, jump to the unknown type step below.Comparisons with this type, as defined by MIME specifications, are done in an ASCII case- insensitive manner. [RFC2046] 4.5. Ifofficial type/official type/ is"unknown/unknown" or"unknown/unknown", "application/unknown", or "*/*", jump to the unknown type step below.5.6. Ifofficial type/official type/ ends in "+xml", or if it is either "text/xml" or "application/xml", then thesniffed type/sniffed type/ of the resource isofficial type;/official type/; return that and abort these steps.6.Barth & Hickson Expires December 2, 2009 [Page 7] Internet-Draft Content-Type Processing Model May 2009 7. Ifofficial type/official type/ is an image type supported by the user agent (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to the "images" section below, passing it theofficial type. Barth & Hickson Expires July 13, 2009 [Page 6] Internet-Draft Content-Type Processing Model January 2009 7./official type/. 8. Ifofficial type/official type/ is "text/html", then jump to the feed or HTML section below.8.9. Thesniffed type/sniffed type/ of the resource isofficial type./official type/. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page7]8] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 4. Text or Binary 1. The user agentmayMAY wait for 512 or more bytes of the resource to be available. 2. Let n be the smaller of either 512 or the number of bytes already available. 3. If n is4greater than ormore,equal to 3, and the first 2 or 3 bytes of the resource match one of the following bytesets:sequences: +----------------------+--------------+ | Bytes in Hexadecimal | Description | +----------------------+--------------+ | FE FF | UTF-16BE BOM | | FF FE | UTF-16LE BOM | | EF BB BF | UTF-8 BOM | +----------------------+--------------+ ...then thesniffed type/sniffed type/ of the resource is "text/plain". Abort these steps. 4. If none of the first n bytes of the resource are binary data bytes then thesniffed type/sniffed type/ of the resource is "text/plain". Abort these steps. +-------------------------+ | Binarydata byte rangesData Byte Ranges | +-------------------------+ | 0x00 -- 0x08 | | 0x0B | | 0x0E -- 0x1A | | 0x1C -- 0x1F | +-------------------------+ 5. If the first bytes of the resource match one of the byte sequences in the "pattern" column of the table in the unknown type section below, ignoring any rows whose cell in the "security" column says "scriptable" (or "n/a"), then thesniffed type/sniffed type/ of the resource is the type given in the corresponding cell in the "sniffed type" column on that row; abort these steps.Warning!WARNING! It is critical that this step not ever return a scriptable type (e.g. text/html), as otherwise that would allow a privilege escalation attack. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page8]9] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 6. Otherwise, thesniffed type/sniffed type/ of the resource is "application/ octet-stream". Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page9]10] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 5. Unknown Type 1. The user agentmayMAY wait for 512 or more bytes of the resource to be available. 2. Letstream length/stream length/ be the smaller of either 512 or the number of bytes already available. 3. For each row in the table below: * If the row has no "WS" bytes: 1. Letpattern length/pattern length/ be the length of the pattern (number of bytes described by the cell in the second column of the row). 2. Ifstream length/stream length/ is smaller thanpattern length/pattern length/ then skip this row. 3. Apply the "and" operator to the firstpattern length/pattern length/ bytes of the resource and the given mask (the bytes in the cell of first column of that row), and let the result be the data. 4. If the bytes of the data matches the given pattern bytes exactly, then thesniffed type/sniffed type/ of the resource is the type given in the cell of the third column in that row; abort these steps. * If the row has a "WS" byte: 1. Letindex_pattern/index pattern/ be an index into the mask and pattern byte strings of the row. 2. Letindex_stream/index stream/ be an index into the byte stream being examined. 3. Loop: Ifindexstream/index stream/ points beyond the end of the byte stream, then this row doesn't match, skip this row. 4. Examine theindexstreamth/index stream/th byte of the byte stream as follows: - If theindex_patternth/index pattern/th byte of the pattern is a normal hexadecimal byte and not a "WS" byte: If the "and" operator, applied to theindex_streamth/index stream/th byte of the stream and theindex_patternth byte of/index Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page10]11] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 pattern/th byte of the mask, yield a value different that theindex_patternth/index pattern/th byte of the pattern, then skip this row. Otherwise, incrementindex_pattern/index pattern/ to the next byte in the mask and pattern andindex_stream/index stream/ to the next byte in the byte stream. - Otherwise, if theindexpatternth/index pattern/th byte of the pattern is a "WS" byte: "WS" means "whitespace", and allows insignificant whitespace to be skipped when sniffing for a type signature. If theindex_streamth/index stream/th byte of the stream is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space), then increment only theindex_stream/index stream/ to the next byte in the byte stream. Otherwise, increment only theindex_pattern/index pattern/ to the next byte in the mask and pattern. 5. Ifindex_pattern/index pattern/ does not point beyond the end of the mask and pattern byte strings, then jump back to the loop step in this algorithm. 6. Otherwise, thesniffed type/sniffed type/ of the resource is the type given in the cell of the third column in that row; abort these steps. 4. If none of the first n bytes of the resource are binary data bytes then the sniffed type of the resource is "text/plain". Abort these steps. 5. Otherwise, the sniffed type of the resource is "application/ octet-stream". The table used by the above algorithm is: +-------------------+-------------------+-----------------+------------+ | Mask in Hex | Pattern in Hex | SniffedtypeType | Security | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 21 44 4F 4354| text/html | Scriptable | | DF DF DF FF DF DF | 54 59 50 45 20 4854| | | | DF DF | 54 4D 4C | | | | Comment: "<!DOCTYPE HTML", case-insensitive, with leading spaces. | Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page11]12] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 +-------------------+-------------------+-----------------+------------+ |Comment: The string "<!DOCTYPE HTML" in US-ASCII or compatibleFF FF DF DF DF DF | WS 3C 48 54 4D 4C | text/html | Scriptable | |encodings, case-insensitively.Comment: "<HTML", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 48 45 41 44 | text/html | Scriptable | | Comment: "<HEAD", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 53 43 52 49 | text/html | Scriptable | | DF DF | 50 544D 4C| | | | Comment: "<SCRIPT", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 49 46 52 41 | text/html | Scriptable | | DF DF | 4d 45 | | | | Comment:The string "<HTML" in US-ASCII or compatible encodings,"<IFRAME", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF | WS 3C 48 31 | text/html | Scriptable | | Comment: "<H1", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF | WS 3C 44 49 56 | text/html | Scriptable | |case-insensitively, possiblyComment: "<DIV", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C48 4546 4f 4e 54 | text/html | Scriptable | | Comment: "<FONT", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 54 414442 4c | text/html | Scriptable | | DF | 45 | | | | Comment:The string "<HEAD" in US-ASCII or compatible encodings,"<TABLE", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF | WS 3C 41 | text/html | Scriptable | |case-insensitively, possiblyComment: "<A", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 5343 5254 59 4c | text/html | Scriptable | | DF | 45 | | | | Comment: "<STYLE", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 54 49 54 4c | text/html | Scriptable | | DF | 45 | | | | Comment: "<TITLE", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF | WS 3C 42 | text/html | Scriptable | | Comment: "<B", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF DF DF | WS 3C 42 4f 44 59 | text/html | Scriptable | | Comment: "<BODY", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF DF DF | WS 3C 42 52 | text/html | Scriptable | | Comment: "<BR", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ Barth & Hickson Expires December 2, 2009 [Page 13] Internet-Draft Content-Type Processing Model May 2009 | FF FF DF | WS 3C 5054| text/html | Scriptable | | Comment: "<P", case-insensitive, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF | WS 3C 21 2d 2d | text/html | Scriptable | | Comment: The string "<!--", an HTML comment, with leading spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | WS 3C 3f 78 6d 6c | text/xml | Scriptable | | Comment: The string"<SCRIPT" in US-ASCII or compatible | | encodings, case-insensitively, possibly"<?xml", case-sensitive, with leading| |spaces. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF | 25 50 44 46 2D | application/pdf | Scriptable | || |Comment: The string "%PDF-", the PDF signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 25 21 50 53 2D 41 | application/ | Safe | | FF FF FF FF FF | 64 6F 62 65 2D | postscript | | || |Comment: The string "%!PS-Adobe-", the PostScript signature. | +-------------------+-------------------+-----------------+------------+ | FF FF 00 00 | FE FF 00 00 | text/plain | n/a | || |Comment: UTF-16BE BOM | +-------------------+-------------------+-----------------+------------+ | FF FF 00 00 | FFFFFE 00 00 | text/plain | n/a | || |Comment: UTF-16LE BOM | +-------------------+-------------------+-----------------+------------+ | FF FF FF 00 | EF BB BF 00 | text/plain | n/a | || |Comment: UTF-8 BOM | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 47 49 46 38 37 61 | image/gif | Safe | || |Comment: The string "GIF87a", a GIF signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 47 49 46 38 39 61 | image/gif | Safe | || |Comment: The string "GIF89a", a GIF signature. |Barth & Hickson Expires July 13, 2009 [Page 12] Internet-Draft Content-Type Processing Model January 2009+-------------------+-------------------+-----------------+------------+ | FF FF FF FF FF FF | 89 50 4E 47 0D 0A | image/png | Safe | | FF FF | 1A 0A | | | || |Comment: The PNG signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF | FF D8 FF | image/jpeg | Safe | || |Comment: A JPEG SOI marker followed by a byte of another marker. | +-------------------+-------------------+-----------------+------------+ | FF FF | 42 4D | image/bmp | Safe | || |Comment: The string "BM", a BMP signature. | +-------------------+-------------------+-----------------+------------+ | FF FF FF FF | 00 00 01 00 | image/vnd. | Safe | | | | microsoft.icon | | || |Comment: A0 word following by a 1 word, aWindows Icon signature. | +-------------------+-------------------+-----------------+------------+Note: I'd like to add types like MPEG, AVI, Flash, Java, etc, to| FF FF FF FF FF FF | 52 61 72 20 1A 07 | application/ | Safe | | FF | 00 | x-rar-compressed| | | Comment: A RAR archive. | Barth & Hickson Expires December 2, 2009 [Page 14] Internet-Draft Content-Type Processing Model May 2009 +-------------------+-------------------+-----------------+------------+ | FF FF FF FF | 50 4B 03 04 | application/zip | Safe | | Comment: A ZIP archive. | +-------------------+-------------------+-----------------+------------+ | FF FF FF | 1F 8B 08 | application/ | Safe | | | | x-gzip | | | Comment: A GZIP archive. | +-------------------+-------------------+-----------------+------------+ | FF FF FF | 49 44 33 | audio/mpeg | Safe | | Comment: The string "ID3", theabove table.MP3 signature. | +-------------------+-------------------+-----------------+------------+ User agents may supportfurtheradditional types if desired, by implicitly adding to the above table. However, user agents should not use any other patterns for types already mentioned in the tableabove, asabove because this could then be used for privilege escalation (where, e.g., a server uses the above table to determine that content is not HTML and thus safe fromXSScross-site scriping attacks, but then a user agent detects it as HTML anyway and allows script to execute). The column marked "security" is used by the algorithm in the "text or binary" section, to avoid sniffing text/plain content as a type that can be used for a privilege escalation attack. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page13]15] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 6. Image If the resource'sofficial type/official type/ is "image/svg+xml", then thesniffed type/sniffed type/ of the resource is itsofficial type/official type/ (an XML type). Otherwise, if the first bytes of the resource match one of the byte sequences in the first column of the following table, then thesniffed type/sniffed type/ of the resource is the type given in the corresponding cell in the second column on the same row: +-------------------------+--------------------------+----------+ | Bytes in Hexadecimal | SniffedtypeType | Comment | +-------------------------+--------------------------+----------+ | 47 49 46 38 37 61 | image/gif | "GIF87a" | | 47 49 46 38 39 61 | image/gif | "GIF89a" | | 89 50 4E 47 0D 0A 1A 0A | image/png | | | FF D8 FF | image/jpeg | | | 42 4D | image/bmp | "BM" | | 00 00 01 00 | image/vnd.microsoft.icon | | +-------------------------+--------------------------+----------+ Otherwise, thesniffed type/sniffed type/ of the resource is the same as itsofficial type./official type/. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page14]16] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 7. Feed or HTML 1. The user agentmayMAY wait for 512 or more bytes of the resource to be available. 2. Let s be the stream of bytes, and let s[i] represent the byte in s with position i, treating s as zero-indexed (so the first byte is at i=0). 3. If at any point this algorithm requires the user agent to determine the value of a byte in s which is not yet available, or which is past the first 512 bytes of the resource, or which is beyond the end of the resource, the user agentmustMUST stop this algorithm, and assume that thesniffed type/sniffed type/ of the resource is "text/html". Note: User agents are allowed, by the first step of this algorithm, to wait until the first 512 bytes of the resource are available. 4. Initialize pos to 0. 5. If s[0]isequals 0xEF, s[1]isequals 0xBB, and s[2]isequals 0xBF, then set pos to 3. (This skips over a leading UTF-8 BOM, if any.) 6. Loop start: Examine s[pos]. * If itisequals 0x09 (ASCII tab), 0x20 (ASCII space), 0x0A (ASCII LF), or 0x0D (ASCII CR) Increase pos by 1 and repeat this step. * If itisequals 0x3C (ASCII "<") Increase pos by 1 and go to the next step. * If it is anything else The sniffed type of the resource is "text/html". Abort these steps. 7. If the bytes with positions pos to pos+2 in s are exactly equal to 0x21, 0x2D, 0x2D respectively (ASCII for "!--"), then: 1. Increase pos by 3. Barth & Hickson Expires December 2, 2009 [Page 17] Internet-Draft Content-Type Processing Model May 2009 2. If the bytes with positions pos to pos+2 in s are exactly equal to 0x2D, 0x2D, 0x3E respectively (ASCII for "-->"),Barth & Hickson Expires July 13, 2009 [Page 15] Internet-Draft Content-Type Processing Model January 2009then increase pos by 3 and jump back to the previous step (the step labeled loop start) in the overall algorithm in this section. 3. Otherwise, increase pos by 1. 4. Return to step 2 in these substeps. 8. If s[pos]isequals 0x21 (ASCII "!"): 1. Increase pos by 1. 2. If s[pos]equalequals 0x3E, then increase pos by 1 and jump back to the step labeled loop start in the overall algorithm in this section. 3. Otherwise, return to step 1 in these substeps. 9. If s[pos]isequals 0x3F (ASCII "?"): 1. Increase pos by 1. 2. If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively, then increase pos by 1 and jump back to the step labeled loop start in the overall algorithm in this section. 3. Otherwise, return to step 1 in these substeps. 10. Otherwise, if the bytes in s starting at pos match any of the sequences of bytes in the first column of the following table, then the user agent must follow the steps given in the corresponding cell in the second column of the same row. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page16]18] Internet-Draft Content-Type Processing ModelJanuaryMay 2009+----------------------+-----------------------------------+-----------++----------------------+------------------------------------+---------+ | Bytes in Hexadecimal | Requirement | Comment |+----------------------+-----------------------------------+-----------++----------------------+------------------------------------+---------+ | 72 73 73 | Thesniffed type/sniffed type/ of the resource |"rss"rss | | | is "application/rss+xml"; abort | | | | these steps. | |+----------------------+-----------------------------------+-----------++----------------------+------------------------------------+---------+ | 66 65 65 64 | Thesniffed type/sniffed type/ of the resource |"feed"feed | | |siis "application/atom+xml"; abort | | | | these steps. | |+----------------------+-----------------------------------+-----------++----------------------+------------------------------------+---------+ | 72 64 66 3A 52 44 46 | Continue to the next step in this |"rdf:RDF"rdf:RDF | | | algorithm. | |+----------------------+-----------------------------------+-----------++----------------------+------------------------------------+---------+ If none of the byte sequences above match the bytes in s starting at pos, then thesniffed type/sniffed type/ of the resource is"text/ html"."text/html". Abort these steps. 11.???? If, before the next ">", you find two xmlns* attributes with http://www.w3.org/1999/02/22-rdf-syntax-ns# and http://purl.org/rss/1.0/ as the namespaces, then the sniffed type of the resource is "application/rss+xml", abort these steps. (maybe we only need to check for http://purl.org/rss/1.0/ actually) ???? 12.Otherwise, thesniffed type/sniffed type/ of the resource is "text/html". For efficiency reasons, implementations may wish to implement this algorithm and the algorithm for detecting the character encoding of HTML documents in parallel. Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page17]19] Internet-Draft Content-Type Processing ModelJanuaryMay 2009 Authors' Addresses Adam BarthUniveristyUniversity of California, Berkeley Email: abarth@eecs.berkeley.edu URI: http://www.adambarth.com/ Ian Hickson Google, Inc. Email: ian@hixie.ch URI: http://ln.hixie.ch/ Barth & Hickson ExpiresJuly 13,December 2, 2009 [Page18]20] ----