well-formed でない html で DOM を試す

Firebug で試したログをコピペ。

まず、well-formed なもので確認。

>>> var html = '<html><head><title>foo</title></head><body><img src="bar.png"/></body></html>'
>>> var doc = (new DOMParser).parseFromString(html, 'text/xml')
>>> doc.getElementsByTagName('title')
[title]
>>> doc.evaluate('//title', doc, null, 6, null).snapshotLength
1
>>> doc.evaluate('//title', doc, null, 6, null).snapshotItem(0).firstChild
"foo"
>>> (new XMLSerializer).serializeToString(doc)
"<html><head><title>foo</title></head><body><img src="bar.png"/></body></html>"

DOMParser が正しく parse するので問題なし。

次に well-formed ではないもの。img タグの / を削った。

>>> var html2 = '<html><head><title>foo</title></head><body><img src="bar.png"></body></html>'
>>> var doc2 = (new DOMParser).parseFromString(html2, 'text/xml')
>>> doc2.getElementsByTagName('title')
[]
>>> doc2.firstChild
<parsererror>
>>> doc2.firstChild.textContent
"XML パースエラー: タグの対応が間違っています。終了タグが必要です: </img>
URL: about:blank
行番号: 1, 列番号: 65:<html><head><title>foo</title></head><body><img src="bar.png"></body></html>
----------------------------------------------------------------^"

parse に失敗すると、Document にエラー原因を入れて返す仕様らしい…。

で、この前知った innerHTML に設定する方法。

>>> var div = document.createElement('div')
>>> div.innerHTML = html2
"<html><head><title>foo</title></head><body><img src="bar.png"></body></html>"
>>> div.getElementsByTagName('title')
[title]
>>> document.evaluate('//title', div, null, 6, null).snapshotLength
0
>>> (new XMLSerializer).serializeToString(div)
"<DIV><TITLE>foo</TITLE><IMG src="bar.png"/></DIV>"
>>> document.evaluate('//TITLE', div, null, 6, null).snapshotLength
0

XPath でアクセスできないよ！head タグとかはどっかいっちゃうし、タグが大文字だし、well-formed になってるし。

2007/09/28 追記
できます。ごめんなさい。

>>> document.evaluate('.//title', div, null, 6, null).snapshotLength
1

以下、不要だけど一応残しておく。

小文字にして parse してみた。

>>> var html3 = (new XMLSerializer).serializeToString(div)
>>> var html4 = html3.replace(/(<\/?)([A-Z]+)/g, function(s, p1, p2) { return p1 + p2.toLowerCase(); })
>>> html4
"<div><title>foo</title><img src="bar.png"/></div>"
var doc4 = (new DOMParser).parseFromString(html4, 'text/xml')
>>> doc4.evaluate('//title', doc4, null, 6, null).snapshotLength
1
>>> doc4.evaluate('//title', doc4, null, 6, null).snapshotItem(0).textContent
"foo"

でけた。けど、ちょっとやだな…。

次はこれ試そう。