Web::Scraper Watch - へたっぴ日記

0.16 から 0.19。

0.19 Thu Sep 20 22:42:30 PDT 2007
　　　　- Try to get HTML encoding from META tags as well, when there's
　　　　　no charset value in HTTP response header.
0.18 Thu Sep 20 19:49:11 PDT 2007
　　　　- Fixed a bug where URI is not absolutized when scraper is nested
　　　　- Use as_XML not as_HTML in 'RAW'
0.17 Wed Sep 19 19:12:25 PDT 2007
　　　　- Reverted Term::Encoding support since it causes segfaults
　　　　　(double utf-8 encoding) in some environment
0.16 Tue Sep 18 04:48:47 PDT 2007
　　　　- Support 'RAW' and 'TEXT' for TextNode object
　　　　- Call Term::Encoding from scraper shell if installed
http://search.cpan.org/src/MIYAGAWA/Web-Scraper-0.19/Changes

内部的な改良がほとんどでしょうか。それらは省略して。

　　　　- Support 'RAW' and 'TEXT' for TextNode object

XPath でテキストノードを指定しても動くようになった。

#!/usr/bin/perl
use strict;
use warnings;
use Web::Scraper;
use YAML;

print Dump scraper {
  process '//text()[.=~/\d{9}[\dX]/]', 'isbns[]' => 'text';
}->scrape(<<HTML
<div>477413192X</div>
<ul>
  <li>4873113377</li>
  <li>4063726266</li>
</ul>
HTML
);

[hetappi@lily work]# perl ./isbn.pl
---
isbns:
  - 477413192X
  - 4873113377
  - 4063726266
[hetappi@lily work]#

このケースだと、テキストノード指定でなくても

  process '//*[text()=~/\d{9}[\dX]/]', 'isbns[]' => 'text';

で前からできてたので、あんまりいい例じゃないな。

Term::Encoding はおしいなぁ…。