Web::Scraper Watch - へたっぴ日記

env_proxy に喜びすぎて、ほかの更新を見逃してたのは内緒。
さっきちょこっと書いたけど、

　- Call env_proxy in scraper CLI

D:\>set HTTP_PROXY=http://userid:passwd@proxy.example.com:8080
D:\>scraper "http://quote.yahoo.co.jp/q?s=9684.t&d=t"
scraper>

とかして、プロキシを設定できるようになった。会社、大学な人で喜んでる人は多いんじゃないでしょうか。

　- Added $Web::Scraper::UserAgent and $scraper->user_agent accessor to deal
　　with UserAgent object

LWP::UserAgent を設定/参照できるようになった。
今まで

my $scraper = scraper {
...
};
$scraper->__ua->proxy(http => 'http://userid:passwd@proxy.example.com:8080');
my $data = $scraper->scrape(...);

とかして参照は一応できてたけど、

$Web::Scraper::UserAgent = LWP::UserAgent->new(keep_alive => 1);
my $foo = scraper {
...
}->scrape(...);

my $scraper = scraper {
...
};
my $foo = $scraper->scrape(...);

$scraper->user_agent->cookie_jar({});
my $bar = $scraper->scrape(...);

とかできるようになった。LWP::UserAgent のサブクラス作ってごにょごにょとかもできる。なるほど。

　- Don't escape non-ASCII characters into &#xXXXX; in scraper shell 's' and WARN

scraper CLI で遊ぶその２ - へたっぴ日記の例だと、

scraper> s
<html>
  <head>
    <title> Yahoo!&#x30D5;&#x30A1;&#x30A4;&#x30CA;&#x30F3;&#x30B9; - 9684.t </title>
...
  </body>
</html>
scraper>

が、

scraper> s
<html>
  <head>
    <title> Yahoo!繝輔ぃ繧､繝翫Φ繧ｹ - 9684.t </title>
...
...
  </body>
</html>
scraper> binmode STDERR, ':encoding(sjis)'
scraper> s
<html>
  <head>
    <title> Yahoo!ファイナンス - 9684.t </title>||<
...
  </body>
</html>
scraper>

に。

scraper> process '//table[@border="1"]/tr[2]/td[1]', WARN;
<td colspan="2" nowrap>&#x53D6;&#x5F15;&#x5024;<br />9/3 <b>3,570</b></td>

が、

scraper> process '//table[@border="1"]/tr[2]/td[1]', WARN;
<td colspan="2" nowrap>蜿門ｼ募&#128;､<br />9/14 <b>3,780</b></td>
scraper> binmode STDERR, ':encoding(sjis)'
scraper> process '//table[@border="1"]/tr[2]/td[1]', WARN;
<td colspan="2" nowrap>取引値<br />9/14 <b>3,780</b></td>
scraper>

に。めっさわかりやすくなったー。
関係ないけど、スクエニ順調だな。売らなきゃよかった…。

2007/09/18 追記
ちょっとはずしてた気がするので修正した。