Friday 21 August 2015

Fun with Mojolicious UserAgent and DOM


The best solution I have found for scraping webpages has got to be without any doubt Mojolicious. http://mojolicio.us/
Mojolicious is a Perl framework which includes among many things a fully functioning UserAgent. The Mojolicious UserAgent allows you to easily scrape pages and feed the results to Mojolicious's DOM parser.

The problem I was trying to solve was to log into a website and pull details from my account. Parse the results and e-mail myself the details once a month.

Mojolicious comes with everything you need in one cpan bundle. Check out  http://mojolicio.us/ on how to install. The only extra things I needed to install were OpenSSL, Libio-socket-ssl-perl, libssl-dev, and Net::SSLeay.


The way I always start to scrape a page is to use Chrome or Firefox to find the headers I am interesting in.

For Google Chrome this is - Tools-->More Tools-->Developer Tools.
Browse to the webpage you want to scrape, select Network from the developer tools panel and then refresh the page you want to scrape. Then view the headers for the transaction you are interested in.
The Page I am interesting in is:  https://www.zurichlife.ie/bgsi/log_on/login.jsp

From Google Chrome Developer Tools, we can see that the interesting bits are:
Request URL:
https://www.zurichlife.ie/bgsi/servlet/com.eaglestar.servlets.LoginServlet
Request Method:
POST
Content-Type:
application/x-www-form-urlencoded
Referer:
https://www.zurichlife.ie/bgsi/log_on/login.jsp
userName: xxxxxxxxx
password: xxxxxx
pin: xxxxxx


Now for the fun with MOJO::UserAgent.


my %params = (
userName => 'xxxxxxxx',
password => 'xxxxxx',
pin => 'xxxx',
);
my $ua = Mojo::UserAgent->new;
$ua->transactor->name('Mozilla/5.0');
$ua->max_redirects(5);

my $tx = $ua->build_tx(POST => 'https://www.zurichlife.ie/bgsi/servlet/com.eaglestar.servlets.LoginServlet', form => \%params);

$tx->req->headers->referrer('https://www.zurichlife.ie/bgsi/log_on/login.jsp');
$tx = $ua->start($tx);

 
my $filtered_dom;
if (my $res = $tx->success) {
#    print $res->body;
my $dom = $res->dom;
$filtered_dom = $dom
        ->find('p')
        ->grep(qr/Current Transfer/)
        ->join("\n");
#print $filtered_dom;
}
else {
    my ($err, $code) = $tx->error;
    print $code ? "$code response: $err\n" : "Connection error: $err\n";
}

$res contains the raw HTML page. I then use Mojo::DOM to parse the page. Mojo::DOM uses CSS selectors. I expect there is a much nicer way to use DOM and CSS to filter out the details I am looking for.

I then use Cron and Perl to e-mail the results to me monthly. Mojolicious keeps on surprising me:-)

There is a great tutorial here from Brian D Foy.
http://perltricks.com/article/143/2015/1/8/Extracting-from-HTML-with-Mojo--DOM

No comments:

Post a Comment