
You'll need a system with perl (obviously

When I started working on the second script, I just went and told wget to grab all 1576 strips. After testing the output (who the blazes is "Neko Piro"?) I realized I had stupidly forgotten that all the strips are numbered, including the Dead Piro Days and such. Hence the first script, which takes the results of a search-ninja search for the Story comics (leave the search box blank

I put a copy of this text file on google drive if you don't want to bother with the scripts themselves. It has some drawbacks: names like "fanboy" get reused a lot, and are not necessarily referring to the same guy, so their "popularity" is a bit inflated. And I just now noticed a character named "Piro: I don't like it", which if you look at the transcript, does indeed have that chunk of dialogue in the "Characters shown:" bit (instead of "Piro: I don't like it" and "Also shown: Miho"). Not sure if this was deliberate but I don't have the strength to try to handle it atm.
But since it's a tab separated file you can load it straight into MS Excel if you want, and I hope somebody might find it useful. For example, apropos of the Technetium (99m?) comment in the 1577 thread, did you know Megumi is the 14th most frequently appearing character, just behind Yutaka and ahead of Ed (my most and least favorite characters respectively


Anyway, Share and Enjoy. Restricting the behavior to a subset of strips instead of all twelve hundred or so is left as an exercise for the even nerdier than me.


get_story_strips.pl:
Code: Select all
#!/usr/bin/perl -w
use strict;
$/ = undef; # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted
my $strip_search = 'search.php@&q=&x=18&y=19&meta%5B%5D=Story'; # Assumes you already did a wget for this one
open my $fh, '<', $strip_search
or die "Could not open $strip_search for read, aborting";
my $ss_html = <$fh>;
$ss_html =~ /<ol class="results">(.*)<\/ol>/s;
my $results = $1;
my @strips = $results =~ /<a href="strip\/(\d+)">/gs;
print join("\n", reverse @strips), "\n";
Code: Select all
#!/usr/bin/perl -w
use strict;
my @strips = (map {chomp; $_} `./get_story_strips.pl`);
for my $strip (@strips) {
unless (-e $strip) {
if (system('wget', 'https://megatokyo.com/transcript/' . $strip)) {
warn "Could not pull strip $strip";
}
}
}
$/ = undef; # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted
my %personae;
for my $strip (@strips) {
open my $fh, '<', $strip
or do {
warn "Could not open $strip for read, skipping";
next;
};
my $strip_html = <$fh>;
$strip_html =~ /<ol class="transcript">(.*)<\/ol>/s;
my $transcript = $1;
my @line_list = $transcript =~ /<dt>(.*?):<\/dt><dd>(.*?)<\/dd>/sg;
while (@line_list) {
my $who = shift @line_list;
my $what = shift @line_list;
my @shown = ($who);
if ($who =~ /shown$/) {
@shown = split(/,\s*/, $what);
}
for my $who (@shown) {
$personae{$who}{$strip} = 1;
}
}
}
my @personae_list = map {$_->[0]} sort {
$b->[1] <=> $a->[1] || $a->[0] cmp $b->[0]
} map {[$_, scalar(keys %{$personae{$_}})]} keys %personae;
for my $strip (@strips) {
my @cells;
push @cells, $strip;
for my $who (@personae_list) {
push @cells, exists($personae{$who}{$strip}) ? $who : '';
}
print join("\t", @cells), "\n";
}