You'll need a system with perl (obviously ) and a tool called wget, which is kind of like a command-line web browser, except instead of rendering a nice pretty web page, it dumps the raw html in a file. (ggogle claims it is available for MS Windows, but I've only used it on linux.)
When I started working on the second script, I just went and told wget to grab all 1576 strips. After testing the output (who the blazes is "Neko Piro"?) I realized I had stupidly forgotten that all the strips are numbered, including the Dead Piro Days and such. Hence the first script, which takes the results of a search-ninja search for the Story comics (leave the search box blank ) and then uses wget to pull them (unless you've pulled them already). The second script goes through those strips and collates the characters by which strips they appear in, then dumps the results to a tab-separated text file. The columns (one per character) are ordered by the number of strips the character appears in; so Piro and Largo are at the left, with John Romero in amidst a slew of other only-appears-once folks way over on the right (255 columns!).
I put a copy of this text file on google drive if you don't want to bother with the scripts themselves. It has some drawbacks: names like "fanboy" get reused a lot, and are not necessarily referring to the same guy, so their "popularity" is a bit inflated. And I just now noticed a character named "Piro: I don't like it", which if you look at the transcript, does indeed have that chunk of dialogue in the "Characters shown:" bit (instead of "Piro: I don't like it" and "Also shown: Miho"). Not sure if this was deliberate but I don't have the strength to try to handle it atm.
But since it's a tab separated file you can load it straight into MS Excel if you want, and I hope somebody might find it useful. For example, apropos of the Technetium (99m?) comment in the 1577 thread, did you know Megumi is the 14th most frequently appearing character, just behind Yutaka and ahead of Ed (my most and least favorite characters respectively )? paarfi, did you know Kenji's name is misspelled as "Kenhi" in the 1252 transcript?
Anyway, Share and Enjoy. Restricting the behavior to a subset of strips instead of all twelve hundred or so is left as an exercise for the even nerdier than me. Oh and remember kids, we are NOT supposed to parse HTML by hand like this, we are supposed to use HTML::Parser and other nice libraries. Don't be a lazy twerp like me.
get_story_strips.pl:
Code: Select all
#!/usr/bin/perl -w
use strict;
$/ = undef; # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted
my $strip_search = 'search.php@&q=&x=18&y=19&meta%5B%5D=Story'; # Assumes you already did a wget for this one
open my $fh, '<', $strip_search
or die "Could not open $strip_search for read, aborting";
my $ss_html = <$fh>;
$ss_html =~ /<ol class="results">(.*)<\/ol>/s;
my $results = $1;
my @strips = $results =~ /<a href="strip\/(\d+)">/gs;
print join("\n", reverse @strips), "\n";
Code: Select all
#!/usr/bin/perl -w
use strict;
my @strips = (map {chomp; $_} `./get_story_strips.pl`);
for my $strip (@strips) {
unless (-e $strip) {
if (system('wget', 'https://megatokyo.com/transcript/' . $strip)) {
warn "Could not pull strip $strip";
}
}
}
$/ = undef; # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted
my %personae;
for my $strip (@strips) {
open my $fh, '<', $strip
or do {
warn "Could not open $strip for read, skipping";
next;
};
my $strip_html = <$fh>;
$strip_html =~ /<ol class="transcript">(.*)<\/ol>/s;
my $transcript = $1;
my @line_list = $transcript =~ /<dt>(.*?):<\/dt><dd>(.*?)<\/dd>/sg;
while (@line_list) {
my $who = shift @line_list;
my $what = shift @line_list;
my @shown = ($who);
if ($who =~ /shown$/) {
@shown = split(/,\s*/, $what);
}
for my $who (@shown) {
$personae{$who}{$strip} = 1;
}
}
}
my @personae_list = map {$_->[0]} sort {
$b->[1] <=> $a->[1] || $a->[0] cmp $b->[0]
} map {[$_, scalar(keys %{$personae{$_}})]} keys %personae;
for my $strip (@strips) {
my @cells;
push @cells, $strip;
for my $who (@personae_list) {
push @cells, exists($personae{$who}{$strip}) ? $who : '';
}
print join("\t", @cells), "\n";
}