Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post Reply
User avatar
darrin
Posts: 628
Joined: Sun Jun 04, 2017 7:19 pm

Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by darrin » Sun May 24, 2020 1:35 am

Over in Story Discussion louisxiv asked if the search ninja could "list who appears in a comic page, rather than which pages a character appears in". It was pointed out that that functionality is already provided by the transcripts (megatokyo.com/transcript rather than megatokyo.com/strip for those unaware). But then speculation arose about whether such a tool might be developed to automate the task. paarfi suggested MS Excel, but I am more of a unix guy. Hence the following two perl scripts. 8-)

You'll need a system with perl (obviously :D) and a tool called wget, which is kind of like a command-line web browser, except instead of rendering a nice pretty web page, it dumps the raw html in a file. (ggogle claims it is available for MS Windows, but I've only used it on linux.)

When I started working on the second script, I just went and told wget to grab all 1576 strips. After testing the output (who the blazes is "Neko Piro"?) I realized I had stupidly forgotten that all the strips are numbered, including the Dead Piro Days and such. Hence the first script, which takes the results of a search-ninja search for the Story comics (leave the search box blank ;)) and then uses wget to pull them (unless you've pulled them already). The second script goes through those strips and collates the characters by which strips they appear in, then dumps the results to a tab-separated text file. The columns (one per character) are ordered by the number of strips the character appears in; so Piro and Largo are at the left, with John Romero in amidst a slew of other only-appears-once folks way over on the right (255 columns!).

I put a copy of this text file on google drive if you don't want to bother with the scripts themselves. It has some drawbacks: names like "fanboy" get reused a lot, and are not necessarily referring to the same guy, so their "popularity" is a bit inflated. And I just now noticed a character named "Piro: I don't like it", which if you look at the transcript, does indeed have that chunk of dialogue in the "Characters shown:" bit (instead of "Piro: I don't like it" and "Also shown: Miho"). Not sure if this was deliberate but I don't have the strength to try to handle it atm.

But since it's a tab separated file you can load it straight into MS Excel if you want, and I hope somebody might find it useful. For example, apropos of the Technetium (99m?) comment in the 1577 thread, did you know Megumi is the 14th most frequently appearing character, just behind Yutaka and ahead of Ed (my most and least favorite characters respectively :lol:)? paarfi, did you know Kenji's name is misspelled as "Kenhi" in the 1252 transcript? ;)

Anyway, Share and Enjoy. Restricting the behavior to a subset of strips instead of all twelve hundred or so is left as an exercise for the even nerdier than me. :lol: Oh and remember kids, we are NOT supposed to parse HTML by hand like this, we are supposed to use HTML::Parser and other nice libraries. Don't be a lazy twerp like me. :x

get_story_strips.pl:

Code: Select all

#!/usr/bin/perl -w
use strict;
$/ = undef;  # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted

my $strip_search = 'search.php@&q=&x=18&y=19&meta%5B%5D=Story';  # Assumes you already did a wget for this one

open my $fh, '<', $strip_search
  or die "Could not open $strip_search for read, aborting";
my $ss_html = <$fh>;
$ss_html =~ /<ol class="results">(.*)<\/ol>/s;
my $results = $1;
my @strips = $results =~ /<a href="strip\/(\d+)">/gs;

print join("\n", reverse @strips), "\n";
get_personae.pl:

Code: Select all

#!/usr/bin/perl -w
use strict;

my @strips = (map {chomp; $_} `./get_story_strips.pl`);
for my $strip (@strips) {
  unless (-e $strip) {
    if (system('wget', 'https://megatokyo.com/transcript/' . $strip)) {
      warn "Could not pull strip $strip";
    }
  }
}


$/ = undef;   # Easier to parse a block of html if we don't care where the arbitrary newlines have been inserted
my %personae;

for my $strip (@strips) {
  open my $fh, '<', $strip
    or do {
      warn "Could not open $strip for read, skipping";
      next;
    };
  my $strip_html = <$fh>;
  $strip_html =~ /<ol class="transcript">(.*)<\/ol>/s;
  my $transcript = $1;
  my @line_list = $transcript =~ /<dt>(.*?):<\/dt><dd>(.*?)<\/dd>/sg;
  while (@line_list) {
    my $who = shift @line_list;
    my $what = shift @line_list;
    my @shown = ($who);
    if ($who =~ /shown$/) {
      @shown = split(/,\s*/, $what);
    }

    for my $who (@shown) {
      $personae{$who}{$strip} = 1;
    }
  }
}

my @personae_list = map {$_->[0]} sort {
  $b->[1] <=> $a->[1] || $a->[0] cmp $b->[0]
} map {[$_, scalar(keys %{$personae{$_}})]} keys %personae;

for my $strip (@strips) {
  my @cells;
  push @cells, $strip;
  for my $who (@personae_list) {
    push @cells, exists($personae{$who}{$strip}) ? $who : '';
  }
  print join("\t", @cells), "\n";
}
Avatar by Broken, I changed the book
My rescripts, now with little bits of commentary for each one

louisxiv
Posts: 39
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by louisxiv » Sun May 24, 2020 4:09 am

Thanks again, for the code and the text file. Works better than what I was starting to tinker with... I'd entirely forgotten about wget, despite having used it to do something or other with a comic site a few years back.

User avatar
darrin
Posts: 628
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by darrin » Sun May 24, 2020 3:22 pm

Other possible name typos (in addition to Kenji -> "Kenhi" in 1252):

Largo's translation device is variously given as "tr4nzl33t" (1142, 1144, 1148, 1153, 1232, 1234), "Transl33t" (1167), "Transl33tor" (1167), "Tranzl33t" (1163), and "tr4nsl33t" (1148, 1162).
Probably more a transliteration choice than typo, but Eimi (1042, 1045, 1049, 1050, 1054) appears in one speech bubbles as "Emi" (1042) but another as "Eimi" (1049).
Komugiko is "Komugko" in 1447 panel 5, and 1452 panel 2.
Horde is "Hord" in 1386 panel 3.
Junpei is "Junpe" in 1449 panel 5.
Junpei is "Jupei" in 1413 panel 4.
Kimiko's yawn is "mostrous" in 1284 panel 6.
Moeko is "Meoko" in 1560 panel 8 (and also in the dialogue of 1497 panel 4).
Miho is "Miiho" in 1518 panel 5.
Ping is "Pring" in 1414 panel 2.
There is a "newpane" listed in 1228 panel "6", along with "Yuki, Yuki"; but it looks like those should have been in a separate Panel 7 (Yuki's dialogue for that panel is missing), and the stated "Panel 7" should actually have been Panel 8.
Ping is "pING" in 1141 panel 1.


These next may have been deliberate, like I said above; but these are cases where a chunk of dialogue is included in an "Also shown:" or "Characters shown:"
1454: "Boo: Squeek!"
1386: "Komugiko: You want a demonstration?"
1365: "Piro: I don't like it."


Ooh, and this one is my bad:
"Piro, Kimiko" in 632; I think that's the only case of multiple people sharing a bubble, but I didn't account for it in my script, only for the "Characters shown" and "Also shown".
Avatar by Broken, I changed the book
My rescripts, now with little bits of commentary for each one

User avatar
paarfi
Super Mod
Super Mod
Posts: 826
Joined: Sat Jun 03, 2017 5:32 pm
Location: south-central Pennsylvania

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by paarfi » Sun May 24, 2020 5:02 pm

I fixed most of them. I didn't find these below. Can you please double-check?
Emi" (1042) but another as "Eimi" (1049)
1386: "Komugiko: You want a demonstration?"
1454: "Boo: Squeek!"
Proud owner of kendermouse's 500th post.
Lean and slippered forum loon

User avatar
darrin
Posts: 628
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by darrin » Sun May 24, 2020 5:33 pm

paarfi wrote:
Sun May 24, 2020 5:02 pm
I fixed most of them. I didn't find these below. Can you please double-check?
Emi" (1042) but another as "Eimi" (1049)
1386: "Komugiko: You want a demonstration?"
1454: "Boo: Squeek!"
The "Emi/Eimi" ones above are referring to the actual speech bubbles in the strips; what I meant was it's not clear to me what should be in the transcripts given this discrepancy in what's in the actual comics..

D'oh, the second should be 1366 (swore I fixed that before posting :oops:). And the third should have been 1464, I seem to be having trouble with sixes today, sorry. :P

This change to the code fixes the "Piro, Kimiko" issue (EDIT: or it would, but I see paarfi has already made the change to the transcript :D); I verified that the original and new megatokyo_characters_per_strip.txt files were identical except for the removal of the 192nd column (the one with "Piro, Kimiko" in strip 632), so I didn't bother uploaded a new version to google drive.

Code: Select all

--- get_personae.pl~    2020-05-23 22:37:54.061711400 -0500
+++ get_personae.pl     2020-05-24 16:10:52.534145400 -0500
@@ -27,10 +27,7 @@
   while (@line_list) {
     my $who = shift @line_list;
     my $what = shift @line_list;
-    my @shown = ($who);
-    if ($who =~ /shown$/) {
-      @shown = split(/,\s*/, $what);
-    }
+    my @shown = split(/,\s*/, $who =~ /shown$/ ? $what : $who);

     for my $who (@shown) {
       $personae{$who}{$strip} = 1;
EDIT:
Heh, it just occurred to me that I used to do rescripting, and now I am just doing scripting. You'd think it would be the other way around. :lol:

EDIT2:
I just realized the description of the two scripts up in my first post is bad. The first just gets the list of story comics, it's the second that takes that list and "uses wget to pull them (unless you've pulled them already)".
Avatar by Broken, I changed the book
My rescripts, now with little bits of commentary for each one

User avatar
paarfi
Super Mod
Super Mod
Posts: 826
Joined: Sat Jun 03, 2017 5:32 pm
Location: south-central Pennsylvania

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by paarfi » Sun May 24, 2020 5:50 pm

Fred corrected Emi to Eimi in the books, so the transcripts are technically right for those two.
The other two are fixed now.

Thanks for pointing those out.
Proud owner of kendermouse's 500th post.
Lean and slippered forum loon

User avatar
darrin
Posts: 628
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by darrin » Sun May 24, 2020 6:36 pm

paarfi wrote:
Sun May 24, 2020 5:50 pm
Fred corrected Emi to Eimi in the books, so the transcripts are technically right for those two.
Ah, excellent, that makes perfect sense thanks.

A couple more that I missed during the first round above, sorry:
The tr4nzl33t is "Tranzl33t" in 1163 panel 4.
1575: "Komugiko: <***>"
Moeko is "Meoko" in 1560 panel 2 (and dialogue of 1497 panel 4).

Thanks for helping with this. :D
Avatar by Broken, I changed the book
My rescripts, now with little bits of commentary for each one

User avatar
paarfi
Super Mod
Super Mod
Posts: 826
Joined: Sat Jun 03, 2017 5:32 pm
Location: south-central Pennsylvania

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by paarfi » Sun May 24, 2020 6:46 pm

Fixed. Thanks.
Proud owner of kendermouse's 500th post.
Lean and slippered forum loon

User avatar
darrin
Posts: 628
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by darrin » Sun May 24, 2020 8:48 pm

Woot, I reran the downloader to take into account all the changes paarfi made (thanks paarfi!!!), and replaced the google drive file I posted with the new one.

Share and Enjoy once more. B)

Oh hey, who likes power laws? Long tail and all that? I ran this:

Code: Select all

darrin@cornelia ~/mt> for person in (for i in (seq 235); cat megatokyo_characters_per_strip.txt | cut -f2- | cut -f{$i} | grep -v '^$' | sort | uniq; end); echo $person; grep -c -P "\t\Q$person\E(\t|\Z)" megatokyo_characters_per_strip.txt; end > megatokyo_characters_power_law.txt
and got this. (That's a shell called "fish" I started using at work a few months back, for zsh and such you'd need to use backticks instead of parens, and a few other tweaks, but you get the idea.)

Kinda fun to look at the data this way. You see surprises like Ashe being tied with Kenji at 20 strips; Ashe "feels" like a very new character, and Kenji "feels" like he's been around forever. But Kenji's appearances were mostly pretty isolated (a few strips at a time, at fairly rare intervals) whereas Ashe was in nearly every strip for quite a few of this chapter's scenes.

Anyway, huge thanks again to paarfi for making this all possible (and for putting up with my annoying transcript questions :lol:).
Avatar by Broken, I changed the book
My rescripts, now with little bits of commentary for each one

louisxiv
Posts: 39
Joined: Sun Jun 04, 2017 7:19 pm

Re: Scripts to list characters by strip (instead of search-ninja's all strips a character appears in)

Post by louisxiv » Mon Jul 06, 2020 7:07 am

On the back of the heavy lifting done by paarfi and darren I've made a Megatokyo index page.

The search ninjas are likely more convenient for character and text searches, but I'm adding location (Hospital, Foxhole, etc) and significant object (Wand, MagicalGirlDetector) tags. You can click on a Piro tag and a Foxhole tag to list only pages featuring both, for example.

All chapters are character tagged from darren and paarfi's work. Chapter 12 is up to date with the other tags (E&OE). Supplemental tagging is ongoing, as soon as I decide which chapter to start next — likely 0 or 11. Tagging input welcome...

Post Reply

Who is online

Users browsing this forum: No registered users and 12 guests