Reverse engineering Wargroove

So far, I think I have spent a total of 8 hours working on a reverse engineering project to mod Wargroove. I thought it would be trivial, but it has been a little more convoluted than I expected due to my inexperience in reverse engineering. The end goal is to make an Advance Wars total conversion mod.

Wargroove is based on the Halley engine. It is a well-designed C++14 game engine – arguably one of the best-designed game engines I have seen to date (and that’s saying something). The CMake build process is straightforward, and most libraries can be statically linked without hassle. The engine also encourages scripting using Lua, which means that how much of the game that can be modded might be much greater than I originally anticipated (which was just image resources).

I was able to hack up Halley’s asset pack inspector to try to extract Wargroove’s assets, and lo and behold, it listed out all of the files in each pack. But there was one problem: the files it extracts were all gibberish. Naturally, all of the asset packs are encrypted using AES. That means I have to find the IV and a decryption key, which is just a string.

The IV is easy to find: it’s in the header of the asset pack. However, the key is not so easy to find: you have to take a deep dive into the executable to find it.

I’ve done reverse engineering before as part of a university course, and of course a little bit in my spare time using Radare. I thought this would be relatively easy: find the code that corresponds to the decrypt function, then find cross-references to it, tracing upward until I can find a reference to some kind of string. But while I have come close, I have not been able to find the string through static analysis.

All right, so let’s try dynamic analysis. The problem with dynamic analysis, however, is that it’s difficult to hook into the executable right on game startup due to its dependency on Steam, so I can’t put a breakpoint on the decrypt function. Instead, however, I intentionally caused an error on load time to trigger an error message, which gives me a chance to hook a debugger and inspect the memory.

My first attempt to cause an error was by modifying what I thought was a decryption key and then watching the game try to decrypt without success. I’m basically looking for a string that is at least 16 characters long, and I found some good candidates that were close to some game initialization-related strings. However, none of the keys that I modified caused the game to crash.

My second, fail-safe attempt, was to simply rename one of the .dat files, which surely caused the game to fail to run.

Yet still no dice. None of the references in the stack pointed to anything that looked like a string, except for the error message in the dialog box. It was almost as if the error message that was produced overwrote the decryption key that I needed, which doesn’t seem like it’s supposed to happen.

After a while, I considered inspecting the network to see if the decryption key was being received from a server. But the game makes zero network communication except in the multiplayer and user-generated content (UGC) modes, and besides, getting a decryption key from a server would imply always-online DRM.

While IDA has given many clues, it does not provide any useful cross-references for the supposed decryption keys that I found. I don’t believe there is anything devious going on here – the developers would not take so much time to obfuscate a single string – so I’m going to give those strings a hard pass and keep looking.

I think I will focus my efforts on finding a way to hook into the executable right at start time, so that I can lay a breakpoint right at the decrypt function, which will give me the most accurate stack trace. I think this is possible if I hook into the Steam client and then watch for child process creation.

None of this is easy, though, when you are using Steam on Wine and launching Wargroove with an old version of Proton, in OpenGL mode, with the –no-intro flag, and then simultaneously launching x64dbg from a terminal with the same Wine prefix as Wargroove, with WINEESYNC set to 1. Yikes! I should probably do the debugging on my Windows machine.

At times, I feel like hitting my head against the wall, but I have confidence that I will find the key eventually so that I can move forward.

Unfortunately, after another six hours sunk into the project, I haven’t been able to make much headway. The Steam library is interfering with the debug hook on my Windows machine. Using the Image File Execution Options in the Windows registry, I can immediately hook a debugger on process startup; however, the interruption from hooking x96dbg causes the Steam API to fail initialization with a cryptic error code, and therefore halts the entire game from starting up.

Thinking that maybe the interference was solely an x96dbg issue, I tried Cheat Engine instead. I was originally hesitant about using Cheat Engine due to my unfamiliarity with it, but nevertheless it does not seem to support just-in-time debugging since elevation is required. Moreover, on automatically detecting the process and manually pausing it on startup, upon hooking any debugger, the game crashes.

Heck, when the engine fails to load an asset pack, it even prints a stack trace in the message box! The problem is that the stack trace string seems to overwrite the decryption key that I need.

I think what I need to do is patch a breakpoint into the executable by way of an INT 3.

Okay, well, that still didn’t work. I’ll have to try again later.

Coping with loneliness

This break, I’ve been coping with loneliness, and things have actually gone better than I expected. Fortunately, the break does not need to drag on any further – I return to college this week.

However, yesterday I was feeling particularly lonely, so I did an exercise that I like doing when it seems like there is nobody to talk to: I list everyone that I know (who is of my age), and I categorize them by connection. (more…)

2018: a retrospective

A week ago, I wrote a retrospective in my private writings, but yesterday after rereading it, I found it profound enough to publish in this blog. Since the retrospective includes personal details, I had to omit them, but given the result ended up looking like a Mad Lib, I decided to reword the retrospective to skirt around such details. The reworded narrative is a bit nebulous and abstract, and I considered even giving up publishing it entirely, but perhaps there is some value in the end.


On the versioning of large files

I’m such an idealist. It’s become clear that I can’t settle for anything other than the most perfect, elegant, long-term solution for software.

That reality has become painfully apparent trying to find a good way to track the assets on Attorney Online in a consistent, efficient manner. I just want to track its history for myself, not to put the content itself online or anything like that (since putting it online might place me further at risk for a copyright strike).

Well, that eliminates Git LFS as a choice. Git LFS eliminates the ability to make truly local repositories, as it requires the LFS server to be in a remote location (over SSH or HTTPS). It’s not like Git LFS is even very robust, anyway – the only reason people use it is because GitHub (now the second center of the Internet, apparently) developed it and marketed it so hard that they were basically able to buy (or should I say, sell?) their way to get it into the Git source tree.

Git-annex, on the other hand, seems promising, if I could figure out how to delete the remote that I accidentally botched. There’s not a whole lot of documentation on it, save the manpages and the forums, most posts of which are entirely unanswered. What’s more, GitLab dropped support of git-annex a year ago, citing lack of use. Oh well, it lets me do what I wanted to do: store the large files wherever I want.

I could also sidestep these issues by using Mercurial. But that would be almost as bad as using bare Git – the only difference would be that Mercurial tries to diff binary files, but I’d still probably have to download the entire repository todo en un cantazo.

I was also investigating some experimental designs, such as IPFS. IPFS is interesting because it’s very much a viable successor to BitTorrent, and it’s conservative enough to use a DHT instead of the Ethereum blockchain. The blockchain is seen as some kind of holy grail for computer science, but it’s still bound under the CAP theorem. It just so happens to sidestep the issues stipulated by the CAP theorem in convenient ways. Now, don’t get me wrong, my personal reservation for Ethereum is that I didn’t invest in it last year (before I went to Japan, I told myself, “Invest in Ethereum!!”, and guess what, I didn’t), and it seems that its advocates are people who did invest in it and consequently became filthy rich from it, so they come off as a little pretentious to me. But that’s enough ranting.

IPFS supports versioning, but there is no native interface for it. I think it would be a very interesting research subject to investigate a massively distributed versioning file system. Imagine a Git that supports large files, but there’s one remote – IPFS – and all the objects are scattered throughout the network…

Well, in the meantime, I’ll try out git-annex, well aware that it is not a perfect solution.


This morning, I received a “boil water” notice from the university. I immediately searched the news to investigate the exact reason – is the water contaminated, and what is it contaminated with?

However, all that I could find were two vague reports from city officials about how the treatment plants were overloaded from silt due to flooding, and that Lake Travis was only four feet away from spilling over the dam. Pressed to maintain a water pressure adequate enough for fire hoses to remain usable, the city decided to “reduce” the treatment of the water to allow enough water to be supplied, such that it is no longer at the “high standards” that the city provides for potable water.

But water treatment systems are not a black box; they are a multi-stage process! What stage of the treatment was hastened; or are stages being bypassed entirely? Surely, the filtration for particulate matter is being reduced, but the chlorine process should still be keeping the water sterile. However, none of these questions can be answered due to the vagueness of the report.

Affected treatment plants? Undisclosed. Particulate matter and bacteria reports? Nonexistent, assuming the Austin website actually works right now, which it does not.

Here is the main contradiction in their statement:

WHY IS THE BOIL WATER NOTICE IMPORTANT Inadequately treated water may contain harmful bacteria, viruses, and parasites which can cause symptoms such as diarrhea, cramps, nausea, headaches, or other symptoms.

But earlier in their statement, they stated the following:

It’s important to note that there have been no positive tests for bacterial infiltration of the system at this time.

So what bacteria am I going to kill from boiling water?

All that I can conclude is that the city of Austin is spreading fear, uncertainty, and doubt of the water quality simply to reduce stress on the system, without presenting hard evidence that the water is indeed unsafe to drink. Boiling water will not eliminate particulate matter, and from the aforementioned press release, “city officials” (whoever those are) have explicitly stated that bacteria has not yet contaminated treatment plants, so there is no bacteria to kill from boiling water.

One benefit to treatment plant operators from this warning, however, is that they now have free reign over which stages they wish to reduce or bypass, including the disinfection stage. However, due to the lack of transparency, there is no information to ascertain which stages are being bypassed – the water can really be of any quality right now, and it could even be still perfectly fine.

My questioning of this warning stems from a fundamental distrust in government decisions and communication to its citizens. People simply echo the same message, without seeming to place much thought into it: “Boil water. Boil water. Boil water.” And on the other hand, city officials might state that the treated water is completely safe to drink, despite findings of statistically significant lead concentration in some schools!

I’ll comply out of an abundance of caution (and because noncompliance has social implications), but mindless compliance and echoing of vague mass messages should not be the goal of the government. Individuals should be able to obtain enough information to make an informed decision and understand the rationale of the government in its own decisions.

It is now the next day since the announcement of the restrictions, and the technical details surrounding the problem remain vague. It seems that the restriction has indeed granted free license for treatment plant operators to modify treatment controls as they see fit, without necessarily needing to meet criteria for potable water. Moreover, it appears that the utility has known about this problem for quite some time now, and only now have they decided to take drastic action to prevent a water shortage.

I would not trust this water until the utility produces details of actions being taken in these treatment plants to fix this mess up.

On Windows, part 2

Here I am on my cozy Arch Linux machine, enjoying the good life of customizability and modularity of, well, literally every component of the machine.

I look up the equivalent of DMG on Windows – apparently, DMG files also have built-in code-signing and checksum capabilities. The best part about a DMG file is that it is a multipurpose format: it can be mounted like a drive as a method of isolation, or it can be used to package a full software installation.

On Windows-land, there are only ZIP files, MSI installers, and whatever other breed of self-extracting archives and installers have been devised over the decades.

At this point, I realize that Windows is fundamentally outdated. Unable to keep up with the breakneck development of Mac OS X/macOS, Microsoft will be hard-pressed to sweep out deprecated APIs one by one.

The success of Windows is attributable to the fact that it has worked on every IBM-compatible PC since the late 1980s and has maintained a stellar record in software compatibility, a coveted characteristic of computer systems for enterprises looking to minimize software development costs. By comparison, the Macintosh has experienced various leaps in architecture, notwithstanding the high cost of the machine.

I think that the market is in need of a well-designed, uncomplicated Linux distribution that is accessible and familiar to consumers, all the while being enticing for OEMs to deploy. Such a distro would not be another Ubuntu – although it could well be Ubuntu, since Canonical has cemented its position in the open-source world. The problem with Ubuntu, however, is that it has a reputation for advice that involves the command line. A distro that is consumer-oriented keeps the intimidating terminal away!

It would fill the niche market that Chrome OS dominated: lightweight, locked-down devices mostly for browsing the Web. The part where Chrome OS failed, however, was when companies wished to port native software that a web browser lacks the performance or capability to drive, such as anything involving hardware peripherals. With a Linux base, hardware interfacing need not be sacrificed.

Would such an operating system run into legal trouble if it came with Wine or an ability to install Wine when the first Windows program is installed? What if it could run Office seamlessly?

What if it began to make some revolutionary design decisions of its own?

Honestly, I don’t know where I’m going with this anymore. Back to work.

Does pyqtdeploy even work?

I know nobody is going to read this terrible blog to find this, but still, I’m moderately frustrated in trying to find a decent workflow to deploy a small, single-executable, Python-based Qt application.

Even on Windows using C++, it was not so easy to build statically until I found the Qt static libraries on the MinGW/MSYS2 repository – then building statically became a magical experience.

So far, the only deployment tools that promise to deploy a Python Qt program as a single executable are PyInstaller and pyqtdeploy.

PyInstaller works by freezing everything, creating an archive inside the executable with the minimum number of modules necessary to run, invoking UPX on these modules, and then when the program is run, it extracts everything to a temporary folder and runs the actual program from there. As such, startup times seem to be around 3-5 seconds, and the size of the executable is about 30 MB.

pyqtdeploy works by freezing your code, turning it into a Qt project with some pyqtdeploy-specific code, and then compiling that code as if it were a C++-style project, so you could compile a static version of Qt against this generated code.

But in order to use pyqtdeploy, you need to have the libraries at hand for linking:

LIBS += -lQtCore
LIBS += -lQtGui
LIBS += -lpython36
LIBS += -lsip

There’s no way around it – you must build Python and the other dependencies from scratch, and this could take a long time.

I have also encountered strange errors such as SOLE_AUTHENTICATION_SERVICE being undefined in the Windows API headers.

I mean, I suppose pyqtdeploy works, but is this even a trajectory worth going? What would be the pre-UPX size of such an executable – 25 MB, perhaps? That would put it on par with the AO executable.

I might as well write the launcher in C++, or switch to Tkinter.

A humanitarian mission for mesh networking

After Hurricane Maria, I was invited to a Slack group in Puerto Rico to offer my programming expertise for anyone who needed it. After beginning to comprehend the magnitude of the communications problem, I scoured for ways to set up long-distance mesh networking – no, not mobile apps like FireChat that rely on short-distance Wi-Fi or Bluetooth to establish limited local communications – rather, ways to post and find information across the entire island, with relays that could connect through the limited submarine cables to the outside Internet as a gateway for government agencies and worried relatives.

During the three weeks in my interest of this project (but powerlessness in doing anything, as I was taking classes), I investigated present technologies (such as 802.11s), as well as capabilities of router firmware, theoretical ranges of high-gain antennas, and other existing projects.

I saw Project Loon, but never expected much of it. The project must have taken a great deal of effort to take off, but unfortunately, it seemed to have a high cost with little return. Essentially, balloons were sent from some point on Earth and then led by high-altitude winds to cross Puerto Rico for a few hours, eventually to land at some location in the United States. Despite this effort, I found very few reports of actual reception from a Project Loon balloon.

Meanwhile, someone in the mesh networking Slack channel informed me that they were working with a professor at A&M to implement a mesh network from a paper that was already written. While I ultimately never saw the implementation of this mesh network, I felt put down by my naivete, but accepting that my plans were undeveloped and unexecutable, I moved on with the semester. Surely, mobile carriers must have had all hands on deck to reestablish cell phone coverage as quickly as possible, which is certainly the best long-term solution to the issue.

However, many places other than Puerto Rico remain in dire need of communications infrastructure, in towns and villages that for-profit carriers have no interest in placing coverage in. Moreover, there are islands at risk of becoming incommunicable in case of a hurricane.

I am looking to start a humanitarian mission to set up a mesh network. I find that there are three major characteristics to a theoretical successful mesh network: resilience, reach, and time to deploy.

A mesh network that is not resilient is flimsy: one failed node, perhaps bad weather or even vandalism, should not render all of the other nodes useless. Rather, the network should continue operating internally until connection can be reestablished with other nodes, or the situation can be avoided entirely by providing connections with other node, or even wormholing across the mesh network via cellular data.

A mesh network that does not reach does not have any users to bear load from, and thus becomes a functionally useless work of modern art. No, your users will not install an app from the app store – besides, with what Internet? – or buy a $50 pen-sized repeater from you. They want to get close to a hotspot – perhaps a few blocks away in Ponce – and let relatives all the way in Rio Piedras know that they are safe. And to maximize reach, of course, you need high-gain antennas to make 10-to-15-mile hops between backbone nodes that carry most of the traffic, which then distribute the traffic to subsidiary nodes down near town centers using omnidirectional antennas.

A mesh network that takes too long to deploy will not find much use in times of disaster. Cellular companies work quickly to restore coverage – a mesh network simply cannot beat cell coverage once it has been reestablished. First responders will bring satellite phones, and chances of switching to an entirely new communication system will dwindle as the weeks pass as volunteer workflows solidify.

How do I wish to achieve these mesh networking goals?

  • Resilience – use Elixir and Erlang/OTP to build fault-tolerant systems and web servers that can shape traffic to accommodate both real-time and non-real-time demands. For instance, there could be both voice and text coming through a narrow link, which could be as low as 20 Mbps. There may also be an indirect route to the Internet, but there may not be enough bandwidth to allow all users to be routed to the Internet. Moreover, decentralized data structures exist that can be split and merged, in case new nodes are added or nodes become split in an emergency, with possible delayed communication between nodes due to an unreliable data link.
  • Reach – allow users to reach the access point via conventional Wi-Fi or cellular radio, and connect via web browser. Nodes use omnidirectional antennas for distribution and high-gain antennas to form a backbone that can span dozens of miles.
  • Time to deploy – use off-the-shelf consumer hardware and allow flexibility in choice of hardware. Make the specs open for anyone to build a node if desired. Pipeline the production of such nodes with a price tag of less than $400 per node.

I imagine that the mesh network will predominantly serve a disaster-oriented social network with various key features:

  • Safety check – when and where did this person report that they were okay or needed assistance?
  • Posts – both public and private
  • Maps – locations that are open for business, distress calls, closed roads, etc.
  • Real-time chat (text and voice)
  • Full interaction with the outside world via Internet relays
  • Limited routing to specific websites on the open Internet, if available (e.g. Wikipedia)

One issue with this idea, I suppose, is the prerequisite of having a fully decentralized social network, which has yet to be developed. But we cannot wait until the next big disaster to begin creating resilient mesh networks. We must begin experimenting very soon.

Threading in AC

Last time I read about threading, I read that “even experts have issues with threading.” Either that’s not very encouraging, or I’m an expert for even trying.

There are a bunch of threads and event loops in AC, and the problem of how to deal with them is inevitable. Here is an executive summary of the primary threads:

  • UI thread (managed by Qt)
    • Uses asyncio event loop, but some documentation encourages me to wrap it with QEventLoop for some unspecified reason. So far, it’s working well without using QEventLoop.
    • Core runs on the same thread using a QPygletWidget, which I assume separates resources from the main UI thread since it is OpenGL.
      • Uses QTimer for calling draw and update timers
      • Uses Pyglet’s own event loop for coordinating events within the core
  • Network thread (QThread)
    • Uses asyncio event loop, but it uses asyncio futures and ad-hoc Qt signals to communicate with the UI thread.
    • Main client handler is written using asyncio.Protocol with an async/await reactor pattern, but I want to see if I can import a Node-style event emitter library, since I was going that route anyway with the code I have written.

My fear is that the network threads will all get splintered into one thread per character session, and that Pyglet instances on the UI thread will clash, resulting in me splintering all of the Pyglet instances into their own threads. If left unchecked, I could end up with a dozen threads and a dozen event loops.

Then, we have the possibility of asset worker threads for downloading. The issue with this is possible clashing when updating the SQLite local asset repository.

The only way to properly manage all of these threads is to take my time writing clean code. I cannot rush to write code that “works” because of the risk of dozens of race conditions that bubble up, not to mention the technical debt that I incur. Still, I should not need to use a single lock if I design this correctly, due to the GIL.