Kubeflow Jupyter Configurations

For those of you familiar with Kubeflow you’ve probably worked with the form below. This is the screen you get when creating a custom Jupyter notebook server. Here you can dictate all the pieces of information required to deploy the jupyter notebook via Kubeflow.

In the documentation for Kubeflow notebooks step 12 discusses setting up customer configurations which would allow you to add volumes or Kubernetes secrets (DB creds, etc.) to your notebook before deploying it. This is great if you have credentials you’d like to keep safe while also allowing to access to DBs containing the information you need to perform your analysis.

The instructions discuss creating a PodDefault (similar to a Kubernetes PodPreset) which will allow for Kubeflow to inject the configuration variables into the Pod when it deploys. However I believe this explanation is a little light on the important details and I would like to elaborate a little more on how to get this setup correctly.

First thing to know is that these secrets / volumes and the PodDefault must be created in the namespace which will run the notebook server. For Kubeflow v1.0 you can isolate users into their own namespaces and this is the namespace which you must deploy these resources in order for the Jupyter web app to recognize them.

Image you have a Kubernetes secret like the one below:

apiVersion: v1
kind: Secret
metadata:
  name: kf-mysql-secret
  namespace: <user's namespace>
type: Opaque
stringData:
  mysql_user: "xxxx"
  mysql_password: "xxxx"
  mysql_host: "xxxx"
  mysql_port: "xxxx"

Now we need to create a PodDefault to go along with this Kubrenetes secret:

apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
  name: add-mysql-secret
  namespace: <user's namespace>
spec:
  selector:
    matchLabels:
      add-mysql-secret: "true"
  desc: "Adds creds for MySQL DB"
  env:
  - name: DB_USER
    valueFrom:
      secretKeyRef:
        name: kf-mysql-secret
        key: mysql_user
  - name: DB_PWD
    valueFrom:
      secretKeyRef:
        name: kf-mysql-secret
        key: mysql_password
  - name: DB_HOST
    valueFrom:
      secretKeyRef:
        name: kf-mysql-secret
        key: mysql_host
  - name: DB_PORT
    valueFrom:
      secretKeyRef:
        name: kf-mysql-secret
        key: mysql_port

The “matchLabels” above is how Kubeflow selects which Pods it will inject the DB creds into.

Now that we have a Secret and a PodDefault both created in the user’s namespace we next have to update the configmap for the Jupyter web app. We can do that using the command below:

kubectl edit configmap jupyter-web-app-config -n kubeflow

For linking up the PodDefaults you only have to modify the piece I have highlighted below. However there are a ton of other useful configurations here you might want to play around with as well.

Once you have updated the configmap be sure to restart the deployment with the following command:

kubectl rollout restart deployment jupyter-web-app-deployment -n kubeflow

And there you go! The configuration should now be there for you to select.

Constructing the Twitch Bot

I want to give the IRC bot a post of its own as not to distract too much from the machine learning post. This bot will be used to log IRC chat messages and poll for viewer counts on Twitch. There are a ton of different ways to implement this so feel free to take what you see here and build a bot of your own. Here I’ll be using Python to construct the bot. All our data will be stored in a SQLite database, and we will spin off three threads for each Twitch streamer we plan to track. One thread will keep track of whether the streamer is online or not, another thread will log all the IRC chat activity, and the finial thread will log the viewer count. Using SQLite is great because later we’ll be able to easily access our data via Pandas.

So first let’s define the main loop that will kick everything off:

import sys;
import time;
import socket;
import sqlite3;
import requests;
from threading import Thread, Lock;
from time import gmtime, strftime, sleep;

if __name__ == "__main__":

	password  = "oauth:YOUR_CODE_HERE";
	nickname  = "YOUR_USERNAME";
	client_id = "YOUR_CLIENT_ID";

	study_channels = ["drdisrespectlive", "lirik", "manvsgame", "cohhcarnage"]

	TwitchBot.db_con = sqlite3.connect("twitch.db", check_same_thread=False);
	TwitchBot.db_con.text_factory = str;

	bots = []
	for i in range(0, len(study_channels)):
		bots.append(TwitchBot(study_channels[i], nickname, password, client_id))

	running = True;

	while running:
		command = raw_input("");
		if command == "exit":
			running = False;
			for i in range(0, len(study_channels)):
				bots[i].running = False

	TwitchBot.db_con.close();

Let’s break this down a little:

  1. if __name__ == “__main__”: checks to make sure we are running the script directly (as opposed to being imported via someone else).
  2. Next we define a few variables which we will need in order to connect to the IRC chat room and poll for viewer counts. password is our OAuth password we will use when connecting to the chat rooms. If you don’t already have one grab one from herenickname is the username that you use when logging into Twitch (therefore you’ll need to sign up for Twitch in order to log into the IRC rooms). And finally we have client_id which is our ID which Twitch gives us when we register our application (see here). Once you have this information filled out you’ll be good to go!
  3. Now comes the fun part, which streamers are you going to follow? study_channels is a list of the channels we are going to log. When picking steamers you might want to consider when they stream (i.e. if you want the bot running 24/7 find 2/3 night steamers and 2/3 daytime streamers) and how many threads you want to use for the bot (3 threads per streamer plus one main thread).
  4. TwitchBot.db_con is a static variable for the SQLite database connection. (We will write the TwitchBot class itself soon!) TwitchBot.db_con.text_factory = str just ensures that every time we attempt to save a string to the SQLite database is in the standard string format we are expecting (some emotes are sent in non-ascii format. stuff like “¯_(ツ)_/¯”, which we will ignore). I should also point out that check_same_thread=False prevents SQLite from throwing errors when accessing a single database via multiple threads. In any other scenario this would be a problem but we will be using our own mutex to ensure that only one thread writes to the database at a time.
  5. Next we create an instance of the TwitchBot class for each streamer we plan to follow and then we “listen” for when the user types “exit” and then proceed to shut down all the bots we created and close the database connection.

Next let’s define the constructor for our TwitchBot class. Here I will check to see whether our SQLite database already has the tables we need and create them if they don’t already exist. We will also spin off the three threads we’ll need for each channel we plan to log information for.

class TwitchBot:

	mutex  = Lock();
	db_con = None;

	def __init__(self, channel, nickname, password, client_id):

		self.net_data       = None;
		self.net_sock       = socket.socket();
		self.channel        = channel;
		self.nickname       = nickname;
		self.password       = password;
		self.client_id      = client_id;
		self.running        = True;
		self.is_online      = False;
		self.cur_stream_ind = None;

		TwitchBot.mutex.acquire();

		self.cur = TwitchBot.db_con.cursor();
		self.cur.execute("CREATE TABLE IF NOT EXISTS messages (id integer PRIMARY KEY AUTOINCREMENT, username text NOT NULL, message text NOT NULL, channel text NOT NULL, datetime_recv datetime NOT NULL);")
		self.cur.execute("CREATE TABLE IF NOT EXISTS viewers (id integer PRIMARY KEY AUTOINCREMENT, num_viewers integer NOT NULL, channel text NOT NULL, datetime_recv datetime NOT NULL);")
		self.cur.execute("CREATE TABLE IF NOT EXISTS streams (id integer PRIMARY KEY AUTOINCREMENT, start_time datetime NOT NULL, end_time datetime, channel text NOT NULL);")
		TwitchBot.db_con.commit();

		TwitchBot.mutex.release();

		thread_func = self.check_online_thread;
		thread = Thread(target=thread_func, args=(self,));
		thread.start();

		sleep(1)

		thread_func = self.irc_thread;
		thread = Thread(target=thread_func, args=(self,));
		thread.start();

		sleep(1)

		thread_func = self.viewer_thread;
		thread = Thread(target=thread_func, args=(self,));
		thread.start();

		sleep(1)

The first thing we do here is acquire the mutex and check whether three tables exist within our SQLite database (by the way, you can hover your mouse over the code above and use the slider to see the rest of the SQL code). The first table we need is called “messages”:

In this table we have the IRC username for who said the chat message, the message itself, the channel in which the message was said, and the date + time for when we received the message. We also need a table for our viewer counts, let’s call this table “viewers”:

In this table we have the number of viewer, the channel, and the date + time for which we polled for the viewer count. Next we need a table that will allow us to differentiate the different streams given by the same streamer (this also allows us to stop polling the viewer count when they go offline). Let’s call this table “streams”:

Here we have the start and end time for each stream and the channel that went live. Setting things up this way will allow us to easily grab all the messages and viewer counts for each of the individual streams. If you would like to view your SQLite database similar to the images above you should take a look at DB Browser for SQLite. Next we spin off our three threads, the first of which checks to see whether the streamer is online:

def check_online_thread(self, data):
	while self.running:
		try:
			resp = requests.get(
				"https://api.twitch.tv/helix/streams?user_login="+self.channel,
				headers={"Client-ID": self.client_id}
			);

			resp = resp.json();

			if "data" in resp:
				if not self.is_online:
					print("33[93mChecking to see if "+self.channel+" is online33[0m.");

				if not resp["data"]:
					if self.is_online:

						TwitchBot.mutex.acquire();
						self.cur.execute("UPDATE streams SET end_time = ? WHERE id = ?;",
							(strftime("%Y-%m-%d %H:%M:%S", gmtime()), self.cur_stream_ind)
						);
						TwitchBot.db_con.commit();
						TwitchBot.mutex.release();

						print("33[93m""+self.channel+"" is offline33[0m.");

						self.is_online = False;
					else:
						print("33[93m""+self.channel+"" is offline33[0m.");

				else:
					if not self.is_online:

						TwitchBot.mutex.acquire();
						self.cur.execute("INSERT INTO streams (start_time, end_time, channel) VALUES (?, NULL, ?);",
							(strftime("%Y-%m-%d %H:%M:%S", gmtime()), self.channel)
						);
						TwitchBot.db_con.commit();
						self.cur_stream_ind = self.cur.lastrowid;
						TwitchBot.mutex.release();

						print("33[93m""+self.channel+"" is online33[0m.");

						self.is_online = True;

			sleep(60);
		except:
			continue

Here we are simply sending a get request to Twitch using our application ID and Twitch sends us back a JSON string which we can decode and check whether there is an data available for our streamer. If the “data” vector in our get response is empty then the streamer is offline. If they were previously online but are now offline we update the end time in the database and inform all the other threads. On the other hand, if the stream was previously offline but is not online we insert a new row into the “streams” table and set the start time for the stream. The end time is set to NULL until the stream ends. This allows us to ignore streams for which we do not have complete data on as they are still in progress. Next let’s take a look at the thread which interfaces with the IRC chat room:

#--------------------------------------------------------------------------
def parse(self, line):
	prefix = "";
	trailing = [];
	if line[0] == ":":
		prefix, line = line[1:].split(" ", 1);
	if line.find(" :") != -1:
		line, trailing = line.split(" :", 1);
		args = line.split();
		args.append(trailing);
	else:
		args = line.split();
	command = args.pop(0);	
	return prefix, command, args

#--------------------------------------------------------------------------
def process(self, prefix, command, args):
	if command == "PING":
		self.net_sock.send(self.net_data.replace("PING", "PONG"));
	elif command == "376":
		self.net_sock.send("JOIN #"+self.channel+"rn");
	elif command == "PRIVMSG":
		if self.is_online:
			user_name = prefix.split("!")[0];
			user_message = args[1];
			print("33[91m"+user_name+"33[0m: "+args[1]);
			self.save_msg(user_name, user_message);

#--------------------------------------------------------------------------
def save_msg(self, user_name, user_message):
	TwitchBot.mutex.acquire();

	self.cur.execute("INSERT INTO messages (username, message, channel, datetime_recv) VALUES (?, ?, ?, ?);",
		(user_name, user_message, self.channel, strftime("%Y-%m-%d %H:%M:%S", gmtime()))
	);
	TwitchBot.db_con.commit();

	TwitchBot.mutex.release();

#--------------------------------------------------------------------------
def irc_thread(self, data):
	self.net_sock.connect(("irc.twitch.tv", 6667));

	self.net_sock.send("PASS "+self.password+"rn");
	self.net_sock.send("NICK "+self.nickname+"rn");

	while self.running:
		try:
			self.net_data = self.net_sock.recv(1024);
			if not self.net_data: break;

			lines = self.net_data.split("rn");
			lines.remove("");

			for line in lines:
				prefix, command, args = self.parse(line);
				self.process(prefix, command, args);
		except:
			continue;

	self.net_sock.close();

This thread is a bit more complicated but we first open a network socket to connect to “irc.twitch.tv” via port 6667. This is where all the IRC chat rooms are hosted. We send our password and nickname in order to log into the chat server. Then we continuously check for data coming from the Twitch IRC server. When we do receive data we pass it to self.parse(line) which grabs the prefix, command, and arguments for the response we got from the server. We only care to process a subset of all the possible responses we might receive from the IRC server and def process(self, prefix, command, args) handles this for us. Command “376” informs us that the login handshake has been finalized and we may now join an IRC channel. Command “PING” is to check whether we are still listening to the server. We have to reply with a “PONG” or the server will close our socket connection because it assumes we disconnected or timed out. The “PRIVMSG” command tells us we got a chat message and we should save it to the SQLite database. Saving occurs in the def save_msg(self, user_name, user_message) method and is relatively straight forward. Next we’ll need a thread for polling the viewer count. Polling the viewer count is very similar to checking whether the streamer is online:

#--------------------------------------------------------------------------
def grab_num_viewers(self):
	resp = requests.get(
		"https://api.twitch.tv/helix/streams?user_login="+self.channel,
		headers={"Client-ID": self.client_id}
	);

	try:
		resp = resp.json();
	except:
		return -1;

	if "data" in resp:
		if not resp["data"]:
			return -1;
		return resp["data"][0]["viewer_count"];
	else:
		return -1;

#--------------------------------------------------------------------------
def save_viewers(self, num_viewers):
	TwitchBot.mutex.acquire();

	self.cur.execute("INSERT INTO viewers (num_viewers, channel, datetime_recv) VALUES (?, ?, ?);",
		(num_viewers, self.channel, strftime("%Y-%m-%d %H:%M:%S", gmtime()))
	);
	TwitchBot.db_con.commit();
	
	TwitchBot.mutex.release();

#--------------------------------------------------------------------------
def viewer_thread(self, data):
	while self.running:
		try:
			if self.is_online:
				num_viewers = self.grab_num_viewers();
				if num_viewers != -1:
					print("33[94mNumber of viewers33[0m 33[93m(""+self.channel+"")33[0m: "+str(num_viewers));
					self.save_viewers(num_viewers);
				sleep(10);
			else:
				sleep(3);
		except:
			continue;

Here we again send a get request using the Twitch API, decode the JSON string, and check for the data list – however this time we are actually making use of the information within the data list!

Using the data collected by this bot we can construct plots like the one below!

Do you see that big spike between the 10:00 and 12:00 marks? Well if you check the VOD Bikeman (another streamer) actually raids ManVsGame 6 hours and 48 minutes into the stream and we were able to capture it using our bot!

You should make your own bot and give this a try! Let it run for a few days and collect plenty of data because I am planning to make a follow-up post were we will apply a statistical classifier to the IRC messages and see how well we can predict which stream they came from!

Knight Online Terrain (GTD format for 1299)

I want to talk a little bit about reading from binary files using C/C++ and the file structure for the Knight Online terrain files (i.e. files with the extension “.gtd”). To follow along you need two things:

  1. A C/C++ compiler, here I’ll be using Visual C++ (Community 2015)
  2. A 1299 terrain file, I’m using “karus_start.gtd” but any .gtd will work as long as it’s from the 1299 version of the game – however I’d recommend that you also use “karus_start.gtd” so that our number match up.

The basic concepts I’ll discuss here are valid for reading any of the Knight Online content files. Also, if a particular map spans multiple versions of the game (with each version varying in the file structure) you can use the data you know is required for the 1299 version (discussed in this post) to figure out what was added/removed in other versions. Therefore, even though we’ll be discussing the 1299 file format this information is relevant for cracking all versions of the Knight Online terrain files. Here I’ll be assuming you have little to no C/C++ knowledge. However, I am assuming you know general programming concepts and the difference between memory on the stack vs. memory on the heap.

First let’s read in the .gtd file.

/*
*/

#include "stdio.h"

//-----------------------------------------------------------------------------
int main(int argc, char** argv) {

	char filename[] = "karus_start.gtd";

	FILE* fp = fopen(filename, "rb");
	if(fp == NULL) {
		printf("ERROR: Unable to open file "%s"n", filename);
		return -1;
	}

	fclose(fp);

	return 0;
}

The function int main(int argc, char** argv) is the main entry point for the program. #include “stdio.h” includes the declarations for all standard input and output operations. char filename[] = “karus_start.gtd”; defines a variable name “filename” which is a pointer to an array of “char”s allocated on the stack. FILE* fp = fopen(filename, “rb”); opens the file “karus_start.gtd” in the “read binary” mode. This means we will only be reading from (as opposed to writing to) the file and we wish to treat the contents of the file as if it were all binary data. The rest of the code here checks to make sure we successfully opened the file (printing an error if we were unable to open it) and immediately closing the file and exiting the program.

So far this program don’t do much. We are just opening the file (if it exists) and closing it. What we want to do is start reading in the contents of the file. We can’t read in the contents unless we know a little bit about how the terrain information is structured. So before we start read in all the binary information let’s take a moment to discuss the .gtd file at a higher level.

Having played the game you have probably noticed how the maps are broken up into these little squares.

Each of these rectangles is referred to as a “tile” and each tile is a 4×4 rectangle. The map data for a tile consists of four points in 3D space. In this coordinate system X and Z run along the floor and Y points up into the sky. Therefore, since we know that tiles are 4×4 rectangles, the X and Z coordinates for all the tiles on the map can be set just by knowing how big the map is (this will be important later because only the Y coordinate for these four points is stored in the .gtd file). The idea of a tile isn’t enough though, we also need the concept of a “patch.” A patch consists of a rectangle of 8×8 tiles. Patches carry absolutely no information about how the map should be rendered. Patches carry information for how the map should be updated and are useful for increasing computational efficiency. We eventually plan to have trees blowing in the wind or NPCs performing particular animations, etc. – it would be a waste of CPU power to have trees located on the opposite side of the map blowing in the wind when the player isn’t anywhere near them. Patches are useful for only updating objects which are close to the player.

Interestingly, the 1299 .gtd files start with an 4-byte integer which we do not know the function of (I’d guess it’s likely a version number).

int iIdk;
fread(&iIdk, sizeof(int), 1, fp);

An int in Visual C++ is 4 bytes. int iIdk; allocates the 4 bytes on the stack which we will use to store the first 4 bytes of the .gtd file. fread(&iIdk, sizeof(int), 1, fp); actually reads 4 bytes from the file and copies them into the 4 bytes we have allocated. Next we read in the name of the map.

int iNL;
fread(&iNL, sizeof(int), 1, fp);

char m_szName[0xFF] = "";
if(iNL > 0) fread(m_szName, sizeof(char), iNL, fp);

Here we are reading in another 4-byte integer, although this time it is the length of the map name. This integer tells us how many characters there are in the map’s name. Now we’ll read in the size of the map, set the number of patches based on the map size, and grab all the map data.

int m_ti_MapSize = 0;
fread(&m_ti_MapSize, sizeof(int), 1, fp);
int m_pat_MapSize = (m_ti_MapSize - 1) / 8;

_N3MapData* m_pMapData = new _N3MapData[m_ti_MapSize*m_ti_MapSize];
fread(m_pMapData, sizeof(_N3MapData), m_ti_MapSize*m_ti_MapSize, fp);

Based on the code we have looked at up to this point it should be relatively straight forward to understand what’s going on here. However, we haven’t talked at all about _N3MapData , let’s take a look.

struct _N3MapData {
	float fHeight;
	unsigned int bIsTileFull : 1;
	unsigned int Tex1Dir : 5;
	unsigned int Tex2Dir : 5;
	unsigned int Tex1Idx : 10;
	unsigned int Tex2Idx : 10;

	_N3MapData(void) {
		bIsTileFull = true;
		fHeight = FLT_MIN;
		Tex1Idx = 1023;
		Tex1Dir = 0;
		Tex2Idx = 1023;
		Tex2Dir = 0;
	}
};

Each of the four points making up a tile has one of these map data structs. fHeight is the Y coordinate for a point (remember we can get the X and Z coordinate based on how big the map is). bIsTileFull is less straight forward but I believe it gets used when setting which textures to use for a particular tile (here I’m taking “full” to mean “stuff is covering this tile”). I would be interested to hear possible alternative theories about this. The rest of the variables are used for deciding which texture to blend together (Tex1Idx and Tex2Idx) and how they should be oriented (Tex1Dir and Tex2Dir).

Next we read in the patch information.

float** m_ppPatchMiddleY = new float*[m_pat_MapSize];
float** m_ppPatchRadius = new float*[m_pat_MapSize];

for (int x = 0; x<m_pat_MapSize; x++) {
	m_ppPatchMiddleY[x] = new float[m_pat_MapSize];
	m_ppPatchRadius[x] = new float[m_pat_MapSize];

	for (int z = 0; z<m_pat_MapSize; z++) {
		fread(&(m_ppPatchMiddleY[x][z]), sizeof(float), 1, fpMap);
		fread(&(m_ppPatchRadius[x][z]), sizeof(float), 1, fpMap);
	}
}

This simply gives us how high each 8×8 patch is and the radius of its bounding sphere. Next we’ll read in the grass information.

unsigned char* m_pGrassAttr = new unsigned char[m_ti_MapSize*m_ti_MapSize];
fread(m_pGrassAttr, sizeof(unsigned char), m_ti_MapSize*m_ti_MapSize, fp);

char* m_pGrassFileName = new char[260];
fread(m_pGrassFileName, sizeof(char), 260, fp);

This information is used for rendering the grass which pops up out of the tile. Then we read in information about the textures which we will use for each of the tiles on the map.

int m_NumTileTex;
fread(&m_NumTileTex, sizeof(int), 1, fp);

int NumTileTexSrc;
fread(&NumTileTexSrc, sizeof(int), 1, fp);

char** SrcName = new char*[NumTileTexSrc];
for (int i = 0; i<NumTileTexSrc; i++) {
	SrcName[i] = new char[260];
	fread(SrcName[i], 260, 1, fp);
}

short SrcIdx, TileIdx;
for (int i = 0; i<m_NumTileTex; i++) {
	fread(&SrcIdx, sizeof(short), 1, fp);
	fread(&TileIdx, sizeof(short), 1, fp);
}

m_NumTileTex is the number of textures which get used for this particular map. NumTileTexSrc is the number of files from which all these textures will be loaded from. SrcIdx is the index into the source name array for which this texture is located and TileIdx is the index of this texture within the file. Basically the DTEX folder has a bunch of files in it and each of these files contains a bunch of textures put together one by one. In order to get one of these textures you first must know which file to look into and then how far into that file to look in order  to get your texture.

Here are the last few variables we need.

int NumLightMap;
fread(&NumLightMap, sizeof(int), 1, fp);

int m_iRiverCount;
fread(&m_iRiverCount, sizeof(int), 1, fp);

int m_iPondMeshNum;
fread(&m_iPondMeshNum, sizeof(int), 1, fp);

If you have a few rivers or ponds on the map then each files have to be loaded and looked through.

This is the basic structure of the 1299 .gtd file. If you would like all the code along with a simplistic OpenGL implementation of the rendering go here.