• Ultra-Fast Trajectory Generation (Motion Profiling)

Alternative Title: Jaci you're not allowed to play with Assembly anymore.

Motion profiling has been making the rounds lately, with more and more applications for it, including Automation, Self-Driving Vehicles, and pretty much anything that moves with a computer. A while ago I wrote the Pathfinder library, which was designed for 2D S-Curve trajectory generation, in short, a path planner. Today I'm going to combine that with a post I wrote a few weeks ago on making Vision Processing more efficient.

You can see the project on GitHub here

Results

Before I get into how this was done, I'm going to start with why you should care. Here's the results of my testing (all tests done on a stock-standard FRC RoboRIO)

TRAPEZOID (LINEAR)
SIMD (1.0kHz x100):     23ms AVG: 0.23ms
C    (1.0kHz x100):    126ms AVG: 1.26ms

SCURVE (LINEAR)
C    (1.0kHz x100):   4792ms AVG: 47.92ms

TRAPEZOID (TRAJECTORY)
SIMD (1.0kHz 10KSamples x100):   1352ms AVG: 13.52ms
C    (1.0kHz 10KSamples x100):   2930ms AVG: 29.30ms

SCURVE (TRAJECTORY)
SIMD (1.0kHz 10KSamples x100):   5328ms AVG: 53.28ms
C    (1.0kHz 10KSamples x100):   6786ms AVG: 67.86ms

(there is no SCURVE-LINEAR-SIMD, this is explained later)

TRAPEZOID is a standard Trapezoidal motion profile. SCURVE is a standard SCurve Motion Profile.

LINEAR denotes a 1D plane of movement, that is, start moving, and stop moving x distance away. TRAJECTORY is 2D, allowing you to plot a path on a 2D plane (like an FRC field floor). Linear is generally more useful for subsystems, while Trajectory is more applicable for navigating the field during autonomous. Please note that part of the Trajectory benchmark in timing is due to the generation of the first, linear profile (e.g. 4792ms of the SCurve Trajectory is due to the SCurve linear generation).

SIMD is the optimizations we'll be putting into place today, while C is the speed of a standard C implementation (identical to the Pathfinder library).

Each profile is generated at a 1000th of a second (1kHz) timescale, meaning your following mechanism (e.g. PID) would update at 1000 times a second. To give an idea, 100Hz is much more widely used, but this is a benchmark after all.

Each profile is generated 100 times over, with the average for a single run being displayed beside the total runtime. The AVG: is the speed at which the algorithm will run for a single generation.

Trajectories are put through 10,000 samples for the splines, resulting in an extremely smooth path. This is fairly typical.

You can see the specifics of the generation here, if you want an idea of what parameters are being passed to the profiles.

Below is an image of the S-Curve Trajectory that was generated to give you an idea. The left set (blue) is in relation to the X-Axis, and the right set (red) is in relation to Time. From top to bottom: Y-Axis, Velocity, Acceleration, Distance.

If you want to view the Trajectories for yourself, you can download them from my dropbox, or run the program yourself. You can view them in your favorite CSV viewer (I like Tableau). Microsoft Excel is also a good option.

Getting into it

Let's get started. First of all, I highly recommend you see my first post on ARM Neon, as I won't be going into nearly as much detail here. If you want specific implementation details, see the source code on the GitHub project.

First of all, let's start with linear trajectories. Linear Trajectories are commonly used to follow a straight line, but have some control over how it moves in that straight line (i.e. it's motion profile). This can include such things as how fast to speed up, "easing" into and out of acceleration, etc.

I'll start by assuming you're somewhat familiar with the difference between Trapezoidal, Triangular and S-Curve Motion Profiles. If you're not, I highly recommend this seminar by Team 254 & 971.

Since we're generating profiles, we need a way to store them. In our case, we're going to store them as 'segments', with each segment having the following properties:
- How far it has travelled (the position)
- How fast it's going at that moment (the velocity)
- How fast the velocity is changing (the acceleration)

Each segment is generated at a certain time scale. This will match up with how fast you're checking and updating your control loops (usually, anywhere from 50 to 200 times a second, with 100 seeming the most common). So if we were to configure the timescale at 100Hz, there would be a 10ms gap between when one segment ends and the next begins. These segments are stored in an Array.

Trapezoidal Profiles

Trapezoidal Profiles are fairly basic. They have a 'ramp up' potion, a 'hold steady' portion, and a 'ramp down' portion. Below is an example, with distance over time being on top, velocity over time in the middle, and acceleration over time at the bottom.

The great thing about Trapezoidal Profiles is that they're incredibly easy to generate, as they are just 2 triangles, a rectangle and a few motion equations (v = at, s = ut + 0.5at^2, you get the idea). This also means that you can find the Acceleration, Velocity and Position without looking at the entry before it. This is great because we can use our friend ARM Neon to vectorize and parallelize the generation of Trapezoidal Profiles. In our case, we can generate 4 points at once, meaning that it only takes a quarter of the time it usually would to generate our profile.

In short, we use the ARM Neon registers and instruction set to generate points 0, 1, 2 and 3 at the same time, then move on to 4, 5, 6, 7, etc. This is much faster than the standard alternative of going 0, then 1, then 2, then 3, etc. This is what gains us a net speed increase. This sounds all well and good, but there is a few challenges that come with this. You may have already thought about it, in fact.

The issue is, that we need to go 0, 1, 2, 3, and then 4, 5, 6, 7. If we're doing everything parallel, this is a big tough to do, as we don't really have an iterative index to work on. There is no for (i = 0; i < len; i++), as we're doing it 4 at a time.

The way to overcome this is through the use of the .rodata section of the final binary. The .rodata section stands for Read-Only Data, and is used to store data that is immutable at runtime, and immediately available in memory (it's the equivilent of const in C). At the top of the ASM file we have the following:

.section .rodata
f32_step_1:
.float 0
.float 1
f32_step_2:
.float 2
.float 3

And later in the file, we have:

// Write 'segment step' into q5
ldr r6, =f32_step_1
vld1.32 d10, [r6]
ldr r6, =f32_step_2
vld1.32 d11, [r6]

What we're doing is loading the float values [0, 1, 2, 3] into the Neon registers d10 and d11 (which correspond to q5). The quadword registers can hold 4 floats each, so we've loaded one of them with those four, sequential floats. We're going to call this the 'segment step'.

Furthermore, we've also zero'd register q4, which we will be using as the 'segment index offset'.

At the beginning of every loop, we run the following: asm vadd.f32 q6, q4, q5

This will add the segment step ([0, 1, 2, 3]) to the segment index offset ([0, 0, 0, 0]) and store it in q6. Now, q6 contains the index for all four of the segments we're currently working on. After the loop, we add 4 the entire q4 register, so on the next iteration, q6 will be [4 + 0, 4 + 1, 4 + 2, 4 + 3]. This is what allows us to generate what would usually be sequential in parallel.

The Assembly Code for this is available (and commented!) in this file.

S-Curve Profiles

It's time for some sad news. Unfortunately, we can't make S-Curve linear generation any faster, as they are generating using the Chief-Delphi dubbed "Copioli Method", which requires knowledge of the previous generation to influence the next one.

Trajectory Generation

Now we've covered linear profiles, for mechanisms that move in 1 dimension. But what about 2D? This is how Pathfinder's core library works, taking Waypoints in the 2D plane and creating a smooth profile for your robot to follow. For now, we're going to focus on the generation of this path, but not any modifications (i.e. splitting the path into 2 for tank drive, or 4 for swerve drive).

Trajectories have extra "extensions" on top of the standard segments.
- The X-Position
- The Y-Position

First of all, how do we generate a 2D profile? Well, first we have to have some waypoints. For example:

[
x = -4, y = -1, angle = 45 deg
x = -1, y = 2, angle = 0
x = 2, y = 4, angle = 0
]

The measurement units can be whatever you want, as to avoid an argument about Metric vs Imperial.

From these Waypoints, we create some Splines. These splines are, in essence, curves that join the waypoints. In this case, we will have 2 splines, between Waypoint 0 & 1, and between Waypoint 1 & 2.

Next, we flatten the Splines into a 1D profile. We find the length along the arc of the Splines (i.e. the distance covered when travelling from end-to-end along each spline), and pass that into our linear generation. This can be either Trapezoidal, S-Curve, or really any kind of profile you want. This mostly governs the speed and acceleration along the profile. The 'shape' of the profile (Y plotted against X) is dependent on how the Spline itself is generated (we use Cubic Hermite).

Next, we generate the headings for the Spline (where the robot will face at each point along the spline). We also fit the linear path back onto the 2D plane, taking each 1D segment and fitting it along the spline.

There are a few subroutines that go into doing this. The first is the fitting of the spline. Thankfully, this is a single iteration, so we don't need to worry about making this more efficient (you can't parallelize a single iteration).

The next is finding the length of the spline along its arc. This is by far the most processing intensive potion. This is done by using calculus. We can't do this algebraically, so we have to do it by 'sampling' the spline over and over again. The higher the samples, the lower the time delta between calculations, and therefore, more accuracy. It's sort of like how the more accurate you measure a coastline, the longer it gets. I've put a graphical representation below (credits to this page).

(I know that isn't an entirely accurate representation of how this algorithm works, but it helps to illustrate the point on how important high samples are)

It's fairly typical to use anywhere from 1,000 to 1,000,000 samples (depending on the application) to find the arc length of this spline. This makes it a perfect candidate for Neon. Let's take a look at what the algorithm looks like in C.

float pf_spline_deriv_2(float a, float b, float c, float d, float e, float k, float p) {
double x = p * k;
return (5*a*x + 4*b) * (x*x*x) + (3*c*x + 2*d) * x + e;
}

// Meanwhile, in another function
for (i = 0; i < sample_count; i = i + 1) {
// t ranges from 0.0 -> 1.0 (a percentage)
t = i / sample_count;
// a, b, c, d, e, knot are all properties of the spline
dydt = pf_spline_deriv_2(a, b, c, d, e, knot, t);
integrand = sqrt(1 + dydt*dydt) / sample_count;
arc_length += (integrand + last_integrand) / 2;
last_integrand = integrand;
}
double total_arc_length = knot * arc_length;

This seems fairly tame, and those calculations can be easily be very processing-intensive when the sample_count is in the thousands. There is one problem, though, and that is arc_length += (integrand + last_integrand) / 2. What's happening is that we're relying on the previous result of the integrand before adding to the arc length. However, this is only a small part of the equation, with the bulk of the processing power happening in the lines above it.

There's a few ways we can fix this:

• Store integrands in an array, and add them later (ouch, an array of size 1,000,000? Yea no)
• Find something else to parallelize (I'm too stubborn for that, no)
• Half and Half.

Here's the plan. We'll put the calculation of t, dydt and integrand all into Neon. Since we're cutting the loop size into quarters (we can process 4 at a time), we only really have to use 4 integrands at once. If you've ever used a shift register, you'll see where I'm going with this. After the calculation of t, dydt and integrand, we'll do the following:

• Store a value (last_integrand) in a single-float register (in our case, s4)
• Store the integrand of our 4 parallel calculations in a q register (must be at or under q7, you'll see why)
• Add last_integrand and the first entry of q7 (q7[0]), divide by 2 and add to arc length
• Add q7[0] and q7[1], divide by 2 and add to arc length
• Add q7[1] and q7[2], divide by 2 and add to arc length
• Add q7[2] and q7[3], divide by 2 and add to arc length
• Make q7[3] the new last_integrand (store it in s4), ready for the next iteration.

Essentially what this does is makes the bulk of calculation happen in parallel, and then does the adding to arc length sequentially. It's a wonderful little tradeoff that works lovely.

"But Jaci, why can't we use a register higher than Q7 for the integrand?".
Well, I'm glad you asked. There's 2 ways to pull a single value out of a double or quad word register. You can either:
- a) Push it out to memory (or the stack), then read it back again (very inefficient)
- b) Grab it from a smaller register.

ARM Neon registers are interesting in how they are formatted. The s registers are 'single word' (a 'word' is 4 bytes = 32 bits, the same size as a float/integer). The d registers are 'double word', and the q registers are 'quad word'. They are quite cleverly arranged so that d0 is made up of s0 and s1, and q0 is made up of d0 and d1 (s0-s3). You can see a graphical view of this below (credits to the ARM NEON Developer Guide):

The thing is, that this only goes up to q7. After that, all the registers are quadword and are not made up of smaller registers. For us to access a single register (i.e. a single float), we have to access an s register. This is why it must be under q7. All other values, however, are more than happy being stored in any Neon register.

The Assembly Code for this is available (and commented!) in this file.

Conclusion

Using some handy ARM Neon, we're able to dramatically increase the generation speed of Linear and Trajectory profiles. This can be incredibly useful for systems with low processing power, or where you want to generate profiles on the fly.

You can see the project on GitHub here

• RoboRIO Vision Tracking at 30fps without a Coprocessor

TL;DR: We can run 30fps Vision Tracking on the RoboRIO at 7-8ms per frame, equating to only about 23% CPU Time. For scale, the FRC Network Comms program uses about 20% CPU constantly

• Writing your own mDNS implementation to hack the FRC Driver Station

When I wrote Toast for Java, I had a master plan for the simulation environment, and it had to fit the following agenda:

• It has to work with the official FRC Driver Station
• It has to work when the Driver Station is on a different computer
• It has to be easy to use
• It can't have external dependencies outside of the bundled software.

WELP. In Toast Java, the Driver Station communications didn't follow those last two points. It required you to do all this. Painful, right?

Well, for Toast C++ I decided to try and fix this. I played around with a few ideas:

• Use the bonjour SDK (eww, it requires me to link a system library, it's not cross platform and generally just not a fun time)
• Exec a command to dns-sd (still requires bonjour, doesn't work on linux, and is basically identical to the Toast Java implementation)
• RollYourOwnâ„˘ mDNS implementation (well, I guess).

So, here we are. TL;DR, here's the code. Toast C++ will now broadcast its own mDNS service that the Driver Station can connect to immediately.

How does FRC use mDNS?

FRC uses mDNS to provide a path from the Driver Station to the RoboRIO. When you enter your team number into the Driver Station, it tries to connect to a lot of different hosts, including:

• roborio-####-frc.local
• roborio-####.local
• roborio-####.lan
• 10.##.##.2
• 172.22.11.2

This is how the Driver Station communicates with your robot, through UDP to be exact, with a separate TCP channel being used to pass log messages, fault status and joystick descriptors (actually joystick values are passed through UDP).

mDNS uses services in order to broadcast information about different devices. Each service has a name, a type, a domain and a port. Let's look at our friend: The RoboRIO.

For our RoboRIO, roborio-####-frc is the name of the service. The type is _ni._tcp, with _ni belonging to National Instruments, the manufacturers of the RoboRIO. The domain is local, which is fairly standard in the world of mDNS. Finally, the port we're going to use is 3580.

To register this on the command line of a bonjour-enabled windows computer, we can use dns-sd -R roborio-####-frc _ni._tcp local 3580, which will allow the computer's bonjour server to listen for discovery requests to this service and reply with the current computer. Neato.

But, we don't want to use dns-sd. I want the hard way, dammit.

The Hard Way

First, I had to figure out how the mDNS packet format worked. Thankfully, I knew already that mDNS works on a request-response type system. When your computer wants to know who owns an mDNS service, it sends out a request over multicast (multicast is the 'm' in 'mDNS'), and if a computer responds to that, it sends a response back. Because it's multicast, it is sent over UDP, and therefore doesn't require a handshake, which means I don't have to kill the running bonjour service on my computer in order to run my own responder. Neato.

I also found out, to my delight, that the requests/responses don't have a sequence number, which means I can send out responses on a timer instead of waiting for a request, I opted for a 5 second loop.

Now, figuring out the response packet protocol itself. I searched an searched, but the first result that came up was literally that wiki entry I wrote for Toast Java, so I could tell I wasn't going to get far. All the information I found was incredibly vague and didn't really help much, at least not in this purpose.

So, off to wireshark we go.

I needed a test-case to base my observations on. I don't have a RoboRIO with me currently, so I just observed UDP port 5353 (the mDNS port) for a while on my network to see if there were any devices on my network broadcasting. There was my chromecast, but that didn't prove to be much help. Interestingly enough, my printer showed up. I remembered that my printer runs a HTTP server to check supply levels and queue jobs and whatnot, as its got a print-server built in. This is my new test subject.

Cutting In

I connected to the HTTP Interface on my printer and took a look at the wireshark capture. I hadn't accessed the printer's Web UI in a long time, so I knew it wasn't in my computer's domain name cache, so theoretically my computer should send out a request, and the printer should send back a response. Sure enough,

Here we have something I can work with. The first 42 bytes are the UDP and Internet Protocols themselves, so we can disregard those. The actual mDNS packet starts at 0x2A. Interestingly enough, wireshark has an inbuilt inspector for mDNS packets, allowing me to skip a lot of the dirty work and figure out what everything is called very easily. A+.

The first 2 bytes are the Transaction ID. These bytes are always 0 in an mDNS packet. The next 2 bytes are the flags, for us, its 0x80 or 0x84.

The next few bytes describe how many questions we're asking (0, as we're a response), how many answers we're giving (in our case, 3), authority RRs (0), and additional RRs (1 for us, the A record).

Now comes the fun stuff.

The PTR Record

Our first record is the humble PTR record. PTR is short for domain name PoinTeR. In mDNS, the PTR record is used to lookup instances of a service type. In the case of my printer, it links the _http._tcp.local service type to Samsung CLP-310 Series (CLP-315w)._http._tcp.local (a mouthful, right?). The PTR record is really just a descriptor for the type of a registered service.

The SRV Record (and fun encoding stuff)

The SRV record is used to link a server instance to a service, or, in this case, a service to a target name (more on this later). In the case of my printer, it links Samsung CLP-310 Series (CLP-315w)._http._tcp.local to SEC001599377D00.local, on port 80. This links the service with a port and hostname that the client can connect to, therefore registering an 'instance' of the service.

This is really cool and all, but here comes the fun part of the protocol. Occasionally, instead of a string of bytes representing a name, it may give you 0xc0 <some byte>. 0xc0 is a mask of bits that tells the protocol to look back to the <some byte> index, and read the name at that position. In our case, instead of repeating the name of the service (Samsung CLP-310 Series (CLP-315w)._http._tcp.local), it instead uses 0xc0 0x28, to look at hexadecimal index 0x28, which is, you guessed it, the service name.

The TXT Record

Our next, and final 'Answer' record is the TXT record. TXT records are used to carry custom key-value data pairs. In our case, we don't have any of these pairs, as we don't really need them, but alas, we shall keep the record anyway.

But wait, first some implementation information.

So far I've talked about my humble old printer, let's put this into some context.

PTR

For us, the PTR record links the roborio-####-frc._ni._tcp.local service name to the _ni._tcp service type and local domain.

SVR

The SVR record links the roborio-####-frc._ni._tcp.local service name, on port 3580 to some host name. For us, we're going to use some-host-name.local. Usually this would be your <computer hostname>.local, but it can really be whatever you want. In Toast C++, we use toast-mdns-resolve.local by default.

Okay, now you can have Additional Records!

Our last record is the A record. If you've done any work with DNS before, whether it be hosting a website or whatnot, this will seem familiar.

The A record takes some hostname, and links it to an IP address that our computers can understand. In our case, it will match some-host-name.local to the local loopback, 127.0.0.1 (localhost).

Putting this all together, we get a working mDNS responder that can tell the Driver Station that you are in fact the RoboRIO, allowing you to 'simulate' the RoboRIO, while still using the official Driver Station. Cool stuff.

Why localhost?

Really, you can use whatever IP address you want. The reason we use localhost is simple: the simulation is running on the same computer as the driver station... most of the time. In Toast C++, you can change this localhost to any IP you want. If you want other computers on your network to be able to run their driver station and connect to your running simulation, you can set this IP address to the IP address of your computer as seen by your network, e.g. 10.0.0.x. This may be useful if your development computer is different from your driver station, such as if you use a Mac for development, but the Driver Station only runs on windows.

Another useful quirk of using localhost is that other computers on your network that try to connect to roborio-####-frc.local will resolve the IP address to 127.0.0.1, and connect to themselves. This is particularly useful if you work in a team environment where multiple programmers are testing at once, which can stop conflicts for which driver station gets which simulation.

Conclusion.

I hope you enjoyed this little write up, and helps you understand a little bit more about mDNS in the FRC sense. Feel free to look through the Driver Station Communications code for C++ here. If you have any questions, feel free to email me at jaci.brunning@gmail.com

Soon.