-
Ultra-Fast Trajectory Generation (Motion Profiling)
Alternative Title: Jaci you're not allowed to play with Assembly anymore.
Motion profiling has been making the rounds lately, with more and more applications for it, including Automation, Self-Driving Vehicles, and pretty much anything that moves with a computer. A while ago I wrote the Pathfinder library, which was designed for 2D S-Curve trajectory generation, in short, a path planner. Today I'm going to combine that with a post I wrote a few weeks ago on making Vision Processing more efficient.
You can see the project on GitHub here
Results
Before I get into how this was done, I'm going to start with why you should care. Here's the results of my testing (all tests done on a stock-standard FRC RoboRIO)
TRAPEZOID (LINEAR) SIMD (1.0kHz x100): 23ms AVG: 0.23ms C (1.0kHz x100): 126ms AVG: 1.26ms SCURVE (LINEAR) C (1.0kHz x100): 4792ms AVG: 47.92ms TRAPEZOID (TRAJECTORY) SIMD (1.0kHz 10KSamples x100): 1352ms AVG: 13.52ms C (1.0kHz 10KSamples x100): 2930ms AVG: 29.30ms SCURVE (TRAJECTORY) SIMD (1.0kHz 10KSamples x100): 5328ms AVG: 53.28ms C (1.0kHz 10KSamples x100): 6786ms AVG: 67.86ms(there is no SCURVE-LINEAR-SIMD, this is explained later)
TRAPEZOIDis a standard Trapezoidal motion profile.SCURVEis a standard SCurve Motion Profile.LINEARdenotes a 1D plane of movement, that is, start moving, and stop moving x distance away.TRAJECTORYis 2D, allowing you to plot a path on a 2D plane (like an FRC field floor). Linear is generally more useful for subsystems, while Trajectory is more applicable for navigating the field during autonomous. Please note that part of the Trajectory benchmark in timing is due to the generation of the first, linear profile (e.g. 4792ms of the SCurve Trajectory is due to the SCurve linear generation).SIMDis the optimizations we'll be putting into place today, whileCis the speed of a standard C implementation (identical to the Pathfinder library).Each profile is generated at a 1000th of a second (1kHz) timescale, meaning your following mechanism (e.g. PID) would update at 1000 times a second. To give an idea, 100Hz is much more widely used, but this is a benchmark after all.
Each profile is generated 100 times over, with the average for a single run being displayed beside the total runtime. The
AVG:is the speed at which the algorithm will run for a single generation.Trajectories are put through 10,000 samples for the splines, resulting in an extremely smooth path. This is fairly typical.
You can see the specifics of the generation here, if you want an idea of what parameters are being passed to the profiles.
Below is an image of the S-Curve Trajectory that was generated to give you an idea. The left set (blue) is in relation to the X-Axis, and the right set (red) is in relation to Time. From top to bottom: Y-Axis, Velocity, Acceleration, Distance.

If you want to view the Trajectories for yourself, you can download them from my dropbox, or run the program yourself. You can view them in your favorite CSV viewer (I like Tableau). Microsoft Excel is also a good option.
Getting into it
Let's get started. First of all, I highly recommend you see my first post on ARM Neon, as I won't be going into nearly as much detail here. If you want specific implementation details, see the source code on the GitHub project.
First of all, let's start with linear trajectories. Linear Trajectories are commonly used to follow a straight line, but have some control over how it moves in that straight line (i.e. it's motion profile). This can include such things as how fast to speed up, "easing" into and out of acceleration, etc.
I'll start by assuming you're somewhat familiar with the difference between Trapezoidal, Triangular and S-Curve Motion Profiles. If you're not, I highly recommend this seminar by Team 254 & 971.
Since we're generating profiles, we need a way to store them. In our case, we're going to store them as 'segments', with each segment having the following properties:
- How far it has travelled (the position)
- How fast it's going at that moment (the velocity)
- How fast the velocity is changing (the acceleration)Each segment is generated at a certain time scale. This will match up with how fast you're checking and updating your control loops (usually, anywhere from 50 to 200 times a second, with 100 seeming the most common). So if we were to configure the timescale at 100Hz, there would be a 10ms gap between when one segment ends and the next begins. These segments are stored in an Array.
Trapezoidal Profiles
Trapezoidal Profiles are fairly basic. They have a 'ramp up' potion, a 'hold steady' portion, and a 'ramp down' portion. Below is an example, with distance over time being on top, velocity over time in the middle, and acceleration over time at the bottom.

The great thing about Trapezoidal Profiles is that they're incredibly easy to generate, as they are just 2 triangles, a rectangle and a few motion equations (
v = at,s = ut + 0.5at^2, you get the idea). This also means that you can find the Acceleration, Velocity and Position without looking at the entry before it. This is great because we can use our friend ARM Neon to vectorize and parallelize the generation of Trapezoidal Profiles. In our case, we can generate 4 points at once, meaning that it only takes a quarter of the time it usually would to generate our profile.In short, we use the ARM Neon registers and instruction set to generate points 0, 1, 2 and 3 at the same time, then move on to 4, 5, 6, 7, etc. This is much faster than the standard alternative of going 0, then 1, then 2, then 3, etc. This is what gains us a net speed increase. This sounds all well and good, but there is a few challenges that come with this. You may have already thought about it, in fact.
The issue is, that we need to go 0, 1, 2, 3, and then 4, 5, 6, 7. If we're doing everything parallel, this is a big tough to do, as we don't really have an iterative index to work on. There is no
for (i = 0; i < len; i++), as we're doing it 4 at a time.The way to overcome this is through the use of the
.rodatasection of the final binary. The.rodatasection stands forRead-Only Data, and is used to store data that is immutable at runtime, and immediately available in memory (it's the equivilent ofconstin C). At the top of the ASM file we have the following:.section .rodata f32_step_1: .float 0 .float 1 f32_step_2: .float 2 .float 3And later in the file, we have:
// Write 'segment step' into q5 ldr r6, =f32_step_1 vld1.32 d10, [r6] ldr r6, =f32_step_2 vld1.32 d11, [r6]What we're doing is loading the float values
[0, 1, 2, 3]into the Neon registersd10andd11(which correspond toq5). The quadword registers can hold 4 floats each, so we've loaded one of them with those four, sequential floats. We're going to call this the 'segment step'.Furthermore, we've also zero'd register
q4, which we will be using as the 'segment index offset'.At the beginning of every loop, we run the following:
asm vadd.f32 q6, q4, q5This will add the segment step (
[0, 1, 2, 3]) to the segment index offset ([0, 0, 0, 0]) and store it inq6. Now,q6contains the index for all four of the segments we're currently working on. After the loop, we add 4 the entireq4register, so on the next iteration,q6will be[4 + 0, 4 + 1, 4 + 2, 4 + 3]. This is what allows us to generate what would usually be sequential in parallel.The Assembly Code for this is available (and commented!) in this file.
S-Curve Profiles
It's time for some sad news. Unfortunately, we can't make S-Curve linear generation any faster, as they are generating using the Chief-Delphi dubbed "Copioli Method", which requires knowledge of the previous generation to influence the next one.
Trajectory Generation
Now we've covered linear profiles, for mechanisms that move in 1 dimension. But what about 2D? This is how Pathfinder's core library works, taking Waypoints in the 2D plane and creating a smooth profile for your robot to follow. For now, we're going to focus on the generation of this path, but not any modifications (i.e. splitting the path into 2 for tank drive, or 4 for swerve drive).
Trajectories have extra "extensions" on top of the standard segments.
- The X-Position
- The Y-Position
- The Heading (bearing)First of all, how do we generate a 2D profile? Well, first we have to have some waypoints. For example:
[ x = -4, y = -1, angle = 45 deg x = -1, y = 2, angle = 0 x = 2, y = 4, angle = 0 ]The measurement units can be whatever you want, as to avoid an argument about Metric vs Imperial.
From these Waypoints, we create some Splines. These splines are, in essence, curves that join the waypoints. In this case, we will have 2 splines, between Waypoint 0 & 1, and between Waypoint 1 & 2.
Next, we flatten the Splines into a 1D profile. We find the length along the arc of the Splines (i.e. the distance covered when travelling from end-to-end along each spline), and pass that into our linear generation. This can be either Trapezoidal, S-Curve, or really any kind of profile you want. This mostly governs the speed and acceleration along the profile. The 'shape' of the profile (Y plotted against X) is dependent on how the Spline itself is generated (we use Cubic Hermite).
Next, we generate the headings for the Spline (where the robot will face at each point along the spline). We also fit the linear path back onto the 2D plane, taking each 1D segment and fitting it along the spline.
There are a few subroutines that go into doing this. The first is the fitting of the spline. Thankfully, this is a single iteration, so we don't need to worry about making this more efficient (you can't parallelize a single iteration).
The next is finding the length of the spline along its arc. This is by far the most processing intensive potion. This is done by using calculus. We can't do this algebraically, so we have to do it by 'sampling' the spline over and over again. The higher the samples, the lower the time delta between calculations, and therefore, more accuracy. It's sort of like how the more accurate you measure a coastline, the longer it gets. I've put a graphical representation below (credits to this page).

(I know that isn't an entirely accurate representation of how this algorithm works, but it helps to illustrate the point on how important high samples are)
It's fairly typical to use anywhere from 1,000 to 1,000,000 samples (depending on the application) to find the arc length of this spline. This makes it a perfect candidate for Neon. Let's take a look at what the algorithm looks like in C.
float pf_spline_deriv_2(float a, float b, float c, float d, float e, float k, float p) { double x = p * k; return (5*a*x + 4*b) * (x*x*x) + (3*c*x + 2*d) * x + e; } // Meanwhile, in another function for (i = 0; i < sample_count; i = i + 1) { // t ranges from 0.0 -> 1.0 (a percentage) t = i / sample_count; // a, b, c, d, e, knot are all properties of the spline dydt = pf_spline_deriv_2(a, b, c, d, e, knot, t); integrand = sqrt(1 + dydt*dydt) / sample_count; arc_length += (integrand + last_integrand) / 2; last_integrand = integrand; } double total_arc_length = knot * arc_length;This seems fairly tame, and those calculations can be easily be very processing-intensive when the
sample_countis in the thousands. There is one problem, though, and that isarc_length += (integrand + last_integrand) / 2. What's happening is that we're relying on the previous result of the integrand before adding to the arc length. However, this is only a small part of the equation, with the bulk of the processing power happening in the lines above it.There's a few ways we can fix this:
- Store integrands in an array, and add them later (ouch, an array of size 1,000,000? Yea no)
- Find something else to parallelize (I'm too stubborn for that, no)
- Half and Half.
Here's the plan. We'll put the calculation of
t,dydtandintegrandall into Neon. Since we're cutting the loop size into quarters (we can process 4 at a time), we only really have to use 4 integrands at once. If you've ever used a shift register, you'll see where I'm going with this. After the calculation oft,dydtandintegrand, we'll do the following:- Store a value (last_integrand) in a single-float register (in our case,
s4) - Store the integrand of our 4 parallel calculations in a q register (must be at or under
q7, you'll see why) - Add last_integrand and the first entry of
q7(q7[0]), divide by 2 and add to arc length - Add
q7[0]andq7[1], divide by 2 and add to arc length - Add
q7[1]andq7[2], divide by 2 and add to arc length - Add
q7[2]andq7[3], divide by 2 and add to arc length - Make
q7[3]the new last_integrand (store it ins4), ready for the next iteration.
Essentially what this does is makes the bulk of calculation happen in parallel, and then does the adding to arc length sequentially. It's a wonderful little tradeoff that works lovely.
"But Jaci, why can't we use a register higher than Q7 for the integrand?".
Well, I'm glad you asked. There's 2 ways to pull a single value out of a double or quad word register. You can either:
- a) Push it out to memory (or the stack), then read it back again (very inefficient)
- b) Grab it from a smaller register.ARM Neon registers are interesting in how they are formatted. The
sregisters are 'single word' (a 'word' is 4 bytes = 32 bits, the same size as a float/integer). Thedregisters are 'double word', and theqregisters are 'quad word'. They are quite cleverly arranged so thatd0is made up ofs0ands1, andq0is made up ofd0andd1(s0-s3). You can see a graphical view of this below (credits to the ARM NEON Developer Guide):
The thing is, that this only goes up to
q7. After that, all the registers are quadword and are not made up of smaller registers. For us to access a single register (i.e. a single float), we have to access ansregister. This is why it must be underq7. All other values, however, are more than happy being stored in any Neon register.The Assembly Code for this is available (and commented!) in this file.
Conclusion
Using some handy ARM Neon, we're able to dramatically increase the generation speed of Linear and Trajectory profiles. This can be incredibly useful for systems with low processing power, or where you want to generate profiles on the fly.
You can see the project on GitHub here
Read More - Store integrands in an array, and add them later (ouch, an array of size 1,000,000? Yea no)
-
RoboRIO Vision Tracking at 30fps without a Coprocessor
TL;DR: We can run 30fps Vision Tracking on the RoboRIO at 7-8ms per frame, equating to only about 23% CPU Time. For scale, the FRC Network Comms program uses about 20% CPU constantly
Read More -
Writing your own mDNS implementation to hack the FRC Driver Station
When I wrote Toast for Java, I had a master plan for the simulation environment, and it had to fit the following agenda:
- It has to work with the official FRC Driver Station
- It has to work when the Driver Station is on a different computer
- It has to be easy to use
- It can't have external dependencies outside of the bundled software.
WELP. In Toast Java, the Driver Station communications didn't follow those last two points. It required you to do all this. Painful, right?
Well, for Toast C++ I decided to try and fix this. I played around with a few ideas:
- Use the bonjour SDK (eww, it requires me to link a system library, it's not cross platform and generally just not a fun time)
- Exec a command to
dns-sd(still requires bonjour, doesn't work on linux, and is basically identical to the Toast Java implementation) - RollYourOwn™ mDNS implementation (well, I guess).
So, here we are. TL;DR, here's the code. Toast C++ will now broadcast its own mDNS service that the Driver Station can connect to immediately.
How does FRC use mDNS?
FRC uses mDNS to provide a path from the Driver Station to the RoboRIO. When you enter your team number into the Driver Station, it tries to connect to a lot of different hosts, including:
roborio-####-frc.localroborio-####.localroborio-####.lan10.##.##.2172.22.11.2
This is how the Driver Station communicates with your robot, through UDP to be exact, with a separate TCP channel being used to pass log messages, fault status and joystick descriptors (actually joystick values are passed through UDP).
mDNS uses services in order to broadcast information about different devices. Each service has a name, a type, a domain and a port. Let's look at our friend: The RoboRIO.
For our RoboRIO,
roborio-####-frcis the name of the service. The type is_ni._tcp, with_nibelonging to National Instruments, the manufacturers of the RoboRIO. The domain islocal, which is fairly standard in the world of mDNS. Finally, the port we're going to use is3580.To register this on the command line of a bonjour-enabled windows computer, we can use
dns-sd -R roborio-####-frc _ni._tcp local 3580, which will allow the computer's bonjour server to listen for discovery requests to this service and reply with the current computer. Neato.But, we don't want to use
dns-sd. I want the hard way, dammit.The Hard Way
First, I had to figure out how the mDNS packet format worked. Thankfully, I knew already that mDNS works on a request-response type system. When your computer wants to know who owns an mDNS service, it sends out a request over multicast (multicast is the 'm' in 'mDNS'), and if a computer responds to that, it sends a response back. Because it's multicast, it is sent over UDP, and therefore doesn't require a handshake, which means I don't have to kill the running bonjour service on my computer in order to run my own responder. Neato.
I also found out, to my delight, that the requests/responses don't have a sequence number, which means I can send out responses on a timer instead of waiting for a request, I opted for a 5 second loop.
Now, figuring out the response packet protocol itself. I searched an searched, but the first result that came up was literally that wiki entry I wrote for Toast Java, so I could tell I wasn't going to get far. All the information I found was incredibly vague and didn't really help much, at least not in this purpose.
So, off to wireshark we go.
I needed a test-case to base my observations on. I don't have a RoboRIO with me currently, so I just observed UDP port 5353 (the mDNS port) for a while on my network to see if there were any devices on my network broadcasting. There was my chromecast, but that didn't prove to be much help. Interestingly enough, my printer showed up. I remembered that my printer runs a HTTP server to check supply levels and queue jobs and whatnot, as its got a print-server built in. This is my new test subject.
Cutting In
I connected to the HTTP Interface on my printer and took a look at the wireshark capture. I hadn't accessed the printer's Web UI in a long time, so I knew it wasn't in my computer's domain name cache, so theoretically my computer should send out a request, and the printer should send back a response. Sure enough,

Here we have something I can work with. The first 42 bytes are the UDP and Internet Protocols themselves, so we can disregard those. The actual mDNS packet starts at
0x2A. Interestingly enough, wireshark has an inbuilt inspector for mDNS packets, allowing me to skip a lot of the dirty work and figure out what everything is called very easily. A+.The first 2 bytes are the Transaction ID. These bytes are always
0in an mDNS packet. The next 2 bytes are the flags, for us, its0x80or0x84.The next few bytes describe how many questions we're asking (0, as we're a response), how many answers we're giving (in our case, 3), authority RRs (0), and additional RRs (1 for us, the A record).
Now comes the fun stuff.
The PTR Record
Our first record is the humble PTR record. PTR is short for domain name PoinTeR. In mDNS, the PTR record is used to lookup instances of a service type. In the case of my printer, it links the
_http._tcp.localservice type toSamsung CLP-310 Series (CLP-315w)._http._tcp.local(a mouthful, right?). The PTR record is really just a descriptor for the type of a registered service.The SRV Record (and fun encoding stuff)
The SRV record is used to link a server instance to a service, or, in this case, a service to a target name (more on this later). In the case of my printer, it links
Samsung CLP-310 Series (CLP-315w)._http._tcp.localtoSEC001599377D00.local, on port80. This links the service with a port and hostname that the client can connect to, therefore registering an 'instance' of the service.This is really cool and all, but here comes the fun part of the protocol. Occasionally, instead of a string of bytes representing a name, it may give you
0xc0 <some byte>.0xc0is a mask of bits that tells the protocol to look back to the<some byte>index, and read the name at that position. In our case, instead of repeating the name of the service (Samsung CLP-310 Series (CLP-315w)._http._tcp.local), it instead uses0xc0 0x28, to look at hexadecimal index0x28, which is, you guessed it, the service name.The TXT Record
Our next, and final 'Answer' record is the TXT record. TXT records are used to carry custom key-value data pairs. In our case, we don't have any of these pairs, as we don't really need them, but alas, we shall keep the record anyway.
Additional Records..?
But wait, first some implementation information.
So far I've talked about my humble old printer, let's put this into some context.
PTR
For us, the PTR record links the
roborio-####-frc._ni._tcp.localservice name to the_ni._tcpservice type andlocaldomain.SVR
The SVR record links the
roborio-####-frc._ni._tcp.localservice name, on port3580to some host name. For us, we're going to usesome-host-name.local. Usually this would be your<computer hostname>.local, but it can really be whatever you want. In Toast C++, we usetoast-mdns-resolve.localby default.Okay, now you can have Additional Records!
Our last record is the
Arecord. If you've done any work with DNS before, whether it be hosting a website or whatnot, this will seem familiar.The A record takes some hostname, and links it to an IP address that our computers can understand. In our case, it will match
some-host-name.localto the local loopback,127.0.0.1(localhost).Putting this all together, we get a working mDNS responder that can tell the Driver Station that you are in fact the RoboRIO, allowing you to 'simulate' the RoboRIO, while still using the official Driver Station. Cool stuff.

Why localhost?
Really, you can use whatever IP address you want. The reason we use localhost is simple: the simulation is running on the same computer as the driver station... most of the time. In Toast C++, you can change this localhost to any IP you want. If you want other computers on your network to be able to run their driver station and connect to your running simulation, you can set this IP address to the IP address of your computer as seen by your network, e.g.
10.0.0.x. This may be useful if your development computer is different from your driver station, such as if you use a Mac for development, but the Driver Station only runs on windows.Another useful quirk of using localhost is that other computers on your network that try to connect to
roborio-####-frc.localwill resolve the IP address to127.0.0.1, and connect to themselves. This is particularly useful if you work in a team environment where multiple programmers are testing at once, which can stop conflicts for which driver station gets which simulation.Conclusion.
I hope you enjoyed this little write up, and helps you understand a little bit more about mDNS in the FRC sense. Feel free to look through the Driver Station Communications code for C++ here. If you have any questions, feel free to email me at jaci.brunning@gmail.com
Read More -
It's Coming
Soon.
Read More -
Starting Over
I've decided to start over on this blog. All the old, irrelevant posts have been archived and removed for the sake of being able to take this website in a new direction.
I'm hoping to post a little bit more about my programming work, whether it be any discoveries that I make, demonstrations or showcases. Hopefully this will help my site fit better into the purpose I initially intended for it to fulfill.
Ciao for now,
~JaciRead More