A while back I wrote up a short introductory overview of Genetic Algorithms. Just for the shear, absolute fun of it, I thought I’d implement a basic genetic algorithm within SQL Server and use it to solve a form of the knapsack problem.
Now first a few comments on this. As the title states, this is not really an appropriate use of SQL Server. Genetic algorithms are generally implemented in languages like Java, C++, C#, etc; languages that are good at complex mathematics, string manipulation and have complex data types. I’m not necessarily using efficient, well-performing methods here, UDFa abound. This is not an article on best practices and well-performing code. I’m also doing no error handling, which I would if this were a real system (in a more suitable language)
Still, doing just for the sake of seeing if it’s possible is all sorts of fun. So, without further ado, the knapsack problem, an approximate solution with genetic algorithms in SQL Server. (For the record, this is a multi-constrained, bounded knapsack problem)
There’s a bag that has a maximum volume that it can hold and a maximum mass that it can hold (and we assume that we can pack perfectly with no wasted space). There are eight items, all with different masses, different volumes and different values. The goal here is to maximise the total value of all the items contained within the bag.
CREATE TABLE ObjectStatistics ( ObjectNumber TINYINT NOT NULL, Mass NUMERIC(4,2) NOT NULL, Volume NUMERIC(4,2) NOT NULL, Value NUMERIC(4,2) NOT NULL, NumberAvailable TINYINT NOT NULL, CONSTRAINT pk_ObjectStats PRIMARY KEY CLUSTERED (ObjectNumber) ); CREATE TABLE BagStatistics ( MaxMass NUMERIC(5,2), MaxVolume NUMERIC(5,2) ); INSERT INTO dbo.ObjectStatistics (ObjectNumber, Mass, Volume, Value, NumberAvailable) VALUES (1,0.5,0.45,2,50), (2,1.5,0.25,20,10), (3,0.1,1.25,8,150), (4,0.25,0.1,1,250), (5,0.67,0.3,6,100), (6,0.34,0.75,18,5), (7,1.25,0.25,5,40), (8,1.1,0.8,10,25); INSERT INTO dbo.BagStatistics (MaxMass, MaxVolume) VALUES (100, 75);
Those two tables set up the constraints for the scenario, the maximum mass and volume for the bag and the mass, volume, value and maximum number available for each of the items.
Next we need a way to store the potential solutions (referred to as ‘individuals’). I decided on two tables (pretending to be doing proper DB design here). Keeping everything in one table would also have worked, it would just have required a different implementation later on.
CREATE TABLE Individuals ( Generation SMALLINT, IndividualID SMALLINT, CONSTRAINT pk_Individuals PRIMARY KEY CLUSTERED (Generation, IndividualID) ); CREATE TABLE IndividualDetails ( Generation SMALLINT, IndividualID SMALLINT, ObjectNumber TINYINT, Quantity TINYINT, CONSTRAINT pk_IndividualDetails PRIMARY KEY CLUSTERED (Generation, IndividualID, ObjectNumber), CONSTRAINT fk_IndividualDetails_Individuals FOREIGN KEY (Generation, IndividualID) REFERENCES dbo.Individuals (Generation, IndividualID) );
I’m going to add two more tables, one to make the processing easier (it’ll become clear later where this is used) and one to store some results. Also a view to get around the ‘cannot use RAND inside a UDF’ limitation.
CREATE TABLE Numbers ( Value TINYINT PRIMARY KEY, BinaryRepresentation CHAR(8) UNIQUE ); CREATE TABLE GenerationStatistics ( Generation SMALLINT, MaxFitness NUMERIC(8,2), AvgFitness NUMERIC(8,2) ); GO CREATE VIEW RandomNumber AS SELECT RAND(CHECKSUM(NEWID())) AS Random; GO
That’s the basics out of the way. There are two important things I need to decide on straightaway.
The first is an encoding of the individual’s properties (in this case object quantities) so that the genetic algorithm can manipulate them. I’m going to use a binary representation – convert the quantity for each object into binary and concatenate the strings together. 8 bits per object.
The second is a fitness function, a way to tell which individuals are the best. In this case it’ll be a simple sum of the values of all the items, taking into account the max mass, max volume and max quantity for each.
I decided to implement both of these as scalar UDFs (yes, shock and horror, I’m using data-accessing scalar UDFs) and make them calculated columns of the Individuals table.
The integer-to-binary uses a slightly modified version of a function written by Mitch Wheat, available on his blog – http://mitch-wheat.blogspot.com/2006/10/t-sql-way-converting-integers-to.html. Thanks Mitch
-- Originally from http://mitch-wheat.blogspot.com/2006/10/t-sql-way-converting-integers-to.html CREATE FUNCTION IntegerToBinary(@intval int) RETURNS char(8) AS BEGIN DECLARE @bincode char(64); SET @bincode = '0000000100100011010001010110011110001001101010111100110111101111'; RETURN ( SUBSTRING( @bincode, ((@IntVal / 16) & 15) * 4 + 1 , 4 ) + SUBSTRING( @bincode, (@IntVal & 15) * 4 + 1 , 4 ) ); END GO CREATE FUNCTION dbo.ComputeBinaryString(@Generation SMALLINT, @Individual SMALLINT) RETURNS CHAR(64) AS BEGIN DECLARE @BinaryString VARCHAR(64) = ''; SELECT @BinaryString = @BinaryString + dbo.IntegerToBinary(Quantity) FROM dbo.IndividualDetails WHERE Generation = @Generation AND IndividualID = @Individual ORDER BY ObjectNumber; RETURN @BinaryString; END; GO CREATE FUNCTION dbo.Fitness(@Generation SMALLINT, @Individual SMALLINT) RETURNS NUMERIC(6,2) AS BEGIN DECLARE @MaxMass NUMERIC(5,2), @MaxVolume NUMERIC(5,2); SELECT @MaxMass = MaxMass, @MaxVolume = MaxVolume FROM dbo.BagStatistics; DECLARE @TotalValue NUMERIC(6,2) = 0; SELECT @TotalValue = SUM(id.Quantity*os.Value) FROM dbo.IndividualDetails id INNER JOIN dbo.ObjectStatistics os ON id.ObjectNumber = os.ObjectNumber WHERE Generation = @Generation AND IndividualID = @Individual HAVING SUM(id.Quantity*os.Mass) <= @MaxMass AND SUM(id.Quantity*os.Volume) <= @MaxVolume AND SUM(CASE WHEN id.Quantity>os.NumberAvailable THEN 1 ELSE 0 END) = 0; RETURN @TotalValue; END; GO ALTER TABLE dbo.Individuals ADD BinaryString AS dbo.ComputeBinaryString (Generation, IndividualID), Fitness AS dbo.Fitness(Generation, IndividualID); GO
At this point I have a way to store the individual in each generation, a way to represent them in order for the genetic algorithm to process them and a way to calculate the fitness. Good to go.
I now need to implement the genetic algorithm itself. The main components that I need to implement are the crossover and the mutation and then an evolution procedure to generate new individuals from the current ones.
I’m going to again use functions for the crossover and mutation, for convenience and easy coding. A multi-statement table-valued UDF for the crossover and a scalar UDF for the mutation. (Performance? What performance?)
The crossover’s fairly simple. Take 2 binary strings as parameters. Pick a random point along the string and cut the two strings at that point. Create two new strings from the first part of the first input string concatenated with the second part of the second input string and from the first part of the second input string concatenated with the second part of the first input string.
If I started with ‘00000000’ and ‘11111111’ and picked 5 as the crossover point, afterwards I’d get ‘00000111’ and ‘11111000’
CREATE FUNCTION dbo.Crossover(@BinaryString1 CHAR(64), @BinaryString2 CHAR(64)) RETURNS @NewIndividuals TABLE (CrossoverPoint TINYINT, BinaryString CHAR(64)) AS BEGIN -- where in the string is the crossover DECLARE @CrossoverPoint TINYINT; SELECT @CrossoverPoint = FLOOR(Random*64)+1 FROM RandomNumber; INSERT INTO @NewIndividuals SELECT @CrossoverPoint, SUBSTRING(@BinaryString1,0,@CrossoverPoint) + SUBSTRING(@BinaryString2,@CrossoverPoint, 64-@CrossoverPoint+1) UNION ALL SELECT @CrossoverPoint, SUBSTRING(@BinaryString2,0,@CrossoverPoint) + SUBSTRING(@BinaryString1,@CrossoverPoint, 64-@CrossoverPoint+1); RETURN; END GO
The mutation’s not much more complex. Take a single binary string as input. Pick a random position along the string. Flip that bit (0->1; 1->0). Return the modified string
CREATE FUNCTION dbo.ApplyMutation(@BinaryString CHAR(64)) RETURNS CHAR(64) AS BEGIN DECLARE @MutationLocation TINYINT, @SwappedBit CHAR(1); -- Which bit are we toggling? SELECT @MutationLocation = FLOOR(Random*64)+1 FROM RandomNumber; -- simple toggle for the chosen bit SET @SwappedBit = 1 - CAST(SUBSTRING(@BinaryString, @MutationLocation,1) AS BIT); -- and stuff it back in SET @BinaryString = STUFF(@BinaryString,@MutationLocation,1,@SwappedBit); RETURN @BinaryString; END GO
One thing to note at this point is that neither the crossover nor the mutation has any dependency on the details of the problem being solved. This is intentional, and is generally how genetic algorithms work. With the details encoded properly (in this case into a binary string), the crossover and mutation should be able to work just on the encoded string without needing any knowledge of the actual problem.
Last thing needed here is something that will take the existing individuals, apply the crossover and mutation and return new individuals. This time I think a stored procedure would be best.
CREATE PROCEDURE Evolve (@CurrentGeneration INT) AS BEGIN CREATE TABLE #SelectedIndividuals ( BinaryString CHAR(64), CrossoverGroup TINYINT ); -- use the fittest half of the current population to evolve the next generation -- ensures that the low fitness individuals are eliminated (survival of the fittest FTW) INSERT INTO #SelectedIndividuals (BinaryString, CrossoverGroup) SELECT BinaryString, NTILE(2) OVER (ORDER BY (SELECT NEWID())) AS CrossoverGroup FROM ( SELECT TOP (50) PERCENT BinaryString FROM Individuals ORDER BY Fitness DESC ) sub; -- Join so that I have the two binary strings I'm crossing over in the same row. -- The function then does the cross over and the mutation applies to the result SELECT @CurrentGeneration AS Generation, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS IndividualID, dbo.ApplyMutation(co.BinaryString) AS NewIndividual FROM ( SELECT BinaryString, ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())) AS PartnerNo FROM #SelectedIndividuals WHERE CrossoverGroup = 1 ) AS Group1 INNER JOIN ( SELECT BinaryString, ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())) AS PartnerNo FROM #SelectedIndividuals WHERE CrossoverGroup = 2 ) AS Group2 ON Group1.PartnerNo = Group2.PartnerNo CROSS APPLY dbo.CrossOver(Group1.BinaryString, Group2.BinaryString) co; END; GO
That’s all that’s needed for the GA framework.
What’s needed for the solution is a loop to run through the generations one by one, a way to decode the individual returned by the Evolve procedure and some housekeeping.
First off, I need to fill the Numbers table.
INSERT INTO dbo.Numbers (Value, BinaryRepresentation) SELECT Value, dbo.IntegerToBinary(Value) FROM ( SELECT TOP (256) Row_Number() OVER (ORDER BY column_id)-1 AS Value FROM sys.columns ) sub; GO
The setup for the execution loop involves declaring a few variable, a temp table and setting up the initial generation (randomly).
SET NOCOUNT ON DECLARE @CurrentGeneration SMALLINT = 0; DECLARE @TotalGenerations SMALLINT = 50; DECLARE @GenerationSize SMALLINT = 64; -- keep to powers of 2, and everything will work... -- Create initial generation -- First the individuals INSERT INTO Individuals (Generation, IndividualID) SELECT TOP(@GenerationSize) @CurrentGeneration AS Generation, ROW_NUMBER() OVER (ORDER BY column_id) AS IndividualID FROM sys.columns; -- Now the details INSERT INTO IndividualDetails (Generation, IndividualID, ObjectNumber, Quantity) SELECT Generation, IndividualID, ObjectNumber, CEILING(RAND(CHECKSUM(NEWID()))*NumberAvailable/4) FROM Individuals CROSS JOIN ObjectStatistics; -- Initial generation statistics INSERT INTO GenerationStatistics (Generation, MaxFitness, AvgFitness) SELECT @CurrentGeneration, MAX(Fitness), AVG(Fitness) FROM Individuals; -- placeholder table for new generations CREATE TABLE #NewIndividuals ( Generation SMALLINT, IndividualID SmallINT, BinaryString CHAR(64) );
The processing loop itself is nothing complex. Insert the results of the Evolve procedure into a temp table. From the temp table insert into the Individuals, decode the binary string and insert into IndividualDetails, then delete the lower-fitness individuals so that the population remains at a constant size.
In this case I’ve decided just to run for a fixed number of iterations. Often genetic algorithms would terminate after a certain number of iterations or if the max fitness didn’t improve by more than a certain threshold. For simplicity I’ll leave that out. It’s trivial to implement if anyone wants to experiment further.
WHILE (@CurrentGeneration<@TotalGenerations) BEGIN SET @CurrentGeneration+=1; -- Generate the new generation INSERT INTO #NewIndividuals (Generation, IndividualID, BinaryString) EXEC Evolve @CurrentGeneration; -- store the new individuals INSERT INTO Individuals (Generation, IndividualID) SELECT Generation, IndividualID FROM #NewIndividuals; -- chop binary string, insert into details table. INSERT INTO IndividualDetails (Generation, IndividualID, ObjectNumber, Quantity) SELECT Generation, IndividualID, Allele, Value FROM ( SELECT Generation, IndividualID, Value+1 AS Allele, SUBSTRING(BinaryString, 1 + (Value * 8), 8 ) AS BinaryString FROM #NewIndividuals CROSS JOIN Numbers WHERE Value < 8 ) sub INNER JOIN Numbers ON sub.BinaryString = Numbers.BinaryRepresentation; -- delete all but the top (@GenerationSize) fittest individuals. Keep fixed population size. DELETE FROM IndividualDetails FROM IndividualDetails LEFT OUTER JOIN ( SELECT TOP (@GenerationSize) Generation, IndividualID FROM Individuals ORDER BY fitness DESC ) Chosen ON IndividualDetails.Generation = Chosen.Generation AND IndividualDetails.IndividualID = Chosen.IndividualID WHERE Chosen.IndividualID IS NULL; DELETE FROM Individuals WHERE BinaryString = ''; -- only possible when there are no details -- Compute current max fitness and current avg fitness INSERT INTO GenerationStatistics (Generation, MaxFitness, AvgFitness) SELECT @CurrentGeneration, MAX(Fitness), AVG(Fitness) FROM Individuals; -- cleanup TRUNCATE TABLE #NewIndividuals; END -- while DROP TABLE #NewIndividuals
And that’s pretty much all there is to it. All that’s left is to see how well the genetic algorithm did and how good the best solution it found really is.
Please note that this entire solution is driven by random numbers, if you run it you will get different results to what I did.
SELECT TOP (1) i.Generation, i.IndividualID, TotalMass, TotalVolume, TotalValue FROM dbo.Individuals AS i INNER JOIN ( SELECT Generation, IndividualID, SUM(Quantity*Value) AS TotalValue, SUM(Quantity*Mass) AS TotalMass, SUM(Quantity*Volume) AS TotalVolume FROM dbo.IndividualDetails d INNER JOIN dbo.ObjectStatistics AS os ON d.ObjectNumber = os.ObjectNumber GROUP BY Generation, IndividualID ) id ON i.Generation = id.Generation AND i.IndividualID = id.IndividualID ORDER BY Fitness DESC;
Not half bad.Is it the absolute best? Probably not. Genetic algorithms do not guarantee that they will find the best solution, just that they will find good ones
If we look at how the fitness (total value) improved over the generations it shows a massive initial increase in fitness then slow refinements. I could have run only 10 generations and still had a reasonably good result.
I think that’s enough playing with AI for now. While I doubt there’s any real practical use, I hope it was at least entertaining.
Full code: GeneticAlgorithms