Recently I needed to process a bunch of text to find about 1300 tokens(needles) amongst over a million text strings(haystack). One of the searches involved checking for token instances at the start of each text string. So I wrote a script that cycled through each needle and at the core used a simple function that returned values whenever the token started the text String. Knowing my needles were always going to be at the start of each string, I thought I’d give POSITION(substr IN str) a spin.
select id, textString, needle from HayStack where POSITION(needle IN textString) = 1;
This took over 43 minutes to process the whole dataset! It didn’t need to be time critical, but 43 minutes was a PITA when trying to refine the results through trial and error filters. So I tried the same thing using SUBSTRING().
select id, textString, needle from HayStack where SUBSTRING(textString,1,CHAR_LENGTH(needle)) = needle;
This time the same dataset was now processed in just over 17 minutes! That’s an impressive 60% drop in execution time. I hard coded a value in needle, just to see how a single cycle would run: POSITION()= 3639 ms & SUBSTRING() = 1864 ms. That’s still about 50% improvement.
The SUBSTRING() query might look less efficient on the surface – but the results make sense when you consider what’s happening. SUBSTRING() only checks the start of each textString and moves on, whilst POSITION() actually looks over the whole of every textString (for any instance) but the query ultimately only returns those where the token is at the start (position() = 1).
I’ve still got another task looking for tokens anywhere in the textString, for which I think POSITION(or LOCATE) is going to be unavoidable. But, if you know where the token should be located – SUBSTRING() is the way to go.