I've been trying to speed up a set of queries in SQL Server 2005 to no avail.
The script I've written combines the use of both static tables which contain many rows of relational data and temporary tables which are created/destroyed within a single execution of the script.
For statements that only deal with temporary tables the advisor says "Statement does not reference any tables" (which makes sense I suppose). The advisor completes with "success" on all steps but unfortunately reports an "estimated improvement" of 0%.
I tried passing the script as a workload, I tried creating a trace file, both yield the same result.
It may be doing its job the way it is supposed to; that is, there may not be any way to help the performance.
However, if there is anything to suggest I try, please let me know.
On the other hand...
If the dta is in fact doing its job right, then perhaps the statement which I believe gives me the biggest hit in performance can be optimized:
(The static portion of the database is pretty simple; it contains relationships between Words, Titles, Articles, and Categories. The ArtCat table stores the relationship between Article IDs and Category IDs, TitWord table stores the relationship between Title IDs and Word IDs, etc... The only temporary table in use is #QueryWords which contains WIDs of interest.)
WHILE @i > 1
BEGIN
INSERT INTO #Quads
SELECT DISTINCT #QueryWords.WID, TACs.TID, TACs.AID, TACs.CID FROM
(SELECT TAs.TID, TAs.AID, ArtCat.CID FROM
(SELECT Ts.TID, TA.AID FROM
(SELECT DISTINCT TitWord.TID FROM TitWord INNER JOIN #QueryWords ON TitWord.WID = #QueryWords.WID) AS Ts
INNER JOIN TitArt AS TA ON Ts.TID = TA.TID
WHERE
(
(SELECT COUNT(*) FROM TitWord WHERE TitWord.TID = Ts.TID AND TitWord.WID IN (SELECT WID FROM #QueryWords)) +
(SELECT COUNT(*) FROM ArtWord WHERE ArtWord.AID = TA.AID AND ArtWord.WID IN (SELECT WID FROM #QueryWords)
AND ArtWord.WID NOT IN
(SELECT DISTINCT TW.WID FROM TitWord AS TW INNER JOIN #QueryWords AS QW ON TW.WID = QW.WID WHERE TW.TID = Ts.TID)
)
) = @i
) AS TAs
INNER JOIN ArtCat ON TAs.AID = ArtCat.AID
) AS TACs
CROSS JOIN #QueryWords;
IF @@ROWCOUNT > 0
BREAK
SET @i = @i - 1;
END
Any help is much appreciated.
Posts
Someone after me might reply with a superstisious premature optimization about changing count(*) to count(id) or something. The trick to performance optimization is finding out the single change that will change your query by the greatest order of magnitude, not shaving .0001ms off a 10 second query.
Things that pop out at me are: That loop, how big is @i? That cross join, you do understand what a cross join is, right? I could explain why that is most likely bad, but you would be better off googling it and finding out yourself.
Edit: The more I look at this, the more curious I am at what it is trying to accomplish. It is very verbose for such a simple relationship, but I don't know what to think without actually knowing what this procedure is trying to accomplish. Can you tell me what the goal of this code is? There may a much simpler solution.
@i is typically between 1 and 5... sometimes as large as 7; not very large.
The cross join is a necessary evil. For each intermediate T-A-C triple/row, a row exists in the final result with the triple + a query word for all query words (best explanation of my understanding of the definition of a cross join).
Also, I began to wonder if a UNION would suffice; instead of counting those in TitWord then counting those in ArtWord that aren't in TitWord, count those in the union of the two. Thinking about it now, the amount of processing might be the similar. I don't know how to trust the optimizer yet. Either way, I'll try that too.
Edit: the goal of the code is to get T-A pairs which encompass @i distinct WIDs in #QueryWords (through the relationships TitWord and ArtWord), create T-A-C triples through the ArtCat relationship, then end with W-T-A-C quads by simply cross-joining.
It is quite verbose. I like explicit joins over the implicit joins (especially for queries this large) in order to spare people in the future (or now) from trying to decipher it. I inherited code that contained implicit joins and that changed my whole perspective. Indenting also helps.
SELECT COUNT(*) FROM ArtWord WHERE ArtWord.AID = TA.AID AND ArtWord.WID IN (SELECT WID FROM #QueryWords)
AND ArtWord.WID NOT IN
(SELECT DISTINCT TW.WID FROM TitWord AS TW INNER JOIN #QueryWords AS QW ON TW.WID = QW.WID WHERE TW.TID = Ts.TID)
as a LEFT JOIN WHERE ____ IS NULL? Try that, union to the other Count(), and see what kind of difference that makes.
Edit: Also, I don't know if Sql Server makes this optimization itself in the back-end, but are all of the inner DISTINCTs really necessary?
Edit 2: Is there some way to bring everything in the WHERE (_____) = @i up from a WHERE clause into a join, and carry it up through the nested levels and don't filter @i = until the top level? Depending on how many Titles and TitArt you have, that may help the indexes do their job better.
Unfortunately, I don't know how a left-join would work in this case.
I think I understand where you're coming from though: I've been joining on #QueryWords up until the count statement, where I for some reason say "WHERE WID IN (SELECT WID FROM #QueryWords" -- this accomplishes the same task. This might be because the TitWord and ArtWord tables are 100 000s of rows (maybe even a million).
@ROWCOUNT > 0 BREAK.
If there were some way to compare the unique set of WIDs from both TitWord and ArtWord (where the TID = Ts.TID and the AID = TA.AID, remember this is an insert) in less than 6 selects + 1 join, I'm sure there would be a nice improvement.
A union would get rid of the INNER JOIN and I think a single SELECT statement, but the syntax is unacceptable to accomplish this:
is incorrect syntax.
Feel free to let me know what you think about my above rambling.
Edit: I need the result of the SELECT to be distinct. I'm under the impression, in SQL Server, if there are two rows with the same A column, a "SELECT A FROM..." will produce duplicates (let me know if I'm wrong on this). Whether it is true or not, I'm explicit on my need for it to be distinct.
select distinct A from ....
http://www.sql-tutorial.com/sql-distinct-sql-tutorial/
I'm pretty sure the results would be the same if you left the distinct out until the top level. WHERE 3 NOT IN(1,2,3) is the same as WHERE 3 NOT IN(1,2,3,3). I seem to recall distinct forcing additional processing time, but in this case it may or may not matter. Everything really just relies on what your data is like.