SQL problem

Jeice · February 2010

So, let's say I have the following table


ColumnA  |   Column B
ValueA      |   sec999
ValueA      |   sec000
ValueB      |    sec000

I want to return all rows that don't have sec999 in ColumnB. But, since ValueA has sec999 and sec000, I don't want that to appear in my results.

So pretty much, the above code should just return

ValueB | sec000

Select columnA, columnB
from table
where columnB <> 'sec999'

that code returns

ValueA | sec000
ValueB | sec000

but, I just want
ValueB | sec000

how would I do it?

Pooshlmer · February 2010

Look up NOT IN

urahonky · February 2010

Can't you do

Select columnA, columnB
from table
where columnB != 'sec999'

? I thought != exists.

e: oooh I get what you're saying now.

exmello · February 2010

Select columnA, columnB
From table as qryA
LEFT JOIN
(Select columnA, columnB
From table
where columnB <> 'sec999') as qryB
ON qryA.columnA = qryB.columnA
Where gryB.columnA is null

Is how I would do it.

Or try the minus keyword instead of "left join <tableB>... where <tableB>.<key> is null" if your version of sql supports it.

JHunz · February 2010

I'd personally do it like so (obviously syntax varies a little between products):

select columnA, columnB from table
where columnA not in (select distinct columnA from table where columnB = 'sec999')

exmello · February 2010

JHunz wrote: »
I'd personally do it like so (obviously syntax varies a little between products):
select columnA, columnB from table
where columnA not in (select distinct columnA from table where columnB = 'sec999')

That's pretty much equivalent and much easier to read.

I have a habit of avoiding the IN syntax if the subquery could possibly return a large result. I found that in mssql 2000 it performed very slowly, but they did something in later versions that made it so there's very difference in performance. (This is just an experience thing, I haven't actually read anything that says use one over the other).

vonPoonBurGer · February 2010

NOT IN will work, but a correlated subquery using NOT EXISTS will often give you much better performance. Like so:

SELECT ColumnA FROM table t WHERE NOT EXISTS
(SELECT 1 FROM table WHERE ColumnA = t.ColumnA AND ColumnB = 'sec999');

Especially if you're working with a large data set, using a non-correlated subquery and a NOT IN clause often forces your database to test each value in the outer query against every value in the subquery's result set before it can positively say "No, this value is not in that set". Say your table had a million rows, and a quarter-million of them have "sec999" in ColumnB. With the NOT IN query, for every row in your table the query compares the ColumnA value to the set of 250,000 values returned by the subquery. When ColumnA does exist in the subquery set, it will stop the comparisons once it's found a match, but in many cases it will have to make all 250,000 comparisons before it can affirm that ColumnA doesn't exists in the subquery set. Worst case scenario in this example would be upwards of 200 trillion compare operations, I think.

With the correlated subquery, your RDBMS can pull out that arbitrary "1" value quickly, especially if you're working with indexed columns. So using the million-row table example above, instead of having to make up to 250,000 comparisons for each row in the table, the RDBMS just has to run the correlated subquery once per row and see if the result set is empty or not. Exactly how big your table needs to be before you'll really notice the performance improvement is going to depend on your DB server's I/O, CPU and memory speed, and probably your flavor of RDBMS. On the 9i, 10g or 11g Oracle systems I work with, using relatively recent hardware, I find on any table with more than a couple thousand rows a NOT EXISTS query will be noticeably faster than a NOT IN equivalent.

JHunz · February 2010

Yeah, not exists is probably significantly better (in large-dataset cases). That's what I get for posting SQL at 2 AM.

Gilbert0 · February 2010

vonPoonBurGer wrote: »
NOT IN will work, but a correlated subquery using NOT EXISTS will often give you much better performance. Like so:
SELECT ColumnA FROM table t WHERE NOT EXISTS
(SELECT 1 FROM table WHERE ColumnA = t.ColumnA AND ColumnB = 'sec999');
Especially if you're working with a large data set, using a non-correlated subquery and a NOT IN clause often forces your database to test each value in the outer query against every value in the subquery's result set before it can positively say "No, this value is not in that set". Say your table had a million rows, and a quarter-million of them have "sec999" in ColumnB. With the NOT IN query, for every row in your table the query compares the ColumnA value to the set of 250,000 values returned by the subquery. When ColumnA does exist in the subquery set, it will stop the comparisons once it's found a match, but in many cases it will have to make all 250,000 comparisons before it can affirm that ColumnA doesn't exists in the subquery set. Worst case scenario in this example would be upwards of 200 trillion compare operations, I think.

With the correlated subquery, your RDBMS can pull out that arbitrary "1" value quickly, especially if you're working with indexed columns. So using the million-row table example above, instead of having to make up to 250,000 comparisons for each row in the table, the RDBMS just has to run the correlated subquery once per row and see if the result set is empty or not. Exactly how big your table needs to be before you'll really notice the performance improvement is going to depend on your DB server's I/O, CPU and memory speed, and probably your flavor of RDBMS. On the 9i, 10g or 11g Oracle systems I work with, using relatively recent hardware, I find on any table with more than a couple thousand rows a NOT EXISTS query will be noticeably faster than a NOT IN equivalent.

I thank you for this information and will try to remember this on my database.

Penny Arcade

Quick Links

SQL problem

Posts