The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.
I have an SQL database with MS SQL Server 2005. One of the tables has a field that is ntext, and contains a block of text (typically thousands of words, with some words occurring several times).
I would like to run a query to return the rows of the table where the ntext text contains at least one occurrence of a specific word. Is that possible? If so, how?
I think what you're looking for is some implementation of regular expressions in MS SQL Server 2005. I think in other versions of SQL it's given for free with the keyword "like", but it looks like in this you have to implement it yourself:
If you're having to regex your data, and performance is an issue, you have to think about whether your schema is correct at all. Searching large text fields usually indicates you should have broken up that information into a separate column, but it depends on what you're trying to achieve.
If you're having to regex your data, and performance is an issue, you have to think about whether your schema is correct at all. Searching large text fields usually indicates you should have broken up that information into a separate column, but it depends on what you're trying to achieve.
Well my DB will work in two stages. First I have to build the DB from the original data, then once it's built users can run queries.
Right now I am splitting the data in separate columns, and rows, and using two extra tables. The problem is that splitting up a block of text thousands of words long and running individual processing for each word is extremely slow. It can take me five to ten minutes to process one entry in my original data, and I have millions of entries like that. Clearly this is unacceptable.
So I'm looking to speed up the DB creation stage. Searching the text fields for individual words is something that will only occur in the second stage, during the user queries. And it will typically only be done for a handful of words, not for thousands. So I'm ok with slowing down the user queries a little if I can get a major performance gain in the DB creation stage.
And like I said, I need every bit I can get in the DB creation stage to create it in a reasonable time.
MS SQL 2005 has the LIKE operator, you could just use that.
Assuming there's a common ID and these blocks of text are related...(Your page or program puts the rows together, but in the background the data is chunked) you could use the LIKE operator to search the text and a DISTINCT on their common ID to make sure it only grabs one hit per "record". That would be how you'd handle it if it was done as lewisham's suggesting. This has its ups and downs though, because if your text is split and you're looking for "apple butt" and apple is in one row and butt is in another, you wouldn't get a hit.
If each row represents a unique text whatever, however... then no matter how many times your keyword appears in your single field, it will only show up once, since the LIKE operator just runs through that field/row. Unless there was a common ID involved in the case where you've split your text fields.
edit: Wow I shouldn't game while posting here. What exactly are you trying to store in your database? What is the format now?
tastydonuts on
“I used to draw, hard to admit that I used to draw...”
MS SQL 2005 has the LIKE operator, you could just use that.
Assuming there's a common ID and these blocks of text are related...(Your page or program puts the rows together, but in the background the data is chunked) you could use the LIKE operator to search the text and a DISTINCT on their common ID to make sure it only grabs one hit per "record". That would be how you'd handle it if it was done as lewisham's suggesting. This has its ups and downs though, because if your text is split and you're looking for "apple butt" and apple is in one row and butt is in another, you wouldn't get a hit.
If each row represents a unique text whatever, however... then no matter how many times your keyword appears in your single field, it will only show up once, since the LIKE operator just runs through that field/row. Unless there was a common ID involved in the case where you've split your text fields.
edit: Wow I shouldn't game while posting here. What exactly are you trying to store in your database? What is the format now?
Oh, I didn't know the DISTINCT one. Thanks, that will come in handy. I'm basically learning database commands on the fly here.
So to answer your question, my DB (or at least the table I'm talking about here) is being built from a few million text documents of a few hundred to a few thousand words each. Each row will be one document. There will be a column for a unique ID number (integer), one for the title of the document (ntext), one for the text of the document (that's the ntext field I'm talking about in my OP) and one for a list of related documents' ID numbers (ntext, I'll be storing them as a long text string with spaces in-between).
That's three ntext field per entry for million of entires in this table, plus those in other tables, so if there's some improvement to be had by using varchar(max) instead... even a few fractions of a second for each entry I add in the DB, multiplied by millions, will be significant.
edit: Wow I shouldn't game while posting here. What exactly are you trying to store in your database? What is the format now?
This.
What is "individual processing" for each word? What are you doing?
Without all the background, any recommendations will be complete shots in the dark with a database as large as yours.
Alright, what I mean by "individual processing". Right now when a document is added into the DB, I split it into individual words. For each word, I check if it is in the Words table, and if it's not I add it. Then I check if it is the first occurrence of that word in the document, and if so I increment a counter of documents featuring this word. Then I add the complete text into an ntext field in the Documents table, so that at query time I can compute the number of occurrences of a queried word in the document.
The idea is that a word's importance in a document can be computer in terms of (1) how many documents in the collection does it appear in, and (2) how frequently does it appear in that specific document. That's the tf-idf metric, for those who are interested in learning more. I use this to rank the documents and discover the most relevant document given the words in a user's query.
Anyway, I need to get those two statistics, and right now I'm doing it word by word when I build the DB, which slows down DB creation considerably. But using the LIKE and DISTINCT commands, I could just store the text in an nfield and then get the statistics and query time for the words the user cares about.
Hm. I think it'd be a good idea to modify/design this table to incorporate more criteria to search across than just the actual body of the document for keywords.. and have that as a constraint as well, because there's not much that can be done to optimize what you're wanting otherwise. Assuming one million records, with an average of 1000 words each the query would have to go through a billion words per run, every run.
Creation Date/Time, Author, etc would all be useful not only in making searching easier and more relevant for your users (if your documents span text files created in 95' and somebody wants something that was done last week, at the end-user level they'd have to sift through results for that range, and your query would have looked through all that for nothing... the cost is going to be excessively high for every query run, and if you have multiple users then it'll only get uglier.
Edit: Must post faster.
Actually, it could be a lot easier for you to grab an existing natural language search software... just visualizing complexity the query, along with maintenance and space for upkeep doesn't make this seem like a viable solution. Is this something for work?
tastydonuts on
“I used to draw, hard to admit that I used to draw...”
May I ask who this is for; and how mission-critical this is? If it's a university library or something where it is reasonably important, you need more brains to sit down and work this out. Certainly, if you've not come across the "distinct" keyword, you're out of your depth (I'm not saying you are incompetent, just that you really need some more knowledge before you can begin to wrestle with a problem like this).
My assumption is that your database is built to be queried, so you have to think of record creation as cheap, and searching expensive. Don't worry about how long it takes to enter a record (within reason). EDIT: Cut out some nonsense because I was reading things wrong. You're right to be trying to pre-process as much as possible.
In addition, the DISTINCT keyword is something that will get you out of a jam; it shouldn't be relied on for huge queries where response time is important. I've used DISTINCT in overnight processing jobs, but I wouldn't use it here. DISTINCT will just get the entire dataset that the WHERE matched, and then match every record with everything else. It's an expensive operation, and most DBAs consider its use either a failure of the programmer to formulate a decent query, or a failure of the schema if its use is unavoidable.
Lewisham on
0
SmasherStarting to get dizzyRegistered Userregular
edited May 2008
Instead of storing the related documents' IDs in an ntext field you should create a separate table with the IDs of two related documents being the foreign and primary keys of that table. Also, how are you defining related documents?
Something's very wrong if it's taking five to ten minutes to process one document, assuming that's what you meant by 'entry'. What are you indexing your words table by? Typically you want the primary key to be an integer, but in this case I think it might be faster to index it by the word itself. I suggest you also explicitly exclude common words such as 'a', 'the', 'and', and the like; those could slow you down considerably while adding no real value to your database.
MS SQL 2005 has a thing called full-text indexing. You have to enable it and set a column to use it, but once you do, SQL Server will index the words in the said columns and make searching these columns much faster. I would definitely check your local SQL Server help (or Google) for instructions on how to do this... it may speed things up quite a bit.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
For each word, I check if it is in the Words table, and if it's not I add it. Then I check if it is the first occurrence of that word in the document, and if so I increment a counter of documents featuring this word.
This seems redundant to me. If the word is not yet in the Words table and you add it, it can be assumed that it's the first occurrence of that word in this document. If it wasn't the first occurrence, it would already have been added to the Words table previously.
Anyway, I need to get those two statistics, and right now I'm doing it word by word when I build the DB, which slows down DB creation considerably. But using the LIKE and DISTINCT commands, I could just store the text in an nfield and then get the statistics and query time for the words the user cares about.
This was already mentioned, but is adding records to the database going to be more frequent or is searching going to be more frequent? I assume searching, in which case you care less about how long it takes to add a record. I think searching the text at query time to create these metrics is going to be extremely slow if the user searches for a relatively common word, because you have to first obtain all records with that word, and then search through the full text of each returned record to get a count of the word.
incorporate more criteria to search across than just the actual body of the document for keywords
Unfortunately the search is only keyword-based. Other information, like document author and date, are not relevant in my problem.
Actually, it could be a lot easier for you to grab an existing natural language search software... just visualizing complexity the query, along with maintenance and space for upkeep doesn't make this seem like a viable solution.
Do you mean one that goes on top of SQL server, or...? Can you suggest one?
May I ask who this is for; and how mission-critical this is?
It's for a research project, and it's important.
if you've not come across the "distinct" keyword, you're out of your depth
You mean, if the user is querying for a keyword that doesn't occur anywhere in the DB? Then the keyword is simply ignored. There's not much else I can do there.
Don't worry about how long it takes to enter a record (within reason). You're right to be trying to pre-process as much as possible.
Well that's my problem. Right now it takes an unreasonable time to enter a record into the DB, and I need to cut down on it, which is why I'm looking to remove as much of the preprocessing as possible.
Also, how are you defining related documents?
Some documents refer to other documents in the DB. For each document I have a list of such related documents. I add it to the document's entry as a list of referred documents' ID numbers.
What are you indexing your words table by? Typically you want the primary key to be an integer, but in this case I think it might be faster to index it by the word itself.
I have both right now, actually. Each entry in each table has an integer ID number, and in the Words table there is a ntext field for the word.
I suggest you also explicitly exclude common words such as 'a', 'the', 'and', and the like; those could slow you down considerably while adding no real value to your database.
Indeed, I remove them. I also remove all punctuation marks, and cut more complex words down to their stems. This has all already been done before the DB creation stage, so it doesn't count in the creation time or in the preprocessing I'm trying to reduce.
MS SQL 2005 has a thing called full-text indexing. You have to enable it and set a column to use it, but once you do, SQL Server will index the words in the said columns and make searching these columns much faster. I would definitely check your local SQL Server help (or Google) for instructions on how to do this... it may speed things up quite a bit.
Interesting. I'll look into that this afternoon. Thanks.
This was already mentioned, but is adding records to the database going to be more frequent or is searching going to be more frequent? I assume searching, in which case you care less about how long it takes to add a record. I think searching the text at query time to create these metrics is going to be extremely slow if the user searches for a relatively common word, because you have to first obtain all records with that word, and then search through the full text of each returned record to get a count of the word.
Unfortunately, how long it takes to enter a record is a major problem here. Let's round it to 2,000,000 records. If it takes me one minute per record (and right now it takes me a lot longer than that) then it'll take me over three years to build the DB.
If I can cut it down to one second per entry creation (take one month to create the DB) and the flip side is slightly slower querying, I can live with that. At least I'll have a DB to query through.
I wonder if it wouldn't be faster to write something in C++ or C# to do the word parsing and have that run database calls to do the inserts/updates. C# especially has a good bit of word processing/regex methods already in the libraries and I can't imagine it would be slower than SQL (which is not known for its speed).
Also, use varchar(max). Text and NText have been deprecated and should no longer be used according to Microsoft.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
Unfortunately, how long it takes to enter a record is a major problem here. Let's round it to 2,000,000 records. If it takes me one minute per record (and right now it takes me a lot longer than that) then it'll take me over three years to build the DB.
If I can cut it down to one second per entry creation (take one month to create the DB) and the flip side is slightly slower querying, I can live with that. At least I'll have a DB to query through.
Well, I know for a situation where we wanted to have some of our users use natural language searching we picked up ISYS Search Software. It has the ability to query SQL databases, but we don't use it for that (not a SQL db, text is stored line per record - not my design, it's old). So ours is set up to work off an extraction of relevant data.
It did indexing much faster than what you've got going now... I don't recall the time on the initial builds but I'm thinking it was less than an hour on my work PC, though my data was broken down into chunks (by year), and was slightly smaller (less than a million files). You wouldn't have to have a DB for your solution... it creates an index for it. but I'm assuming that your research doesn't require this information to be converted and stored into a DB. However, it could be pricey, as it's an enterprise solution... there could be some smaller ones out there but I couldn't say offhand.
That said... just looking at your project's data entry portion, what is the mechanism by which you're importing this data? Have you written a script or program to do this?
tastydonuts on
“I used to draw, hard to admit that I used to draw...”
Something else of note: Using indexes. Any field used in a where clause should get some kind of index. The table with the word count should be indexed so the search to find the word whose value needs to be incremented can be located in much much less than O(n) speed (which gets worse as the word count increases, of course). This will cause inserts to go faster as you put more entries in.
Another option would be to hash each word (there are hash commands in SQL) so that the 'value' used is actually a number. Use that as the primary key of the word table and the locating would definitely go faster. Unfortunately I don't know if it would speed up the inserts as much as it would speed up the queries.
Now, as for your actual problem, it seems you've got a hell of a complexity level. Let me see if I understand:
For each word in document
if word exists in wordTable == false
insert word into wordTable //count defaulting to 0
end if
if document.FirstIndexOf(word) == currentIndex //first occurrence of the word
update wordtable set count += 1 where wordtable.word = word
end if
end for
insert document into documentTable
Ok, this has some nasty problems. Your first loop is going to execute n times (n = number of words). The if statement will, through SQL behind-the-scenes work, execute a/m times (m = number of rows in wordTable, a <= m; a is the position in the table for the word). The second if will execute some b/n times (b <= n; b is the position in the document for the word) depending on where the word is located. Lastly, you get another a/m for the update because it has to go find the row again.
So your total execution loop is something like n * (a/m)^2 * (b/n) which should give you a hell of a lot of execution since there is no short cutting here. If wordTable.word is not the index, SQL will look at every single field an arbitrary number of steps until it finds the word. If it is indexed, then you'd at least ensure the search is based off some binary search tree or the like that will greatly increase the search speed. This is where DISTINCT comes into play. It ensures a/m instead of m because it knows to stop at the first find... this doesn't work so well for the update as it will go running down the table thinking that there may be more than one entry UNLESS you make the word column a primary key (no duplicates allowed) and indexed so it knows to never bother trying to find a second entry (this also means DISTINCT is not necessary). If no distinct claims, primary key restrictions, or indexing are used here, a = m for each search because it will check every row to see if it needs to work on it (since it won't know that the field would be unique).
Ok, so here's my suggestion for reducing time spent in the execution loop:
declare #local table(wordField varchar(max))
for each word in document
if exists(select * from #local where wordField = word) //returns false if the select finds no rows
insert into #local values(word)
update wordTable
set count = count + 1
where wordTable.Word = word
if @@rowcount == 0 //contains number of rows modified by the last instruction. 0 means it isn't present.
insert into wordTable values(word) //count should be defaulted to 1 so inserts already have 1.
end if
end if
end for
insert into documentTable values(document) //and other values as necessary of course.
You are now looking at n * (a/m) * (b/q)which is less than the original n * (a/m)^2 * (b/n). Do also keep in mind what will happen as the project runs. We know N to be relatively constant and Q to be to relative variety of words in a document (also assumed to be relatively constant). As each document is processed, Q will increase (derivative will approach zero as less new words are found in the document). M will also increase due to it growing in size, but depending on the variety of your words across all documents, M's growth should drop down to zero well before you run out of documents. We are also guarantied that (b/q) < ((a/m) * (b/n)) since q <= n due to it being a subset and m will quickly grow to be > n. Because q <= n you can also be sure that the first b < second b since the top <= bottom of both statements... blah blah blah, just trust me on this, heh.
If you index wordTable, you will reduce b/m down to very near 1. Unsurprisingly, that will ensure the second method's complexity to simply be n * (a/q) or, in other words, the size of the document * the variety in the document. Even if you don't, you're still talking n * (a/q) * m vs n * m^2 * (b/n)... which is still better.
A small modification that may speed it up is as follows:
declare #local table(wordField varchar(max))
declare @word varchar(max)
for each word in document
if exists(select * from #local where wordField = word)
insert into #local values(word)
end for
DECLARE cr CURSOR FAST_FORWARD on select word from #local
OPEN cr
FETCH NEXT FROM cr INTO @word
WHILE @@FETCH_STATUS = 0
BEGIN
update wordTable
set count = count + 1
where wordTable.word = word
if @@rowcount > 0 //contains number of rows modified by the last update.
insert into wordTable values(word) //count should have a default 1 so new records start with count 1.
end if
FETCH NEXT FROM cr INTO @word
END
CLOSE cr
insert into documentTable values(document)
This will load each distinct word into #local and then, using a cursor, iterate through #local's word list. The time cost here comes from SQL managing the cursor... which isn't known for blazing speed. I would given both a shot and see how it goes.
As for query time, it really depends on what you want to do. If you make a lookup table of WordID, DocumentID, then you could make the query Very Fast. Just find the WordID and return every DocumentID (oneup number describing each document) with the given WordID (another identity column oneup number deal). Join this to the document table to pull the document text out. It would be easy to do, just use the update ... into command to get the WordID of existing words and the scope_identity(). For Example (using the first of my examples to make life easier, though it is functionally equivalent to the second example):
declare #local table(wordField varchar(max))
declare @wordID int
declare @docID int
insert into documentTable values(document)
set @docID = scope_identity() //gets the identity column value of the last insert in the scope.
for each word in document
if exists(select * from #local where wordField = word)
insert into #local values(word)
update wordTable
set count = count + 1
output INSERTED.WordID into
@wordID
where wordTable.word = word
if @@rowcount > 0
insert into wordTable values(word)
set @wordID = scope_identity()
end if
insert @wordID, @docID into wordDocLookupTable
end if
end for
My one uncertainty is if you have to output into a table variable or not. If so, make another temporary local table with one column and add 'SELECT @wordID = wordID from #localWordTable' before the insert into the lookup table.
This way your query to find all documents with word X and return the text looks like this:
select docTable.document
from wordTable
inner join wordDocLookupTable on
wordTable.wordID = wordDocLookupTable.wordID
inner join docTable on
wordDocLookupTable .docID = docTable.docID
where wordTable.word = @word
No more likes or such. Anyway, I hope this helps you in some way or another. At the least, I enjoyed writing it, heh.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
Just a thought: are you allowing searches for arbitrary strings, or are you just doing a keywording type thing?
It might be completely out of scope for what you're doing, but you might want to think about using some sort of Bayes / neural net hybrid to reduce all of these database hits. Run a quick Bayes filter over the text to get 10 keywords, and then just store those.
Keep a persistent hash in memory to play the middleman between your database and your frontend.
Also a thought: if you're allowing arbitrary strings, then use some sort of neural net, or even a thesaurus, to see which keywords are 'close to' the arbitrary word, and then just select on those keywords for a search.
Should cut down your total searching time logarithmically.
Sorry if something like that's not an option for you.
That said... just looking at your project's data entry portion, what is the mechanism by which you're importing this data? Have you written a script or program to do this?
The data is in a text file. I've already pre-processed it to structure it, so I know the first line is the document's title, then there's a number of lines for the referenced documents, then the document text in one line. All the lines are labelled, so I know which is which.
I've built a C# program to read the file line by line. It stores the title, list of referenced documents, and the text, and when it has a complete entry it adds it into the database.
Richy on
0
SmasherStarting to get dizzyRegistered Userregular
edited May 2008
Is it possible for you to post the code you're currently using? I can't think of any reason it should be taking as long as it currently is per entry, which leads me to think there's some sort of bug or redundancy in your code.
Given that it is C#, you should just iterate through every word and shove each into a locally defined dictionary<String, int> and use if dictionary.KeyExists(word) == false) dictionary.Add[word, 1] else dictionary[word]++. This should be relatively fast, as would dumping this dictionary to a database table. Now, the one major slowdown here would be when the program's memory footprint surpasses the RAM... then life sucks really really hard as page files are employed and everything becomes bound by hard drive I/O.
You may want to watch your RAM usage while running the program and see if it ever peaks. You'll know it has broken the limit when CPU drops from 99% to 3% because the page file I/O is orders of magnitude slower than anything your CPU can do. The solution would be better memory management... which is hard in C# due to its automatic garbage collection (but not impossible).
Another question is whether these files are divided in such a way as to allow multiple versions of the program to run against, say, different folders. Shove it on a few machines and run them in concurrence. SQL Server will take care of most necessary deconflictions... just make sure your insert command can also handle the case when it finds the word already there (this would be done via an insert trigger).
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
Well it's a lot of code in a lot of functions. I'll try to explain it, and hopefully you guys can make sense of it.
This is the main function to read through the text file, get the titles, related documents (which are called "categories" here) and the document text. The line I marked in red is the one where I insert the text in the DB, and that's the one that slows down everything.
using (System.IO.StreamReader srWikiReader = System.IO.File.OpenText(sArticleFile))
{
string sLine = "";
string sTitle = "";
string sCategories = "";
string sContent = "";
int iCounter = 0;
//Read the file line by line
while ((sLine = srWikiReader.ReadLine()) != null)
{
sLine = sLine.Trim();
if (sLine.Length > 9)
{
if (sLine.Substring(0, 7) == "<title>")
{
sTitle = sLine.Substring(7, sLine.Length - 15);
sCategories = "";
sContent = "";
iCounter++;
}
else if (sLine.Substring(0, 9) == "Category:")
{ sCategories += (sLine.Substring(9, sLine.Length - 9) + "|"); }
else
{ sContent = sLine; }
}
else
{ sContent = sLine; }
if ((sTitle != "") && (sCategories != "") && (sContent != ""))
{
if (iCounter > iPosition)
{
//Add the title into the database (returns the title's ID number)
int iArticleID = RichyDB.AddTitleToDB(sTitle);
//Add the article into the database and link to title (returns the article's unique name)
string sArticleTitle = RichyDB.AddArticleToDB(sTitle);
//Link each category to the article (creates the category if it doesn't exist, and also adds the article to the vocabulary)
foreach (string sCat in sCategories.Split('|'))
{
if (sCat != "")
{ RichyDB.AddCategoryToArticle(sCat, sArticleTitle, sTitle); }
}
//Add the text to the article
[COLOR="Red"]RichyDB.AddTextToArticle(sContent, sArticleTitle);[/COLOR]
//Update resume file
File.WriteAllText(sResumePosition, "articles\n" + iCounter.ToString());
}
//Clear the variables
sTitle = "";
sCategories = "";
sContent = "";
}//Add to DB
}//Read the file line by line
}//Using srWikiReader
//To insure we go to the next block
sFile = "redirects";
iPosition = 0;
}//If
So here is the AddTextToArticle() function. As you can see, it checks each word of the document one by one, and calls the AddWordToDB function for each (in the section in green).
public static void AddTextToArticle(string sArticleText, string sArticleTitle)
{
int iWordCount = 0;
//Split string into words
char[] delimit = new char[] { ' ' };
[COLOR="Lime"] foreach (string sWord in sArticleText.Split(delimit))
{
//Add each word to the DB and keep track of the number
AddWordToDB(sWord);
iWordCount++;
//Increment the Temp counter in preparation for the Aw (number of articles with that word)
using (SqlCommand DBCmd = new SqlCommand("UPDATE WikiWords SET Temp = Temp + 1 WHERE Word = '" + sWord + "'", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
}[/COLOR]
//Increment the Aw (number of articles with that word)
using (SqlCommand DBCmd = new SqlCommand("UPDATE WikiWords SET Aw = Aw + 1 WHERE Temp > 0", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
//Reset the Temp variable
using (SqlCommand DBCmd = new SqlCommand("UPDATE WikiWords SET Temp = 0 WHERE Temp <> 0", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
//Update the word count of the article
using (SqlCommand DBCmd = new SqlCommand("UPDATE Articles SET La = " + iWordCount.ToString() + " WHERE ArticleName = '" + sArticleTitle + "'", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
//Add the text to the article
using (SqlCommand DBCmd = new SqlCommand("UPDATE Articles SET ArticleText = '" + sArticleText + "' WHERE ArticleName = '" + sArticleTitle + "'", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
}
The function called is AddWordToDB, which is here. It doesn't seem so bad. I mean, yeah, there's a while in it (green), but it's over the number of times a word is in the DB, which can only be 0 or 1, so it's no big deal.
//Add a word to the DB
public static void AddWordToDB(string sWord)
{
//Check if the word is already there
int iWordID = 0;
int iCounter = 0;
using (SqlCommand DBCmd = new SqlCommand("SELECT ID FROM WikiWords WHERE Word = '" + sWord + "'", accessConnection))
{
using (SqlDataReader reader = DBCmd.ExecuteReader())
{
[COLOR="green"] while (reader.Read())
{
iCounter++;
iWordID = Convert.ToInt32(reader.GetValue(0));
}
[/COLOR] }
}
//If the word isn't there, add it
if (iCounter == 0)
{
//The next available word ID number
using (SqlCommand DBCmd = new SqlCommand("SELECT CWord FROM Stats WHERE ID = 1", accessConnection))
{ iWordID = Convert.ToInt32(DBCmd.ExecuteScalar()); }
iWordID++;
//Add the word
using (SqlCommand DBCmd = new SqlCommand("INSERT INTO WikiWords values (" + iWordID.ToString() + ",'" + sWord + "',0,0,0,'',0)", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
//Update the word ID numbers
using (SqlCommand DBCmd = new SqlCommand("UPDATE Stats SET CWord = CWord + 1 WHERE ID = 1", accessConnection))
{ DBCmd.ExecuteNonQuery(); }
}
}
So I believe the problem is the foreach word in the document, which is the part in green in the second function. That's the only one I can see that would slow things down.
Well I've removed the loop in bright green in my code, as well as the code that needed it. I've also changed all the ntext fields to nvarchar(max). And I've abandoned the Words table entirely. DB creation is now a lot faster (like, less than a second per entry).
I make queries using the LIKE command, it's ok. Only takes maybe 5 seconds per query. Plus, I save some statistics that are computed in the DB, so that when I make another query that uses the same information it will be faster.
So far, so good. Thanks everyone for all your help so far. I'm still open to more ideas to accelerate the program (either in creation or query time, provided it doesn't slow down creation time).
Well, obviously you got rid of the major bottleneck. I think you could probably still make use of the words table if you needed to, you would just have to reorganize the way you are performing the actions. One would be to use a local List<string> to hold all of the unique words in the article and then send that to a stored procedure on the database to do all of the update word table entries using SQL I provided in my earlier post (the 'update ... set ... output into ...' stuff).
However, given if what you have right now is fast enough for your interests, then you really don't need to do too much more work. Did you get a chance to do a full text index on your article column? That may slow down inserts but it will greatly speed up queries.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
One thing to also consider is that this project can be easily paralellised, assuming the SQL server is set up to allow connections form people other than localhost. If you split up the text file and spool it out to different machines, you can do the CPU processing on each document on the separate machine; the SQL calls should be relatively cheap.
I'm not sure what speed-up you'll see doing this, but it's worth a try across two computers to see what happens.
Damnation. I just let my program run all weekend long, hoping it would insert all the documents in the database.
It did not. It crashed after 98,072 articles. It was running a query and timed-out because the operation was taking too long or the server wasn't responding (which isn't the problem, because my SQL Server is on the same computer). The query was "SELECT Categories FROM Articles WHERE ArticleName = '(whatever the name of the article was)'".
So it seems my database construction is slowing down considerably as the database gets bigger. That's a problem.
1) If I ran a query using the Articles table's primary key (which is an integer) rather than ArticleName (which is a varchar(max)) would it speed things up?
2) I'll have to look for ways to minimise the number of DB accesses I do. As it stands, I can cut a few, I think, but not a lot. I guess I could use the dictionary structure like Murphy proposed, but I'm confident that there's so much data it would surpass the RAM.
3) Other suggestions?
----
And look, more problems!
I just ran a query to see if everything still works, and it doesn't. I get the same problem, timed-out because the operation was taking too long or the server wasn't responding. Right at the first DB call too, which is "SELECT id FROM Articles WHERE ArticleText LIKE '% (my query) %'". ArticleText is a long string of words (varchar(max)) and I'm looking for the ID numbers (primary key, integer) of the documents in which that word occurs.
I'll remind you at this point that this is happening with 98,000 entries in the DB, and I need it to go up to 2,000,000.
Did you put full text indexing on the ArticleName and ArticleText fields? And yes, to query against a PK (or any indexed field, as Primary Key fields are inherently indexed) resolves into a very fast query. I assume that, if you haven't set the timeout to anything particular, then it is at its default of 30 seconds (or is it 10... hmm... I forget now).
One option is to do a hash on ArticleName and store that as an integer in the PK column. This will allow you to pull back any individual article based on an integer query. Both .NET and SQL Server have a hash method.
If full-text indexing still doesn't help you out for question 4, you will need to bring back the word table and the word/document lookup table I espoused in my final example... that would make document lookup on each world trivial, though I suspect there would be some small cost-per-insert. You would really have to implement it and see what happens.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
How can I change the timeout value? I don't know if it's 10 or 30 seconds, but either way it's too much. Adding an article to the DB has got to take less than 1 second, and there are several DB operations required, so each one has to be less than one second.
How can I change the timeout value? I don't know if it's 10 or 30 seconds, but either way it's too much. Adding an article to the DB has got to take less than 1 second, and there are several DB operations required, so each one has to be less than one second.
What box are you doing this on? Is there a possibility you can get access to an enterprise level DB like Oracle on a research machine somewhere?
I mean, I'm going to assume your code is OK, and that the machine the database is on can't handle it. It sounds like you're trying to run it off your own machine.
What box are you doing this on? Is there a possibility you can get access to an enterprise level DB like Oracle on a research machine somewhere?
I mean, I'm going to assume your code is OK, and that the machine the database is on can't handle it. It sounds like you're trying to run it off your own machine.
I'm using MS SQL Server 2005. And yes, I'm running the program and the database on the same machine. It's my laptop, and it's an Intel Duo @ 2GHz with 1Gb of RAM, running Windows XP professional SP2.
We just got a new computer for the lab, with a faster processor and 4Gb of RAM. Once it's completely setup and has Visual Studio and SQL Server installed I can move my stuff there and run the program there. Of course it will be a little faster, but I'd rather optimise the code than rely on sheer hardware power.
Did you put full text indexing on the ArticleName and ArticleText fields? And yes, to query against a PK (or any indexed field, as Primary Key fields are inherently indexed) resolves into a very fast query.
The documents have unique names, which I store in nvarchar(max) fields. If I make that field full text indexed and define as primary keys, that would be faster than just searching against that field. Is that redundant (i.e. do I just need PK or FTI)? And is it better (i.e. faster) for my PK to be an int rather than a nvarchar(max), or does it make no difference once they're indexed?
Also, I can't seem to be able to enable the full-text index. Both the SQL Server help and the online tutorials I found say that I should have the option when I right-click on the table, like so:
But I don't. There's no "Full Text Index" option. It goes from "View Dependencies" to "Rename" without anything in-between.
Are you not able to ask someone about this at your university?
I kinda think we're reaching the limits of what is possible with the unavoidable vagaries of forum posting.
Yeah, I was just thinking of talking to this girl I know who's worked with databases. Maybe she can help, or recommend someone who can.
The problem with getting help from my teammates is that none of them can do it. We're a team of four, two of which have not used SQL Server before, and the third has never done something on this scale. Plus, they're all preparing their comp exams or thesis defense later this term and writing papers, so they wouldn't have time to help me out even if they could.
A university is a much less collaborative environment than you might expect, unfortunately
Any recomendations on SQL resources (for newbs)? I am very curious about it, having taught myself a fair about of Microsoft Access. I downloaded the trial of SQL 2005 and no idea where to even begin. It feels like I just need a good tech manual to read.
Hmm, don't know why you can't see the Full Text option. I would recommend finding an SQL 2005 specific forum (for example, http://msdn.microsoft.com/en-us/sqlserver/bb671079.aspx ) and asking specifically about the Full Text option. It may be due to certain settings on the database itself or on the version of 2005 (student, professional, etc) that you are using. This should be a straight forward and easy answer for a forum like that.
Now, you don't have to make the column the Primary Key. There's a subtle difference here, true, but overall you just need to index it one way or another.
As for the timeout, this is the time that SQL will wait for a query to finish before it kills it... kind of like the number of rings a phone will allow before the answering machine picks up. Anything that takes more than even 10 seconds to complete is generally a bad deal, especially given that you were seeing timeouts at only 5% your desired capacity.
Lastly, your computer is woefully underpowered for this operation.
As for SQL Resources, the main MSDN SQL page is very useful. It contains all sorts of information from "WTF?" to DBA level complexities.
MurphysParadox on
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
You were getting faster SELECT queries with the Words table idea, but the act of creating all the rows was very slow, right?
If that was the case, perhaps you could do what you switched to (that is, not immediately using the Words table), but once a document is added you could have it process in the background or in a separate process. So you wouldn't see fast SELECT results immediately from new documents, but over time they would begin to become faster.
Let's say that you install whatever you have at some office. They add a bunch of docs through the week, and during the weekend they're gone... but this separate process is still churning, creating the Word table. By the time they come back on monday, their queries for the documents added the week before are fast again and they didn't experience too much slowdown before it got to that point.
It's a zany idea and I don't know if it's a great one, but it's an option available to you if you saw a significant SELECT speed increase with your individual word table concept.
Lastly, your computer is woefully underpowered for this operation.
Well, what kind of computer would you recommend?
A lot more RAM... like 4-16 GB. The DB tries to live in the RAM, which is hard if the actual table itself breaks the available RAM, which it probably does with 98,000 records each with several thousand words. However, you will make do with what you have.
So, here's what happens when you say "SELECT Categories FROM Articles WHERE ArticleName = '(whatever the name of the article was)'".
The database grabs the Articles table. It then takes your search term and walks, row by row, through the entire table looking for an exact string match. Now, exact string matches work basically as follows:
Thus, the more similar the name, the more characters are checked. It may perform some shortcuts (like compare lengths for a quick decision), but that's basically it will do a character comparison for each word.
Furthermore, because it is not the primary key (and thus enforced to be unique) this query will check every row EVEN when a match is made because there is no limitation on returned values. I know TOP (1) will not help because that gets the entire result set and then just hands back the first entry. I don't know exactly how DISTINCT works, but that may also not work since I imagine it will still go back through the list (but maybe not... try putting distinct in after the SELECT).
So... yeah... you need to do a full text index on that column. In SQL server, everything represented by the GUI is an actual SQL command, so try actually executing the following command against your database.
CREATE UNIQUE INDEX ui_index_name ON table_name(id_column_name);
CREATE FULLTEXT CATALOG ft AS DEFAULT;
CREATE FULLTEXT INDEX ON table_name(varchar_column_name) KEY INDEX ui_index_name;
You may not have to do the first line - especially if the id_column is already set as the PK. If this is the case, for ui_index_name in line 3, just use the name of the primary key index (viewable under the list of indexes when you expand out the table in the GUI, it'll have a golden key next to the name). See what happens when you run this.
Murphy's Law: Whatever can go wrong will go wrong.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
Posts
Check here for some information on regular expressions in MS SQL Server 2005.
Looks like I still have some time...
So if performance is an issue (especially when adding new rows in the tables) should I use nword or varchar(max)? Does it make a difference?
Right now I am splitting the data in separate columns, and rows, and using two extra tables. The problem is that splitting up a block of text thousands of words long and running individual processing for each word is extremely slow. It can take me five to ten minutes to process one entry in my original data, and I have millions of entries like that. Clearly this is unacceptable.
So I'm looking to speed up the DB creation stage. Searching the text fields for individual words is something that will only occur in the second stage, during the user queries. And it will typically only be done for a handful of words, not for thousands. So I'm ok with slowing down the user queries a little if I can get a major performance gain in the DB creation stage.
And like I said, I need every bit I can get in the DB creation stage to create it in a reasonable time.
Assuming there's a common ID and these blocks of text are related...(Your page or program puts the rows together, but in the background the data is chunked) you could use the LIKE operator to search the text and a DISTINCT on their common ID to make sure it only grabs one hit per "record". That would be how you'd handle it if it was done as lewisham's suggesting. This has its ups and downs though, because if your text is split and you're looking for "apple butt" and apple is in one row and butt is in another, you wouldn't get a hit.
If each row represents a unique text whatever, however... then no matter how many times your keyword appears in your single field, it will only show up once, since the LIKE operator just runs through that field/row. Unless there was a common ID involved in the case where you've split your text fields.
edit: Wow I shouldn't game while posting here. What exactly are you trying to store in your database? What is the format now?
This.
What is "individual processing" for each word? What are you doing?
Without all the background, any recommendations will be complete shots in the dark with a database as large as yours.
So to answer your question, my DB (or at least the table I'm talking about here) is being built from a few million text documents of a few hundred to a few thousand words each. Each row will be one document. There will be a column for a unique ID number (integer), one for the title of the document (ntext), one for the text of the document (that's the ntext field I'm talking about in my OP) and one for a list of related documents' ID numbers (ntext, I'll be storing them as a long text string with spaces in-between).
That's three ntext field per entry for million of entires in this table, plus those in other tables, so if there's some improvement to be had by using varchar(max) instead... even a few fractions of a second for each entry I add in the DB, multiplied by millions, will be significant.
The idea is that a word's importance in a document can be computer in terms of (1) how many documents in the collection does it appear in, and (2) how frequently does it appear in that specific document. That's the tf-idf metric, for those who are interested in learning more. I use this to rank the documents and discover the most relevant document given the words in a user's query.
Anyway, I need to get those two statistics, and right now I'm doing it word by word when I build the DB, which slows down DB creation considerably. But using the LIKE and DISTINCT commands, I could just store the text in an nfield and then get the statistics and query time for the words the user cares about.
Creation Date/Time, Author, etc would all be useful not only in making searching easier and more relevant for your users (if your documents span text files created in 95' and somebody wants something that was done last week, at the end-user level they'd have to sift through results for that range, and your query would have looked through all that for nothing... the cost is going to be excessively high for every query run, and if you have multiple users then it'll only get uglier.
Edit: Must post faster.
Actually, it could be a lot easier for you to grab an existing natural language search software... just visualizing complexity the query, along with maintenance and space for upkeep doesn't make this seem like a viable solution. Is this something for work?
May I ask who this is for; and how mission-critical this is? If it's a university library or something where it is reasonably important, you need more brains to sit down and work this out. Certainly, if you've not come across the "distinct" keyword, you're out of your depth (I'm not saying you are incompetent, just that you really need some more knowledge before you can begin to wrestle with a problem like this).
My assumption is that your database is built to be queried, so you have to think of record creation as cheap, and searching expensive. Don't worry about how long it takes to enter a record (within reason). EDIT: Cut out some nonsense because I was reading things wrong. You're right to be trying to pre-process as much as possible.
In addition, the DISTINCT keyword is something that will get you out of a jam; it shouldn't be relied on for huge queries where response time is important. I've used DISTINCT in overnight processing jobs, but I wouldn't use it here. DISTINCT will just get the entire dataset that the WHERE matched, and then match every record with everything else. It's an expensive operation, and most DBAs consider its use either a failure of the programmer to formulate a decent query, or a failure of the schema if its use is unavoidable.
Something's very wrong if it's taking five to ten minutes to process one document, assuming that's what you meant by 'entry'. What are you indexing your words table by? Typically you want the primary key to be an integer, but in this case I think it might be faster to index it by the word itself. I suggest you also explicitly exclude common words such as 'a', 'the', 'and', and the like; those could slow you down considerably while adding no real value to your database.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
This seems redundant to me. If the word is not yet in the Words table and you add it, it can be assumed that it's the first occurrence of that word in this document. If it wasn't the first occurrence, it would already have been added to the Words table previously.
This was already mentioned, but is adding records to the database going to be more frequent or is searching going to be more frequent? I assume searching, in which case you care less about how long it takes to add a record. I think searching the text at query time to create these metrics is going to be extremely slow if the user searches for a relatively common word, because you have to first obtain all records with that word, and then search through the full text of each returned record to get a count of the word.
Unfortunately the search is only keyword-based. Other information, like document author and date, are not relevant in my problem.
Do you mean one that goes on top of SQL server, or...? Can you suggest one?
It's for a research project, and it's important.
You mean, if the user is querying for a keyword that doesn't occur anywhere in the DB? Then the keyword is simply ignored. There's not much else I can do there.
Well that's my problem. Right now it takes an unreasonable time to enter a record into the DB, and I need to cut down on it, which is why I'm looking to remove as much of the preprocessing as possible.
Some documents refer to other documents in the DB. For each document I have a list of such related documents. I add it to the document's entry as a list of referred documents' ID numbers.
I have both right now, actually. Each entry in each table has an integer ID number, and in the Words table there is a ntext field for the word.
Indeed, I remove them. I also remove all punctuation marks, and cut more complex words down to their stems. This has all already been done before the DB creation stage, so it doesn't count in the creation time or in the preprocessing I'm trying to reduce.
Interesting. I'll look into that this afternoon. Thanks.
Unfortunately, how long it takes to enter a record is a major problem here. Let's round it to 2,000,000 records. If it takes me one minute per record (and right now it takes me a lot longer than that) then it'll take me over three years to build the DB.
If I can cut it down to one second per entry creation (take one month to create the DB) and the flip side is slightly slower querying, I can live with that. At least I'll have a DB to query through.
Also, use varchar(max). Text and NText have been deprecated and should no longer be used according to Microsoft.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
It did indexing much faster than what you've got going now... I don't recall the time on the initial builds but I'm thinking it was less than an hour on my work PC, though my data was broken down into chunks (by year), and was slightly smaller (less than a million files). You wouldn't have to have a DB for your solution... it creates an index for it. but I'm assuming that your research doesn't require this information to be converted and stored into a DB. However, it could be pricey, as it's an enterprise solution... there could be some smaller ones out there but I couldn't say offhand.
That said... just looking at your project's data entry portion, what is the mechanism by which you're importing this data? Have you written a script or program to do this?
Another option would be to hash each word (there are hash commands in SQL) so that the 'value' used is actually a number. Use that as the primary key of the word table and the locating would definitely go faster. Unfortunately I don't know if it would speed up the inserts as much as it would speed up the queries.
Now, as for your actual problem, it seems you've got a hell of a complexity level. Let me see if I understand:
Ok, this has some nasty problems. Your first loop is going to execute n times (n = number of words). The if statement will, through SQL behind-the-scenes work, execute a/m times (m = number of rows in wordTable, a <= m; a is the position in the table for the word). The second if will execute some b/n times (b <= n; b is the position in the document for the word) depending on where the word is located. Lastly, you get another a/m for the update because it has to go find the row again.
So your total execution loop is something like n * (a/m)^2 * (b/n) which should give you a hell of a lot of execution since there is no short cutting here. If wordTable.word is not the index, SQL will look at every single field an arbitrary number of steps until it finds the word. If it is indexed, then you'd at least ensure the search is based off some binary search tree or the like that will greatly increase the search speed. This is where DISTINCT comes into play. It ensures a/m instead of m because it knows to stop at the first find... this doesn't work so well for the update as it will go running down the table thinking that there may be more than one entry UNLESS you make the word column a primary key (no duplicates allowed) and indexed so it knows to never bother trying to find a second entry (this also means DISTINCT is not necessary). If no distinct claims, primary key restrictions, or indexing are used here, a = m for each search because it will check every row to see if it needs to work on it (since it won't know that the field would be unique).
Ok, so here's my suggestion for reducing time spent in the execution loop:
@rowcount does the exist check for you.
You are now looking at n * (a/m) * (b/q)which is less than the original n * (a/m)^2 * (b/n). Do also keep in mind what will happen as the project runs. We know N to be relatively constant and Q to be to relative variety of words in a document (also assumed to be relatively constant). As each document is processed, Q will increase (derivative will approach zero as less new words are found in the document). M will also increase due to it growing in size, but depending on the variety of your words across all documents, M's growth should drop down to zero well before you run out of documents. We are also guarantied that (b/q) < ((a/m) * (b/n)) since q <= n due to it being a subset and m will quickly grow to be > n. Because q <= n you can also be sure that the first b < second b since the top <= bottom of both statements... blah blah blah, just trust me on this, heh.
If you index wordTable, you will reduce b/m down to very near 1. Unsurprisingly, that will ensure the second method's complexity to simply be n * (a/q) or, in other words, the size of the document * the variety in the document. Even if you don't, you're still talking n * (a/q) * m vs n * m^2 * (b/n)... which is still better.
A small modification that may speed it up is as follows:
This will load each distinct word into #local and then, using a cursor, iterate through #local's word list. The time cost here comes from SQL managing the cursor... which isn't known for blazing speed. I would given both a shot and see how it goes.
As for query time, it really depends on what you want to do. If you make a lookup table of WordID, DocumentID, then you could make the query Very Fast. Just find the WordID and return every DocumentID (oneup number describing each document) with the given WordID (another identity column oneup number deal). Join this to the document table to pull the document text out. It would be easy to do, just use the update ... into command to get the WordID of existing words and the scope_identity(). For Example (using the first of my examples to make life easier, though it is functionally equivalent to the second example):
My one uncertainty is if you have to output into a table variable or not. If so, make another temporary local table with one column and add 'SELECT @wordID = wordID from #localWordTable' before the insert into the lookup table.
This way your query to find all documents with word X and return the text looks like this:
No more likes or such. Anyway, I hope this helps you in some way or another. At the least, I enjoyed writing it, heh.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
It might be completely out of scope for what you're doing, but you might want to think about using some sort of Bayes / neural net hybrid to reduce all of these database hits. Run a quick Bayes filter over the text to get 10 keywords, and then just store those.
Keep a persistent hash in memory to play the middleman between your database and your frontend.
Also a thought: if you're allowing arbitrary strings, then use some sort of neural net, or even a thesaurus, to see which keywords are 'close to' the arbitrary word, and then just select on those keywords for a search.
Should cut down your total searching time logarithmically.
Sorry if something like that's not an option for you.
The data is in a text file. I've already pre-processed it to structure it, so I know the first line is the document's title, then there's a number of lines for the referenced documents, then the document text in one line. All the lines are labelled, so I know which is which.
I've built a C# program to read the file line by line. It stores the title, list of referenced documents, and the text, and when it has a complete entry it adds it into the database.
You may want to watch your RAM usage while running the program and see if it ever peaks. You'll know it has broken the limit when CPU drops from 99% to 3% because the page file I/O is orders of magnitude slower than anything your CPU can do. The solution would be better memory management... which is hard in C# due to its automatic garbage collection (but not impossible).
Another question is whether these files are divided in such a way as to allow multiple versions of the program to run against, say, different folders. Shove it on a few machines and run them in concurrence. SQL Server will take care of most necessary deconflictions... just make sure your insert command can also handle the case when it finds the word already there (this would be done via an insert trigger).
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
This is the main function to read through the text file, get the titles, related documents (which are called "categories" here) and the document text. The line I marked in red is the one where I insert the text in the DB, and that's the one that slows down everything.
So here is the AddTextToArticle() function. As you can see, it checks each word of the document one by one, and calls the AddWordToDB function for each (in the section in green).
The function called is AddWordToDB, which is here. It doesn't seem so bad. I mean, yeah, there's a while in it (green), but it's over the number of times a word is in the DB, which can only be 0 or 1, so it's no big deal.
So I believe the problem is the foreach word in the document, which is the part in green in the second function. That's the only one I can see that would slow things down.
I make queries using the LIKE command, it's ok. Only takes maybe 5 seconds per query. Plus, I save some statistics that are computed in the DB, so that when I make another query that uses the same information it will be faster.
So far, so good. Thanks everyone for all your help so far. I'm still open to more ideas to accelerate the program (either in creation or query time, provided it doesn't slow down creation time).
However, given if what you have right now is fast enough for your interests, then you really don't need to do too much more work. Did you get a chance to do a full text index on your article column? That may slow down inserts but it will greatly speed up queries.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
I'm not sure what speed-up you'll see doing this, but it's worth a try across two computers to see what happens.
It did not. It crashed after 98,072 articles. It was running a query and timed-out because the operation was taking too long or the server wasn't responding (which isn't the problem, because my SQL Server is on the same computer). The query was "SELECT Categories FROM Articles WHERE ArticleName = '(whatever the name of the article was)'".
So it seems my database construction is slowing down considerably as the database gets bigger. That's a problem.
1) If I ran a query using the Articles table's primary key (which is an integer) rather than ArticleName (which is a varchar(max)) would it speed things up?
2) I'll have to look for ways to minimise the number of DB accesses I do. As it stands, I can cut a few, I think, but not a lot. I guess I could use the dictionary structure like Murphy proposed, but I'm confident that there's so much data it would surpass the RAM.
3) Other suggestions?
----
And look, more problems!
I just ran a query to see if everything still works, and it doesn't. I get the same problem, timed-out because the operation was taking too long or the server wasn't responding. Right at the first DB call too, which is "SELECT id FROM Articles WHERE ArticleText LIKE '% (my query) %'". ArticleText is a long string of words (varchar(max)) and I'm looking for the ID numbers (primary key, integer) of the documents in which that word occurs.
I'll remind you at this point that this is happening with 98,000 entries in the DB, and I need it to go up to 2,000,000.
4) What can I do to speed things up?
One option is to do a hash on ArticleName and store that as an integer in the PK column. This will allow you to pull back any individual article based on an integer query. Both .NET and SQL Server have a hash method.
If full-text indexing still doesn't help you out for question 4, you will need to bring back the word table and the word/document lookup table I espoused in my final example... that would make document lookup on each world trivial, though I suspect there would be some small cost-per-insert. You would really have to implement it and see what happens.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
What box are you doing this on? Is there a possibility you can get access to an enterprise level DB like Oracle on a research machine somewhere?
I mean, I'm going to assume your code is OK, and that the machine the database is on can't handle it. It sounds like you're trying to run it off your own machine.
We just got a new computer for the lab, with a faster processor and 4Gb of RAM. Once it's completely setup and has Visual Studio and SQL Server installed I can move my stuff there and run the program there. Of course it will be a little faster, but I'd rather optimise the code than rely on sheer hardware power.
But I don't. There's no "Full Text Index" option. It goes from "View Dependencies" to "Rename" without anything in-between.
I kinda think we're reaching the limits of what is possible with the unavoidable vagaries of forum posting.
The problem with getting help from my teammates is that none of them can do it. We're a team of four, two of which have not used SQL Server before, and the third has never done something on this scale. Plus, they're all preparing their comp exams or thesis defense later this term and writing papers, so they wouldn't have time to help me out even if they could.
A university is a much less collaborative environment than you might expect, unfortunately
Any recomendations on SQL resources (for newbs)? I am very curious about it, having taught myself a fair about of Microsoft Access. I downloaded the trial of SQL 2005 and no idea where to even begin. It feels like I just need a good tech manual to read.
Now, you don't have to make the column the Primary Key. There's a subtle difference here, true, but overall you just need to index it one way or another.
As for the timeout, this is the time that SQL will wait for a query to finish before it kills it... kind of like the number of rings a phone will allow before the answering machine picks up. Anything that takes more than even 10 seconds to complete is generally a bad deal, especially given that you were seeing timeouts at only 5% your desired capacity.
Lastly, your computer is woefully underpowered for this operation.
As for SQL Resources, the main MSDN SQL page is very useful. It contains all sorts of information from "WTF?" to DBA level complexities.
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.
You were getting faster SELECT queries with the Words table idea, but the act of creating all the rows was very slow, right?
If that was the case, perhaps you could do what you switched to (that is, not immediately using the Words table), but once a document is added you could have it process in the background or in a separate process. So you wouldn't see fast SELECT results immediately from new documents, but over time they would begin to become faster.
Let's say that you install whatever you have at some office. They add a bunch of docs through the week, and during the weekend they're gone... but this separate process is still churning, creating the Word table. By the time they come back on monday, their queries for the documents added the week before are fast again and they didn't experience too much slowdown before it got to that point.
It's a zany idea and I don't know if it's a great one, but it's an option available to you if you saw a significant SELECT speed increase with your individual word table concept.
A lot more RAM... like 4-16 GB. The DB tries to live in the RAM, which is hard if the actual table itself breaks the available RAM, which it probably does with 98,000 records each with several thousand words. However, you will make do with what you have.
So, here's what happens when you say "SELECT Categories FROM Articles WHERE ArticleName = '(whatever the name of the article was)'".
The database grabs the Articles table. It then takes your search term and walks, row by row, through the entire table looking for an exact string match. Now, exact string matches work basically as follows:
Thus, the more similar the name, the more characters are checked. It may perform some shortcuts (like compare lengths for a quick decision), but that's basically it will do a character comparison for each word.
Furthermore, because it is not the primary key (and thus enforced to be unique) this query will check every row EVEN when a match is made because there is no limitation on returned values. I know TOP (1) will not help because that gets the entire result set and then just hands back the first entry. I don't know exactly how DISTINCT works, but that may also not work since I imagine it will still go back through the list (but maybe not... try putting distinct in after the SELECT).
So... yeah... you need to do a full text index on that column. In SQL server, everything represented by the GUI is an actual SQL command, so try actually executing the following command against your database.
You may not have to do the first line - especially if the id_column is already set as the PK. If this is the case, for ui_index_name in line 3, just use the name of the primary key index (viewable under the list of indexes when you expand out the table in the GUI, it'll have a golden key next to the name). See what happens when you run this.
Here is a great explanation from Microsoft on this entire idea: http://msdn.microsoft.com/en-us/library/ms142545.aspx
Murphy's Paradox: The more you plan, the more that can go wrong. The less you plan, the less likely your plan will succeed.