So, I was given an interesting challenge here. There are 5000 5-page hand-written forms that we want to scan and optical-character-recognize in order to be able to study them. Right now, I'm looking for any and all ideas on how to do that, aside from doing it manually myself.
Some details:
1. There are 5000 documents, each has 5 pages, and each is stapled. If staples are removed in the process, we would like them restored after to maintain the integrity of the documents.
2. The forms are hand-written, with varying levels of handwriting quality. The OCR will need to recognize it with a minimum of errors regardless.
3. Some people wrote on the back side of the pages. The text on the back side needs to be digitized and OCR'd as well. There would not be an issue scanning all 5000 documents as two-sided if that's the easiest way to make sure we don't lose a page of text.
4. Some people drew pictures in their documents. When that happened, I would like a note in the text and metadata like <picture here> or some such.
5. Forms have an ID line at the top, and the last page is a survey. We would like all that information extracted and included in metadata for easy sorting and retrieval. We would keep the scanned images, so we can go back and look at the pictures if we want to. The images often have speech bubbles and captions though, it'd be nice if they were extracted and OCR'd and included in the text.
6. The text being hand-written by kids, there are numerous spelling and grammar mistakes as well as made-up words. We would like two copies of the OCR results, one 'original' with all mistakes, and one 'clean' with mistakes corrected.
7. Budget is of course tight (when isn't it?), but let's leave that as a secondary concern once we have a good list of options to pick from. I think we have enough technical problems to deal with, without throwing budget into the mix from the start.
So, how would you go about doing this? Do you know of professional services that do this? Or useful software?
Thanks all.
Posts
There are a ton of temp agencies you could contact that should have some data entry folks that'll do it.
There ARE ways to do it, but you'd need quite a bit of coding knowledge to do so.
Off the top of my head, the way I would do it is:
Scan them all in front and back.
Remove all the empty pages (acrobat can do this automatically).
Make a copy of the file and have acrobat crop the top of the pages for x pages (this is for the ID line). You can use the action wizard in acrobat to do this.
Run OCR on the cropped files.
That'll give you a pdf file with the text you need, you should then be able to extract the text to use as a note.
Do the same for the rest of the info you need and then recombine them all.
You could pull the info from the pdfs as an image and then recreate the pdfs using all the pulled info with a php pdf program.
But like I said it's going to be a job, no matter how you go about it.
OCR is a thing, but it's not going to be able to correct mistakes, or, guess at implied grammatical spellings and meanings.
Something like this is probably a few tens of thousands of dollars just to get started.
You can probably get a rudimentary OCR with some online tools and libraries, even some API/SDKs if you're a programmer.