As was foretold, we've added advertisements to the forums! If you have questions, or if you encounter any bugs, please visit this thread: https://forums.penny-arcade.com/discussion/240191/forum-advertisement-faq-and-reports-thread/

MS Office OCR List Problem Thingie

Dr SnofeldDr Snofeld Registered User regular
edited April 2009 in Help / Advice Forum
The situation is thus.

I have a 90-odd page list of book distributors, for work. My boss wanted me to scan them all in. I did that. He now asked if it were possible to A: remove the entries without email addresses, and B: send a stock email to all of the distributors WITH email addresses.

The problem is that since the pages are formatted as four columns of pretty small and slightly faded text, if I OCR the lists I'll more than likely get a lot of gibberish in there that I'd have to manually check ANYWAY. PLUS I'd end up with a mish-mash of stuff in MS Word which I couldn't really do anything with. From my very faint memories of working with stuff in Office I could do the whole stock email thing in...Works, maybe? But even there I'd have to manually enter or copy-paste into a database the names and email addresses of several hundred distributors, and THEN I'm not certain how I'd send said emails out.

I say I, but this really isn't my job, it's just been handed on to me just now because the person who was meant to do it has quit working for the guy. MY job is to "computerise" (his words) his typewriter-based manuscripts. I initially thought my boss just wanted the list on my PC as a backup, so I scanned it all in and he's gonna pay me for the time. But this stuff, I'm guessing, would take much longer than I have time for, given that I have exams to study for, and he's been understanding about that sort of thing. So I think I'm just gonna have to say that I cannot get this stuff done soon enough for him, given my other commitments, and that he may have to find someone else to do this list business.

That being said, if I don't do this, I think he'd like to know how it should be done, to pass instructions on to whoever has to do it.

So, am I on the money with my assessment of how this task would be done, or is there a more efficient way?

l4d_sig.png
Dr Snofeld on

Posts

  • LaOsLaOs SaskatoonRegistered User regular
    edited April 2009
    If you can photocopy or just scan images of the pages, scan the faded text darker (with the light/darkness and contrast settings). Try scanning a page of that through the OCR to see if it picks it up. Have you OCR scanned one of the original pages to determine what results you get with the faded text?

    If that works, OCR the pages into a Word document or whatever. I would then try pasting the text from the Word document into an Excel spreadsheet... it should make new columns where there are new columns. At the very least, using concatenate and other formulas, you should be able to make the text read like it should. Then, I would sort or filter based on the presense or lack of presense of an email address... and then cut out the rest and you've got a ready-made database for mail merges or you can just copy all the email addresses at once and past them into the BCC of the email you want to send out and email them all at once.

    I'm not sure if this will actually solve your problem though, so good luck!

    LaOs on
  • Dr SnofeldDr Snofeld Registered User regular
    edited April 2009
    The columns arent like Name, Address, Email or that stuff, it's more like how a newspaper is laid out. I'll try OCRing it in a bit.

    EDIT: Scanned a page, but the email addresses had mistakes in them - m as nn, e as a, stuff that's par for the course when the text is any smaller than, say, size 12. But it'd mean having to go through every single address anyway, to correct the little errors, and it'd be as well entering them in manually anyway.

    As for removing those entries, I'm just going to remove the entries without email addresses straight from the scanned image using GIMP 2 to "white out" the sections.

    EDIT EDIT: Spoke to my boss, he's going to find someone else to do the email thing. I'm gonna remove the email-less entries and put a .zip of the whole 92-or-so pages up on RapidShare or somesuch when he finds someone to do the work. Luckily for them I'm also removing from the list the entries that would definately not be interested in the religious and, ugh, Intelligent Design-based books. (Working on these books, I feel like a traitor to the scientific community, but at least I didn't write it, my name's not on it, and they're not for schools. Plus I need the money and the work is easy as hell. At least his other books on WWII and the like are okay.) So the list shrinks considerably once I remove all the art shops and purveyors of Chinese literature or Elvis merch.

    All the same I'd appreciate any help that I may pass it on to whoever ends up with this task.

    Dr Snofeld on
    l4d_sig.png
Sign In or Register to comment.