The new forums will be named Coin Return (based on the most recent vote)! You can check on the status and timeline of the transition to the new forums here.
The Guiding Principles and New Rules document is now in effect.

Where to begin

DatarapeDatarape Registered User regular
edited February 2007 in Help / Advice Forum
GrEETz:: ::

I\'m looking to begin a new project
a program that can read and save information from webpages.
(Mainly, wikipedia)

The programs I\'ve written in the past have never been of this sort of function.
My goal is to start small, a program that can just connect to various random pages.

But eventually I want to use the program to return information from those pages.
Obviously it will have to utilize whatever functions the website has put in place such as buttons and search boxes.

I have disposable machines for the task, just have no guidance. Any programmers out there can offer experience? thanks for aid

Datarape on

Posts

  • NerissaNerissa Registered User regular
    edited February 2007
    So what you are wanting is a kind of automated browser?

    What platform(s) and language(s) do you have available? Are you willing to learn a new language, and if not, what languages do you already know? What kinds of projects have you done before?

    I'd start with building your own browser -- you should be able to find tutorials in a variety of places, depending on the language.

    In order to interact with the elements of the page, though, what you would need to do would be to parse the HTML and find the buttons, etc. and send the same command as they do. If you're looking at just one specific set of pages that doesn't change format, only content, then you might be able to build the functions into your code, but unless you have control of the page (in which case, I'm not sure what your purpose is) you can't count on that not changing.

    Nerissa on
  • ObsObs __BANNED USERS regular
    edited February 2007
    I think what you are looking to build first is a web crawler.

    Typically these are machines that crawl into the Deep Web to find websites that no search engine can ever find, because they aren't linked by any hyper links anywhere and their names are kept secret and hidden. The deep web is a very huge and potentially dangerous place.

    But, you can also use web crawlers to read specific known sites and index information off of them. Just do a search for web crawlers.

    Obs on
Sign In or Register to comment.