Welcome guest. Before posting on our computer help forum, you must register. Click here it's easy and free.

Author Topic: Farming Webpages .... any suggestions for data acquisition  (Read 3164 times)

0 Members and 1 Guest are viewing this topic.

DaveLembke

    Topic Starter


    Sage
  • Thanked: 662
  • Certifications: List
  • Computer: Specs
  • Experience: Expert
  • OS: Windows 10
Farming Webpages .... any suggestions for data acquisition
« on: June 05, 2009, 10:12:34 AM »
I have been farming web sites for information through automated macros which dump the copy/pasted data into a database.

I am checking to see if anyone knows of an easier way to farm web sites for information than having to create a mouse/keyboard macro for each web site to farm information from for analysis?

Maybe a piece of software that interfaces with the HTML source of the web page itself and copies to clipboard information after a specified flag such as a web site stating that it is 83 degrees out if you look at its source will show a field that shows 83 degrees and to copy it directly from HTML source.

Just want to also mention that... I have NO NEED to capture keystrokes or screenshots which could be used maliciously. I simply want an easier way to interface with static information provided openly to all public at these sites and have no need for dynamic data such as data entered by a user which could be used for the wrong means.

Any suggestions greatly appreciated.

Rob Pomeroy



    Prodigy

  • Systems Architect
  • Thanked: 124
    • Me
  • Experience: Expert
  • OS: Other
Re: Farming Webpages .... any suggestions for data acquisition
« Reply #1 on: June 06, 2009, 04:06:45 AM »
Hi Dave.  I do this kind of stuff from time to time.  I tend to use PHP, partly because I'm very familiar with at and partly because it's very good at processing text.  Here's the general process:

  • if the web site requires authentication, programmatically send credentials by whatever means it normally requires (HTTP authentication, POST variables, etc) and allow cookies to be set
  • load page into variable
  • parse page using regular expression pattern matching
  • dump extracted data into database, text file, whatever

And here's an example:

Code: [Select]
<?php
/****************************************
 * BEGIN: Configure cURL                *
 ****************************************/
$ch curl_init();
curl_setopt($chCURLOPT_POST1);
curl_setopt($chCURLOPT_FOLLOWLOCATIONfalse);
curl_setopt($chCURLOPT_COOKIEJARdirname(__FILE__).'/cookie.txt');
curl_setopt($chCURLOPT_HEADER 1);
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
/****************************************
 * END: Configure cURL                *
 ****************************************/

$post   'username=username&password=password';
curl_setopt($chCURLOPT_URL'http://www.example.com/login');
curl_setopt($chCURLOPT_POSTFIELDS$post);

if (
$page curl_exec($ch))
{
  
// That was the login; now to retrieve the pages:
  
$regexp '|insert regexp here with (brackets around text we want to save)|isU';

  
// Page to parse
  
$url "http://www.example.com/start";

  
// Load the page
  
curl_setopt($chCURLOPT_URL$url);
  
$page curl_exec($ch);

  
// Find the desired text
  
if (preg_match_all($regexp$page$result))
  {
    
// do something with the matches
  
}

  
// Save the page
  
if (file_put_contents("/somewhere/file.txt"$page))
  {
    echo 
"succeeded<br/>";
  } else {
    echo 
"FAILED<br/>";
  }

} else {
  echo 
"Login failed";
}
curl_close($ch);
?>
Only able to visit the forums sporadically, sorry.

Geek & Dummy - honest news, reviews and howtos

DaveLembke

    Topic Starter


    Sage
  • Thanked: 662
  • Certifications: List
  • Computer: Specs
  • Experience: Expert
  • OS: Windows 10
Re: Farming Webpages .... any suggestions for data acquisition
« Reply #2 on: June 07, 2009, 11:56:01 PM »
Thanks Rob!


   I am going to try this out

Rob Pomeroy



    Prodigy

  • Systems Architect
  • Thanked: 124
    • Me
  • Experience: Expert
  • OS: Other
Re: Farming Webpages .... any suggestions for data acquisition
« Reply #3 on: June 08, 2009, 06:10:32 AM »
Cool.  The hardest part is getting the regular expression right.  Let me know if you need any more help.

Out of interest, this technique is usually called "web scraping" or sometimes (inaccurately) "screen scraping".
Only able to visit the forums sporadically, sorry.

Geek & Dummy - honest news, reviews and howtos