• Home
  • About
  • Piqq.us Invite Feed
  • Links
  • RSS CULT
  • Free Script: Keyword and Web Page Data Scraper!

    Add to Mixx!

    Ok, new script for everybody. You load it with a list of urls, and it scrapes the data off of them, minus HTML formatting data(aka prepped for use).

    For every URL entered, 3 things happen.

    1. The keywords are scraped off the site, using whatever you specify in the “keyword” form as what to look for, and saved into /cache/YourCategorykeywords.txt
    2. Saves a copy of the site, containing only <p></p>, and <br> tags, into ./cache/category/random.rawData.txt
    3. Saves a copy of the site with no HTML tags, split up into random lengths(aka prepped for a scraper/cloaker site) to ./cache/category/random.scraperData.txt

    No question, it’s a quick and dirty script, but it’s not meant to be a standalone. It’s meant to be modified if you intend on using it for scraping. If you only want it for keywords, it’ll work fine. Just make sure to ONLY USE ONE KEYWORD IN THE KEYWORD FORM.

    Just put it on a server that allows outbound connections, and has a decent MAX_EXECUTION_TIME. The more sites you put in, the longer it takes. Start out with just 3-4, and see what your server can take.
    Anyways, yeah. Just load up basicScraper.php in your web browser, and life is good!

    Here’s the script:
    <center><h3><b><a href=”http://www.slightlyshadyseo.com”><font face=”arial”>The Quick And Dirty Keyword+Content Scraper</font></a></b></h3></center>
    <?php
    srand(make_seed());
    if(!isset($_POST[’sites’]) || !isset($_POST[’category’]) || !isset($_POST[’keyword’]))
    {
    ?>
    <form action=”<? echo $_SERVER[”REQUEST_URI”] ?>” method=”POST”>
    <table border=”2″>
    <tr><td colspan=”2″><b><font face=”arial”>SlightlyShady KeyWord+Data Scraper</font></b></td></tr>
    <tr><td><b>Keyword</b></td><td><input type=”text” name=”keyword”></td></tr>
    <tr><td><b>Category</b></td><td><input type=”text” name=”category”></td></tr>
    <tr><td><b>URLs</b></td><td><textarea name=”sites”>Put your URLs Here</textarea></td></tr>
    <tr><td colspan=”2″><input type=”Submit” value=”Scrape”></td></tr>
    <table>
    </form>
    <?
    }
    else
    {
    $category=$_POST[’category’];
    $keyword=$_POST[’keyword’];
    if(stristr($keyword, ” “)!==FALSE)
    {
    die(”<font color=\”red\”>Only Put ONE word in the keyword form</font>”);
    }
    $urls=$_POST[’sites’];
    if(!file_exists(”./cache/”))
    {
    mkdir(”./cache/”);
    }
    if(!file_exists(”./cache/$category/”))
    {
    mkdir(”./cache/$category/”);
    }
    $spl=explode(”\n”, $urls);
    echo sizeof($spl).” URLs loaded<br>”;
    for($i=0; $i<sizeof($spl); $i++)
    {
    $spl[$i]=trim($spl[$i]);//godamn \r
    $data=file_get_contents($spl[$i]);
    $search = array(’@<script[^>]*?>.*?</script>@si’,'@<style[^>]*?>.*?</style>@siU’);//kill scripts and stylesheets not killed by strip tags
    $data = preg_replace($search, ”, $data);
    $data=strip_tags($data,”<br><br /><p></p>”);//keep the paragraphing and line breaks
    $data=str_replace(”<p>”,”<br>”,$data); //switch to <br> to make sorting data easier
    $data=str_replace(”</p>”,”<br>”,$data);//switch to <br> to make sorting data easier
    $data=str_replace(”<br />”,”<br>”,$data);//switch to <br> to make sorting data easier
    $data=str_replace(”<BR>”,”<br>”, $data);

    $handle=fopen(”./cache/$category/”.md5($spl[$i]).”.RawData.txt”,”a”);
    fwrite($handle,$data);
    fclose($handle);

    $data=str_replace(”\r”,”",$data);//clear line breaks for extraneous windows BS
    $data=str_replace(”\n”,”<br>”,$data);//replace line breaks with html new line to break up the data
    $data=strip_tags($data,”<br>”);//strip all but the <br> tags. This is to include case sensitive nastiness, since explode() is case sensitive
    writeData($data,”./cache/$category/”.md5($spl[$i]).”.scraperData.txt”, $keyword);
    getKeywords($data,$keyword,$category,$spl[$i]);
    echo “Writing raw data to ./cache/$category/”.md5($spl[$i]).RawData.”.txt<br>”;
    }
    }
    function getKeyWords($data,$key, $category,$url)
    {
    $spl=explode(”<br>”, $data);
    $keyword=array();
    for($i=0; $i<sizeof($spl); $i++)
    {
    if(strlen($spl[$i])>3)
    {
    $spl2=explode(” “,$spl[$i]);
    $tmpKey=getKeywordsFromLine($spl2,$key);
    for($j=0; $j<sizeof($tmpKey); $j++)
    {
    if(!contains($keyword,$tmpKey[$j]))
    {
    $keyword[sizeof($keyword)]=$tmpKey[$j];
    }

    }

    }
    }
    if(sizeof($keyword)>1)
    {
    echo “<table border=2>”;
    echo “<tr><td><b>Keywords from “.$url.”</b></td></tr>”;
    $handle=fopen(”./cache/$category.keywords.txt”,”a”);
    fwrite($handle,”Keywords From “.$url.”\r\n”);
    for($i=0; $i<sizeof($keyword); $i++)
    {
    fwrite($handle,$keyword[$i].”\r\n”);
    echo “<tr><td>$keyword[$i]</td></tr>”;
    }
    fwrite($handle,”\r\n\r\n”);
    fclose($handle);
    echo “</table>”;
    }
    }
    function contains($array, $key)
    {
    for($i=0; $i<sizeof($array); $i++)
    {
    if(strtolower(trim($array[$i]))==strtolower(trim($key)))
    {
    return(true);
    }
    }
    return(false);
    }
    function getKeywordsFromLine($spl,$key)
    {
    $keywords=array();
    for($i=0; $i<sizeof($spl); $i++)
    {
    if(strtolower($spl[$i])==strtolower($key))
    {
    if($i!=0)
    {
    if(strlen($spl[$i-1])>2)
    {
    //echo “Keyword:”.$spl[$i-1].” “.$spl[$i].”<br>”;
    $keywords[sizeof($keywords)]=$spl[$i-1].” “.$spl[$i];
    }
    }
    else if($i!=sizeof($spl)-1)
    {
    if(strlen($spl[$i+1])>2)
    {
    //echo “Keyword:”.$spl[$i+1].” “.$spl[$i].”<br>”;
    $keywords[sizeof($keywords)]=$spl[$i+1].” “.$spl[$i];
    }
    }
    }
    }
    return($keywords);
    }
    function writeData($data, $path, $keyword)
    {
    if(strlen($data)<=5)//we don’t want no 0kb files
    {
    return;
    }
    echo “Writing Scraper Data to “.$path.”<br>”;
    $handle=fopen($path,”a+”);
    $spl=explode(”<br>”,$data);
    for($i=0; $i<sizeof($spl); $i++)
    {
    if(strlen($spl[$i])>5)
    {
    $spl2=explode(” “,$spl[$i]);
    //echo sizeof($spl2).” words in spl2<br>”;
    $sinceBreak=0;
    $curLine=”";
    for($j=0; $j<sizeof($spl2); $j++)
    {
    if($sinceBreak>2 && rand(0,10)==3 && strlen(trim($curLine))>4)//make sure we have at least two words, then randomally decide when to break the data. Average of 10 words per line
    {
    $curLine=$curLine.” “.trim($spl2[$j]);
    fwrite($handle, trim($curLine).”\r\n”);//write curLine to the text file
    $sinceBreak=0;//reset our word count
    $curLine=”";
    }
    else
    {
    $curLine=$curLine.” “.trim($spl2[$j]);
    $sinceBreak++;
    }
    }
    if(strlen(trim($curLine))>5)//if there’s substantial data in curLine, write it before continuing.
    {
    fwrite($handle, trim($curLine).”\r\n”);
    $curLine=”";
    }
    }
    }
    fclose($handle);
    }

    function make_seed()
    {
    list($usec, $sec) = explode(’ ‘, microtime());
    return (float) $sec + ((float) $usec * 100000);
    }
    ?>

    Download it direct, with formatting from here: Web Site Scraper(change file extension to .php)

    Share and Enjoy(You know you want to): These icons link to social bookmarking sites where readers can share and discover new web pages.
    • Technorati
    • StumbleUpon
    • Reddit
    • PlugIM
    • Blue Dot
    • Bumpzee
    • Simpy
    • Netscape
    • del.icio.us
    • blogmarks
    • Spurl
    • Furl
    • Fark
    • TailRank
    • BlinkList
    • NewsVine

    Leave a Reply

    XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

    Marketing & SEO Blogs - Blog Top Sites
    © Slightly Shady SEO, All Rights Reserved. Scrape me, and I will eat your soul.