Most convenient way to Extract URLs from webpage using PHP
0
up 4 down 0
Thank you for voting !

Most convenient way to Extract URLs from webpage using PHP


What is URL Extraction ?

The most User friendly definition of URL extraction is the process of fetching all the links from any website page. Technically it means extracting hrefs or src attribute from anchor tags or img tags.

The URL extraction procedure is used in many cases, for generating the Sitemap.xml file or displaying all the images of a webpage. The URL extraction PHP code is executed in PHP file at server side scripts and response is generated at browser’s front-end.

Personally I never recommend the below procedure to extract the links (URLs) because for loop is being used and is considered slower than the foreach loop. The foreach loop seems to be better in performance than the for loop. The foreach loop is executed over an array of elements and finishes the loop in less time comparatively.

Normal way of extracting urls from web-page (not recommended)


<?php
// store URL in php variable
$fetchurl = "http://example.com"
// get URL content using file_get_contents function and store content in php variable
$urlContent = file_get_contents($fetchurl);

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for($i = 0; $i < $hrefs->length; $i++){
  $href = $hrefs->item($i);
  $url = $href->getAttribute('href');
  // sanitize the extracted url inside loop
  $url = filter_var($url, FILTER_SANITIZE_URL);
  $link_title = $href->nodeValue;
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
	$urlsList .= '<li><a href="'.$url.'" target="_blank">'.$url.'</a></li>';
     }
}
?>


Below is the improved version of above code. Its faster and more convenient way. we can even store extracted URLs to PDF, CSV or TXT file. Let us know in comment section so we will publish new article on how to save extracted URLs to any file format.

The most convenient Example of extracting links (URLs) from any Webpage (Recommended by infoconic)


<?php
// store URL in php variable
$fetchurl = "http://example.com"
// get URL content using file_get_contents function and store content in php variable
$urlContent = file_get_contents($fetchurl);

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

//Iterate over the extracted links and display their URLs
foreach ($hrefs as $link){
  $url = $link->getAttribute('href');
  $url = filter_var($url, FILTER_SANITIZE_URL);
  // fetch the anchor tag text in case if extracting href attribute
  $link_title = $link->nodeValue;
     if(!filter_var($url, FILTER_VALIDATE_URL) === false){
	$urlsList .= '<li><a href="'.$url.'" target="_blank">'.$url.'</a></li>';
     }
}
?>


Another way for extracting links from a web-page Using PHP function getElementsByTagName(). The getElementsByTagName() PHP function is great but it doesn’t provide developers the facility to extract HTML tags inside specific HTML elements. When developers use DOMXPath() and evaluate() method, extraction of HTML tags is possible from any specific HTML elements from a web page.

<?php
// store URL in php variable
$fetchurl = "http://example.com"
// get URL content using file_get_contents function and store content in php variable
$urlContent = file_get_contents($fetchurl);

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);

/Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$hrefs = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($hrefs as $link){
  $url = $link->getAttribute('href');
  $url = filter_var($url, FILTER_SANITIZE_URL);
  // fetch the anchor tag text in case if extracting href attribute
  $link_title = $link->nodeValue;
     if(!filter_var($url, FILTER_VALIDATE_URL) === false){
	$urlsList .= '<li><a href="'.$url.'" target="_blank">'.$url.'</a></li>';
     }
}
?>

Something To Say ?

Your email address will not be published. Required fields are marked *

*