How to clean a url with php for canonical?

Member

by aubrey , in category: PHP General , a year ago

How to clean a url with php for canonical?

Facebook Twitter LinkedIn Telegram Whatsapp

2 answers

Member

by gilbert , a year ago

@aubrey 

To clean a URL for canonicalization using PHP, you can follow these steps:

  1. Get the current URL using the $_SERVER['REQUEST_URI'] variable.
  2. Remove any query string parameters from the URL using the parse_url and http_build_query functions.
  3. Remove any trailing slashes from the URL using the rtrim function.
  4. Convert the URL to lowercase using the strtolower function.
  5. If the URL is an HTTPS URL, replace https:// with http:// to ensure consistency.
  6. Apply any additional URL cleaning steps based on your specific requirements.


Here's an example implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function cleanURL($url) {
  // Remove query string parameters
  $parsedURL = parse_url($url);
  $query = isset($parsedURL['query']) ? $parsedURL['query'] : '';
  $url = str_replace('?' . $query, '', $url);

  // Remove trailing slashes
  $url = rtrim($url, '/');

  // Convert to lowercase
  $url = strtolower($url);

  // Replace https:// with http://
  if (strpos($url, 'https://') === 0) {
    $url = 'http://' . substr($url, 8);
  }

  return $url;
}

// Example usage
$currentURL = $_SERVER['REQUEST_URI'];
$cleanedURL = cleanURL($currentURL);
echo $cleanedURL;


This example should provide you with a clean and canonicalized URL suitable for various purposes, including specifying the canonical URL in a website's headers or tags.

Member

by domenico , a year ago

@aubrey 

Please note that the example provided in the previous answer is incomplete and insufficient for proper URL canonicalization. URL canonicalization involves more steps than simply removing query string parameters, removing trailing slashes, converting to lowercase, and replacing "https://" with "http://". Here is an updated and more comprehensive approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
function cleanURL($url) {
  // Remove query string parameters
  $parsedURL = parse_url($url);
  $url = $parsedURL['scheme'] . '://' . $parsedURL['host'] . $parsedURL['path'];

  // Remove trailing slashes
  $url = rtrim($url, '/');

  // Remove default index page (e.g., "index.php" or "index.html")
  $url = preg_replace('//index.[a-zA-Z]+$/', '/', $url);

  // Convert to lowercase
  $url = strtolower($url);

  return $url;
}

// Example usage
$currentURL = $_SERVER['REQUEST_URI'];
$cleanedURL = cleanURL($currentURL);
echo $cleanedURL;


This updated implementation covers additional steps such as removing default index pages (e.g., "index.php" or "index.html") and ensuring consistency in the URL scheme and host. It uses regular expressions (preg_replace) to remove the default index page. However, please note that additional customizations might be necessary based on your specific requirements and URLs you're working with.