robots.txt is a common file found webservers. This file is meant to instruct web crawlers such as
# robots.txt for http://www.wikipedia.org/ and friends # # Please note: There are a lot of pages on this site, and there are # some misbehaved spiders out there that go _way_ too fast. If you're # irresponsible, your access to the site may be blocked. # # Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN # and ignoring 429 ratelimit responses, claims to respect robots: # http://mj12bot.com/ User-agent: MJ12bot Disallow: / # advertising-related bots: User-agent: Mediapartners-Google* Disallow: /
Using View Page Source in the browser or
curl from the command line can reveal
html comments in the webpage which may provide useful information
╰─⠠⠵ curl http://127.0.0.1/inde.php <html> <head> <title>Test Site</title> </head> <body> <!-- Do not forget --> <!-- password: qwertyuiop --> </body> </html
Firefox & Chrome you can pre-pend the URL with
Mirroring a website¶
Sometimes it can be useful to mirror a website locally, this can be done using
╰─○ wget -m -k --no-parent https://example.com/ --2021-03-07 01:34:44-- https://example.com/ Resolving example.com (example.com)... 18.104.22.168, 2606:2800:220:1:248:1893:25c8:1946 Connecting to example.com (example.com)|22.214.171.124|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1256 (1.2K) [text/html] Saving to: ‘example.com/index.html’ example.com/index.html 100%[=======================================================================================>] 1.23K --.-KB/s in 0s 2021-03-07 01:34:45 (22.4 MB/s) - ‘example.com/index.html’ saved [1256/1256] FINISHED --2021-03-07 01:34:45-- Total wall clock time: 0.5s Downloaded: 1 files, 1.2K in 0s (22.4 MB/s) Converting links in example.com/index.html... nothing to do. Converted links in 1 files in 0 seconds.
The above flags I use by default basically create a local version of the website.
-m / --mirror Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing. --no-parent Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. -k / --convert-links After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.