linklint-2.3.5.orig/0040775000175000017500000000000007477476453014137 5ustar barbierbarbierlinklint-2.3.5.orig/doc/0040775000175000017500000000000007477476453014704 5ustar barbierbarbierlinklint-2.3.5.orig/doc/bugs.html0100664000175000017500000003227307341361022016505 0ustar barbierbarbier Linklint Documentation - Known Bugs Linklint Documentation - Known Bugs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Bug Reports · Memory · Parsing · Won't Check

Linklint has been used on hundreds (maybe thousands) of sites around the world for over four years. When it was initially released (in 1997) many bugs were reported and fixed. After that, there were only a small handful of bug reports (all of which were fixed in release 2.3).

If you read all of the documentation and then start using Linklint there is a very high probability that it will work correctly. In the spirit of Perl, Linklint has been designed to "do the right thing".

Bug Reports
 (top)  (command index)  (topic index)
If you think you have found a bug, please let us know at bugs@linklint.org.

It is often helpful if you include:

Requests for information and comments are also welcomed. You can send these to info@linklint.org.

Requests for information in the documentation that start out with "I don't have time to read the documentation" may not get a response.

Memory Problems
 (top)  (command index)  (topic index)
No doubt about it, Linklint can be a big memory hog (for large sites). We have received several reports of Linklint being unable (or unwilling) to write the HTML output files after doing a site check. These have all been traced back to memory problems.

We tried to optimize Linklint for speed and use a lot of memory in order to make it fast. We have had one report that there is a huge difference in memory consumption depending on the operating system Linklint is running under. It was reported that a Sun used almost 1/10th the memory as Linux when checking the same site ( 68M on the Sun system versus about 464M on Linux).

There is a clear need for:

(a) a re-write in Object Oriented Perl 5
This could significantly ease memory usage with no (or very little) sacrifice in speed.

(b) hooks to a database back-end
This would virtually eliminate the memory problems but there could be a significant reduction in speed.

In the meantime, if you are running out of memory there are several hints that will help ease this problem.

Parsing Problems
 (top)  (command index)  (topic index)
Linklint parses HTML files as fast as possible. It was designed to be a fast link checker and not an HTML validator. Originally Linklint was written "to spec" but after a flurry of bug reports (and fixes) it now does a very good job at emulating a Netscape Browser including many of this browsers idiosyncrasies.

One trick that lets Linklint parse HTML quickly is to use the "<" character as the effective "new-line", so HTML files are split on "<" as they are read. This is very fast but has two downsides.

  1. (minor) If your HTML tags contains "<" inside of HTML tags as in <img src=back.gif alt="<<<"> then Linklint will need to do some backtracking (which will slow it down a bit. Solution: always use &lt; instead of < inside of HTML tags. Since this use of bare "<" inside of HTML tags is uncommon Linklint remains very fast "on average".

  2. (occasional problem) if an HTML tag is missing a closing ">" or if you have a bare "<" character in the text of your page, Linklint will read in everything in the file until the next > (or the end of file) as a single tag. This can cause a memory problem (see above.). These problems are rare and can usually be tracked down quickly.
    Solutions:

    1. Use Weblint to check the HTML in your pages before link checking.
    2. Use the log.txt Linklint output file to track down the offending page (it is usually the last page checked).
What Linklint Won't Check
 (top)  (command index)  (topic index)
Linklint will not be able to track down links that require specific visitor form input. For example, a search engine site uses form input from visitors to generate new HTML pages. These pages would not be checked by Linklint. Likewise, if you have written your own CGI program that uses the IsMap attribute to create new links depending on where a visitor clicks on an image, Linklint will not be able to find these links.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Bug Reports · Memory · Parsing · Won't Check

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/doc_index.html0100664000175000017500000004320607341361022017477 0ustar barbierbarbier Linklint Documentation - index Linklint Documentation - index
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Command index · Topic index

The command index links every input command to the section in the documenation where that command can be found.

The topic index gives a quick overview off all of the documentation and links to all of the sections.

Command Index
 (top)  (command index)  (topic index)
-cache directory -case -checksum -concise_url
-db1..9 -delay d -doc -docbase base -dont_output xxxx
-error
-flush -forward
-help -help_all -host hostname:port -host hostname -htmlonly -http -http_header "Name:value"
-ignore ignoreset -index file
-language zz -limit n -list -local linkset
-map /a=[/b]
-netmod -netset -no_anchors -no_query_string -no_warn_index
-orphan -out file -output_frames -output_index filename
-password realm user:password -proxy hostname[:port]
-quiet
-redirect -retry
-silent -skip skipset
-textonly -timeout t
-url_doc_prefix url/
-version
-warn
-xref
Topic Index
 (top)  (command index)  (topic index)

Inputs

Input FilesCommand Files · Reading Commands from STDIN · Files of Local Pages · Files of Remote Links
Which Files to CheckLinksets defined · Other File Selection Options
Local Site CheckingOther Local Site Options
HTTP Site CheckingHTTP Site Check Options
Remote URL CheckingWhich URLs to check · Other Remote URL Options · Status Cache Options
Output OptionsMulti File Output · Single File Output
Debug and other FlagsDebug Flags · Other Flags


Outputs

-Doc Directory
Site-Check Output FilesSite-Check Summary Files · Site-Check Data Files
Url-Check Output FilesUrl-Check Summary Files · Url-Check Data Files


Hints

Create a Command File
Resolving Memory ProblemsCheck Your Site in Sections · Run Linklint Twice · Use the -no_anchor option
Add Passwords
Add Server-Side Image Maps
Tracking Down Errors
Server Redirection


How it works

Creating Seeds for a Site Check
Site Check Recursion
Parsing HTML Files
Resolving Links
Default Index Files
Server-side Image Maps
How the Status Cache Works


Bugs

Bug Reports
Memory Problems
Parsing Problems
What Linklint Won't Check

Other

GNU General Public License
Common Country/Language Codes

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Command index · Topic index

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/dot.gif0100664000175000017500000000012507341361022016123 0ustar barbierbarbierGIF89a³ÿÿÿBBBZZZ{{{œœœÞÞÞï÷çÆÖ½”½„R¥1)„ÿŒçÿBÎÿ½!ù,D;linklint-2.3.5.orig/doc/hints.html0100664000175000017500000005073407341361022016674 0ustar barbierbarbier Linklint Documentation - hints Linklint Documentation - hints
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Cmd files · Memory · Passwords · Image maps · Errors · Redirection

This page contains detailed hints for configuring Linklint for web sites that needs to make use of most of the features of Linklint. Some sections here may not be applicable to your site. These hints are intended as suggestions to help you quickly get started checking links. Your mileage my vary.
  • Use the /@ linkset to check your entire site.
  • Use -limit NNN to check more than 500 HTML files.
  • Always use the -doc dir option.
  • Consider using the -docbase option if you are doing a local site check.
  • Read all the documentation.
Create a Command File
 (top)  (command index)  (topic index)
Before checking your site, take the time to put some information about your site in a command file. This will avoid a lot of retyping (and possible typos) later on. It often makes sense to name the command file after your host name. The command file should look like:
# general command file for hostname

-host www.hostname.com
-root /absolute/path/to/your/htmlrootdirectory
-doc linkdoc
-http
-limit 1000
Resist the temptation to include any linksets in this command file. The reason will become clear when you start tracking down broken links. If you need to use a large list of linksets, another option is to include these in their own separate command file.

You can check your home page with linklint @hostname. You can check your entire site with linklint @hostname /@.

Often the easiest way to understand some of the many features that Linklint has to offer is to try them out. Linklint is very fast and it is easy to try play around with it on just a few pages. Start with a simple command file like the example above and then add features and options as needed. In the spirit of Perl, Linklint has been designed to "do the right thing".

Resolving Memory Problems
 (top)  (command index)  (topic index)
Here are some things you can do to reduce the amount of memory that Linklint uses.

Check Your Site in Sections

If you have a very large site (thousands of pages), it might make sense to check your site up into several sections and check each section separately. One way to do this is to check all the files in the root directory and then check files in each subdirectory.

Note: Linklint is designed so that all links between the sections will be checked correcty. Currently, the output files for each section will not be merged.

Create a command file named root for checking the root directory:

# root directory command file
@hostname
-doc rootdoc
/#
For each subdirectory (or group of subdirectories) create a command file name subdir:
# command file for subdir
@hostname
-doc subdirdoc
/subdir/@
Now you can check just your root directory with linklint @root and each subdirectory with linklint @subdir and the results will be kept in separate output directories.

Run Linklint Twice

If you have a large site, don't use the -net when you are checking your site. Instead, after you check your site (without the -net command) run Linklint again as:

linklint -doc doc_dir @@

You will end up with the same results as with a single pass of Linklint but the memory requirements will have been eased.

Use the -no_anchor option

Since there are often many named anchors on a single page, the list of named anchors that Linklint generates and checks can be larger than the list of HTML pages. You can use the -no_anchors option to tell Linklint to ignore named anchors which should reduce memory consumption.

Add Passwords
 (top)  (command index)  (topic index)
If you get warning messages that say need password for "realm", you will have to provide Linklint with a username and password for each password protected realm. Add these lines to your hostname file:
-password "realm1" username1:password1
-password "realm2" username2:password2
The realms are double quoted in the warning messages. You will have to use double quotes in the command file if the realm contains any space characters. You can also use the realm "DEFAULT" to provide a default username and password. The default will be tried only if a password for the specific realm was not given. Once you have made these changes to your command file, check the site again to make sure that you entered all the information correctly. You will get warning messages for invalid username/password combinations.

Note: The HTTP protocol uses a named realm to identify a set of pages that share a common set of username/password combinations. This system was created so that visitors only need to be prompted for their username and password once (per session) in order to browse any number of pages in a given realm. Realms are often used to protect all the files under a particular subdirectory, but they can be used in other configurations.

Add Server-Side Image Maps
 (top)  (command index)  (topic index)
If your site makes use of server-side images maps, you may have to add a -map option to your command file so Linklint knows how to find your .map files. See Server-Side Image Maps for a detailed explanation. You may have to add one of the following lines to your hostname file:
-map /cgi-bin/imagemap
-map /cgi-bin/imagemap.exe
-map /cgi-bin/htimage
You will also need to have the -root directory specified so Linklint knows where to look for map files locally on your machine.
Tracking Down Errors
 (top)  (command index)  (topic index)
Sometimes the error messages generated by Linklint do not provide sufficient information for figuring out why an error was reported. In these cases it can be useful to look at the HTML tags that caused the errors. One way to see these tags is to use the -db3 flag. This flag causes all HTML tags that contain links to be printed out followed by the fully expanded links.

Here is one strategy for tracking down errors:

  1. Look in the errorF.txt or errorX.txt file to find the file that caused the error. Let's call its full (URL) path: /some/file.html
  2. Run linklint @hostname /some/file.html -db3 -doc dbdoc
    This will cause all the tags containing links in /some/file.html to be printed out in dbdoc/log.txt.
  3. Examine the dbdoc/log.txt file to see the HTML tags found by Linklint and the links that were extracted from these tags.
If you use this technique frequently, you can avoid repeated typing by making a debug command file:
# debug command file
@hostname
-db3
-doc dbdoc
You can use this file to debug an HTML page with the command:
linklint @debug /some/file.html.
Server Redirection
 (top)  (command index)  (topic index)
One of the worst causes of confusion in debugging broken links is server redirection. Some http servers are programmed to deliver a different page than the one a visitor asks to see.

The most benign form of redirection is when the server program sends back a moved status code (301 or 302), telling the browser that the page requested has moved, along with a new url. Linklint follows these links and reports all moved urls in the file mapped.txt.

Sometimes a server is programmed to serve up the contents of a page that is different from the page requested without giving any hints to the browser (or to Linklint) that a switch has been made. Take a simple example where fileA.html is mapped to fileB.html. Linklint will tell you fileA.html is missing whenever fileB.html is missing even if fileA.html exists!

Since the server is not providing any clues that this switch has been made, there is nothing Linklint can do to alleviate the situation. I can only suggest that you minimize your use of this type of server redirection and familiarize yourself with which links on your site have been mapped this way.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Cmd files · Memory · Passwords · Image maps · Errors · Redirection

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/howitworks.html0100664000175000017500000005352707341361022017772 0ustar barbierbarbier Linklint Documentation - how it works Linklint Documentation - how it works
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Seeds · Recursion · Parsing · Resolving links · Default Files · Image Maps · Status Cache

This page explains in more detail how Linklint performs site checks.
Creating Seeds for a Site Check
 (top)  (command index)  (topic index)
A linkset (entered on the command line) specifies a set of links to check. For each linkset a seed is created for starting Linklint's search of your site. If the linkset contains no wildcard characters (@ and #), it must be a single link and the complete linkset becomes a seed file. If the linkset contains wildcard characters, the seed is the longest string of non-wildcard characters starting with the leading "/" and ending with the last "/" before a wildcard. For example, if you specify /@ to check your entire site, Linklint will start with one seed file "/" which is the default file for your root directory (sometimes called your home page).

Linklint does not have (or need) a -seed option. A linkset without wildcard characters is the same thing as a seed file. In fact, if you have a list of specific HTML pages to check, just put the paths, (one per line) in a file and tell Linklint that this is a command file (single leading @ sign before the filename). Make sure that you list only the paths (no http://, and no hostname) otherwise Linklint will do a remote URL check on your pages (it will see if the pages exist but it won't check the links on your pages).

Site Check Recursion
 (top)  (command index)  (topic index)
Linklint tries to find all of the pages and files in a site using recursion. Each seed is checked and if it is an HTML file it is parsed creating a new list of files to check. These files are checked creating new lists of files to check and so on. This process continues until one of the mechanisms to stop recursion kicks in.

The primary method used to stop recursion is to only check local links. A link is considered local if either: it resolves to a file reference without a scheme or host (i.e. /something), or it resolves to http://hostname/. . . and -host hostname was specified.

The second method for halting recursion is the use of specific linksets. Only HTML pages that match one or more of the linksets you specify will be checked for more links. HTML pages which don't match any of the linksets will be skipped, which means they are checked to see if they exist but none of the links inside the file are added to the list of files to check. You can also specifically -skip sets of HTML files or -limit the total number of HTML files checked.

Parsing HTML Files
 (top)  (command index)  (topic index)
These are the rules Linklint uses to extract links from HTML files.

Any tags inclosed inside of comments tags: <!-- . . . -->
or script tags: <script> . . . </script> are ignored.

The <base href=URL> tag will cause Linklint to set the base scheme, host, path, and file to the appropriate parts of URL for the remainder of the file. I've tried to emulate the behavior of the Netscape Navigator 3.0 browser. In general missing elements from the front part of a url are filled in from the base specification.

Links are extracted from the following tags:

<a href=LINK name=NAME>
<applet code=LINK codebase=BASE>
<area href=LINK>
<bgsound src=LINK>
<body background=LINK>
<embed src=LINK>
<form action=LINK>
<frame src=LINK>
<img src=LINK lowsrc=LINK dynsrc=LINK usemap=NAME>
<input src=LINK>
<map name=NAME>
<meta http-equiv=refresh content="... href=LINK">
<script src=LINK>

Tag and attribute names are case insensitive. A LINK can be bare or enclosed in single or double quotes. The characters < and > are allowed inside of a tag only if they are enclosed in single or double quotes. Arbitrary whitespace is allowed around the = sign and between a tag's name and its attributes.

Tags and/or attributes that do not match any of the above criteria are ignored.

All the links found on an HTML page are checked. Non-HTML links are checked only for existence. If a link is to an HTML file, it will also get parsed subject to the rules of recursion.

Resolving Links
 (top)  (command index)  (topic index)
In order to be able to follow links properly and to ensure that links get checked only once, all links are made absolute before they are checked. I have tried to use the same rules as a browser for making links absolute. You can use the -db3 flag to see how links get resolved. This flag causes every tag from an HTML file that contains a link to get printed out in the log file followed by the fully expanded link.

If a -host is specified, links starting with "http://host" have this text removed, creating a local link. Thus all local links will start with "/" followed by a full path from the server root to the file to be checked.

Default Index Files
 (top)  (command index)  (topic index)
Http servers treat a link to a directory followed by a "/" as a default file. The server will look for a (server specific) default file in the directory and serve that up if it exists. Otherwise the server will generate a listing of all of the files and subdirectories in the directory.

Linklint emulates this behavior in local site checks by searching for its own list of default files: home.html, index.html, index.shtml, index.htm, index.cgi, wwwhome.html, and welcome.html. If none of these are found, all the files and subdirectories in the directory are checked. You can change the set of default files Linklint looks for with the -index filename option which will replace the built-in set with the file(s) you specify. On the command line each default file must be preceded with the -index flag. If all of the default files are in lowercase, the search is case insensitive. If any of the files has an uppercase letter, the search is case sensitive.

Server-side Image Maps
 (top)  (command index)  (topic index)
Linklint can check all links that are used in both client-side and server-side image maps. Client-side image maps are handled automatically since Linklint parses the <area href=LINK> tag in HTML files.

Server-side image maps are a little bit tricky. Some servers have the imagemap CGI software built-in so links ending in .map are treated as map files and automatically sent to the image map program for processing. Linklint mimics this behavior. Any link ending in .map is parsed as if it were a map file. In addition, all .map links are checked locally even if the -http flag is used since map files are generally not accessible directly via http.

Some servers require server-side image map links to contains the path of the CGI image map program followed by the path to the map file as in:

<a href=/cgi-bin/imagemap/dir/info.map>.

Here /cgi-bin/imagemap is the location of the image map CGI program and /dir/info.map is the location of the map file. Linklint can resolve these links and read the map file (locally only, even if -http is used). However, you must provide the path from your server root directory to your image map program using the -map option. Three common image map specifications are:

  • -map /cgi-bin/imagemap
  • -map /cgi-bin/imagemap.exe
  • -map cgi-bin/htimage
For example, if you set "-map /cgi-bin/imagemap", the link /cgi-bin/imagemap/dir/info.map will be transformed to /dir/info.map which will be read in locally and parsed as a map file. You need to be sure to set -root properly for Linklint to be able to find the map file.
How the Status Cache Works
 (top)  (command index)  (topic index)
Linklint uses a combination of three different methods to keep track of remote URL modification times:
Last-Modified date
Many web servers, let Linklint know that last date a file was modified. If this date is available for a page then Linklint uses it for keeping track of changes.
If-Modified-Since requests
If the Last-Modified is not available then Linklint tries an If-Modified-Since request. Linklint asks if the page has been modified since the last time (according to Linklint) it was checked.
Checksum of the remote file
If neither method above is available on a remote server then Linklint reads in the entire remote file, makes a checksum of its contents and uses this checksum to keep track of changes.
These methods are totally transparent to the Linklint user (you). For each URL the most efficient method is tried first, and the checksum is only used as a last resort.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Seeds · Recursion · Parsing · Resolving links · Default Files · Image Maps · Status Cache

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/index.html0100664000175000017500000001301607341361022016646 0ustar barbierbarbier Linklint Documentation Linklint Documentation
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index

Inputs describes input files and command line parameters.
Outputs describes output files created by Linklint.
Hints gives hints on how to use Linklint.
How it works explains the basic operation of Linklint.
Index contains indexes of all input parameters and topics.

Linklint is an Open Source Perl program that checks local and remote HTML links. Example:

linklint -http -host my.host.com -limit 1000 -doc dir /@

-http check site via HTTP requests (HTTP site check)
-host my.host.com check the my.host.com site
-limit 1000 bump up the file limit from 500 to 1000
-doc doc put all output files in the dir/ subdirectory
 /@ check entire site

But checking links is often more complicated than this. Even if you don't end up using Linklint, you will get a better understanding and appreciation of link checking if you read this documentation. Please read all of the documentation.

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/inputs.html0100664000175000017500000012635407341361022017073 0ustar barbierbarbier Linklint Documentation - inputs Linklint Documentation - inputs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Input files · Linksets · Local-site · HTTP-site · Remote-URL · Outputs · Debug

This page shows examples of all of the command line (and command file) inputs to Linklint. There are three main modes of operation:
Local Site Checking
Checks pages and links on your site locally, looking for files on the local file system. This is convenient for small sites that you build "at home" which will later be uploaded to an HTTP server. It can also be used for very simple sites that have little CGI.

HTTP Site Checking
Checks pages and links on your site by requesting pages via HTTP, just like a browser would. This mode is less efficient than just reading directly from the file system because Linklint must make a socket connection for each page/file and your web server most respond to each request.

Remote URL Checking
The two site checking options only check pages from a single host computer. Remote URL checking can check all of the links on your site that go to other sites or a list of specific URL's.
Input Files
 (top)  (command index)  (topic index)
There are two types of input files that can be specified on the command line: @command_files which contain command line options and @@http_files which are parsed to find http:// URLs. Command files indicated with a single @ sign before the file name. Http files are indicated with two @ signs before the file name. [This means that the actual file name of command files cannot start with an @ sign. Oh well.]

Command Files

linklint @command_file
Reads in command line arguments from command_file. Command files can be nested. Each command file is interpreted line by line. Empty lines and lines beginning with # are ignored. Lines that start with -anything can only contain command line arguments. You can have multiple arguments on one line, and arguments can take repeated parameters in command files only. Example:

# This is a sample command file
#
-host  www.linklint.org
-root  /www/
-doc   linkdoc
-index index.html index.cgi

Reading Commands from STDIN

linklint @ < command_file
linklint @STDIN < command_file
A plain @ sign or @STDIN will cause Linklint to read STDIN and treat it as a command file. This is useful if you want to run Linklint as a configurable CGI program. If no STDIN is available then Linklint will hang waiting for an end-of-file from STDIN. You can also use this mode to "interactively" feed commands to linklint. On Unix, terminate your input with ^D.

Files of Local Pages

If you only want to check the links on one or two pages then just use the path to those pages (starting each path with "/") on the command line instead of /@:

linklint /first/page.html /second/page.html

If you have a long list of pages (on your site) that you want to have link checked (not just the existence of each page, but all of the links on each page) then put the path to each page in a command file and send that command file (with a leading @ sign:

linklint @local_pages

# local_pages
#
/first/page.html
/second/page.html
/third/page.html
# etc.
If the list of pages you want to check contains full URLs, it is very easy to write a little Perl program to strip off the scheme and host:
perl -ne "s{http://[^/]+}{} and print" full_links.in > rel_links.out

Files of Remote Links

linklint @@http_file
Check the status of all http:// references that are found in http_file. Very forgiving in looking for links. If the file looks like a remoteX.txt file generated by Linklint then failed URLs will be cross referenced.

linklint -doc linkdoc @@
When you specify @@ with no filename, Linklint will check all the http links found in the file linkdoc/remoteX.txt. You must specify a -doc directory. This is an easy way to recheck all of the remote links on your site.

Which Files to Check
 (top)  (command index)  (topic index)

Linksets defined

Whether you are doing a local site check or an HTTP site check, you specify which directories (presumably containing HTML files) to check with one or more linksets. A linkset uses two wildcard characters @ and #. Each linkset specifies one or more directories much like the standard * and ? wildcard characters are used to specify the characters in the names of files in one directory.

The @ character matches any string of characters (this kind of acts like "*"), and the # character (which is kind of like "?") matches any string of characters except "/" . The best way to understand how @ and # work is to look at a few examples:

the entire site /@
the homepage only (default) /
files in the root directory only /#
. . . and one directory down /#/#
files in the sub directory only /sub/#
files in the sub directory and below /sub/@
specific files /file1 /file2 ...
specific subdirectories /sub1/@ /sub2/@ ...

If you specify more than one linkset, files matching any of the linksets will be checked. HTML files that don't match any of the linksets will be skipped. Linklint will see if they exist but won't check any of their links.

Other File Selection Options

-skip skipset
Skips HTML files that match skipset. Linklint will make sure these files exist but won't add any of their links to the list of files to check. Multiple skipsets are allowed, but each must be preceded with -skip on the command line. Skipsets use the same wildcard characters as linksets.

-ignore ignoreset
Ignores files matching ignoreset. Linklint doesn't even check to see if these files exist. Multiple ignoresets are allowed, but each must be preceded with -ignore on the command line. Ignoresets use the same wildcard characters as linksets.

-limit n
Limits checking to n HTML files (default 500). All HTML files after the first n are skipped.

Local Site Checking
 (top)  (command index)  (topic index)
If you are developing HTML pages on a computer that does not have an http server, or if you are developing a simple site that does not use Server Redirection or extensive CGI, you should use local site checking.

linklint /@
Checks all HTML files in the current directory and below. Assumes that the current directory is the server root directory so links starting with "/" default to this directory. You must specify /@ to check the entire site. See Which Files to Check for details.

linklint -root dir /@
Checks all HTML files in dir and below. This is useful if you want to check several sites on the same machine or if you don't want to run Linklint in your public HTML directory.

Other Local Site Options

-host hostname
By default Linklint assumes all links on your site that start with http:// are remote links to other sites. If you have absolute links to your own site, give Linklint your hostname and links starting with http://hostname will be treated as local files. If you specify -host hostname:port, only http links to this hostname and port will be treated as local files.

-case
Makes sure that the filename (upper/lower) case used links inside of html tags matches the case used by the file system. This is for Windows only and is very handy if you are porting a site to a Unix host.

-orphan
Checks all directories that contain files used on the site for unused (orphan) files.

-index file
Uses file as the default index file instead of the default list used by Linklint. You can specify more than one file but each one must be preceded by -index on the command line. If a default index file is not found, Linklint uses a listing of the entire directory. See the Default File section for details.

-map /a=[/b]
Substitutes leading /a with /b. For server-side image maps or to simulate Server Redirection.

-no_warn_index Turns of the "index file not found" warning. Applies to local site checking only.

-no_anchors Tells Linklint to ignore named anchors. This could ease memory problems for people with large sites who are primarily interested in missing pages and not missing named anchors. This option works for both HTTP and local site checks.

HTTP Site Checking
 (top)  (command index)  (topic index)
If you have a complicated site that uses lots of CGI or Server Redirection, you should use HTTP site checking. Even though an HTTP site check reads pages via your HTTP server, you will get the best performance if you do your checking on a machine that has a high speed connection to your server.

linklint -http -host www.site.com /@
The -http flag tells Linklint to check HTML files on the site www.site.com via a remote http connection. You must specify a -host whenever you do an HTTP site check (otherwise Linklint won't where to get your pages). You can specify /@ to check the entire site. See Which Files to Check for details.

HTTP Site Check Options

-http
This flag tells Linklint to perform an HTTP site check instead of a local site check. All files (except server side image maps) will be read via the HTTP protocol from your web server.

-host hostname:port
If you include :port at the end of your hostname, Linklint uses this port for the HTTP site check.

-password realm user:password
Uses user and password as authorization to enter password protected realm. Realms are named areas of a site that share a common set of usernames and passwords. If passwords are needed to check your site, Linklint will tell you which realms need passwords in warning messages. Enclose the realm in double quotes if it contains spaces. If no password is given for a specific realm, Linklint will try using the password for the "DEFAULT" realm if it was provided.

-timeout t
Times out after t seconds (default 15) when getting files via http. Once data is received, an additional t seconds is allowed. The timeout is disabled on Windows machines since the Windows port of Perl does not support the alarm() function.

-delay d
Delays d seconds between requests (default 0). If you want to remote check in the background you can set delay to a large number, and Linklint will spend most of its time sleeping.

-local linkset
Gets files that match linkset locally. The default -local linkset is @.map (which matches any link ending in .map). This allows Linklint to follow links through server-side image maps. The default is ignored if you specify your own -local expressions. You need to specify the -root directory for this option to work propery.

-map /a=[/b]
Substitutes leading /a with /b. For server-side image maps or to simulate Server Redirection.

-no_query_string
Up until version 2.3.4, Linklint did not use query strings while doing HTTP site checks. Query strings were removed before making HTTP requests. As of 2.3.4 query strings in links are used in the requests. Use the -no_query_string flag to get back the "old" behavior.

-http_header "Name:value"
Adds the HTTP header "Name: value" to all HTTP requests generated by Linklint. You will need to use quotation marks to hide spaces in the header line from the command line interpreter. Linklint will automatically add a space after the first colon if there is not one there already. Multiple (unique) header lines are allowed.

-language zz
This option is only useful if you are checking a site that uses content negotiation to present the same URL in different languages. Creates an HTTP Request header of the form "Accept-Language: zz" that is included as part of all HTTP requests generated by Linklint. Multiple -language specifications are allowed. This will result in a single Accept-Language: header that lists all of the languages you have specified in alphabetical order. Some web sites can use this information to return pages to you in a specific language.

If you need to get more complicated than this, use the more general purpose -http_header to create your own header. There is a partial list of language abbreviations (taken from Debian) included as part of the Linklint documentation.

Remote URL Checking
 (top)  (command index)  (topic index)
A remote URL check is used to see if a remote URL exists (or has been recently modified). Links in the remote pages are not checked nor does Linklint look for named anchors in remote URLs.

Which URLs to check

Remote URL checking can be used to check all of the "remote" links on your site (those that link to pages on other sites) or it can check a list of URLs. There are several ways to specify which remote URLs to check:

linklint http://somehost/file.html
Checks to see if /file.html exists on somehost. Multiple URLs can be entered on the command line, in an @commandfile, or in an @@httpfile. Every URL to be checked must begin with http://. This will disable site checking.

linklint @@httpfile
Checks all the remote http URLs found in httpfile. Anything in the file starting with http:// is considered to be a URL. If the file looks like a remoteX.txt file generated by Linklint then all failed URLs will be cross referenced.

linklint @@ -doc linkdoc
Assuming you have already done a site check and used "-doc linkdoc" to put all of your output files in the linkdoc directory, Linklint will check all the remote links that were found on your site and cross reference all failed URLs without doing a site check. You can use the -netmod or -netset flags to enable the status-cache.

linklint -net [site check options]
The -net flag tells Linklint to check all remote links after doing either a local or HTTP site check site. If you are having memory problems, don't use the -net option, instead use one of the @@ options above.

Other Remote URL Options

-timeout t
Times out after t seconds (default 15) when getting files via http. Once data is received, an additional t seconds is allowed. The timeout is disabled on Windows machines since the Windows port of Perl does not support the alarm() function.

-delay d
Delays d seconds between requests to the same host (default 0). This is a friendly thing to do especially if you are checking many links on the same host.

-redirect
Checks for <meta> redirects in the headers of remote URLs that are html files. If a redirect is found it is followed. This feature is disabled if the status cache is used.

-proxy hostname[:port]
Sends all remote HTTP requests through the proxy server hostname and the optional port. This allows you to check remote URLs or (new with version 2.3.1) your entire site from within a firewall that has an http proxy server. Some error messages (relating to host errors) may not be available through a proxy server.

-concise_url
Turns off printing successful URLs to STDOUT during remote link checking.

Status Cache Options

The Status Cache is a very powerful feature. It allows you to keep track of recent changes in all of the remote (off-site) pages you link to. You can then use the Linklint output files to quickly check changed pages to see if they still meet your needs.

The flags below make use of the status cache file linklint.url (kept in your HOME or LINKLINT directory). This file keeps track of the modification dates of all the remote URLs that you check.

-netmod
Operates just like -net but makes use of the status cache. Newly checked URLs will be entered in the cache. Linklint will tell you which (previously cached) URLs have been modified since the last -netset.

-netset
Like -netmod but also resets the last modified status in the cache for all URLs that checked ok. If you always use -netset, modified URLs will be reported just once.

-retry
Only checks URLs that have a host fail status in the cache. Sometimes a URL fails because its host is temporarily down. This flag enables you to recheck just those links. An easy way to recheck all the cached URLs with host failures is linklint @@ -retry. use linklint @@linkdoc/remoteX.txt -retry if you want failed URLs to be cross referenced.

-flush
Removes all URLs from the cache that are not currently being checked. The -retry flag has no effect on which URLs are flushed.

-checksum
Ensures that every URL that has been modified is reported as such. This flag can make the remote checking take longer. Many of the pages that require a checksum are dynamically generated and will always be reported as modified.

-cache directory
Reads and writes the linklint.url cache file in this directory. The default directory is set by your LINKLINT or HOME environment variables.

Output Options
 (top)  (command index)  (topic index)
No output files are generated by default, only progress and a brief summary of the results are printed to the screen. You can produce complete documentation (split up into separate files) in a -doc directory or put selected output in a single -out file or by redirecting the standard output to a file. See the Output File Specification section for a detailed description of all output files.

Multi File Output

linklint -doc linkdoc
Sends all output to the linkdoc directory. The output is divided into separate .txt and .html files. Complete documentation is always produced regardless of the single file flags.

The file index.txt contains an index to all the other files; index.html is an HTML version of the index. The index files for remote URL checking are ur_lindex.txt and url_index.html.

-textonly
Prevents any HTML files from being created in the -doc directory.

-htmlonly
Erases redundant text files in the -doc directory after they have been used to create the HTML output files. The files remote.txt and remoteX.txt are not erased since they can be used by Linklint to recheck remote URLs.

-docbase base
Overrides the default base expression used for directing a browser to the resources listed in the output HTML files. The base is prepended to local links in the output HTML files. This only affects the links in HTML output files, it has no effect on what is displayed in these files. Ordinarily this flag would only be used during a local site check to set the base to http://host.

-output_frames
All HTML output data files are linked to from index.html. If you use this flag then the the data files will be opened up in a new frame (window) which can be handy in some cases since it always leaves the index.html file open in its own window.

-output_index filename
The output index files were previously named linklint.txt and linklint.html. These have now been changed to index.txt and index.html. You can use the -output_index option to change this name back to linklint or to something else.

-url_doc_prefix url/
By default, the output files associate with remote URL checking all start with "url". You can change this with the -url_doc_prefix option. If the url_doc_prefix contains a "/" character then the appropriate directory will be created (as a subdirectory of the -doc directory).

-dont_output xxxx
Don't create output files that contain "xxxx". Can be repeated. Example: -dont_output "X$" will supress the output of all cross reference files.

Single File Output

linklint -error > linklint.out
Lists all errors to linklint.out. Progress and summary information will not be included. You can get cross referenced lists with the -xref flag or lists sorted by the files containing errors with the -forward flag.

linklint -error -out linklint.out
Lists all errors and a brief summary to linklint.out You can get cross referenced lists, etc., as in the example above.

-out file sends list output and summary information to file
-list lists all found files, links, directories etc.
-error lists missing files and other errors
-warn lists all warnings
-xref adds cross references to the lists
-forward sorts lists by referring file

Debug and other Flags
 (top)  (command index)  (topic index)

Debug Flags

-db1 debugs command line input and linkset expressions
-db2 prints the name of every file that gets checked (not just HTML files)
-db3 debugs HTML parser, prints out tags and resulting links
-db4 debugs socket connection (kind of)
-db5 not used
-db6 details last-modified status for remote URLs (requires -netset or -netmod)
-db7 prints brief debug information while checking remote URLs
-db8 prints all http headers while checking remote URLs
-db9 generates random http errors

Other Flags

Use linklint with no command line arguments to get simple usage.

-version Gives version information.
-help Lists a few simple examples of how to use Linklint.
-help_all Lists all help (contained in program) including every input option.
-quiet disables printing progress to the screen
-silent disables printing summarys to the screen

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Input files · Linksets · Local-site · HTTP-site · Remote-URL · Outputs · Debug

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/language.html0100664000175000017500000001176007341361022017326 0ustar barbierbarbier Linklint Documentation - Country/Language Codes Linklint Documentation - Country/Language Codes
Version 2.3.5 August 13, 2001

Here is a partial list of ISO 3166 Country Codes that are used in HTTP requests to specify which language or languages to use when returning pages that are available in multiple languages.

This list was lifted off of the Debian home page.

ca català
da dansk
de Deutsch
en English
es Español
eo Esperanto
hr hrvatski
it Italiano
hu magyar
nl Nederlands
no norsk
pl polski
pt Português
ro română
fi suomi
sv svenska
tr Türkçe
zh-cn 中文 (Chine)
zh-hk 中文 (Hong Kong)
zh-tw 中文 (Taiwan)

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/license.html0100664000175000017500000004276507341361022017176 0ustar barbierbarbier GNU General Public License GNU General Public License
Version 2, June 1991

      Copyright (C) 1989, 1991 
      Free Software Foundation, Inc.
      59 Temple Place, Suite 330, 
      Boston, MA  02111-1307  USA

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
 (top)  (command index)  (topic index)
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.

To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.

Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.

Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.

The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
 (top)  (command index)  (topic index)
0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".

Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.

1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.

You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.

2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:

a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.

b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.

c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)

These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.

Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.

In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.

3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:

a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.

If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.

4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.

5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.

6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.

7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.

If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.

It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.

This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.

8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.

9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.

Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.

10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.

NO WARRANTY

11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

END OF TERMS AND CONDITIONS
 (top)  (command index)  (topic index)
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
linklint-2.3.5.orig/doc/linklint.gif0100664000175000017500000000227607341361022017172 0ustar barbierbarbierGIF89aJ÷BcBccc{{{¥¥¥ÆÆÆÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ!ù,Jþ H° Aƒ\Ȱ¡Ã‡0¢Å‹J¤˜±£Ç‹+~I²`È’(IžLÉ2ãÊ–%ÈœI³¦Ì6sêÜÉ“çÀ‚ :ôN¢H“*]Ê´èO=oN µªÕžPŸF¥zµ«×šPE¥)”&UD * '[™hÕ~•¶âØ™e¥`ÀZmm¾ÝÛ÷¯\²F)Þ¥Tªqÿ nZÉ‚+&Ûxãc›ÔpxpÞÁ E“f\¯æÍ¬Lœ´ñHå¾½ºvàÁ¸‰ªEЏ€g®µ¡Î¦z:(Þ¿+ Õ]Ù7ëç»Ù>/îysÙ©±­œ‹Ï·ºióè'»ØycðÉG>ó|ùøÅg?ûñwäÍÝ—œsöåwž€û×_|”A5zC•–ž‚uÀOž Øœƒâ'”€>(@h :¦¡mR8ßp†8”zY dbRá¹Xt"ªׇ%†õÙ\@I×…2:xœH^cµiæ^SPF)åRD2ilIfÙC"ÁäeA;linklint-2.3.5.orig/doc/new.html0100664000175000017500000002330407341361022016331 0ustar barbierbarbier Linklint Documentation - What's New Linklint Documentation - What's New
Version 2.3.4 August 8, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Index

Few changes. Almost 100% backward compatible (output index files are now index.* by default instead of linklint.*). Existing command files should still work.

Query string support for HTTP site checking has changed as of 2.3.4. This may change how Linklint checks your HTTP site if you have dynamic content. It now does "the right thing" by default. Use the -no_query_string flag to get back the old behavior.

  • Linklint is free again
  • Better query string support (as of 2.3.4)
  • Proxy support for site checking
  • Output files are more configurable
  • Input from STDIN
  • Beta SSL support now available
  • Looks for Java Archive .jar files

GNU General Public License

Linklint started out free. It became sharewhare when the flood of comments, suggestions and questions started turning it into a "full-time" job. Thanks to all who contributed! With the rise (and love) of Linux, we wanted to make it Open Source in a way that would not make previous contributors feel ripped off.

There was a gradual (unadvertised) change to Open Source. First, it was made free for use on Open Source operating systems. Then a GPL Open Source version was distributed as a part of the Meta Web Language system and shareware checks were no longer cashed. With the new Linklint web site, it is only being distributed as Open Source via the GNU General Public License.

Better Query String Support

Query strings were suppressed during site checks. This can be changed by commenting out one line in older versions. As of 2.3.4, query strings are used in HTTP site checks. To get the old behavior use the new -no_query_string flag.

If you have an older version of Linklint, you can get the new query string behavior by commenting out the one line that includes the comment "strip query string". This also applies to the 2.4.beta version.

Proxy Support for Site Checking

Previously, proxy support was only for remote-URL checking. Now proxy support has been extended to handle site checking as well. The conflict between proxies and virtual hosts has been resolved.

Output Options

-output_index xxxx The index output files were previously named linklint.* in order to prevent overwriting existing index.html files if the -doc directory collided with an existing HTML directory. By popular demand, these file names have been changed to index.*. This is more convenient for most, and a little more dangerous for a few. You can change back to linklint (or whatever) with the -output_index option.

-output_frames Uses <base> tags in the output files so that new browser windows will open when following links in the HTML output files. This prevents having to reload large output HTML files.

-url_doc_prefix some_prefix Gives control over the prefix of all output files associated with Remote-Url checking.

-dont_output xxx Suppress output of files that match /xxx/. In the past, I have told people to comment out lines in the program to suppress the generation of certain output files.

-no_warn_index Turns of the "index file not found" warning. Applies to local site checking only.

-concise_url Turns off printing successful URLs to STDOUT during remote link checking.

Read inputs from STDIN (keyboard)

@ or @STDIN will cause Linklint to read from STDIN (keyboard) as if it were an @command file. Great when you are using linklint from the shell and run out of space on the command line. Might make runing Linklint as a CGI program a little easier.

Other New Flags

-help_all -version -license are all pretty obvious.
-no_query_string Don't use query strings in HTTP site checks
-http_header Xxx:Yyy Add header lines to HTTP requests.
-language zz Add Accept-Language header line to HTTP requests.

Beta Version with SSL Support

Now that someone else (Sampo Kellomak) has done the heavy lifting of providing a low level Perl interface to the OpenSSL package, we wrote a simple wrapper module around his Net::SSLeay module to provide SSL support in Linklint.

Be warned, this beta version requires the OpenSSL package and the Net::SSLeay module (and the Net::SSLeay::Handle wrapper) in order to run. You will need to install these first which is fairly easy (if you have root) on Linux. Your Win32 mileage may vary.

Once installed, you use -http to check HTTP sites and -https to check HTTPS sites.

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/outputs.html0100664000175000017500000006156207341361022017273 0ustar barbierbarbier Linklint Documentation - outputs Linklint Documentation - outputs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
-Doc directory · Site-check files · Url-check files

This page provides a detailed description of every output file created by Linklint.
-Doc Directory
 (top)  (command index)  (topic index)
If a -doc directory is specified, all output files are written in that directory. The directory will be created if it does not already exist. There are two sets of files that are created, one is for site checks and the other is for remote url checks.

Even though new files are written only as needed, all previously written files from each set are erased before new files from the set are written. Each set is independent of the other. The site-check files will get erased only if you do another site check. The url-check files will be erased only if you check remote URLs again.

Many of the output files now come in HTML and text versions. The HTML versions look like the text versions but contain hyper-links to the resources they refer to. Files that come in both versions are listed without an extension.

Some of the output files are part of a family. Members of the family are related using the simple mnemonic of a trail X meaning cross referenced and a trailing F meaning forward referenced. For example: error lists all missing files, errorX lists the missing files and all HTML files that reference them, and errorF lists all HTML files that contain links to missing files along with the names of the missing files.

Site-Check Output Files
 (top)  (command index)  (topic index)
During local site checking, local links ending in "/" are kept as-is to maintain consistency with HTTP site results. However, if a default index file was found, it is listed below the link in square brackets. If a directory listing is used because no default file was found, this information is listed in the brackets.

Any link that Linklint knows has been redirected (either by the http server or with the -map option), gets an extra line in the output files showing the original link it was mapped from in parentheses. See the Server Redirection section for more details.

Site-Check Summary Files

index.html
a hyperlinked index to all site-check files created. On Windows machines this file will be named linklint.htm. This list differs from the summary list in two ways. First, it contains entries for every file written including cross referenced lists and forward referenced lists that are missing in the summary. Second, it lacks the detailed divisions by file type and external link schema that exist in the summary.

index.txt
text version of the index file.

summary.txt
summary of site-check results, similar to what gets printed to the screen. Every non-empty data family (listed below) will cause a line to be printed in the summary. In addition, external links are listed by the following schema: http, https, ftp, javascript, mailto, gopher, file, news, view-source, about, unknown. Found and missing files are listed by file type:

cgi starting with "/cgi-bin" or containing "?" or .cgi or .pl
default index ending with "/"
HTML ending with .htm .html or .shtml
map ending with .map
image ending with .gif .jpg .jpeg .tif .tiff .pic .pict .hdf .ras .xbm
text ending with .txt
audio ending with .au .snd .wav .aif .aiff .midi .mid
video ending with .mpg .mpeg .avi .qt .mov
shockwave ending with .dcr
applet ending with .class
other all other files

These distinctions sort files (mostly) by extension, they are not true indications of the tag the link was found inside, nor of the MIME format returned by your server.

log.txt
log of site-check progress, similar to what gets printed to the screen but also includes extra output created when -db flags are used.

Site-Check Data Files

action, actionX
a list of all ignored actions. Any link found inside a <form action=LINK> tag is listed here, but it is not checked since normally extra input from the form is required.

anchor, anchorX
a list of all named anchors found. The anchorX file lists all files that use each named anchor. If a named anchor is not referenced by any of the files checked by Linklint, it will have no cross references. Named maps are also included in this list.

case, caseX, caseF
If -case was used during a local site check on a Windows machine, all files that have references that do not match the (upper/lower) case of the file are listed here. This is handy if you are developing a site on a Windows machine that you are planning to port to a Unix server.

dir.txt
a list of all the directories that contain files used by the site. Only created during local site checking.

error, errorX, errorF
a list of all missing files. A file is listed as missing if:

  1. it is referenced by a file checked by Linklint, and
  2. it is a local file, and
  3. it does not match any -ignore expression, and
  4. Linklint could not find the file locally, or an error occurred when Linklint tried to get the file via http.
See the Parsing Html section for details.

errorA, errorAX
a list of missing named anchors. A named anchor is missing if:

  1. it is referenced by a file checked by Linklint, and
  2. it is located inside of a local file, and
  3. the file it is in is not -ignored or -skipped, and
  4. Linklint could not find the file, or the file was found but the named anchor does not exist.

errorM, errorMX
a list of missing named client-side image maps. The rules for inclusion in this list are the same as those for missing named anchors.

file, fileX, fileF
a list all files found on the site. file is a list of all files found sorted by file type; fileX is a cross referenced version, showing a sublist of all the HTML files that reference each file in the list; fileF lists each HTML file and a sublist of all of the links it references. These lists are meant to show file dependencies so multiple links to the same file result in a single listing. Likewise, a named anchor causes the file containing the anchor to be listed, but the actual named anchors are listed separately.

httpfail
a list of all the http errors that occurred while trying to remote check a site.

httpok
a list of all the files that were obtained without error while remote checking a site.If the -db6 flag is used and the status-cache is enabled, these entries are expanded to include the following extra information:

  • ok (200)
  • ok parsed HTML
  • ok skipped

ignore, ignoreX
a list of all ignored files.

imgmap, imgmapX
a list of named image maps for client-side image maps, taken from tags <img usemap=NAME> and <map name=NAME>. The format is in this list is the same as the one used for named anchors.

mapped
a list of all the redirected files that were found while checking a site either locally or remotely.

Some servers automatically change the name of a link. For example a link to http://host/subdir will get automatically mapped to http://host/subdir if subdir is a directory. Many other mappings are possible. Linklint will follow these mappings and treat the resulting file as the actual link (just like a browser would).

This is a potential cause for confusion. If your index.html file has a link to A.html and this gets mapped to B.html, Linklint will tell you that index.html has a link to A.html and that it has a link to B.html, but only B.txt will be listed as a found file.

See the Server Redirection section for more details.

orphan
If the -orphan flag was used during a local site check, all of the unused files and subdirectories in each directory that contain files used by the site, are listed here sorted by directory. If an orphan HTML file contains a meta refresh tag redirecting the visitor to a different file, this new file is listed under its parent preceded by " =>". This method of redirection is often used to steer visitors to the current version of a file without requiring them to change their bookmarks.

remote, remoteX
a list of all references that are not to local files. The remoteX file lists which HTML files link to these resources.

skipped, skipX
a list of all files that were skipped by Linklint. These are generally HTML files which were found to exist but were not checked further because:

  1. they did not match any of the linksets specified, or
  2. they did match one of the -skip expressions, or
  3. more than -limit files had already been checked.

warn, warnX, warnF
a list of all warnings that were generated during the site check. Warnings include: unexpected I/O errors, HTML errors such as unterminated comments, missing index files (during local site checks), space characters inside of links, the use of "\" inside of links, files that are not world readable, mappings that cause infinite loops, and meta refresh tags that redirect to relative URLs.

For HTTP site checking warnings are also generated for: files disallowed by robots.txt, files mapped to a different server, files that require a username and password (and none was provided), an invalid username and password, and files mapped to non-http schemes.

Url-Check Output Files
 (top)  (command index)  (topic index)
You may notice that all of the files in this section start with the prefix url. You can change this prefix with the -url_doc_prefix option. The default value is still url for backward compatibility, but I now prefer either url/ or url_.

Url-Check Summary Files

urlindex.html
a hyperlinked index to all url-check files created. On Windows machines this file will be named urlindex.htm. This list differs from the summary list in two ways. First, it contains entries for every file written including: host failures, cross referenced lists, and forward referenced lists. Second, it lacks the detailed divisions by failure type that exist in the summary.

urlindex.txt
text version of the index file.

urlsum.txt
summary of url-check results, similar to what gets printed to the screen. The summary list includes an entry for every type of warning or failure.

urllog.txt
log of url-check progress, similar to what gets printed to the screen but also includes extra output created when -db flags are used.

Url-Check Data Files

urlfail, urlfailX, urlfailF
a list of URLs that failed due to one of the following errors:

  • could not find ip address
  • could not connect to host
  • all timeout errors
  • had no content (204)
  • bad request (400)
  • access forbidden (403)
  • not found (404)
  • internal server error (500)
  • service not implemented on server (501)
  • server temporarily overloaded (502)
  • gateway timeout (503)

urlhost.txt
a list of all hosts that had failures during the url check. This includes:

  • could not find ip address
  • could not connect to host
  • could not open socket
  • malformed status line
  • timeout errors
  • server overloaded (502)
  • gateway timeout (503)
URLs that fail due to host failures can be retried with the -retry flag if the status-cache was enabled.

urlmod
a list of all URLs that have changed since the last time they were checked with the -netset flag. This list is only generated if -netmod or -netset are specified.

urlmoved
a list of URLs that were reported as redirected by their server. Often this redirection involves nothing more than adding a trailing "/" to a directory name. Sometimes it can be a precursor to a site changing location permanently. Some servers report that the url has been moved temporarily, others will say that the url has been moved permanently. A far as I can tell there is no real distinction between "temporary" and "permanent", they seem to be used interchangeably.

urlok
a list of all URLs that were found with no errors. If the -db6 flag is used and the status-cache is enabled, these entries are expanded to include the following extra information:

  • ok (200)
  • ok not modified (304)
  • ok last-modified date unchanged
  • ok did not compute checksum
  • ok checksum matched

urlskip
a list of URLs that were not checked. This is most often caused by the lack of password authorization but could also be due to to an exceptional condition such as an infinite redirect loop or an unknown internal Linklint error.

urlwarn, urlwarnX, urlwarnF
a list of warning messages generated while Linklint was doing a url-check. The most common warning tells you the name of a realm which requires a username and password.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
-Doc directory · Site-check files · Url-check files

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/0040775000175000017500000000000007477476453016014 5ustar barbierbarbierlinklint-2.3.5.orig/doc/small/bugs.html0100664000175000017500000003266707341361022017624 0ustar barbierbarbier Linklint Documentation - Known Bugs Linklint Documentation - Known Bugs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Bug Reports · Memory · Parsing · Won't Check

Linklint has been used on hundreds (maybe thousands) of sites around the world for over four years. When it was initially released (in 1997) many bugs were reported and fixed. After that, there were only a small handful of bug reports (all of which were fixed in release 2.3).

If you read all of the documentation and then start using Linklint there is a very high probability that it will work correctly. In the spirit of Perl, Linklint has been designed to "do the right thing".

Bug Reports
 (top)  (command index)  (topic index)
If you think you have found a bug, please let us know at bugs@linklint.org.

It is often helpful if you include:

  • The version of linklint you are using (linklint -version)
  • The version of Perl you are using (perl -V)
  • The name and version of your operating system
  • Your Linklint command line and/or command file(s)

Requests for information and comments are also welcomed. You can send these to info@linklint.org.

Requests for information in the documentation that start out with "I don't have time to read the documentation" may not get a response.

Memory Problems
 (top)  (command index)  (topic index)
No doubt about it, Linklint can be a big memory hog (for large sites). We have received several reports of Linklint being unable (or unwilling) to write the HTML output files after doing a site check. These have all been traced back to memory problems.

We tried to optimize Linklint for speed and use a lot of memory in order to make it fast. We have had one report that there is a huge difference in memory consumption depending on the operating system Linklint is running under. It was reported that a Sun used almost 1/10th the memory as Linux when checking the same site ( 68M on the Sun system versus about 464M on Linux).

There is a clear need for:

(a) a re-write in Object Oriented Perl 5
This could significantly ease memory usage with no (or very little) sacrifice in speed.

(b) hooks to a database back-end
This would virtually eliminate the memory problems but there could be a significant reduction in speed.

In the meantime, if you are running out of memory there are several hints that will help ease this problem.

Parsing Problems
 (top)  (command index)  (topic index)
Linklint parses HTML files as fast as possible. It was designed to be a fast link checker and not an HTML validator. Originally Linklint was written "to spec" but after a flurry of bug reports (and fixes) it now does a very good job at emulating a Netscape Browser including many of this browsers idiosyncrasies.

One trick that lets Linklint parse HTML quickly is to use the "<" character as the effective "new-line", so HTML files are split on "<" as they are read. This is very fast but has two downsides.

  1. (minor) If your HTML tags contains "<" inside of HTML tags as in <img src=back.gif alt="<<<"> then Linklint will need to do some backtracking (which will slow it down a bit. Solution: always use &lt; instead of < inside of HTML tags. Since this use of bare "<" inside of HTML tags is uncommon Linklint remains very fast "on average".

  2. (occasional problem) if an HTML tag is missing a closing ">" or if you have a bare "<" character in the text of your page, Linklint will read in everything in the file until the next > (or the end of file) as a single tag. This can cause a memory problem (see above.). These problems are rare and can usually be tracked down quickly.
    Solutions:

    1. Use Weblint to check the HTML in your pages before link checking.
    2. Use the log.txt Linklint output file to track down the offending page (it is usually the last page checked).
What Linklint Won't Check
 (top)  (command index)  (topic index)
Linklint will not be able to track down links that require specific visitor form input. For example, a search engine site uses form input from visitors to generate new HTML pages. These pages would not be checked by Linklint. Likewise, if you have written your own CGI program that uses the IsMap attribute to create new links depending on where a visitor clicks on an image, Linklint will not be able to find these links.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Bug Reports · Memory · Parsing · Won't Check

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/doc_index.html0100664000175000017500000004411507341361022020607 0ustar barbierbarbier Linklint Documentation - index Linklint Documentation - index
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Command index · Topic index

The command index links every input command to the section in the documenation where that command can be found.

The topic index gives a quick overview off all of the documentation and links to all of the sections.

Command Index
 (top)  (command index)  (topic index)
-cache directory -case -checksum -concise_url
-db1..9 -delay d -doc -docbase base -dont_output xxxx
-error
-flush -forward
-help -help_all -host hostname:port -host hostname -htmlonly -http -http_header "Name:value"
-ignore ignoreset -index file
-language zz -limit n -list -local linkset
-map /a=[/b]
-netmod -netset -no_anchors -no_query_string -no_warn_index
-orphan -out file -output_frames -output_index filename
-password realm user:password -proxy hostname[:port]
-quiet
-redirect -retry
-silent -skip skipset
-textonly -timeout t
-url_doc_prefix url/
-version
-warn
-xref
Topic Index
 (top)  (command index)  (topic index)

Inputs

Input FilesCommand Files · Reading Commands from STDIN · Files of Local Pages · Files of Remote Links
Which Files to CheckLinksets defined · Other File Selection Options
Local Site CheckingOther Local Site Options
HTTP Site CheckingHTTP Site Check Options
Remote URL CheckingWhich URLs to check · Other Remote URL Options · Status Cache Options
Output OptionsMulti File Output · Single File Output
Debug and other FlagsDebug Flags · Other Flags


Outputs

-Doc Directory
Site-Check Output FilesSite-Check Summary Files · Site-Check Data Files
Url-Check Output FilesUrl-Check Summary Files · Url-Check Data Files


Hints

Create a Command File
Resolving Memory ProblemsCheck Your Site in Sections · Run Linklint Twice · Use the -no_anchor option
Add Passwords
Add Server-Side Image Maps
Tracking Down Errors
Server Redirection


How it works

Creating Seeds for a Site Check
Site Check Recursion
Parsing HTML Files
Resolving Links
Default Index Files
Server-side Image Maps
How the Status Cache Works


Bugs

Bug Reports
Memory Problems
Parsing Problems
What Linklint Won't Check

Other

GNU General Public License
Common Country/Language Codes

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Command index · Topic index

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/dot.gif0100664000175000017500000000012507341361022017233 0ustar barbierbarbierGIF89a³ÿÿÿBBBZZZ{{{œœœÞÞÞï÷çÆÖ½”½„R¥1)„ÿŒçÿBÎÿ½!ù,D;linklint-2.3.5.orig/doc/small/hints.html0100664000175000017500000005150107341361022017775 0ustar barbierbarbier Linklint Documentation - hints Linklint Documentation - hints
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Cmd files · Memory · Passwords · Image maps · Errors · Redirection

This page contains detailed hints for configuring Linklint for web sites that needs to make use of most of the features of Linklint. Some sections here may not be applicable to your site. These hints are intended as suggestions to help you quickly get started checking links. Your mileage my vary.
  • Use the /@ linkset to check your entire site.
  • Use -limit NNN to check more than 500 HTML files.
  • Always use the -doc dir option.
  • Consider using the -docbase option if you are doing a local site check.
  • Read all the documentation.
Create a Command File
 (top)  (command index)  (topic index)
Before checking your site, take the time to put some information about your site in a command file. This will avoid a lot of retyping (and possible typos) later on. It often makes sense to name the command file after your host name. The command file should look like:
# general command file for hostname

-host www.hostname.com
-root /absolute/path/to/your/htmlrootdirectory
-doc linkdoc
-http
-limit 1000
Resist the temptation to include any linksets in this command file. The reason will become clear when you start tracking down broken links. If you need to use a large list of linksets, another option is to include these in their own separate command file.

You can check your home page with linklint @hostname. You can check your entire site with linklint @hostname /@.

Often the easiest way to understand some of the many features that Linklint has to offer is to try them out. Linklint is very fast and it is easy to try play around with it on just a few pages. Start with a simple command file like the example above and then add features and options as needed. In the spirit of Perl, Linklint has been designed to "do the right thing".

Resolving Memory Problems
 (top)  (command index)  (topic index)
Here are some things you can do to reduce the amount of memory that Linklint uses.

Check Your Site in Sections

If you have a very large site (thousands of pages), it might make sense to check your site up into several sections and check each section separately. One way to do this is to check all the files in the root directory and then check files in each subdirectory.

Note: Linklint is designed so that all links between the sections will be checked correcty. Currently, the output files for each section will not be merged.

Create a command file named root for checking the root directory:

# root directory command file
@hostname
-doc rootdoc
/#
For each subdirectory (or group of subdirectories) create a command file name subdir:
# command file for subdir
@hostname
-doc subdirdoc
/subdir/@
Now you can check just your root directory with linklint @root and each subdirectory with linklint @subdir and the results will be kept in separate output directories.

Run Linklint Twice

If you have a large site, don't use the -net when you are checking your site. Instead, after you check your site (without the -net command) run Linklint again as:

linklint -doc doc_dir @@

You will end up with the same results as with a single pass of Linklint but the memory requirements will have been eased.

Use the -no_anchor option

Since there are often many named anchors on a single page, the list of named anchors that Linklint generates and checks can be larger than the list of HTML pages. You can use the -no_anchors option to tell Linklint to ignore named anchors which should reduce memory consumption.

Add Passwords
 (top)  (command index)  (topic index)
If you get warning messages that say need password for "realm", you will have to provide Linklint with a username and password for each password protected realm. Add these lines to your hostname file:
-password "realm1" username1:password1
-password "realm2" username2:password2
The realms are double quoted in the warning messages. You will have to use double quotes in the command file if the realm contains any space characters. You can also use the realm "DEFAULT" to provide a default username and password. The default will be tried only if a password for the specific realm was not given. Once you have made these changes to your command file, check the site again to make sure that you entered all the information correctly. You will get warning messages for invalid username/password combinations.

Note: The HTTP protocol uses a named realm to identify a set of pages that share a common set of username/password combinations. This system was created so that visitors only need to be prompted for their username and password once (per session) in order to browse any number of pages in a given realm. Realms are often used to protect all the files under a particular subdirectory, but they can be used in other configurations.

Add Server-Side Image Maps
 (top)  (command index)  (topic index)
If your site makes use of server-side images maps, you may have to add a -map option to your command file so Linklint knows how to find your .map files. See Server-Side Image Maps for a detailed explanation. You may have to add one of the following lines to your hostname file:
-map /cgi-bin/imagemap
-map /cgi-bin/imagemap.exe
-map /cgi-bin/htimage
You will also need to have the -root directory specified so Linklint knows where to look for map files locally on your machine.
Tracking Down Errors
 (top)  (command index)  (topic index)
Sometimes the error messages generated by Linklint do not provide sufficient information for figuring out why an error was reported. In these cases it can be useful to look at the HTML tags that caused the errors. One way to see these tags is to use the -db3 flag. This flag causes all HTML tags that contain links to be printed out followed by the fully expanded links.

Here is one strategy for tracking down errors:

  1. Look in the errorF.txt or errorX.txt file to find the file that caused the error. Let's call its full (URL) path: /some/file.html
  2. Run linklint @hostname /some/file.html -db3 -doc dbdoc
    This will cause all the tags containing links in /some/file.html to be printed out in dbdoc/log.txt.
  3. Examine the dbdoc/log.txt file to see the HTML tags found by Linklint and the links that were extracted from these tags.
If you use this technique frequently, you can avoid repeated typing by making a debug command file:
# debug command file
@hostname
-db3
-doc dbdoc
You can use this file to debug an HTML page with the command:
linklint @debug /some/file.html.
Server Redirection
 (top)  (command index)  (topic index)
One of the worst causes of confusion in debugging broken links is server redirection. Some http servers are programmed to deliver a different page than the one a visitor asks to see.

The most benign form of redirection is when the server program sends back a moved status code (301 or 302), telling the browser that the page requested has moved, along with a new url. Linklint follows these links and reports all moved urls in the file mapped.txt.

Sometimes a server is programmed to serve up the contents of a page that is different from the page requested without giving any hints to the browser (or to Linklint) that a switch has been made. Take a simple example where fileA.html is mapped to fileB.html. Linklint will tell you fileA.html is missing whenever fileB.html is missing even if fileA.html exists!

Since the server is not providing any clues that this switch has been made, there is nothing Linklint can do to alleviate the situation. I can only suggest that you minimize your use of this type of server redirection and familiarize yourself with which links on your site have been mapped this way.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Cmd files · Memory · Passwords · Image maps · Errors · Redirection

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/howitworks.html0100664000175000017500000005451007341361022021073 0ustar barbierbarbier Linklint Documentation - how it works Linklint Documentation - how it works
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Seeds · Recursion · Parsing · Resolving links · Default Files · Image Maps · Status Cache

This page explains in more detail how Linklint performs site checks.
Creating Seeds for a Site Check
 (top)  (command index)  (topic index)
A linkset (entered on the command line) specifies a set of links to check. For each linkset a seed is created for starting Linklint's search of your site. If the linkset contains no wildcard characters (@ and #), it must be a single link and the complete linkset becomes a seed file. If the linkset contains wildcard characters, the seed is the longest string of non-wildcard characters starting with the leading "/" and ending with the last "/" before a wildcard. For example, if you specify /@ to check your entire site, Linklint will start with one seed file "/" which is the default file for your root directory (sometimes called your home page).

Linklint does not have (or need) a -seed option. A linkset without wildcard characters is the same thing as a seed file. In fact, if you have a list of specific HTML pages to check, just put the paths, (one per line) in a file and tell Linklint that this is a command file (single leading @ sign before the filename). Make sure that you list only the paths (no http://, and no hostname) otherwise Linklint will do a remote URL check on your pages (it will see if the pages exist but it won't check the links on your pages).

Site Check Recursion
 (top)  (command index)  (topic index)
Linklint tries to find all of the pages and files in a site using recursion. Each seed is checked and if it is an HTML file it is parsed creating a new list of files to check. These files are checked creating new lists of files to check and so on. This process continues until one of the mechanisms to stop recursion kicks in.

The primary method used to stop recursion is to only check local links. A link is considered local if either: it resolves to a file reference without a scheme or host (i.e. /something), or it resolves to http://hostname/. . . and -host hostname was specified.

The second method for halting recursion is the use of specific linksets. Only HTML pages that match one or more of the linksets you specify will be checked for more links. HTML pages which don't match any of the linksets will be skipped, which means they are checked to see if they exist but none of the links inside the file are added to the list of files to check. You can also specifically -skip sets of HTML files or -limit the total number of HTML files checked.

Parsing HTML Files
 (top)  (command index)  (topic index)
These are the rules Linklint uses to extract links from HTML files.

Any tags inclosed inside of comments tags: <!-- . . . -->
or script tags: <script> . . . </script> are ignored.

The <base href=URL> tag will cause Linklint to set the base scheme, host, path, and file to the appropriate parts of URL for the remainder of the file. I've tried to emulate the behavior of the Netscape Navigator 3.0 browser. In general missing elements from the front part of a url are filled in from the base specification.

Links are extracted from the following tags:

<a href=LINK name=NAME>
<applet code=LINK codebase=BASE>
<area href=LINK>
<bgsound src=LINK>
<body background=LINK>
<embed src=LINK>
<form action=LINK>
<frame src=LINK>
<img src=LINK lowsrc=LINK dynsrc=LINK usemap=NAME>
<input src=LINK>
<map name=NAME>
<meta http-equiv=refresh content="... href=LINK">
<script src=LINK>

Tag and attribute names are case insensitive. A LINK can be bare or enclosed in single or double quotes. The characters < and > are allowed inside of a tag only if they are enclosed in single or double quotes. Arbitrary whitespace is allowed around the = sign and between a tag's name and its attributes.

Tags and/or attributes that do not match any of the above criteria are ignored.

All the links found on an HTML page are checked. Non-HTML links are checked only for existence. If a link is to an HTML file, it will also get parsed subject to the rules of recursion.

Resolving Links
 (top)  (command index)  (topic index)
In order to be able to follow links properly and to ensure that links get checked only once, all links are made absolute before they are checked. I have tried to use the same rules as a browser for making links absolute. You can use the -db3 flag to see how links get resolved. This flag causes every tag from an HTML file that contains a link to get printed out in the log file followed by the fully expanded link.

If a -host is specified, links starting with "http://host" have this text removed, creating a local link. Thus all local links will start with "/" followed by a full path from the server root to the file to be checked.

Default Index Files
 (top)  (command index)  (topic index)
Http servers treat a link to a directory followed by a "/" as a default file. The server will look for a (server specific) default file in the directory and serve that up if it exists. Otherwise the server will generate a listing of all of the files and subdirectories in the directory.

Linklint emulates this behavior in local site checks by searching for its own list of default files: home.html, index.html, index.shtml, index.htm, index.cgi, wwwhome.html, and welcome.html. If none of these are found, all the files and subdirectories in the directory are checked. You can change the set of default files Linklint looks for with the -index filename option which will replace the built-in set with the file(s) you specify. On the command line each default file must be preceded with the -index flag. If all of the default files are in lowercase, the search is case insensitive. If any of the files has an uppercase letter, the search is case sensitive.

Server-side Image Maps
 (top)  (command index)  (topic index)
Linklint can check all links that are used in both client-side and server-side image maps. Client-side image maps are handled automatically since Linklint parses the <area href=LINK> tag in HTML files.

Server-side image maps are a little bit tricky. Some servers have the imagemap CGI software built-in so links ending in .map are treated as map files and automatically sent to the image map program for processing. Linklint mimics this behavior. Any link ending in .map is parsed as if it were a map file. In addition, all .map links are checked locally even if the -http flag is used since map files are generally not accessible directly via http.

Some servers require server-side image map links to contains the path of the CGI image map program followed by the path to the map file as in:

<a href=/cgi-bin/imagemap/dir/info.map>.

Here /cgi-bin/imagemap is the location of the image map CGI program and /dir/info.map is the location of the map file. Linklint can resolve these links and read the map file (locally only, even if -http is used). However, you must provide the path from your server root directory to your image map program using the -map option. Three common image map specifications are:

  • -map /cgi-bin/imagemap
  • -map /cgi-bin/imagemap.exe
  • -map cgi-bin/htimage
For example, if you set "-map /cgi-bin/imagemap", the link /cgi-bin/imagemap/dir/info.map will be transformed to /dir/info.map which will be read in locally and parsed as a map file. You need to be sure to set -root properly for Linklint to be able to find the map file.
How the Status Cache Works
 (top)  (command index)  (topic index)
Linklint uses a combination of three different methods to keep track of remote URL modification times:
Last-Modified date
Many web servers, let Linklint know that last date a file was modified. If this date is available for a page then Linklint uses it for keeping track of changes.
If-Modified-Since requests
If the Last-Modified is not available then Linklint tries an If-Modified-Since request. Linklint asks if the page has been modified since the last time (according to Linklint) it was checked.
Checksum of the remote file
If neither method above is available on a remote server then Linklint reads in the entire remote file, makes a checksum of its contents and uses this checksum to keep track of changes.
These methods are totally transparent to the Linklint user (you). For each URL the most efficient method is tried first, and the checksum is only used as a last resort.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Seeds · Recursion · Parsing · Resolving links · Default Files · Image Maps · Status Cache

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/index.html0100664000175000017500000001325707341361022017765 0ustar barbierbarbier Linklint Documentation Linklint Documentation
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index

Inputs describes input files and command line parameters.
Outputs describes output files created by Linklint.
Hints gives hints on how to use Linklint.
How it works explains the basic operation of Linklint.
Index contains indexes of all input parameters and topics.

Linklint is an Open Source Perl program that checks local and remote HTML links. Example:

linklint -http -host my.host.com -limit 1000 -doc dir /@

-http check site via HTTP requests (HTTP site check)
-host my.host.com check the my.host.com site
-limit 1000 bump up the file limit from 500 to 1000
-doc doc put all output files in the dir/ subdirectory
 /@ check entire site

But checking links is often more complicated than this. Even if you don't end up using Linklint, you will get a better understanding and appreciation of link checking if you read this documentation. Please read all of the documentation.

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/inputs.html0100664000175000017500000012774007341361022020203 0ustar barbierbarbier Linklint Documentation - inputs Linklint Documentation - inputs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Input files · Linksets · Local-site · HTTP-site · Remote-URL · Outputs · Debug

This page shows examples of all of the command line (and command file) inputs to Linklint. There are three main modes of operation:
Local Site Checking
Checks pages and links on your site locally, looking for files on the local file system. This is convenient for small sites that you build "at home" which will later be uploaded to an HTTP server. It can also be used for very simple sites that have little CGI.

HTTP Site Checking
Checks pages and links on your site by requesting pages via HTTP, just like a browser would. This mode is less efficient than just reading directly from the file system because Linklint must make a socket connection for each page/file and your web server most respond to each request.

Remote URL Checking
The two site checking options only check pages from a single host computer. Remote URL checking can check all of the links on your site that go to other sites or a list of specific URL's.
Input Files
 (top)  (command index)  (topic index)
There are two types of input files that can be specified on the command line: @command_files which contain command line options and @@http_files which are parsed to find http:// URLs. Command files indicated with a single @ sign before the file name. Http files are indicated with two @ signs before the file name. [This means that the actual file name of command files cannot start with an @ sign. Oh well.]

Command Files

linklint @command_file
Reads in command line arguments from command_file. Command files can be nested. Each command file is interpreted line by line. Empty lines and lines beginning with # are ignored. Lines that start with -anything can only contain command line arguments. You can have multiple arguments on one line, and arguments can take repeated parameters in command files only. Example:

# This is a sample command file
#
-host  www.linklint.org
-root  /www/
-doc   linkdoc
-index index.html index.cgi

Reading Commands from STDIN

linklint @ < command_file
linklint @STDIN < command_file
A plain @ sign or @STDIN will cause Linklint to read STDIN and treat it as a command file. This is useful if you want to run Linklint as a configurable CGI program. If no STDIN is available then Linklint will hang waiting for an end-of-file from STDIN. You can also use this mode to "interactively" feed commands to linklint. On Unix, terminate your input with ^D.

Files of Local Pages

If you only want to check the links on one or two pages then just use the path to those pages (starting each path with "/") on the command line instead of /@:

linklint /first/page.html /second/page.html

If you have a long list of pages (on your site) that you want to have link checked (not just the existence of each page, but all of the links on each page) then put the path to each page in a command file and send that command file (with a leading @ sign:

linklint @local_pages

# local_pages
#
/first/page.html
/second/page.html
/third/page.html
# etc.
If the list of pages you want to check contains full URLs, it is very easy to write a little Perl program to strip off the scheme and host:
perl -ne "s{http://[^/]+}{} and print" full_links.in > rel_links.out

Files of Remote Links

linklint @@http_file
Check the status of all http:// references that are found in http_file. Very forgiving in looking for links. If the file looks like a remoteX.txt file generated by Linklint then failed URLs will be cross referenced.

linklint -doc linkdoc @@
When you specify @@ with no filename, Linklint will check all the http links found in the file linkdoc/remoteX.txt. You must specify a -doc directory. This is an easy way to recheck all of the remote links on your site.

Which Files to Check
 (top)  (command index)  (topic index)

Linksets defined

Whether you are doing a local site check or an HTTP site check, you specify which directories (presumably containing HTML files) to check with one or more linksets. A linkset uses two wildcard characters @ and #. Each linkset specifies one or more directories much like the standard * and ? wildcard characters are used to specify the characters in the names of files in one directory.

The @ character matches any string of characters (this kind of acts like "*"), and the # character (which is kind of like "?") matches any string of characters except "/" . The best way to understand how @ and # work is to look at a few examples:

the entire site /@
the homepage only (default) /
files in the root directory only /#
. . . and one directory down /#/#
files in the sub directory only /sub/#
files in the sub directory and below /sub/@
specific files /file1 /file2 ...
specific subdirectories /sub1/@ /sub2/@ ...

If you specify more than one linkset, files matching any of the linksets will be checked. HTML files that don't match any of the linksets will be skipped. Linklint will see if they exist but won't check any of their links.

Other File Selection Options

-skip skipset
Skips HTML files that match skipset. Linklint will make sure these files exist but won't add any of their links to the list of files to check. Multiple skipsets are allowed, but each must be preceded with -skip on the command line. Skipsets use the same wildcard characters as linksets.

-ignore ignoreset
Ignores files matching ignoreset. Linklint doesn't even check to see if these files exist. Multiple ignoresets are allowed, but each must be preceded with -ignore on the command line. Ignoresets use the same wildcard characters as linksets.

-limit n
Limits checking to n HTML files (default 500). All HTML files after the first n are skipped.

Local Site Checking
 (top)  (command index)  (topic index)
If you are developing HTML pages on a computer that does not have an http server, or if you are developing a simple site that does not use Server Redirection or extensive CGI, you should use local site checking.

linklint /@
Checks all HTML files in the current directory and below. Assumes that the current directory is the server root directory so links starting with "/" default to this directory. You must specify /@ to check the entire site. See Which Files to Check for details.

linklint -root dir /@
Checks all HTML files in dir and below. This is useful if you want to check several sites on the same machine or if you don't want to run Linklint in your public HTML directory.

Other Local Site Options

-host hostname
By default Linklint assumes all links on your site that start with http:// are remote links to other sites. If you have absolute links to your own site, give Linklint your hostname and links starting with http://hostname will be treated as local files. If you specify -host hostname:port, only http links to this hostname and port will be treated as local files.

-case
Makes sure that the filename (upper/lower) case used links inside of html tags matches the case used by the file system. This is for Windows only and is very handy if you are porting a site to a Unix host.

-orphan
Checks all directories that contain files used on the site for unused (orphan) files.

-index file
Uses file as the default index file instead of the default list used by Linklint. You can specify more than one file but each one must be preceded by -index on the command line. If a default index file is not found, Linklint uses a listing of the entire directory. See the Default File section for details.

-map /a=[/b]
Substitutes leading /a with /b. For server-side image maps or to simulate Server Redirection.

-no_warn_index Turns of the "index file not found" warning. Applies to local site checking only.

-no_anchors Tells Linklint to ignore named anchors. This could ease memory problems for people with large sites who are primarily interested in missing pages and not missing named anchors. This option works for both HTTP and local site checks.

HTTP Site Checking
 (top)  (command index)  (topic index)
If you have a complicated site that uses lots of CGI or Server Redirection, you should use HTTP site checking. Even though an HTTP site check reads pages via your HTTP server, you will get the best performance if you do your checking on a machine that has a high speed connection to your server.

linklint -http -host www.site.com /@
The -http flag tells Linklint to check HTML files on the site www.site.com via a remote http connection. You must specify a -host whenever you do an HTTP site check (otherwise Linklint won't where to get your pages). You can specify /@ to check the entire site. See Which Files to Check for details.

HTTP Site Check Options

-http
This flag tells Linklint to perform an HTTP site check instead of a local site check. All files (except server side image maps) will be read via the HTTP protocol from your web server.

-host hostname:port
If you include :port at the end of your hostname, Linklint uses this port for the HTTP site check.

-password realm user:password
Uses user and password as authorization to enter password protected realm. Realms are named areas of a site that share a common set of usernames and passwords. If passwords are needed to check your site, Linklint will tell you which realms need passwords in warning messages. Enclose the realm in double quotes if it contains spaces. If no password is given for a specific realm, Linklint will try using the password for the "DEFAULT" realm if it was provided.

-timeout t
Times out after t seconds (default 15) when getting files via http. Once data is received, an additional t seconds is allowed. The timeout is disabled on Windows machines since the Windows port of Perl does not support the alarm() function.

-delay d
Delays d seconds between requests (default 0). If you want to remote check in the background you can set delay to a large number, and Linklint will spend most of its time sleeping.

-local linkset
Gets files that match linkset locally. The default -local linkset is @.map (which matches any link ending in .map). This allows Linklint to follow links through server-side image maps. The default is ignored if you specify your own -local expressions. You need to specify the -root directory for this option to work propery.

-map /a=[/b]
Substitutes leading /a with /b. For server-side image maps or to simulate Server Redirection.

-no_query_string
Up until version 2.3.4, Linklint did not use query strings while doing HTTP site checks. Query strings were removed before making HTTP requests. As of 2.3.4 query strings in links are used in the requests. Use the -no_query_string flag to get back the "old" behavior.

-http_header "Name:value"
Adds the HTTP header "Name: value" to all HTTP requests generated by Linklint. You will need to use quotation marks to hide spaces in the header line from the command line interpreter. Linklint will automatically add a space after the first colon if there is not one there already. Multiple (unique) header lines are allowed.

-language zz
This option is only useful if you are checking a site that uses content negotiation to present the same URL in different languages. Creates an HTTP Request header of the form "Accept-Language: zz" that is included as part of all HTTP requests generated by Linklint. Multiple -language specifications are allowed. This will result in a single Accept-Language: header that lists all of the languages you have specified in alphabetical order. Some web sites can use this information to return pages to you in a specific language.

If you need to get more complicated than this, use the more general purpose -http_header to create your own header. There is a partial list of language abbreviations (taken from Debian) included as part of the Linklint documentation.

Remote URL Checking
 (top)  (command index)  (topic index)
A remote URL check is used to see if a remote URL exists (or has been recently modified). Links in the remote pages are not checked nor does Linklint look for named anchors in remote URLs.

Which URLs to check

Remote URL checking can be used to check all of the "remote" links on your site (those that link to pages on other sites) or it can check a list of URLs. There are several ways to specify which remote URLs to check:

linklint http://somehost/file.html
Checks to see if /file.html exists on somehost. Multiple URLs can be entered on the command line, in an @commandfile, or in an @@httpfile. Every URL to be checked must begin with http://. This will disable site checking.

linklint @@httpfile
Checks all the remote http URLs found in httpfile. Anything in the file starting with http:// is considered to be a URL. If the file looks like a remoteX.txt file generated by Linklint then all failed URLs will be cross referenced.

linklint @@ -doc linkdoc
Assuming you have already done a site check and used "-doc linkdoc" to put all of your output files in the linkdoc directory, Linklint will check all the remote links that were found on your site and cross reference all failed URLs without doing a site check. You can use the -netmod or -netset flags to enable the status-cache.

linklint -net [site check options]
The -net flag tells Linklint to check all remote links after doing either a local or HTTP site check site. If you are having memory problems, don't use the -net option, instead use one of the @@ options above.

Other Remote URL Options

-timeout t
Times out after t seconds (default 15) when getting files via http. Once data is received, an additional t seconds is allowed. The timeout is disabled on Windows machines since the Windows port of Perl does not support the alarm() function.

-delay d
Delays d seconds between requests to the same host (default 0). This is a friendly thing to do especially if you are checking many links on the same host.

-redirect
Checks for <meta> redirects in the headers of remote URLs that are html files. If a redirect is found it is followed. This feature is disabled if the status cache is used.

-proxy hostname[:port]
Sends all remote HTTP requests through the proxy server hostname and the optional port. This allows you to check remote URLs or (new with version 2.3.1) your entire site from within a firewall that has an http proxy server. Some error messages (relating to host errors) may not be available through a proxy server.

-concise_url
Turns off printing successful URLs to STDOUT during remote link checking.

Status Cache Options

The Status Cache is a very powerful feature. It allows you to keep track of recent changes in all of the remote (off-site) pages you link to. You can then use the Linklint output files to quickly check changed pages to see if they still meet your needs.

The flags below make use of the status cache file linklint.url (kept in your HOME or LINKLINT directory). This file keeps track of the modification dates of all the remote URLs that you check.

-netmod
Operates just like -net but makes use of the status cache. Newly checked URLs will be entered in the cache. Linklint will tell you which (previously cached) URLs have been modified since the last -netset.

-netset
Like -netmod but also resets the last modified status in the cache for all URLs that checked ok. If you always use -netset, modified URLs will be reported just once.

-retry
Only checks URLs that have a host fail status in the cache. Sometimes a URL fails because its host is temporarily down. This flag enables you to recheck just those links. An easy way to recheck all the cached URLs with host failures is linklint @@ -retry. use linklint @@linkdoc/remoteX.txt -retry if you want failed URLs to be cross referenced.

-flush
Removes all URLs from the cache that are not currently being checked. The -retry flag has no effect on which URLs are flushed.

-checksum
Ensures that every URL that has been modified is reported as such. This flag can make the remote checking take longer. Many of the pages that require a checksum are dynamically generated and will always be reported as modified.

-cache directory
Reads and writes the linklint.url cache file in this directory. The default directory is set by your LINKLINT or HOME environment variables.

Output Options
 (top)  (command index)  (topic index)
No output files are generated by default, only progress and a brief summary of the results are printed to the screen. You can produce complete documentation (split up into separate files) in a -doc directory or put selected output in a single -out file or by redirecting the standard output to a file. See the Output File Specification section for a detailed description of all output files.

Multi File Output

linklint -doc linkdoc
Sends all output to the linkdoc directory. The output is divided into separate .txt and .html files. Complete documentation is always produced regardless of the single file flags.

The file index.txt contains an index to all the other files; index.html is an HTML version of the index. The index files for remote URL checking are ur_lindex.txt and url_index.html.

-textonly
Prevents any HTML files from being created in the -doc directory.

-htmlonly
Erases redundant text files in the -doc directory after they have been used to create the HTML output files. The files remote.txt and remoteX.txt are not erased since they can be used by Linklint to recheck remote URLs.

-docbase base
Overrides the default base expression used for directing a browser to the resources listed in the output HTML files. The base is prepended to local links in the output HTML files. This only affects the links in HTML output files, it has no effect on what is displayed in these files. Ordinarily this flag would only be used during a local site check to set the base to http://host.

-output_frames
All HTML output data files are linked to from index.html. If you use this flag then the the data files will be opened up in a new frame (window) which can be handy in some cases since it always leaves the index.html file open in its own window.

-output_index filename
The output index files were previously named linklint.txt and linklint.html. These have now been changed to index.txt and index.html. You can use the -output_index option to change this name back to linklint or to something else.

-url_doc_prefix url/
By default, the output files associate with remote URL checking all start with "url". You can change this with the -url_doc_prefix option. If the url_doc_prefix contains a "/" character then the appropriate directory will be created (as a subdirectory of the -doc directory).

-dont_output xxxx
Don't create output files that contain "xxxx". Can be repeated. Example: -dont_output "X$" will supress the output of all cross reference files.

Single File Output

linklint -error > linklint.out
Lists all errors to linklint.out. Progress and summary information will not be included. You can get cross referenced lists with the -xref flag or lists sorted by the files containing errors with the -forward flag.

linklint -error -out linklint.out
Lists all errors and a brief summary to linklint.out You can get cross referenced lists, etc., as in the example above.

-out file sends list output and summary information to file
-list lists all found files, links, directories etc.
-error lists missing files and other errors
-warn lists all warnings
-xref adds cross references to the lists
-forward sorts lists by referring file

Debug and other Flags
 (top)  (command index)  (topic index)

Debug Flags

-db1 debugs command line input and linkset expressions
-db2 prints the name of every file that gets checked (not just HTML files)
-db3 debugs HTML parser, prints out tags and resulting links
-db4 debugs socket connection (kind of)
-db5 not used
-db6 details last-modified status for remote URLs (requires -netset or -netmod)
-db7 prints brief debug information while checking remote URLs
-db8 prints all http headers while checking remote URLs
-db9 generates random http errors

Other Flags

Use linklint with no command line arguments to get simple usage.

-version Gives version information.
-help Lists a few simple examples of how to use Linklint.
-help_all Lists all help (contained in program) including every input option.
-quiet disables printing progress to the screen
-silent disables printing summarys to the screen

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
Input files · Linksets · Local-site · HTTP-site · Remote-URL · Outputs · Debug

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/language.html0100664000175000017500000001242607341361022020436 0ustar barbierbarbier Linklint Documentation - Country/Language Codes Linklint Documentation - Country/Language Codes
Version 2.3.5 August 13, 2001

Here is a partial list of ISO 3166 Country Codes that are used in HTTP requests to specify which language or languages to use when returning pages that are available in multiple languages.

This list was lifted off of the Debian home page.

ca català
da dansk
de Deutsch
en English
es Español
eo Esperanto
hr hrvatski
it Italiano
hu magyar
nl Nederlands
no norsk
pl polski
pt Português
ro română
fi suomi
sv svenska
tr Türkçe
zh-cn 中文 (Chine)
zh-hk 中文 (Hong Kong)
zh-tw 中文 (Taiwan)

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/license.html0100664000175000017500000004311107341361022020270 0ustar barbierbarbier GNU General Public License GNU General Public License
Version 2, June 1991

      Copyright (C) 1989, 1991 
      Free Software Foundation, Inc.
      59 Temple Place, Suite 330, 
      Boston, MA  02111-1307  USA

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
 (top)  (command index)  (topic index)
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.

To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.

Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.

Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.

The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
 (top)  (command index)  (topic index)
0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you".

Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.

1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.

You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.

2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:

a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.

b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.

c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)

These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.

Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.

In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.

3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:

a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.

If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.

4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.

5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.

6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.

7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.

If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.

It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.

This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.

8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.

9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.

Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.

10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.

NO WARRANTY

11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

END OF TERMS AND CONDITIONS
 (top)  (command index)  (topic index)
Copyright (C) 1989, 1991 Free Software Foundation, Inc.
linklint-2.3.5.orig/doc/small/linklint.gif0100664000175000017500000000227607341361022020302 0ustar barbierbarbierGIF89aJ÷BcBccc{{{¥¥¥ÆÆÆÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ!ù,Jþ H° Aƒ\Ȱ¡Ã‡0¢Å‹J¤˜±£Ç‹+~I²`È’(IžLÉ2ãÊ–%ÈœI³¦Ì6sêÜÉ“çÀ‚ :ôN¢H“*]Ê´èO=oN µªÕžPŸF¥zµ«×šPE¥)”&UD * '[™hÕ~•¶âØ™e¥`ÀZmm¾ÝÛ÷¯\²F)Þ¥Tªqÿ nZÉ‚+&Ûxãc›ÔpxpÞÁ E“f\¯æÍ¬Lœ´ñHå¾½ºvàÁ¸‰ªEЏ€g®µ¡Î¦z:(Þ¿+ Õ]Ù7ëç»Ù>/îysÙ©±­œ‹Ï·ºióè'»ØycðÉG>ó|ùøÅg?ûñwäÍÝ—œsöåwž€û×_|”A5zC•–ž‚uÀOž Øœƒâ'”€>(@h :¦¡mR8ßp†8”zY dbRá¹Xt"ªׇ%†õÙ\@I×…2:xœH^cµiæ^SPF)åRD2ilIfÙC"ÁäeA;linklint-2.3.5.orig/doc/small/new.html0100664000175000017500000002350207341361022017441 0ustar barbierbarbier Linklint Documentation - What's New Linklint Documentation - What's New
Version 2.3.4 August 8, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Index

Few changes. Almost 100% backward compatible (output index files are now index.* by default instead of linklint.*). Existing command files should still work.

Query string support for HTTP site checking has changed as of 2.3.4. This may change how Linklint checks your HTTP site if you have dynamic content. It now does "the right thing" by default. Use the -no_query_string flag to get back the old behavior.

  • Linklint is free again
  • Better query string support (as of 2.3.4)
  • Proxy support for site checking
  • Output files are more configurable
  • Input from STDIN
  • Beta SSL support now available
  • Looks for Java Archive .jar files

GNU General Public License

Linklint started out free. It became sharewhare when the flood of comments, suggestions and questions started turning it into a "full-time" job. Thanks to all who contributed! With the rise (and love) of Linux, we wanted to make it Open Source in a way that would not make previous contributors feel ripped off.

There was a gradual (unadvertised) change to Open Source. First, it was made free for use on Open Source operating systems. Then a GPL Open Source version was distributed as a part of the Meta Web Language system and shareware checks were no longer cashed. With the new Linklint web site, it is only being distributed as Open Source via the GNU General Public License.

Better Query String Support

Query strings were suppressed during site checks. This can be changed by commenting out one line in older versions. As of 2.3.4, query strings are used in HTTP site checks. To get the old behavior use the new -no_query_string flag.

If you have an older version of Linklint, you can get the new query string behavior by commenting out the one line that includes the comment "strip query string". This also applies to the 2.4.beta version.

Proxy Support for Site Checking

Previously, proxy support was only for remote-URL checking. Now proxy support has been extended to handle site checking as well. The conflict between proxies and virtual hosts has been resolved.

Output Options

-output_index xxxx The index output files were previously named linklint.* in order to prevent overwriting existing index.html files if the -doc directory collided with an existing HTML directory. By popular demand, these file names have been changed to index.*. This is more convenient for most, and a little more dangerous for a few. You can change back to linklint (or whatever) with the -output_index option.

-output_frames Uses <base> tags in the output files so that new browser windows will open when following links in the HTML output files. This prevents having to reload large output HTML files.

-url_doc_prefix some_prefix Gives control over the prefix of all output files associated with Remote-Url checking.

-dont_output xxx Suppress output of files that match /xxx/. In the past, I have told people to comment out lines in the program to suppress the generation of certain output files.

-no_warn_index Turns of the "index file not found" warning. Applies to local site checking only.

-concise_url Turns off printing successful URLs to STDOUT during remote link checking.

Read inputs from STDIN (keyboard)

@ or @STDIN will cause Linklint to read from STDIN (keyboard) as if it were an @command file. Great when you are using linklint from the shell and run out of space on the command line. Might make runing Linklint as a CGI program a little easier.

Other New Flags

-help_all -version -license are all pretty obvious.
-no_query_string Don't use query strings in HTTP site checks
-http_header Xxx:Yyy Add header lines to HTTP requests.
-language zz Add Accept-Language header line to HTTP requests.

Beta Version with SSL Support

Now that someone else (Sampo Kellomak) has done the heavy lifting of providing a low level Perl interface to the OpenSSL package, we wrote a simple wrapper module around his Net::SSLeay module to provide SSL support in Linklint.

Be warned, this beta version requires the OpenSSL package and the Net::SSLeay module (and the Net::SSLeay::Handle wrapper) in order to run. You will need to install these first which is fairly easy (if you have root) on Linux. Your Win32 mileage may vary.

Once installed, you use -http to check HTTP sites and -https to check HTTPS sites.

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/doc/small/outputs.html0100664000175000017500000006241707341361022020403 0ustar barbierbarbier Linklint Documentation - outputs Linklint Documentation - outputs
Version 2.3.5 August 13, 2001

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
-Doc directory · Site-check files · Url-check files

This page provides a detailed description of every output file created by Linklint.
-Doc Directory
 (top)  (command index)  (topic index)
If a -doc directory is specified, all output files are written in that directory. The directory will be created if it does not already exist. There are two sets of files that are created, one is for site checks and the other is for remote url checks.

Even though new files are written only as needed, all previously written files from each set are erased before new files from the set are written. Each set is independent of the other. The site-check files will get erased only if you do another site check. The url-check files will be erased only if you check remote URLs again.

Many of the output files now come in HTML and text versions. The HTML versions look like the text versions but contain hyper-links to the resources they refer to. Files that come in both versions are listed without an extension.

Some of the output files are part of a family. Members of the family are related using the simple mnemonic of a trail X meaning cross referenced and a trailing F meaning forward referenced. For example: error lists all missing files, errorX lists the missing files and all HTML files that reference them, and errorF lists all HTML files that contain links to missing files along with the names of the missing files.

Site-Check Output Files
 (top)  (command index)  (topic index)
During local site checking, local links ending in "/" are kept as-is to maintain consistency with HTTP site results. However, if a default index file was found, it is listed below the link in square brackets. If a directory listing is used because no default file was found, this information is listed in the brackets.

Any link that Linklint knows has been redirected (either by the http server or with the -map option), gets an extra line in the output files showing the original link it was mapped from in parentheses. See the Server Redirection section for more details.

Site-Check Summary Files

index.html
a hyperlinked index to all site-check files created. On Windows machines this file will be named linklint.htm. This list differs from the summary list in two ways. First, it contains entries for every file written including cross referenced lists and forward referenced lists that are missing in the summary. Second, it lacks the detailed divisions by file type and external link schema that exist in the summary.

index.txt
text version of the index file.

summary.txt
summary of site-check results, similar to what gets printed to the screen. Every non-empty data family (listed below) will cause a line to be printed in the summary. In addition, external links are listed by the following schema: http, https, ftp, javascript, mailto, gopher, file, news, view-source, about, unknown. Found and missing files are listed by file type:

cgi starting with "/cgi-bin" or containing "?" or .cgi or .pl
default index ending with "/"
HTML ending with .htm .html or .shtml
map ending with .map
image ending with .gif .jpg .jpeg .tif .tiff .pic .pict .hdf .ras .xbm
text ending with .txt
audio ending with .au .snd .wav .aif .aiff .midi .mid
video ending with .mpg .mpeg .avi .qt .mov
shockwave ending with .dcr
applet ending with .class
other all other files

These distinctions sort files (mostly) by extension, they are not true indications of the tag the link was found inside, nor of the MIME format returned by your server.

log.txt
log of site-check progress, similar to what gets printed to the screen but also includes extra output created when -db flags are used.

Site-Check Data Files

action, actionX
a list of all ignored actions. Any link found inside a <form action=LINK> tag is listed here, but it is not checked since normally extra input from the form is required.

anchor, anchorX
a list of all named anchors found. The anchorX file lists all files that use each named anchor. If a named anchor is not referenced by any of the files checked by Linklint, it will have no cross references. Named maps are also included in this list.

case, caseX, caseF
If -case was used during a local site check on a Windows machine, all files that have references that do not match the (upper/lower) case of the file are listed here. This is handy if you are developing a site on a Windows machine that you are planning to port to a Unix server.

dir.txt
a list of all the directories that contain files used by the site. Only created during local site checking.

error, errorX, errorF
a list of all missing files. A file is listed as missing if:

  1. it is referenced by a file checked by Linklint, and
  2. it is a local file, and
  3. it does not match any -ignore expression, and
  4. Linklint could not find the file locally, or an error occurred when Linklint tried to get the file via http.
See the Parsing Html section for details.

errorA, errorAX
a list of missing named anchors. A named anchor is missing if:

  1. it is referenced by a file checked by Linklint, and
  2. it is located inside of a local file, and
  3. the file it is in is not -ignored or -skipped, and
  4. Linklint could not find the file, or the file was found but the named anchor does not exist.

errorM, errorMX
a list of missing named client-side image maps. The rules for inclusion in this list are the same as those for missing named anchors.

file, fileX, fileF
a list all files found on the site. file is a list of all files found sorted by file type; fileX is a cross referenced version, showing a sublist of all the HTML files that reference each file in the list; fileF lists each HTML file and a sublist of all of the links it references. These lists are meant to show file dependencies so multiple links to the same file result in a single listing. Likewise, a named anchor causes the file containing the anchor to be listed, but the actual named anchors are listed separately.

httpfail
a list of all the http errors that occurred while trying to remote check a site.

httpok
a list of all the files that were obtained without error while remote checking a site.If the -db6 flag is used and the status-cache is enabled, these entries are expanded to include the following extra information:

  • ok (200)
  • ok parsed HTML
  • ok skipped

ignore, ignoreX
a list of all ignored files.

imgmap, imgmapX
a list of named image maps for client-side image maps, taken from tags <img usemap=NAME> and <map name=NAME>. The format is in this list is the same as the one used for named anchors.

mapped
a list of all the redirected files that were found while checking a site either locally or remotely.

Some servers automatically change the name of a link. For example a link to http://host/subdir will get automatically mapped to http://host/subdir if subdir is a directory. Many other mappings are possible. Linklint will follow these mappings and treat the resulting file as the actual link (just like a browser would).

This is a potential cause for confusion. If your index.html file has a link to A.html and this gets mapped to B.html, Linklint will tell you that index.html has a link to A.html and that it has a link to B.html, but only B.txt will be listed as a found file.

See the Server Redirection section for more details.

orphan
If the -orphan flag was used during a local site check, all of the unused files and subdirectories in each directory that contain files used by the site, are listed here sorted by directory. If an orphan HTML file contains a meta refresh tag redirecting the visitor to a different file, this new file is listed under its parent preceded by " =>". This method of redirection is often used to steer visitors to the current version of a file without requiring them to change their bookmarks.

remote, remoteX
a list of all references that are not to local files. The remoteX file lists which HTML files link to these resources.

skipped, skipX
a list of all files that were skipped by Linklint. These are generally HTML files which were found to exist but were not checked further because:

  1. they did not match any of the linksets specified, or
  2. they did match one of the -skip expressions, or
  3. more than -limit files had already been checked.

warn, warnX, warnF
a list of all warnings that were generated during the site check. Warnings include: unexpected I/O errors, HTML errors such as unterminated comments, missing index files (during local site checks), space characters inside of links, the use of "\" inside of links, files that are not world readable, mappings that cause infinite loops, and meta refresh tags that redirect to relative URLs.

For HTTP site checking warnings are also generated for: files disallowed by robots.txt, files mapped to a different server, files that require a username and password (and none was provided), an invalid username and password, and files mapped to non-http schemes.

Url-Check Output Files
 (top)  (command index)  (topic index)
You may notice that all of the files in this section start with the prefix url. You can change this prefix with the -url_doc_prefix option. The default value is still url for backward compatibility, but I now prefer either url/ or url_.

Url-Check Summary Files

urlindex.html
a hyperlinked index to all url-check files created. On Windows machines this file will be named urlindex.htm. This list differs from the summary list in two ways. First, it contains entries for every file written including: host failures, cross referenced lists, and forward referenced lists. Second, it lacks the detailed divisions by failure type that exist in the summary.

urlindex.txt
text version of the index file.

urlsum.txt
summary of url-check results, similar to what gets printed to the screen. The summary list includes an entry for every type of warning or failure.

urllog.txt
log of url-check progress, similar to what gets printed to the screen but also includes extra output created when -db flags are used.

Url-Check Data Files

urlfail, urlfailX, urlfailF
a list of URLs that failed due to one of the following errors:

  • could not find ip address
  • could not connect to host
  • all timeout errors
  • had no content (204)
  • bad request (400)
  • access forbidden (403)
  • not found (404)
  • internal server error (500)
  • service not implemented on server (501)
  • server temporarily overloaded (502)
  • gateway timeout (503)

urlhost.txt
a list of all hosts that had failures during the url check. This includes:

  • could not find ip address
  • could not connect to host
  • could not open socket
  • malformed status line
  • timeout errors
  • server overloaded (502)
  • gateway timeout (503)
URLs that fail due to host failures can be retried with the -retry flag if the status-cache was enabled.

urlmod
a list of all URLs that have changed since the last time they were checked with the -netset flag. This list is only generated if -netmod or -netset are specified.

urlmoved
a list of URLs that were reported as redirected by their server. Often this redirection involves nothing more than adding a trailing "/" to a directory name. Sometimes it can be a precursor to a site changing location permanently. Some servers report that the url has been moved temporarily, others will say that the url has been moved permanently. A far as I can tell there is no real distinction between "temporary" and "permanent", they seem to be used interchangeably.

urlok
a list of all URLs that were found with no errors. If the -db6 flag is used and the status-cache is enabled, these entries are expanded to include the following extra information:

  • ok (200)
  • ok not modified (304)
  • ok last-modified date unchanged
  • ok did not compute checksum
  • ok checksum matched

urlskip
a list of URLs that were not checked. This is most often caused by the lack of password authorization but could also be due to to an exceptional condition such as an infinite redirect loop or an unknown internal Linklint error.

urlwarn, urlwarnX, urlwarnF
a list of warning messages generated while Linklint was doing a url-check. The most common warning tells you the name of a realm which requires a username and password.

documentation
Introduction What's New Inputs Outputs Hints How it works Bugs Index
on this page
-Doc directory · Site-check files · Url-check files

Checked by Linklint © Copyright 1997 - 2001 James B. Bowlin
linklint-2.3.5.orig/CHANGES.txt0100664000175000017500000001520207341361011015712 0ustar barbierbarbierLinklint Change History ======================= Version 2.3.5 August 13, 2001 ----------------------------- o added -no_anchors tag (for larger sites) o fixed bug that prevented site checks of some non port 80 sites. (Thanks Rick Perry). Version 2.3.4 August 8, 2001 ----------------------------- o s!//+!/!g inside of UniqueUrl() o added -http_header and -language options o added .php and .jar files to file type list o Look for applet .jar files (archive=*.jar) files o Default is to use query strings in HTTP site check o -no_query_string changes back to "old" behavior (no query strings are used in HTTP site checks). Version 2.3.3 July 6, 2001 ---------------------------- o added 2nd arg to mkdir() for url doc directory Version 2.3.2 June 22, 2001 ---------------------------- o -no_warn_index for missing index file warnings o -concise_url suppress STDOUT output of valid remote links Version 2.3.1 June 21, 2001 --------------------------- o site check proxy support: removed conflict with -proxy, moved proxy support to Request() so it works w/ site checks as well as remote URL's. Version 2.3.0 June 3, 2001 -------------------------- o moved home site and email address o added -help_all -version -license "@" o updated to GPL Version 2.2 January 6, 2000 --------------------------- o GNU GPL o -output_frames o -help_all o @ and @STDIN o -version o -url_doc_prefix o -dont_output xxx o linklint.txt, linklint.html => index.txt, index.html Version 2.1 July 24, 1997 ------------------------- o Added html output files. New flags: -docbase, -textonly, -htmlonly o Changed format of orphans.txt to allow hyperlinks to the orphans o removed "text after space" warning. o added -cache flag to control directory of status cache file Version 2.0.22 July 22, 1997 ---------------------------- o bind() local host to socket before connect() fixed a bug on some systems that were getting a Bus Error on the next remote url after timing out connecting to host o fixed -out file when only remote urls are checked o @@ without a filename defaults to linkdoc/remoteX.txt where "linkdoc" is the -doc directory. o added -redirect flag to check for redirected remote urls (This is disabled when using the modified cache.) o now cross reference url failures when @@remoteX.txt is input o removed urltry.txt files, all failed urls are in urlfail.txt files Version 2.0.21 July 21, 1997 ---------------------------- o fixed a bug that caused erroneous "file not found" errors on some files on some Apache Servers running HTTP/1.1 o @@files can now read in HtTp:// (scheme is now case insensitive) o ignore comments when parsing for redirects o now allow "--!>" as well as standard "-->" for closing comment. Version 2.0.20 July 20, 1997 ---------------------------- o now convert all linkspecs to lowercase on Windows (unless -case) o allow linksets to .htm, .html without leading /. o expanded html documentation o (-timeout t) allow an additional t seconds to read data o added -proxy flag for remote url proxy server (beta) o only print out tags with links in -db3 mode o added -redirect for checking meta redirects in remote urls (beta) Version 2.0.19 July 17, 1997 ---------------------------- o relaxed requirements for well formed http status line o now print out malformed status lines in -db8 mode Version 2.0.18 July 16, 1997 ---------------------------- o added named maps to named anchor list o fixed problem reporting errors for links containing \n o added log o gif to distribution o now ignore anchors outside of parse routine Version 2.0.17, July, 14, 1997 ------------------------------- o fixed typos and errors in the documentation. No changes to the code. Version 2.0.16, July, 14, 1997 ------------------------------ o fixed problem of reporting all subdirs as orphans (Ed Greenberg). o changed format of -password to -password realm user:password. o allow a DEFAULT password realm. o beefed up warning messages for passwords. o full timeout logic implemented for url checking on Unix. o files excluded by robots.txt are now skipped instead of missing. o process named anchors to mapped files correctly. Version 2.0.15, July 4, 1997 ---------------------------- o added -password realm:user:password for authorization o allow optional :port at end of -host hostname o redid tag logic to emulate (weird) Netscape 3.0 behavior. o urlwarn.txt now contains warnings like warn.txt file. o changed many output file names: file_x.txt -> fileX.txt etc. Version 2.0B14, July 3, 1997 ---------------------------- o fixed bug introduced in 2.0B11 of adding basepath to external links. o now resolve named anchors globally to ignore ones in skipped files. Version 2.0B12, July 2, 1997 ---------------------------- o added a warning for spaces in urls o changed warning print out now have warn_x.txt and warn_f.txt o improved printout of failed urls: added urlfailF.txt Version 2.0B11, July 2, 1997 ---------------------------- o ignore text between which changed some of the parser logic and made it a tad slower. o stop recursion into default directories in local site check o print out relative directories instead of absolute o save url progress to urllog.txt o changed some of the -db flags, improved -db3 printout o renamed linklint.doc to linklint.txt Version 2.0B10, June 30, 1997 ----------------------------- o allow named anchors to links that contain queries o truncate progress on screen to 79 chars/line o save full progress to dir/log.txt if -doc dir is specified o changed -log to -out o default -limit is now 500 o use -host as default basehost o added schema: 'javascript', 'view-source', 'about' o check for cgi files first in printing out lost+found lists o fixed a glitch that did not add basehost to http:file Version 2.0B9, June 29, 1997 ---------------------------- o strip off query before getting basepath/basefile. (Ed Greenberg) o print out anchor.txt, anchor_x.txt Version 2.0B8, June 29, 1997 ---------------------------- o fixed another Unix/Windows anomaly. Unix would recognize -d "sub/" as a directory. Now I only append a "/" if the link does not end in "/". o only print extra map information if a link has actually been mapped. Version 2.0B7, June 27, 1997 ---------------------------- o fixed serious Unix problem of not trying -d on files if -f succeeds. This prevented linklint from recursing into directory links that did not end in "/". o changed printout of local directories to full path w/o trailing / o changed -unused to -orphan o changed -ashttp to -http o changed -aslocal to -local o added help.html documentation linklint-2.3.5.orig/INSTALL.unix0100664000175000017500000000317607341361011016123 0ustar barbierbarbierUnix Installation ================= Quick Install -------------- tar zxvf linklint_X.X.tgz cd linklint-X.X.X cp linklint-X.X.X /usr/local/bin/linklint -- or -- cp linklint-X.X.X /usr/local/bin/ cd /usr/local/bin; ln -s linklint-X.X.X linklint Detailed Installation Instructions ---------------------------------- 1. Choose a directory in which to install Linklint. If you have root privileges /usr/local/src/ would be a reasonable place: # cd /usr/local/src/ otherwise, you might want to install Linklint in your home directory: $ cd 2. Download linklint-X.X.X.tar.gz Say ok to warnings from browser and choose "Save File" option. 4. Unzip and untar the distribution: $ tar -zxvf linklint-X.X.X.tar.gz 5. Move the directory just created: $ cd linklint-X.X.X/ 6. Copy linklint to a directory on your path. If you have root privileges then /usr/local/bin is a good place # cp linklint-X.X.X /usr/local/bin/linklint otherwise you might try putting linklint in your ~/bin/ directory: $ cp linklint ~/bin/ You can instead use a symbolic link as in: # cp linklint-X.X.X /usr/local/bin # cd /usr/local/bin # ln -s linklint-X.X.X linklint If you have problems putting linklint into a directory on your path, you can always run linklint as a Perl script: $ perl linklint [options] If your Perl program is not located at /usr/bin/perl and you want to run linklint as a command then you will have to change the first line of linklint to point to your Perl program. Use the command "which perl" to find out where your Perl program resides. Enjoy!linklint-2.3.5.orig/INSTALL.windows0100664000175000017500000000225007341361011016622 0ustar barbierbarbierWindows Installation ==================== Quick Install -------------- tar zxvf linklint_X.X.tgz cd linklint_X.X cp linklint /usr/local/bin/ 1. Create a directory for the linklint distribution files. 2. download the linklint.zip file Say ok to warnings from browser and choose "Save File" option. 3. unzip the distribution: pkunzip linklint.zip Optional: 4. copy linklint file and batch file to a place on your path. 5. Edit batch file changing "\bin" to the directory containing linklint. 6. Set environment variable LINKLINT or HOME to the directory where you want linklint to save the modified cache file "linklint.url". You need to run linklint in a DOS window. You can always run linklint as a Perl script: > perl linklint [options] You can run it as a command using the batch file. NOTE: All local links on Windows systems are converted (internally) to lowercase to ensure that links are listed uniquely. You can use the -case flag to prevent this from happening in which case linklint will make sure that your usage of (upper and lower) case matches the default names in the file system. This is useful for porting a Windows site to a Unix machine. Enjoy!linklint-2.3.5.orig/LICENSE.txt0100664000175000017500000003543407341361011015735 0ustar barbierbarbier GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS linklint-2.3.5.orig/READ_ME.txt0100664000175000017500000000335107341361011015740 0ustar barbierbarbierLinklint - a fast link checker and web site maintenance tool ------------------------------------------------------------- Version 2.3.5 August 13, 2001 Copyright (C) 1997 -- 2001 James B. Bowlin. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License (LICENSE.txt) along with this program; if not, write to Free Software Foundation, Inc. 59 Temple Place - Suite 330 Boston, MA 02111-1307, USA CONTENTS -------- CHANGES.txt History of changes INSTALL.unix Unix installation INSTALL.windows Windows installation LICENSE.txt GNU General Public License READ_ME.txt this file linklint.bat batch file for Windows doc/ HTML documentation doc/small/ HTML documentation (with smaller fonts) linklint-X.X.X The Linklint program INSTALLATION ------------ Linklint is a plain old Perl programs so you can always run it (from the command line) as: perl linklint-X.X.X [parameters] You need to have Perl installed on your system (version 5.004 or greater). For convenience, (on Unix) you may want to rename it to "linklint" and move it to a directory that is on your path. See the file INSTALL.unix or INSTALL.windows for more details. Enjoy. -- Jim Bowlin linklint-2.3.5.orig/linklint-2.3.50100775000175000017500000040035507341361012016326 0ustar barbierbarbier#!/usr/bin/perl -- # -*- perl -*- #========================================================================== # linklint - a fast link checker and web site maintenance tool. # Copyright (C) 1997 -- 2001 James B. Bowlin. All rights reserved. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to # # Free Software Foundation, Inc. # 59 Temple Place - Suite 330 # Boston, MA 02111-1307, USA # # Notice, that ``free software'' addresses the fact that this program # is __distributed__ under the term of the GNU General Public License # and because of this, it can be redistributed and modified under the # conditions of this license, but the software remains __copyrghted__ # by the author. Don't intermix this with the general meaning of # Public Domain software or such a derivated distribution label. # # Linklint is a total rewrite of Rick Jansen's 4/15/96 version of webxref. # # Thanks to Celeste Stokely, Scott Perry, Patrick Meyer, Brian Kaminer # David Hull, Stephan Petersen, Todd O'Boyle, Vittal Aithal (and # many others!) for many excellent suggestions. # # Bugs, comments, suggestions welcome: bugs@linklint.org # Updates available at http://www.linklint.org # # RECENT CHANGES (see CHANGES.txt for full list): # # Version 2.3.5 August 13, 2001 # ----------------------------- # o added -no_anchors tag (for larger sites) # o fixed bug that prevented site checks of # some non port 80 sites. (Thanks Rick Perry). # # Version 2.3.4 August 8, 2001 # ---------------------------- # o keep query string for http site checks # o added no_query_string option to disable above # o added "php" and "jar" file types # o look for .jar files from applet tags # o s!//+!/!g inside of UniqueUrl() # o -http_header and -language options # # Version 2.3.3 July 6, 2001 # --------------------------- # o added 2nd argument to mkdir() on line 921 # o for creating url doc directory # # Version 2.3.2 June 22, 2001 # --------------------------- # o -no_warn_index for missing index file warnings # o -concise_url flag to suppress output of valid remote links # on STDOUT # # Version 2.3.1 June 21, 2001 # --------------------------- # o unified -proxy support (no conflict w/ virtual hosts now) # and moved it to Request() so we should support proxies # for site checking # # Version 2.3.0 June 3, 2001 # -------------------------- # o moved home site and email address # o added -help_all -version -license "@" # o updated to GPL # #======================================================================== $version = "2.3.5"; $date = "August 13, 2001"; $prog = "linklint"; $Usage1 = < redirects in remote urls. -proxy host[:port] Send remote http requests through a proxy server. -password realm user:password Authorize access to "realm". -concise_url Suppress STDOUT output of valid remote links. Remote Status Cache: -netmod Put urls in cache and report modified status. -netset ... and update last modified status in the cache. -retry Only check urls that had host failures. -flush Remove urls from cache that aren't currently being checked. -checksum Exaustive check of modified status. -cache dir Read/save "linklint.url" cache file in this directory. Output: -quiet Don't print progress on screen. -silent Don't print summary on screen. -docbase exp Overrides defaults for linking html output back to site. -textonly Only write .txt files in -doc directory. -htmlonly Erase .txt files after .html output files are written. -output_frames Have index.html open a new window when viewing data files -url_doc_prefix p Prefix Remote URL output files with "p" instead of "url". -output_index xxx Output index files will be "xxx.txt" and "xxx.html". -dont_output xxx Don't create output files that contain xxx. Debug Flags: -db1 Debug input, linkset expressions. -db2 Show every file that gets checked (not just html). -db3 Debug parser. Print tags and links found. -db4 Debug socket connections -db5 not used -db6 Detail last-modified status for remote urls. -db7 Print brief debug information checking remote urls. -db8 Print headers while checking remote urls. -db9 Generate random http errors. @cmndfile Read command line options from "cmndfile". @@file Check status of remote http links in "file". HELP2 $Examples = << 'EXAMPLES'; Examples: 1) linklint -doc linkdoc -root dir Checks home page. Output files go in "linkdoc" directory. 2) linklint -doc linkdoc -root dir /@ Checks all files under "dir". Output files go in "linkdoc" directory. 3) linklint -doc linkdoc -root dir /@ -net Checks site as (2). Then checks all http links found in site. 4) linklint -doc linkdoc -root dir /# Like (2) but only checks files in the root directory. 5) linklint -doc linkdoc -host host /@ -http Same as (2) but checks site using http instead of file system. 6) linklint -doc linkdoc @@linkdoc/remote.txt Check remote link status without rechecking entire site. EXAMPLES $ErrUsage = < LINK $scheme:$_\n\n"; return "$scheme:$_"; }; }; s#^//([^/]*)## && ($host = $1); # specified host ($scheme && $host) || ($scheme = $BaseScheme) && do { $host || $BasePath || m#^/# || ($host = $_, $_ = ''); $host || ($host = $BaseHost) && m#^/# || ($_ = $BasePath . $_); }; (($host && $host ne $ServerHost) || ($scheme && $scheme !~ /http/i )) || do { m#^/# || ($_ = $CurPath . $_); s/\#.*$//; # strip local anchor $local++; $Use_QS or s#\?.*$##; # strip query string }; m/&/ && do { s/&/&/g; # expand & etc s/<//g; s/"/"/g; s/ / /g; s/&#(\d\d?\d?);/pack("c",$1)/ge; }; # s/%([0-9a-zA-Z]{2})/pack("c",hex($1))/ge; s#\\#/#g && &Warn("\\ converted to / in $_[0]", $link); s#//+#/#g && &Warn("// converted to / in $_[0]", $link); #----- make path unique by expanding /.. and /. m#/\.# && do { while (s#/\./#/#) {;} # /./ -> / while (s#/[^/]*/\.\./#/#) {;} # /dir/../ -> / s#/.$##; # trailing /. s#/[^/]*/\.\.$#/#; # trailing /dir/../ -> / }; $local || do { $scheme = $scheme || 'http'; $host = $host ? "//$host" : ''; $DbP && print "==> LINK $scheme:$host$_\n\n"; return "$scheme:$host$_"; }; $IgnoreCase && tr/A-Z/a-z/; $DbP && $ErrTag && print "==> LOCAL LINK $_\n\n"; $_ || '/'; } #-------------------------------------------------------------------------- # WasCached($link, $referer) # # Does some quick checks on $link. Return 1 if we are done with it. # Return 0 if it should be checked further. Also bails out early # If we know we will not have to process any further. #-------------------------------------------------------------------------- sub WasCached { local($link, $referer) = @_; $referer ne "\n" && &AppendList(*Forward, $referer, $link); $LostFile{$link} && (($LostFile{$link} .= "\n$referer"), return '1'); $FileList{$link} && do { $Skipped{$link} && ($Skipped{$link} .= "\n$referer"); $FileList{$link} .= "\n$referer"; return '1'; }; $Action{$link} && return '1'; $link =~ m#^(\w+):# && do { &AppendList(*ExtLink, $link, $referer); return '1'; }; $Ignore && $link =~ m/$Ignore/o && do { &AppendList(*Ignored, $link, $referer); return '1'; }; ''; } #-------------------------------------------------------------------------- # LinkLint($level, $link, $referer) # # $level keeps track of depth of recursion. # $link is the URL or file to check # $referer is the file that referenced $link. # Recursively get all referenced files from a file. # NOTE: $link is assumed to be anchored at the server root. #-------------------------------------------------------------------------- sub LinkLint { local($level, $link, $referer) = @_; local(%newlinks); $DbLink && &Progress("getting $link"); ($ServerMap || $ServerTilde) && do { ($link = &MapLink($link, $referer)) || return ''; }; $link = $Http && $link !~ m/$Local/o ? &LinkRemote($link, $referer) : &LinkLocal($link, $referer); $link || return; $Forward{$link} = "\n"; # this primes forward #----- recurse into all links found in this file foreach $new (keys %newlinks) { &WasCached($new, $link) || &LinkLint($level+1, $new, $link); } } #-------------------------------------------------------------------------- # LinkLocal($link, $referer) # # Does the local equivalent of what the server does. #-------------------------------------------------------------------------- sub LinkLocal { local($link, $referer) = @_; $link =~ s/\?.*$//; # strip local queries local($lastdir); # for directory listings -d "$ServerRoot$link" && $link !~ m#/$# && ( $link .= '/' ); local($path) = $link; # for index files if ( $link =~ m#/$#) { if (&LookupDir($link) ) { $path = &LookupDir($link); $PrintAddenda{$link} = "[file: $path]"; } else { $LASTDIR || do { &AppendList(*LostFile, $link, $referer); return ''; }; $lastdir = '1'; $Arg{no_warn_index} or &Warn("index file not found", $link); $PrintAddenda{$link} = "[directory listing]"; } } elsif ( -f _ ) { ((stat(_))[2] & 4 == 0) && &Warn("not world readable", $link); } else { &AppendList(*LostFile, $link, $referer); $ServerTilde || $link =~ m#(^/~[^/]*)# && $Hints{qq~use -http to resolve "$1" links.~}++; $ServerMap || $link =~ /^($StandardMaps)/o && $Hints{qq~use "-map $1" to resolve imagemaps.~}++; return ''; } &AppendList(*FileList, $link, $referer); &CacheDir($link); $lastdir && do { &StopRecursion($link, $referer) && return ''; &Progress("checking $link"); %newlinks = %LASTDIR; return $link; }; $path =~ /\.($HtmlExts|map)$/io || return ''; # only parse html & .map &StopRecursion($link, $referer) && return ''; &Progress("checking $link"); open($path, "$ServerRoot$path") || do { &Warn(qq~could not open file: "$ServerRoot$path"\n~, 'sys'); return ''; }; $path =~ /\.map$/i ? &ParseMap($path, *newlinks) : &ParseHtml($path, $link, *newlinks); close($path); $link; } #-------------------------------------------------------------------------- # MapLink($link, $referer) # # Resolves my server maps for $link. Returns new $link or '' if # the new link was already cached. #-------------------------------------------------------------------------- sub MapLink { local($link, $referer) = @_; local(%checked, $old); $old = $link; while ( ($ServerMap && $link =~ s#^($ServerMap)#$NewMap{$1}#o) || ($ServerTilde && $link =~ s#^/~([^/]*)#$ServerTilde#oee) ) { $checked{$link}++ || next; &Warn("infinite mapping loop", $link); return $link; } ($old eq $link || "$old/" eq $link) && return $link; $PrintAddenda{$link} = "($old)"; $Mapped{$old} = $link; $DbLink && &Progress("mapped $old\n => $link"); &WasCached($link, $referer) && return ''; $link; } #-------------------------------------------------------------------------- # StopRecursion($link, $referer) # # Stops recursion as needed. Also records skipped files. #-------------------------------------------------------------------------- sub StopRecursion { local($link, $referer) = @_; $Parsed{$link} && return '1'; $Abort || $link !~ /$LinkSet/o || ($Skip && $link =~ /$Skip/o) || ++$Parsed > $Limit || do { $Parsed{$link}++; return ''; }; &AppendList(*Skipped, $link, $referer); push(@Skipped, $link); &Progress("----- $link"); return '1'; } #-------------------------------------------------------------------------- # LinkRemote($link, $referer) # # Checks $link via http. If it is an html file it is parsed and # the results go into local lists maintained by LinkLink(). #-------------------------------------------------------------------------- sub LinkRemote { local($oldlink, $referer) = @_; #---- check url and parse into local arrays in LinkLint(). $Fetched{$oldlink}++ && return ''; ($flag, $link) = &Http'Parse($ServerHost, $oldlink, $referer, *newlinks); $flag == -5000 && return ''; # user interrupt $flag == -4000 && do { # moved to different host &AppendList(*ExtLink, $link, $referer); return ''; }; $link ne $oldlink && $link ne "$oldlink/" && do { $PrintAddenda{$link} = "($oldlink)"; $Mapped{$oldlink} = $link; }; $flag || return ''; # new url was already cached local($msg) = &Http'ErrorMsg($flag); &Http'FlagWarn($flag) && do { &AppendList(*Ignored, $link, $referer); &AppendList(*HttpFail, $link, $msg); &Warn($msg, $link); return ''; }; &Http'FlagOk($flag) || do { &AppendList(*LostFile, $link, $referer); &AppendList(*HttpFail, $link, $msg); return ''; }; &AppendList(*HttpOk, $link, $msg); $flag == -2000 || return ''; $link; } #-------------------------------------------------------------------------- # CacheDir($link) # # Save a list of directories for orphan and case checking. #-------------------------------------------------------------------------- sub CacheDir { local($dir) = @_; $dir =~ s#/[^/]*$##; local($absdir) = $ServerRoot . $dir; $dir = $dir || '(root)'; ($DirList{$dir} || $LostDir{$dir}) && return; &AppendList( -d $absdir ? *DirList : *LostDir, $dir, "\n"); } #-------------------------------------------------------------------------- # ParseHtml(*HANDLE, $link, *list) # # Extracts all (?) links from the file by setting %list{link} = "1". # Links are expanded to full unique URL's or paths. # %Anchor named anchors found # %ImgMap named image maps found # %WantAnch named anchors to find # %WantMap named image maps to find #-------------------------------------------------------------------------- sub ParseHtml { local(*HANDLE, $link, *list) = @_; local($code, $tag, $temp, $url, $att, $term, $anch); &SetBase($link); $DbP && print "\n" , '=' x 60, "\nFILE $link\n\n"; $/ = "<"; # use "<" as newline separator TAG: while () { /^\!\-\-/ && do { while ($_ !~ /\-\-\!?>/ ) { # ignore tags inside comments defined($_ = ) && next; &Warn(q~missing end comment "-->"~, $link); last TAG; } next TAG; }; m/^(\w+)(\s*("[^"]*"|'[^']*'|[^>"'])*)(>?)/ || next; $tag = $1; $att = $2; $term = $4; while ( ! $term ) { $att .= $_; ($term, $att) = &FixTag(*HANDLE, $att); $term eq 'next' && next TAG; $term eq 'last' && last TAG; }; ($_ = $tag) =~ tr/A-Z/a-z/; # convert tag to lower case if ( /^script$/ ) { $att =~ /\ssrc${ATTRIB}/io && $list{&UniqueUrl($+)}++; while ( $_ = ) { m#^/script#i && next TAG; } &Warn("missing ", $link); last TAG; } $att || next TAG; $DbP && $att && do { ($ErrTag = "<$tag$att>") =~ s/\s+/ /; $ErrTag .= "\n"; }; if ( /^a$/ ) { $att =~ s/\sname${ATTRIB}//io && ! $Arg{no_anchors} && ($Anchor{"$CurFile#$+"} = "\n"); $att =~ /\shref${ATTRIB}/io || next; $temp = $+; $temp =~ /^#/ && ($temp = "$BaseFile$temp"); $anch = $temp =~ /(#.*)$/ ? $1 : ''; $url = &UniqueUrl($temp); $list{$url}++; %ProtoJS && $url =~ /^javascript:/ && &ProtoJS($url, $referer); $anch && $url =~ m#^/# && ! $Arg{no_anchors} && &AppendList(*WantAnch, "$url$anch", $link); } elsif ( /^base$/ ) { $att =~ /\shref${ATTRIB}/io || next; $DbP && print $ErrTag; &BaseTag($+); } elsif ( /^bgsound$|^frame$|^input$|^embed$/ ) { $att =~ /\ssrc${ATTRIB}/io && $list{&UniqueUrl($+)}++; } elsif ( /^area$/ ) { $att =~ /\shref${ATTRIB}/io && $list{&UniqueUrl($+)}++; } elsif ( /^body$/ ) { $att =~ /\sbackground${ATTRIB}/io && $list{&UniqueUrl($+)}++; } elsif ( /^img$/ ) { $att =~ s/\ssrc${ATTRIB}//io && $list{&UniqueUrl($+)}++; $att =~ s/\slowsrc${ATTRIB}//io && $list{&UniqueUrl($+)}++; $att =~ /\sdynsrc${ATTRIB}/io && $list{&UniqueUrl($+)}++; $att =~ /\susemap${ATTRIB}/io || next; $temp = $+; $temp =~ /^#/ && ($temp = "$BaseFile$temp"); $anch = $temp =~ /(#.*)$/ ? $1 : ''; $url = &UniqueUrl($temp); $list{$url}++; $anch && $url =~ m#^/# && &AppendList(*WantMap, "$url$anch", $link); } elsif ( /^map$/ ) { $att =~ s/\sname${ATTRIB}//io && ($ImgMap{"$CurFile#$+"} = "\n"); } elsif ( /^form$/ ) { $att =~ /\saction${ATTRIB}/io || next; $temp = &UniqueUrl($+); &AppendList(*Action, $temp, $link); $list{$temp}++; } elsif ( /^applet$/ ) { $att =~ /\scode${ATTRIB}/io or next; $code = $+; $code =~ /\.class$/i || ( $code .= ".class" ); my ($jar, $base); $att =~ /\sarchive${ATTRIB}/io and $jar = $+; my $file = $jar || $code || next; $att =~ /\scodebase${ATTRIB}/io and do { ($base = $+) =~ s#/$##; $file = "$base/$file"; }; $list{&UniqueUrl($file)}++; } elsif ( /^meta$/ ) { $att =~ /\shttp-equiv${ATTRIB}/io || next; $+ =~ /^refresh$/i || next; $att =~ /\scontent\s*=\s*"([^"]*)"/i || next; $1 =~ /url=([^"\s]+)/i || next; $temp = $1; $temp =~ m#^\w+://# || &Warn("re-direct should be absolute", $link); $url = &UniqueUrl($temp); $list{$url}++; } } $/ = "\n"; # reset line seperator to "\n" $DbP or return; print '=' x 60 , "\n\n"; $ErrTag = ''; } #-------------------------------------------------------------------------- # SetBase($link) # # Clears globals: BaseScheme, BaseFile, BasePath, # Sets CurFile and CurPath #-------------------------------------------------------------------------- sub SetBase { ($CurFile) = @_; $BaseScheme = ''; $BaseHost = ''; $BasePath = ''; $BaseFile = $CurFile; ($CurPath = $CurFile) =~ s#(\?.*)$##; # strip query off of path $CurPath =~ s#([^/]+)$##; # strip file off of path $CurPath = $CurPath || "/"; # default to root } sub ProtoJS { local($link, $referer) = @_; local($place, %place, $cnt); $link =~ s#javascript:([^\(\)]+)\(## || return; ($place = $ProtoJS{$1}) || return; grep( $place{$_}++, split("\n", $place)); while ($link =~ s/^\s*("([^"]*)"|'([^']*)'|([^"'\)]+))\s*[,\)]//) { $place{++$cnt} && $list{&UniqueUrl($+)}++; } } sub SetProtoJS { local($proto) = @_; local(@place, $name, $cnt); $proto =~ s#^([^\(\)]+)\(## || return; $name = $1; while ($proto =~ s#^([^\),]*)[,\)]##) { $cnt++; $+ eq 'url' && push(@place, $cnt); } $ProtoJS{$name} = join("\n", @place); $DB{1} && print "javascript:$name $ProtoJS{$name}\n"; } #-------------------------------------------------------------------------- # FixTag(*HANDLE, $att) # # Reads in the remainder of a tag if there was a bare "<" inside the # the tag. Only works if "<" was inside of single or double quotes. # This is slow but we only get here on rare occasions. #-------------------------------------------------------------------------- sub FixTag { local(*HANDLE, $att) = @_; local($erratt, $temp, $quot); $DbP && print "attrb = [$att]\n"; ($erratt = substr($att, 0, 20)) =~ s/\s+/ /; # for error msg $temp = $att; $temp =~ s/"[^"]*"|'[^']*'//g; # strip leading ".." $DbP && print "tail = [$temp]\n"; $temp = m/(['"])/ || do { # should have ' or " &Warn(qq~unquoted "<" in <$tag$erratt~, $link); return 'next'; }; $quot = $1; # last ' or " $DbP && print "quot = [$quot]\n"; $_ = ''; # prime the pump do { $att .= $_; $DbP && print "append1 [$_]\n"; # add new lines ... defined($_ = ) || do { &Warn("unterminated <$tag$erratt", $link); return 'last'; }; } until (($quot eq '"' && s/^([^"]*")//) || # until we close quote ($quot eq "'" && s/^([^']*')//) ); $att .= $1; # got it $DbP && print "append2 [$1]\n"; m/^(("[^"]*"|'[^']*'|[^>])*)(>?)/ && ($att .= $1); $DbP && print "append3 [$1]\n"; $term = $3; $DbP && print "term = [$term]\n"; ! $term && ($att .= $_); ($term, $att); } #-------------------------------------------------------------------------- # BaseTag($url) # # Set Global $BaseHost, $BasePath and $BaseFile defined in $url # Only set BaseHost if a scheme is given! # $BasePath will always start and end with "/" #-------------------------------------------------------------------------- sub BaseTag { local($_) = @_; $BaseFile = $_; s#^([\w\-]+):## && do { ($BaseScheme = $1) =~ tr/A-Z/a-z/; $BaseFile = "$BaseScheme:$_"; $BaseHost = $1 if s#^//([^/]*)##; # can't have host without scheme }; s#([^/]+)$##; # strip file first $BasePath = $_ if m#^/#; # only if absolute # $ServerHost && $BaseFile =~ s#^http://$ServerHost##o; $DbP || return; print "\nBaseScheme $BaseScheme\n"; print "BaseHost $BaseHost\n"; print "BasePath $BasePath\n"; print "BaseFile $BaseFile\n\n"; } #-------------------------------------------------------------------------- # ParseMap(*HANDLE, *list) # Reads a map file and tries to extract all http links. #-------------------------------------------------------------------------- sub ParseMap { local(*HANDLE, *list) = @_; &SetBase("/"); while () { next unless m#(http://[^\s"]+)#i; # strip any junk around an http:// $list{&UniqueUrl($1)}++; } } #-------------------------------------------------------------------------- # $url = ParseRedirect(*HANDLE, $link) # # Reads text from FILE until end of element. Uses $link for # error messages. Returns $url with redirected $url if given otherwise # returns ''. #-------------------------------------------------------------------------- sub ParseRedirect { local(*HANDLE, $link) = @_; &SetBase($link); local($url) = ''; $/ = "<"; # use "<" as newline separator REDIR: while () { $DbP && print; /^\!\-\-/ && do { while ($_ !~ /\-\-\!?>/ ) { # ignore tags inside comments defined($_ = ) && next; &Warn(q~missing end comment "-->"~, $link); last REDIR; } next REDIR; }; last if m#^(/head|body|h\d|font)#i; s/^meta//i || next; /\shttp-equiv\s*=\s*"?refresh/i || next; /\scontent\s*=\s*"([^"]*)"/i || next; $1 =~ /url\s*=\s*([^"\s]+)/i || next; $url = &UniqueUrl($1); $url =~ m#^\w+://# || &Warn("re-direct $url should be absolute", $link); last; } $/ = "\n"; # use "\n" as newline separator return $url; # return value } #-------------------------------------------------------------------------- # $path = LookupDir($dir) # # $dir and $path are both relative to server root. # Find a default index file in $dir. Returns $path of default file # on success or return 0 on failure. Caches results in $DefDir. # Fills %LASTDIR with last successful directory listing. #-------------------------------------------------------------------------- sub LookupDir { local($absdir) = @_; $absdir =~ s#/$##; local($dir) = $absdir . '/'; defined $DefDir{$dir} && return $DefDir{$dir}; # was cached local(%file, $lc); &Progress("looking for $dir(default)"); opendir(DIR, "$ServerRoot$absdir") || return $DefDir{$dir} = $LASTDIR = ''; %LASTDIR = (); foreach (grep(!/^\./, readdir(DIR))) { # all files in directory ($lc = $_) =~ tr/A-Z/a-z/; # lower case version $IgnoreCase && ($_ = $lc); $LASTDIR{ "$dir$_" }++; $file{$DefCaseSens ? $_ : $lc} = $_; } closedir(DIR); $LASTDIR = '1'; foreach (@DefIndex) { next unless $file{$_} && -f "$ServerRoot$dir$file{$_}"; return $DefDir{$dir} = "$dir$file{$_}"; } return $DefDir{$dir} = ''; } #-------------------------------------------------------------------------- # CheckOrphan(*dirlist, *orphlist, *badcase, $checkorphan, $checkcase) # # Checks every directory in dirlist and creates a list of # all files that have not been checked by linklint. # # If CheckCase is set we first check the case of all found files # in that directory. #-------------------------------------------------------------------------- sub CheckOrphan { local(*dirlist, *orphlist, *badcase, $checkorphan, $checkcase) = @_; %dirlist || return; $checkorphan || $checkcase || return; local($msg) = ' for'; $msg .= " orphans" if $checkorphan; $msg .= " and" if $checkorphan && $checkcase; $msg .= " case mismatch" if $checkcase; local(@files, %files, $file, $link, $absdir, $reldir); foreach $dir (sort keys %dirlist) { &Progress("checking $dir"); &Progress($msg); $reldir = $dir eq '(root)' ? '' : $dir; $absdir = $ServerRoot . $reldir; &PushDir($absdir) || next; opendir(DIR, '.') || do { &Warn(qq~could not read directory "$absdir"~, 'sys'); next; }; @files = grep(!/^\./, readdir(DIR)); $IgnoreCase && grep( tr/A-Z/a-z/ && 0, @files); closedir(DIR); $checkcase && do { %files = (); grep($files{"$reldir/$_"}++, @files); foreach $link ( grep( m#^$reldir/[^/]+$#i, keys %FileList) ) { next if $files{$link}; ($file) = grep(/$link/i, keys %files); # get "real" filename $file = $file || $link; # just in case $badcase{$file} = $FileList{$link}; # add to list } }; next unless $checkorphan; foreach $file (@files) { $link = "$reldir/$file"; next if $FileList{$link} || $dirlist{$link} || $badcase{$link}; -d $file && ($link .= "/"); # let them know it's a dir $orphlist{$link} = "\n"; next unless $link =~ /\.($HtmlExts)$/io; #----- parse html files for possible redirects open($file, $file) || do { &Warn(qq~could not open orphan "$link"~, 'sys'); next; }; local($equiv) = &ParseRedirect($file, $link); close($file); $equiv && ($PrintAddenda{$link} = " => $equiv"); } &PopDir; } } #-------------------------------------------------------------------------- # ProcessLocal() # # Does the work needed between gathering links and printing. #-------------------------------------------------------------------------- sub ProcessLocal { local($file, $anch, $ref); &Progress("\nProcessing ..."); #---- Resolve named anchors $Arg{no_anchors} or do { &HashUnique(*WantAnch); &ResolveAnch(*WantAnch, *Anchor, *LostAnch); }; #---- resolve named image maps &HashUnique(*WantMap); &ResolveAnch(*WantMap, *ImgMap, *LostMap); &HashUnique(*FileList); # pathinfo and dirlookup can cause extras. &HashUnique(*LostFile); # pathinfo and dirlookup can cause extras. &HashUnique(*Action); &HashUnique(*Forward); &InvertKeys(*LostFile, *ErrF); &InvertKeys(*BadCase, *CaseF); &InvertKeys(*WarnList, *WarnF); $Parsed = keys %Parsed; $Parsed >= $Limit && ! $Arg{'limit'} && $Hints{"use -limit to check more than $Limit files."}++; @PRINTFORM = @SiteForm; &PrintFiles(0, 0, 0, 0); # count elements in local arrays $ErrTot = $LostFile + $LostAnch + $BadCase; $Fetched = keys %Fetched; } ########################################################################### # # PRINTING ROUTINES # # Entry through PrintOutput() function. # ########################################################################### #-------------------------------------------------------------------------- # PrintOutput(*form, *summary, $docdir, $sumfile, $indexfile, $title) # # A wrapper for a printing routines. Computes counts of # of elements # in all hashes (need scalars). Inverts some hashes for forward # results. Prints output to files in $DocDir or to "std" output. #-------------------------------------------------------------------------- sub PrintOutput { local(*printform) = shift; @PRINTFORM = @printform; # used globally below local(*SUMMARY) = @_; # Summary routine: used and passed on. &PrintFiles(0, 0, 0, 0); # count elements in all hashes &PrintDocDir(@_); # All -doc files $LogFile && do { print "\n"; &PrintFiles(@SumPrint); # summary list of results print "\n"; &SUMMARY; # text summary at top of file zz ? print "\n"; }; &PrintFiles(0,0,0, $PrnFlag); # print as user requested return if $Silent; # -silent: no summary on screen $lastsel = select(STDERR); print "\n"; &PrintFiles(@SumPrint); # list summary print "\n"; &SUMMARY; # text summary at bottom of screen select($lastsel); } #-------------------------------------------------------------------------- # PrintFiles($DocDir, $DOCHEAD, $SUM, $flags) # # Prints all or part of the output according to the flags supplied. # If $DocDir is supplied output goes to the files specified in the # OpenDoc() calls, otherwise output goes to currently selected output. # PrintFile is called several times to print lists to STDOUT, # doc files, summary, and summary file. Passes globals to OpenDoc(), # PrintLISTS() and PrintList(). # where to print: # $DocDir send output to seperate files in $DocDir. # summarize: # $DOCHEAD add filenames to headers. # $SUM 1: summary form, 2: summarize printLISTS to one line # #-------------------------------------------------------------------------- sub PrintFiles { local($docdir, $DOCHEAD, $SUM, $flags) = @_; local($file, $data, $mask, $prog, @params); foreach (@PRINTFORM) { ($file, $data, $mask, $prog, @params) = split(/;\s*/, $_); $Dont_Output and $file =~ m/$Dont_Output/o and next; $mask = oct($mask); $flags || (eval("\$$data = \$$data || keys \%$data"), next); &OpenDoc($docdir, $file, $data) || next; next unless ($flags & $mask) == $mask; $prog == 1 && &PrintList($data, @params); $prog == 2 && &PrintLISTS($data, @params); $prog == 3 && &PrintUrl($data, @params); } } #-------------------------------------------------------------------------- # $flag = OpenDoc($docdir, $file, *list) # # A co-conspirator with PrintFiles(). # $file is the name of a file (sans extension) to be written in # $DocDir. $anydata tells us if there is any data to be written # to the file. Always erase old copies of all the files but save # a backup copy if the file might be theirs. # If $DocDir is 0 then create the filename $DOCFILE and return # $anydata. If $anydata is 0 don't create a new file and return ''. #-------------------------------------------------------------------------- sub OpenDoc { local($docdir, $name, *data) = @_; local($anydata) = $data || ($data = scalar keys %data); $DOCFILE = "$name.txt"; # global used in PrintList() return $anydata unless $docdir; # and output will to current select() local($htmlfile) = $name; $htmlfile .= $Dos ? '.htm' : '.html'; # $Clean && do { -e $DOCFILE && unlink($DOCFILE); -e $htmlfile && unlink($htmlfile); # }; return '' unless $anydata; # don't create new file w/ no data ## print STDERR ">>>$DOCFILE\n"; open(DOC, ">$DOCFILE") || do { &Warn(qq~could not open "$DOCFILE" for output~, 'sys'); return ''; }; $DOCFILES++; # count files created $DOCFILES{$DOCFILE} = $htmlfile; # keep a list of .txt files select(DOC); # print's will default to DOC print "file: $DOCFILE\n"; # indentify filename print &Preamble; # print a file header } sub Preamble { $TimeStr = $TimeStr || &TimeStr('(local)'); join('', $HeaderRoot ? "root: $HeaderRoot\n" : '', $ServerHost ? "host: $ServerHost\n" : '', "date: $TimeStr\n", "Linklint version: $version\n", "\n" ); } #-------------------------------------------------------------------------- # SiteSummary # # Prints the textual summary of what has happened. #-------------------------------------------------------------------------- sub SiteSummary { %Hints && print "hint: ", join("\nhint: ", keys %Hints), "\n\n"; $FileList && print "Linklint found ", &Plural($FileList, "%d file%s"), $DirList ? &Plural($DirList, " in %d director%y") : '', $Parsed ? &Plural($Parsed, " and checked %d html file%s") : '', ".\n"; $CheckCase && $BadCase == 0 && print &Plural($BadCase, "There %w %n file%s with mismatched case.\n"); $CheckOrphan && $OrphList == 0 && print &Plural($OrphList, "There %w %n director%y with orphans.\n"); print &Plural($LostFile, "There %w %n missing file%s."); print &Plural($ErrF, " %N file%s had broken links.\n"); print &Plural($ErrTot, "%N error%s, "); print &Plural($WarnList, "%n warning%s."); $Http || $Time && $Parsed && $Time > 4 && printf(" Parsed ~ %1.1f files/second.", $Parsed /$Time); print "\n"; } #-------------------------------------------------------------------------- # PrintDocDir(*SUMMARY, $docdir, $sumfile, $indexfile, $title) # # Prints complete documentation in $DocDir directory. #-------------------------------------------------------------------------- sub PrintDocDir { local(*SUMMARY, $docdir, $sumfile, $indexfile, $logfile, $title) = @_; $docdir && &PushDir($docdir) || return; local($indexhtml) = $indexfile; $indexhtml .= $Dos ? ".htm" : ".html"; $DOCFILES = 0; %DOCFILES = (); &Progress("\nwriting files to $Arg{'doc'}"); # abbreviated $DocDir local($lastselect) = select; # save "std" output &PrintFiles($docdir, 0, 0, $DocFlag); # print almost all doc files local($dum) = '1'; # used to fool OpenDoc() &OpenDoc($docdir, $sumfile, *dum) && do { &SUMMARY; # text summary at top of file print "\n"; &PrintFiles(@SumPrint); # list summary }; &OpenDoc($docdir, $indexfile, *dum) && do { printf("%12s: %s\n", "$sumfile.txt", 'summary of results'); $Arg{'out'} || printf("%12s: %s\n", "$logfile.txt", 'log of progress'); &PrintFiles('', 1, 2, $DocFlag); # list index to all files }; close(DOC); # close the last one select($lastselect); # restore "std" output &Progress(&Plural($DOCFILES, "wrote %n txt file%s")); $Arg{'textonly'} || do { delete $DOCFILES{"$sumfile.txt"}; delete $DOCFILES{"$indexfile.txt"}; delete $DOCFILES{"dir.txt"}; foreach $txt (keys %DOCFILES) { &HtmlDoc($txt, $DOCFILES{$txt}, *DefDir, *Action, $DocBase ); $Arg{'htmlonly'} && $txt !~ /^remote/ && unlink($txt); }; &Progress(&Plural(scalar keys %DOCFILES, "wrote %n html file%s")); &HtmlSummary("${indexfile}.txt", $indexhtml, $title, *DOCFILES) && &Progress("wrote index file $indexhtml"); $Arg{'htmlonly'} && unlink("${indexfile}.txt"); }; &PopDir; } #-------------------------------------------------------------------------- # HtmlDoc($in, $out, *map, *skip, $base) # # Reads txt file $in, writes html file $out, adding anchors # to lines that start with "/" or "scheme:". Sets the base # to $base if provided, maps links found in %map, does not # add anchors to links found in *skip. #-------------------------------------------------------------------------- sub HtmlDoc { local($in, $out, *map, *skip, $base) = @_; local($title) = "Linklint - $out"; #---- Open files open(IN, $in) || do { &Warn(qq~\ncould not read back "$infile"~, 'sys'); return ''; }; ### print STDERR ">> HTML >> $out\n"; open(OUT, ">$out") || do { &Warn(qq~\ncould not open "$outfile" for output~, 'sys'); close(IN); return ''; }; my $base_tag = $Arg{output_frames} ? "" : ""; #----- print html header print OUT "\n$title\n\n"; print OUT $base_tag; print OUT "\n\n

\n";

    #---- read/print text file header

    $_ = ;                    # skip "file: filename.txt"

    while () {
        print OUT $_;
        last unless /\S/;
    }

    #---- read/print out the list with anchor tags

    local($file);

    while() {
        s/^(\s+)// && print OUT $1;
        s/^(=>\s*)// && print OUT $1;
        m#^(/|\w+:).*# || do { print OUT $_; next; };
        s/\s+$//;
        $file = $map{$_} || $_;
        $skip{$file} && do { print OUT "$_\n"; next; };
        $file =~ m#^/# && ($file = $base .$file);
        print OUT qq~$_\n~;
    }

    close(IN);

    print OUT @tab, "
\n"; close(OUT); return 1; } #-------------------------------------------------------------------------- # TimeStr($gmt, $time) # # Returns a string formated "weekday, month day year hh:mm:ss" #-------------------------------------------------------------------------- sub TimeStr { local($gmt, $time) = @_; defined $time || ($time = time); local($sec, $min, $hour, $mday, $mon, $year, $wday ) = ($gmt =~ m/GMT/) ? gmtime($time) : localtime($time); local(@mon) = ( 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'); local(@wday) = ( 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat' ); $year < 50 && ($year += 2000); $year < 1900 && ($year += 1900); sprintf("$wday[$wday], %02d $mon[$mon] $year %02d:%02d:%02d %s", $mday, $hour, $min, $sec, $gmt); } #-------------------------------------------------------------------------- # HtmlSummary($infile, $outfile, $title, $docfiles) # # Creates a simple html summary file by reading back the linklint.txt # file that was already created. #-------------------------------------------------------------------------- sub HtmlSummary { local($infile, $outfile, $title, *docfiles) = @_; local(%head); open(IN, $infile) || do { &Warn(qq~\ncould not read back "$infile"~, 'sys'); return ''; }; open(OUT,">$outfile") || do { &Warn(qq~\ncould not open "$outfile" for output~, 'sys'); close(IN); return ''; }; while () { last unless /\S/; /\s*(\S+):\s*(\S+.*\S+)\s*$/ && ($head{$1} = $2); } local($title2) = $title; $head{'root'} && ($title2 .= "for $head{'root'}"); my $target = $Arg{output_frames} ? "" : ""; print OUT qq~\n $title\n $target \n $title2
\n
\n $head{'date'}
Linklint version: $version
~;

    local($file);
    while() {
        last unless /\S/;
        s|(ERROR )|$1|;
        s|^\s*(\S+):||;
        $file = $docfiles{$1} || $1;
        $file =~ s~[^/]+/~~;

        print OUT " " x (13 - length($file));
        print OUT "$file:", $_;
    }

    close(IN);

    print OUT @tab, "\n\n
\n"; close(OUT); return 1; } #-------------------------------------------------------------------------- # Plural($cnt,$msg) # # Returns a pluralized version of $msg. # # %w -> was : were %d -> $cnt # %s -> : s %n -> no : $cnt # %y -> y : ies %N -> No : $cnt # %es -> : es #-------------------------------------------------------------------------- sub Plural { local($cnt,$_) = @_; $cnt == 1 ? s/\%w/was/g : s/\%w/were/g; # %w -> 'was' or 'were' $cnt == 1 ? s/\%s//g : s/\%s/s/g; # %s -> '' or 's' $cnt == 1 ? s/\%y/y/g : s/\%y/ies/g; # %y -> 'y' or 'ies' $cnt == 1 ? s/\%es//g : s/\%es/es/g; # %y -> 'y' or 'ies' s/\%n/\%d/ && ($cnt = $cnt || 'no' ); # %n -> "no" or $cnt s/\%N/\%d/ && ($cnt = $cnt || 'No' ); # %N -> "No" or $cnt s/\%d/$cnt/; # %d -> $cnt s/(\%\d+d)/sprintf($1,$cnt)/e; # %3d -> $cnt return $_; } #-------------------------------------------------------------------------- # PrintList(*list, $header, $xref, $subhead) # # Print out keys (and values if $xref) of %list. Prepend "$DOCFILE: " # to header if $DOCHEAD is set. Append "(cross referenced)" to header # if $xref == 2. Only print header if $SUM is set. #-------------------------------------------------------------------------- sub PrintList { local(*list, $header, $xref, $subhead) = @_; $subhead = $subhead || "used in %d file%s:"; return unless %list; local(@major) = sort keys %list; local($headtext) = &Plural(scalar @major, $header); $xref == 2 && ($headtext .= " (cross referenced)"); $DOCHEAD && ($headtext = sprintf("%13s: ", $DOCFILE) . $headtext); $SUM && do { print "$headtext\n"; return; }; print "$Headline# $headtext\n$Headline"; foreach (@major) { s/&cr;/\n/g; print "$_\n"; $PrintAddenda{$_} && print "$PrintAddenda{$_}\n"; $xref && &PrintSubList($list{$_}, $subhead, 4); } print "\n" unless $xref; } #-------------------------------------------------------------------------- # PrintSubList($sublist, $subhead, $indent) # # Prints out all elements of $sublist split by "\n". Prints out the # number of elements in pluralized $subhead. #-------------------------------------------------------------------------- sub PrintSubList { local($sublist, $subhead, $indent) = @_; $indent = " " x $indent; $sublist =~ s/^\n+//; $sublist || return; (local(@items) = sort split(/\n+/,$sublist)) || return; grep( s/&cr;/\n/g, @items); print $indent , &Plural(scalar @items, $subhead), "\n", $indent; print join("\n$indent", @items) , "\n\n"; } #-------------------------------------------------------------------------- # PrintLISTS(*list, *heads, $xref) # # Split %list into sublists and then prints each sublist. The # splitting is controlled by @heads. Each line of @heads must be in the # form "$heading::$regexp" All keys of %list that match %regexp are # printed out under the heading $heading. If regexp contains 'unknown' # then all remaining items in %list are printed out under its $heading. # If $SUM == 2 then a summary of all keys %list is printed using the # first heading format which is otherwise ignored. #-------------------------------------------------------------------------- sub PrintLISTS { local(*listname, *heads, $xref) = @_; local(%sublist, @items, $heading, $regexp); local(%list) = %listname; foreach ( $SUM == 2 ? @heads[0 .. 0] : @heads[1 .. $#heads] ) { ($heading,$regexp) = split(/::/, $_); (@items = $regexp =~ /unknown/ ? keys %list : grep(/$regexp/i, keys %list)) || next; %sublist = (); foreach ( @items ) { $sublist{$_} = $list{$_}; # transfer to %temp delete $list{$_}; # and delete from %list } &PrintList(*sublist, $heading, $xref); } } #-------------------------------------------------------------------------- # PrintUrl(*list, $header, $xref, $posthead, $subhead) # # Prints out URLs (and references if $xref) in %list. #-------------------------------------------------------------------------- sub PrintUrl { local(*list, $header, $xref, $posthead, $subhead) = @_; ($list = $list || scalar keys %list) || return; $header .= ':'; local(@items, %invert); local($headtext) = &Plural($list, $header); $DOCHEAD && ($headtext = sprintf("%13s: ", $DOCFILE) . $headtext); $headtext .= " " . &Plural($list, $posthead); $subhead = $subhead || "%d url%s:"; $xref == 2 && ($headtext .= " (cross referenced)"); $SUM == 2 && ((print "$headtext\n"), return); &InvertKeys(*list, *invert); $SUM || print "$Headline# $headtext\n$Headline"; foreach (sort keys %invert) { @items = sort split(/\n/, $invert{$_} ); print &Plural(scalar @items, $SUM ? $header : $subhead), " $_\n"; $SUM && next; &PrintSubUrl(*items, $xref); } print "\n" unless $SUM; } #-------------------------------------------------------------------------- # PrintSubUrl # # Prints out all elements of $sublist split by "\n". Prints out the # number of elements in pluralized $subhead. #-------------------------------------------------------------------------- sub PrintSubUrl { local(*items, $xref) = @_; foreach ( @items ) { print " ", $_; &PrintUrlRedir($_); print "\n"; ($xref && defined $ExtLink{$_} ) || next; &PrintSubList($ExtLink{$_}, "used in %d file%s:", 8); } print "\n" unless $xref && defined $ExtLink{$_}; } #--------------------------------------------------------------------------- # PrintUrlRedir($url) # # Prints "linked list" of where $url was moved to due to 3XX status # codes. Always returns if an infinite loop ( a -> b -> a ...) # occurs. #--------------------------------------------------------------------------- sub PrintUrlRedir { local($url) = @_; local(%checked); while ( $url = $UrlRedirect{$url} ) { $checked{$url}++ && return; print "\n => ", $url; } } #--------------------------------------------------------------------------- # UrlSummary() # # Print Textual summary of checking remote url status. #--------------------------------------------------------------------------- sub UrlSummary { $TotFail = $UrlFail; $CheckedUrls && print &Plural($CheckedUrls, "Linklink checked %d url%s:\n"), &Plural($UrlOk," %d %w ok, "), $TotFail, " failed", $UrlMoved ? &Plural($UrlMoved, ". %N url%s moved") : '', ".\n"; $HostFail && print &Plural($HostFail, " %n host%s failed:"), &Plural($UrlRetry, " %N url%s could be retried.\n"); %ExtLink && print &Plural($UrlFailedF , "%N file%s had failed urls.\n"); $ErrF && $UrlFailF && print &Plural($ErrF + $UrlFailedF, "There were %n file%s with broken links.\n"); $CacheNet && print &Plural($UrlMod," %N url%s %w modified since last reset.\n"); } #--------------------------------------------------------------------------- # Abbrev($max, $str) # #--------------------------------------------------------------------------- sub Abbrev { local($max, $str) = @_; length($str) > $max && ($str = substr($str, 0, $max - 4) . " ..."); $str; } #--------------------------------------------------------------------------- # LogFile($filename) # # Changes log file to filename for logging progress. #--------------------------------------------------------------------------- sub LogFile { local($name) = @_; $name || return; $LogFile && $LogFile eq $name && return; # don't reopen same file. select(STDERR); $LogFile && close($LogFile); $LogFile = $name; open($LogFile, ">$LogFile") || &Error(qq~could not open file "$LogFile" for output~, 'sys'); select($LogFile); } ########################################################################### # # Utilities # ########################################################################### #--------------------------------------------------------------------------- # $outcnt = InvertKeys(*in, *out) # # Assumes %in is filled with values like "file1\nfile2\n..." # Fills %out with keys file1, file2, ... and each key has values # of the keys from %in that refer to it. #--------------------------------------------------------------------------- sub InvertKeys { local(*in, *out) = @_; %in || return %out ? scalar keys %out : 0; local(%temp); foreach $in (keys %in) { %temp = (); grep( $temp{$_}++, split(/\n+/, $in{$in})); foreach (keys %temp) { /\S/ || next; $out{$_} ? ($out{$_} .= "\n$in") : ($out{$_} = $in); } } scalar keys %out; } #-------------------------------------------------------------------------- # ResolveAnch(*want, *found, *lost) # # named anchors wanted: %want{link#frag} = "ref1 \n ref2 \n ..." # named anchors found: %found{link#frag} = 1 # Fills *found and *lost as appropriate for found and lost named # anchors. Works for named maps too. #-------------------------------------------------------------------------- sub ResolveAnch { local(*want, *found, *lost) = @_; foreach ( keys %want) { /^([^#]*)(#.*)/; # $1 will contain filename $file = $1; $anch = $2; $Ignore && $url =~ /$Ignore/o && next; # ignore ignored files $Skipped{$file} && next; # skip anchors in skipped files $ref = $want{$_}; $Mapped{$file} && do { $file = $Mapped{$file}; $Ignore && $url =~ /$Ignore/o && next; # ignore ignored files $Skipped{$file} && next; # skip skipped files $_ = "$file$anch"; }; $found{$_} ? ($found{$_} = $ref) : ($lost{$_} = $ref); } } #-------------------------------------------------------------------------- # HashUnique(*hash) # # Assumes values in hash are "val1 \n val2 \n ..." # Ensures that each val occurs at most once in each hash value. #-------------------------------------------------------------------------- sub HashUnique { local(*hash) = @_; local(%temp, $key, $val); while ( ($key, $val) = each %hash) { %temp = (); grep($temp{$_}++, split(/\n+/, $val)); $hash{$key} = join("\n", keys %temp); } } #-------------------------------------------------------------------------- # Warn(@msg) # # Registers this warning in %Warn. If $link is supplied we cross # reference the warning to $link. #-------------------------------------------------------------------------- sub Warn { local($msg, $link) = @_; (!$link || $link eq 'sys') && do { $WarnList{$msg} = "\n"; &Progress("WARNING: $msg"); $link && $! && print STDERR " System message: $!\n"; return; }; &AppendList(*WarnList,$msg, $link); } #-------------------------------------------------------------------------- # WantNumber($val, $flag) #-------------------------------------------------------------------------- sub WantNumber { local($val, $flag) = @_; $val || return; $val =~ /^\d+$/ && return; &Error("$flag must be followed by an integer"); } #-------------------------------------------------------------------------- # exit_with(@msg) #-------------------------------------------------------------------------- sub exit_with { print @_; exit;} #-------------------------------------------------------------------------- # Error(@msg) # Print @msg and exit. #-------------------------------------------------------------------------- sub Error { local($msg, $flag) = @_; print STDERR "\n$prog error: $msg\n"; @_ > 1 && $flag eq 'sys' && $! && (print STDERR "\nSystem message: $!\n"); exit; } #-------------------------------------------------------------------------- # Progress(@msg) #-------------------------------------------------------------------------- sub Progress { local($msg) = @_; $LogProgress && print $msg, "\n"; $Quiet && return; print STDERR &Abbrev(75, $msg), "\n"; } #-------------------------------------------------------------------------- # AppendList(*list, $key, $value) # # Adds $value to $list{$key} separates values with "\n" as needed. #-------------------------------------------------------------------------- sub AppendList { local(*list, $key, $value) = @_; $list{$key} ? ( $list{$key} .= "\n$value" ) : ($list{$key} = $value); } #-------------------------------------------------------------------------- # PushDir($newdir) # # Pushes current directory onto a stack @DIRS and then # chdir's to $newdir. We Assume that $newdir is a full path! # Returns O if there is an error. #-------------------------------------------------------------------------- sub PushDir { local($new) = @_; push(@DIRS,$CWD); return $CWD if $new eq $CWD; chdir($new) || do { &Warn(qq~(pushdir) could not chdir to "$new"~, 'sys'); return ''; }; return($CWD = $new); } #-------------------------------------------------------------------------- # PopDir # # Pops most recent directory off of stack @DIRS and # changes $CWD and current directory accordingly. #-------------------------------------------------------------------------- sub PopDir { local($new) = pop(@DIRS); return $CWD if $new eq $CWD; chdir($new) || do { &Warn(qq~(popdir) could not chdir to "$new"~, 'sys'); return ''; }; $CWD = $new; } #-------------------------------------------------------------------------- # GetCwd # # Returns a string containing the current working directory # "\" is changed to "/" for consistency if $DOS. # Sets $CWD to the current working directory. #-------------------------------------------------------------------------- sub GetCwd { local($_) = `$pwdprog`; # different prog's for Dos/Unix $Dos && do { s|\\|\/|g; # replace \ with / s/^([a-zA-Z])://; # remove drive: $DosDrive = $1; # save drive letter for dochtml printout }; s/\n$//; # remove trailing \n $CWD = $_; } #-------------------------------------------------------------------------- # Regular($exp, $flag) # #-------------------------------------------------------------------------- sub Regular { local($exp) = @_; $exp =~ s#([^\w/])#\\$1#g; $exp; } #-------------------------------------------------------------------------- # LinkSet($flag, *in) # # Fills $out with regular exppression made out that is the or'ed # of the the specs in @in. Fills %out with all of the keys # that were literal expresions. #-------------------------------------------------------------------------- sub LinkSet { local($flag, *in) = @_; %in = ( '/*', '1' ) if $in{'/*'}; # keep it simple local($out) = join( '|', grep( $_ = &LinkSet1($_, $flag), keys %in)); $Arg{$flag} = $out; $out; } #-------------------------------------------------------------------------- # LinkSet1($exp) # # If $exp does not contain ()[]^| or $ then it is converted to a # literal expression. If this expression contains ? or * then these # are converted to [^/]* and .* and the whole thing is anchored to the # start and end. Otherwise check $exp to be a valid regular expression. #-------------------------------------------------------------------------- sub LinkSet1 { local($_, $flag) = @_; s!([^\w\@#/])!\\$1!g; # protect special characters s!\@!.*!g; # @ to .* (match any) s!#/!#/?!g; # make / behind # optional s!#![^/]*!g; # '#' to [^/]* (match any but /) $_ = "^$_\$"; } ########################################################################### # # User Input Routines # ########################################################################### #-------------------------------------------------------------------------- # @linkset = ReadFile($file) # # Every infile that starts with "@" is read and its contents are # are added to the end of the @infiles list. If an @file contains # lines starting with - then the entire line is read in as commands. #-------------------------------------------------------------------------- sub ReadFile { local($file) = @_; local(@argv, @out); local($lastarg, $lastflag, $arg); $CheckedIn{$file}++ && next; if ($file eq "STDIN") { $file = \*STDIN; } elsif ($file) { open($file, $file) || &Error(qq~could not open file-list "$file"~, 'sys'); } else { $file = \*STDIN; } while (<$file>) { s/^#! ?// && do {print STDERR $_; next; }; # print comments m/^#/ && next; s/\s+$//; # trailing whitespace s/^\s+//; # leading whitespace s/^@@// && ( push(@HttpFiles, $_), next); s/^@// && ( push(@out, &ReadFile($_)), next); m/^-/ || (push(@out, split(/\s+/, $_)), next); while( s/("([^"]*)"|(\S+))\s*//) { $arg = $+; $arg =~ /^-/ && (push(@argv, $lastflag = $lastarg = $arg), next); $lastflag && $lastarg !~ /^-/ && $lastflag =~ /-($HashOpts)/o && push(@argv, $lastflag); push(@argv, $lastarg = $arg); } } close($file); @argv && &ReadArgs(@argv); @out; } #-------------------------------------------------------------------------- # ReadHttp($file) # # Reads $file and returns every "http://..." it can find. # returns the list of links in an array. #-------------------------------------------------------------------------- sub ReadHttp { local($file) = @_; local(@out); $file || do { $Arg{'doc'} || &Error("@@ must be preceded by -doc linkdoc"); $file = "$Arg{'doc'}/remoteX.txt"; }; $CheckedIn{$file}++ && return; open($file, $file) || &Error(qq~could not open http-list "$file"~, 'sys'); while (<$file>) { /^file: remoteX.txt/ && $. == 1 && return &ReadBack($file); while ( s#(http://[^\s"><']+)##i) { push(@out, $1); } } close($file); @out; } #-------------------------------------------------------------------------- # ReadBack($file) # # Reads in $file as if it were the remoteX.txt file generated by # Linklint. %ExtLink if filled with the cross references found # in the file. Returns a list of all urls found. #-------------------------------------------------------------------------- sub ReadBack { local($file) = @_; local($url, @out); while (<$file>) { /^#/ && ($url = '', next); # new section s/\s+$//; # trailing \n /^http:/ && (push(@out, $url = $_), next); # a url to check s#^ ## || next; # sublist indent m#^/# && &AppendList(*ExtLink, $url, $_); # local link } close($file); @out; } #-------------------------------------------------------------------------- # SetHash('name', $key, $value) # # Sets $name{$key} = $value. #-------------------------------------------------------------------------- sub SetHash { local(*hash, $key, $val) = @_; $hash{$key} = $val; } #-------------------------------------------------------------------------- # ReadArgs(@args) # # Reads arguments from @args (all start with "-") returns the # remainder of @args. We first check for full flags and options. # If these don't match exactly we go through the argument looking # for short flags and options 'globbed' together. Flags set # $Arg{'X'} to 1. For short flags X is the flag, for full flags X # is the first character. Short options set $Arg{'X'} to the next # argument. Full options set $X to the next argument where X # is the value from %fullopts. ZZZ Has been modified. #-------------------------------------------------------------------------- sub ReadArgs { local($name, @out); while ( @_ && ($_ = shift)) { s/^@@// && ( push(@HttpFiles, $_), next); s/^@// && ( push(@out, &ReadFile($_)), next); (m#^/# || m#\.html?$#i || s#^http://#http://#i ) && (push(@out, $_), next); s/^-// || &Error( qq~at "$_"~ . qq~\nexpected: "-flag" or "/linkset" or "http: ..."\n~ . $ErrUsage); if ( /^($MiscFlags)$/o ) { $Arg{$1}++; } elsif (/^($FullOpts)$/o) { (@_ < 1 || $_[0] =~/^-/) && &Error("expected parameter after -$_\n" . $ErrUsage); $Arg{$_} = shift; } elsif (/^($HashOpts)$/o) { (@_ < 1 || $_[0] =~/^-/) && &Error("expected parameter after -$_\n" . $ErrUsage); ($name = $_) =~ tr/a-z/A-Z/; &SetHash( $name , shift, 1); } elsif (/^password$/) { (@_ < 2 || $_[0] =~ /^-/ || $_[1] =~ /^-/ ) && &Error("expected 2 parameters after -$_\n" . $ErrUsage); $_[1] =~ /:/ || &Error(qq~expected username:password at "$_[1]"\n~ . $ErrUsage); $PASSWORD{$_[0]} = $_[1]; shift; shift; } elsif (s/^db//) { while($_) { s/^([\d])// || &Error("expected a digit after -db\n" . $ErrUsage); $DB{$1}++; } } else { &Error("unknown flag -$_\n" . $ErrUsage); } } return (@out, @_); } sub HttpInit { local($arg, $db, $pw, $headers, $agent) = @_; defined &Http'Init || do { unshift(@INC, $HOME); require "linkhttp.pl"; }; &Http'Init($arg, $db, $pw, $headers, $agent); } #========================================================================== # End of linklint #========================================================================== ########################################################################### # # Http Package # # Init(*arg, *db, *pw, *headers, $agent); # 'timeout' abort request after timeout seconds # 'delay' delay between requests to same host # 'netset' resets last modified state in cache # 'redirect' search for redirects in files # 'ignore' # 'VERSION' # 'DOS' # # OpenBot($botfile) # # OpenCache($cachefile) # # FlushCache(*validurls) # # UrlsFromCache($ok, $retry, $fail) # Returns @urls with urls from cache depending on status in cache. # # CheckURLS(*urls) # Check all http:// urls in @urls and returns %UrlStatus # filled with urls and status code. # # StatusMsg(*status, *ok, *fail, *warn) # Fills %ok, %fail %warn, with urls and error messages # from *status. # # OtherStatus(*urlmod, *urlmoved, *hostfail, *redirect) # Fills hashes with modified, moved, and host fail messeges # Fills *redirect with redirected urls and destinations. # # WriteCaches() # ########################################################################### package Http; use Socket; #-------------------------------------------------------------------------- # Init(*arg, *db, *pw, *headers, $agent) # # Sets up global variables. Can be called more than once. #-------------------------------------------------------------------------- sub Init { local(*arg, *db, *pw, *headers, $agent) = @_; %Arg = %arg; %DB = %db; @Base64 = ('A'..'Z', 'a'..'z', '0'..'9', '+', '/'); foreach(keys %pw) { $PassWord{$_} = &Base64Encode($pw{$_}); } $TimeOut = $Arg{'DOS'} ? 0 : $Arg{'timeout'} || 0; $Delay = $Arg{'delay'}; $ResetCache = $Arg{'netset'}; $UserAgent = "LinkLint-$agent"; $Arg{'VERSION'} && ($UserAgent .= "/$Arg{'VERSION'}"); $Proxy = $Arg{'proxy'} || ''; $Proxy_Port = ($Proxy and $Proxy =~ s/:(\d*)$//) ? $1 : ''; ($DB{7} || $DB{8}) && do { print "TimeOut: $TimeOut\n"; print "User-Agent: $UserAgent\n"; $Proxy && print "Proxy: $Proxy\n"; $Proxy_Port && print "Proxy Port: $Proxy_Port\n"; }; $TimeOut && do { $SIG{'ALRM'} = *AlarmHandler; $SIG{'INT'} = *IntHandler; }; $Init++ && return; $CRLF = $Arg{'DOS'} ? "\n" : pack("cc", 13, 10); for my $header_line (keys %headers) { my ($name, $value) = $header_line =~ m/^([\w\-]+):\s*(.*)/; $name && $value or do { &main'Warn("Could not parse -http_header $header_line"); next; }; $name = join "-", map ucfirst, map lc, split /-/, $name; $USER_REQ_HEAD{$name} = $value; $USER_HEADERS .= "${CRLF}$name: $value"; }; $StatType = "url last-modified cache"; $BotType = "robot exclusion cache"; $NOW = time; %FlagOk = ( 200, 'ok (200)', 201, 'ok created (201)', 202, 'ok accepted (202)', 304, 'ok not modified (304)', -2000, 'ok parsed html', -2001, 'ok skipped', -3005, 'ok last-modified date unchanged', -3006, 'ok did not compute checksum', -3007, 'ok checksum matched', ); %FlagNotMod = ( 304, 1, -3005, 1, -3006, 1, -3007, 1, ); %FlagMoved = ( 301, 'moved permanently (301)', 302, 'moved temporarily (302)', -3003, 'redirected', ); %FlagRetry = ( -1, 'could not find ip address', -2, 'could not open socket', -3, 'could not bind socket', -4, 'could not connect to host', -5, 'timed out waiting for data', -8, 'malformed status line', -12, 'timed out before anything could happen', -14, 'timed out connecting to host', -15, 'timed out waiting for data', -16, 'timed out reading status', -17, 'timed out reading header', -18, 'timed out reading data', -19, 'timed out getting host ip address', 502, 'server temporarily overloaded (502)', 503, 'gateway timeout (503)', ); %FlagWarn = ( -6, 'not an http link', -7, 'no status. Will try GET method', -10, 'Disallowed by robots.txt', -11, 'infinite redirect loop', 401, 'access not authorized (401)', -4000, 'moved to non-local server', -4010, 'invalid username/password (401)', -4020, 'unknown authorization scheme (401)', -5000, 'user interrupt', -6000, 'unknown internal error', ); %FlagFail = ( 204, 'had no content (204)', 301, 'moved permanently, no new URL (301)', 302, 'moved temporarily, no new URL (302)', 400, 'bad request (400)', 403, 'access forbidden (403)', 404, 'not found (404)', 500, 'internal server error (500)', 501, 'service not implemented on server (501)', ); %FlagBad = (%FlagRetry, %FlagWarn, %FlagFail); @FlagDebug = ( keys %FlagOk, keys %FlagMoved, keys %FlagRetry, keys %FlagFail, keys %FlagWarn, ); } #-------------------------------------------------------------------------- # OpenBot($botfile) # # Reads cache of robot exclusion info from $botfile filling %BotExclude. # Must be run after Init. #-------------------------------------------------------------------------- sub OpenBot { local($botfile) = @_; &ReadCache($botfile, $BotType, *BotExclude); } #-------------------------------------------------------------------------- # OpenCache($cachefile) # # Reads status/last-modified cache from $cachefile int %StatCache #-------------------------------------------------------------------------- sub OpenCache { local($statfile) = @_; &ReadCache($statfile, $StatType, *StatCache); } #-------------------------------------------------------------------------- # FlushCache(@valid) # # Removes all entrys in StatCache that do not also occur in *valid #-------------------------------------------------------------------------- sub FlushCache { @_ || do { &main'Warn("Won't flush entire cache."); return; }; local(%valid); grep( $valid{$_}++, @_); local(@delete) = grep( !$valid{$_}, keys %StatCache); foreach( @delete) { delete $StatCache{$_}; } local($cnt) = scalar @delete; &main'Progress("Deleted $cnt entries from cache."); @delete && $TaintCache{$StatType}++; } #-------------------------------------------------------------------------- # CheckURLS(@urls) # # Checks all http links in @urls # Returns hash of urls and status codes. #-------------------------------------------------------------------------- sub CheckURLS { local($flag, %http); grep(s#^http://#http://#i && $http{$_}++, @_); local(%checked); &main'Progress( &main'Plural(scalar keys %http, "\nchecking %d url%s ...\n")); foreach $url (sort keys %http) { next unless $url =~ m#^http://#i; $Arg{'ignore'} && $url =~ m/$Arg{'ignore'}/o && next; next if $checked{$url}++; $flag = &CheckUrl($url, 0); $flag = &UrlMoved($flag, $url); $UrlStatus{$url} = $flag; if ($Arg{concise_url}) { $FlagBad{$flag} and do { &main'Progress($url); &main'Progress(" " . &ErrorMsg($flag)); }; } else { &main'Progress(" " . &ErrorMsg($flag) ); } } %UrlStatus; } #-------------------------------------------------------------------------- # $flag = UrlMoved($flag, $url) # # Process 3XX status. Recheck $url given back in Location field # of HEADER. Continue until # a) non-3XX status, b) infinite loop c) already checked. #-------------------------------------------------------------------------- sub UrlMoved { local($flag, $url) = @_; local(%checked, $next); while ( $FlagMoved{$flag} ) { $checked{$url}++ && return -11; # infinite loop ($next = $HEADER{'location'}) || return $flag; # this is an error $UrlMoved{$url} = $FlagMoved{$flag}; $Redirect{$url} = $next; $UrlStatus{$next} && return $UrlStatus{$next}; $flag = &CheckUrl($url = $next, 1); } return $flag; } #-------------------------------------------------------------------------- # Parse($host, $link, $referer, *list, *anchlist, *wantanchor) # # #-------------------------------------------------------------------------- sub Parse { local($thishost, $oldlink, $REFERER, @param) = @_; local($port) = $thishost =~ s/:(\d+)$// ? $1 : 80; local($host) = $thishost; local($scheme); local($LINK) = $oldlink; local($url) = "http://$thishost:$port$LINK"; # for warning messages # $HostError{$host} && return ($HostError{$host}, $LINK); local(%checked); ($flag = &Disallowed($url, 0.5)) && return ($flag, $LINK); $flag = &Request($host, $port, $LINK, 'GET', *GetParse, @param); while ( $FlagMoved{$flag} ) { $checked{$LINK}++ && return (-11, $LINK); # infinite loop ($url = $HEADER{'location'}) || return ($flag, $LINK); ($scheme, $host, $port, $LINK) = &SplitUrl($url); ($host ne $thishost) && return (-4000, $url); # non-local server &main'WasCached($LINK, $REFERER) && return (0, $LINK); $flag = &Request($host, $port, $LINK, 'GET', *GetParse, @param); } return($flag, $LINK); } #-------------------------------------------------------------------------- # $flag = CheckUrl($url, $recheck) # # Returns status of a URL. Uses the robots.txt protocol. # If $Arg{'redirect'} uses "GET" with html files otherwise # tries a "HEAD" first and if that fails tries "GET". #-------------------------------------------------------------------------- sub CheckUrl { local($url, $recheck) = @_; local($scheme, $host, $port, $path) = &SplitUrlQ($url); return -6 if $scheme ne "http"; $REQHEAD{'Host'} = $host; if ($Arg{concise_url}) { } elsif ($recheck) { &main'Progress(" moved"); &main'Progress(" $host$path"); } else { &main'Progress("$host$path"); } $OpenedCache{$StatType} && ! $recheck && return &CheckModified($url); return $Arg{'redirect'} ? &Request($host, $port, $path, 'GET', "GetRedirect", $url) : &Request($host, $port, $path, "GET" ); } #-------------------------------------------------------------------------- # CheckModified($url) # # Outer wrapper to see if $url has been modified. # Always update status in cache. Always update cache for new entries. # Update time, last-mod, checksum if new entry or reset. #-------------------------------------------------------------------------- sub CheckModified { local($url) = @_; local($flag); local($_, $time, $mod, $csum) = $StatCache{$url} ? split(/\s+/, $StatCache{$url}) : ('0', '0', '0', '0'); local($nmod, $ncsum) = ($mod, $csum); $TIMESTR = &main'TimeStr('GMT', $time); # for http header if ( $mod ne '0') { # only check last-mod date $flag = &Request($host, $port, $path, "GET" ); $FlagOk{$flag} && do { if ($HEADER{'last-modified'}) { ($nmod = $HEADER{'last-modified'}) =~ tr/ /_/; $flag = -3005 if ($nmod eq $mod); } else { $mod = $nmod = '0'; } }; } else { $REQHEAD{'if-Modified-Since'} = $TIMESTR; $flag = &Request($host, $port, $path, 'GET', "GetModified"); } $csum = -1 if $flag == 304; # -1: server obeys "if-mod" $FlagNotMod{$flag} || !$FlagOk{$flag} || ($UrlMod{$url} = "modified since " . &main'TimeStr('(local)', $time)); #---- update cache if -reset or 1st time we get something new $FlagOk{$flag} && do { $time = $NOW if $ResetCache || $time eq '0'; $csum = $ncsum if $ResetCache || $csum eq '0'; $mod = $nmod if $ResetCache || $mod eq '0'; }; local($temp) = join(" ", $flag, $time, $mod, $csum); # create new entry #---- update entry in cache if needed (!$StatCache{$url} || $StatCache{$url} ne $temp) && do { $StatCache{$url} = $temp; $TaintCache{$StatType}++; }; $flag; } #-------------------------------------------------------------------------- # GetModified($flag, *S) # # Passed to request4(). Only get here if file was modified or # "If-Modified-Since" was ignored. Check last-mod date 1st. # then check checksum only if they ask and if the $time was > 0. # If $csum == -1 then don't bother with checksum. #-------------------------------------------------------------------------- sub GetModified { local($flag, *S) = @_; $HEADER{'last-modified'} && do { # use this info ($nmod = $HEADER{'last-modified'}) =~ tr/ /_/; return $flag if $mod ne '0' && ($nmod ne $mod); # modified return -3005; # not modified }; $csum == -1 && return $flag; # modified: obeyed "if-mod" before return -3006 unless $Arg{'checksum'} && $time != 0; # don't checksum &main'Progress(" computing checksum"); $ncsum = 0; while () { $ncsum += unpack("%32C*", $_); } # compute csum return $flag if $csum ne '0' && ($ncsum != $csum); # modified return -3007; # not modified } #-------------------------------------------------------------------------- # UrlsFromCache($ok, $retry, $fail) # # Returns @urls with urls from the cache depending on the last # status of the url in the cache. #-------------------------------------------------------------------------- sub UrlsFromCache { local($ok, $retry, $fail) = @_; local(@urls); $OpenedCache{$StatType} || &main'Warn("No url cache to read from."); local($url, $cache, $flag); while( ($url, $cache) = each %StatCache ) { ($flag) = split(" ", $cache); push(@urls, $url) if ($fail && $FlagWarn{$flag} ) || ($fail && $FlagFail{$flag} ) || ($ok && $FlagOk{$flag} ) || ($retry && $FlagRetry{$flag} ); } @urls; } #-------------------------------------------------------------------------- # Recheck(@urls) # # Returns only those urls that need to be retried. #-------------------------------------------------------------------------- sub Recheck { local(%retry); grep( $retry{$_}++, &UrlsFromCache(0, 1, 0)); grep( $retry{$_}, @_); } #-------------------------------------------------------------------------- # $flag = Disallowed($url) # # Checks robots.txt file for $url. Results are cached for each host. # Returns: # -10 if access is excluded by robots.txt # 0 if access is allowed # < 0 if < 0 (non-http) error occured #-------------------------------------------------------------------------- sub Disallowed { local($url, $expire) = @_; local($scheme, $host, $port, $path) = &SplitUrl($url); local($flag); $BotExclude{$host} && do { local($time, $xpath) = split(/\s+/, $BotExclude{$host}); local($secs) = 60 * 60 * 24 * ($expire || 30); $time + $secs >= $NOW && do { $xpath eq 'ok' && return 0; return ($path =~ m/^$xpath/) ? -10 : 0; }; }; &main'Progress(" checking robots.txt for $host"); $flag = &Request($host, $port, "/robots.txt", "GET", "GetText", 100); return $flag if $FlagRetry{$flag}; $BotExclude{$host} = time . " ok"; # default value $TaintCache{$BotType}++; # need to write a new file return 0 unless $FlagOk{$flag}; $_ = join("", @DATA); s/\r\n?/\n/g; # end-of-line = \r | \n | \r\n @DATA = split(/\n/, $_); local(@agents, @disallow); push(@DATA, " "); # ensure last group gets proccessed foreach (@DATA) { next if /^\s*#/; s/\s+$//; if ( /^$/ ) { if (@disallow && @agents) { $_ = join(" ", @disallow); # prepare for use as regexp s#([^\w\s/])#\\$1#g; # literal search (pretty) s/\s+/\|/g; $BotExclude{$host} = "$NOW $_"; last if grep(/linklint/i, @agents); } @agents = @disallow = (); next; } s/\s*#.*$//; # trailing comments if ( m/^\s*(User.?Agent|Robot)s?\s*:\s+(\S+.*\S?)\s*$/i) { push(@agents, $2); } elsif ( m/^\s*Disallow\s*:\s+(\S+.*\S?)\s*$/i) { next unless grep(/(linklint|\*)/i, @agents); push(@disallow, $1); } } return &Disallowed(@_); # only recurse once. Relies on BotExclude{host} } #-------------------------------------------------------------------------- # GetParse($flag, *S, $link, *newlinks, *Anchor, *wantanch); # # Parse html via remote http connection. #-------------------------------------------------------------------------- sub GetParse { local($flag) = shift; local($handle) = shift; &main'AppendList(*main'FileList, $LINK, $REFERER); # &main'CacheDir($LINK); $HEADER{'content-type'} =~ m#^text/html#i || return $flag; &main'StopRecursion($LINK, $REFERER) && return -2001; &main'Progress("Checking $LINK"); &main'ParseHtml($handle, $LINK, @_); return -2000; # parsed the file } #-------------------------------------------------------------------------- # GetRedirect($flag, 'S', $url) # # Passed to request4() by CheckUrl() to parse for redirects in header. #-------------------------------------------------------------------------- sub GetRedirect { local($flag) = shift; $HEADER{'content-type'} =~ m#^text/html#i || return $flag; local($redir) = &main'ParseRedirect(@_); $redir || return $flag; $HEADER{'location'} = $redir; return -3003; } #-------------------------------------------------------------------------- # GetText($flag, *S, $lines) # # Passed to request4() to read $lines of text into @DATA. # For now @DATA is a global. #-------------------------------------------------------------------------- sub GetText { local($flag, *S, $lines) = @_; $HEADER{'content-type'} =~ /^text/ || return $flag; while ( ) { # read $lines into @DATA push(@DATA, $_); $lines && (--$lines || last); } $flag; } #-------------------------------------------------------------------------- # $flag = Request($host, $port, $path, $method, *getmethod, @params) # # Handles host errors. Flags bad hosts. Caches host errors. # Calls Request2(). Will retry if more than one IP address is given # by gethostbyname() and we get StatRetry errors. # # Use subroutine getmethod($flag, 'S', @params) if given # to read data after the header. I don't have a great method for # sending data back at the moment. Use globals for now. #-------------------------------------------------------------------------- sub Request { my ($host, $port, $path, @other) = @_; my $flag; $REQHEAD{Host} = $host; if ($Proxy) { my $v_url = "http://$host"; $port and $v_url .= ":$port"; $v_url .= $path; $flag = Request2($Proxy, $Proxy_Port, $v_url, @other); } else { $flag = &Request2(@_); } %REQHEAD = (); return $flag; } sub Request2 { local($_, $flag, $realm, $pw, $auth); $flag = &Request3(@_); $flag == 401 || return $flag; ($auth = $HEADER{'www-authenticate'}) || do { &main'Warn(qq~missing authentication~, $url); return -4020; }; $auth =~ m/\s*basic\s+realm\s*=\s*"([^"]*)"/i || do { &main'Warn(qq~unknown authorization scheme: $auth~, $url); return -4020; }; $realm = $1; if ( $pw = $PassWord{$realm} || $PassWord{'DEFAULT'} ) { $REQHEAD{'Authorization'} = "Basic $pw"; $flag = &Request3(@_); $flag != 401 && return $flag; &main'Warn(qq~invalid password for "$realm"~); return -4010; } else { &main'Warn(qq~need password for "$realm"~); return $flag; } } sub Request3 { local($host) = shift; local($flag) = &Request4($host, @_); return $flag unless $FlagRetry{$flag}; $IpAddr2{$host} || return &HostError($host, $flag); #----- try alternate ip addresses foreach (split(/\n/, $IpAddr2{$host})) { &main'Progress("Warning: $host $IpAddr{$host}"); &main'Progress(" Error: $FlagRetry{$flag}"); &main'Progress(" checking server $_"); $flag = &Request4($_, @_); next if $FlagRetry{$flag}; $IpAddr{$host} = $_; # save good one as the default return $flag; } &HostError($host, $flag); # all servers for this host are down } #-------------------------------------------------------------------------- # HostError($host, $flag) # # Fills %HostError with $flag, fills %HostFail with error message. # Returns $flag. #-------------------------------------------------------------------------- sub HostError { local($host, $flag) = @_; $HostFail{$host} = &ErrorMsg($flag); $HostError{$host} = $flag; } #-------------------------------------------------------------------------- # $flag = Request4($host, $port, $path, $method, 'getmethod', @params) # # Fills %HEADER with header info, # Uses globals set by Init(). # $flag is error (or success) flag see %Httpxxx for details. # Will use getmethod($flag, 'S', @params) to read data. #-------------------------------------------------------------------------- sub Request4 { local($host, $port, $path, $method, $getmethod, @params) = @_; local($request, $ipaddr, $flag); $port = $port || 80; %HEADER = (); # global %HEADER holds http header info. $DB{9} && do { # $HEADER{'location'} = 'http://dum/dum'; return $flag = $FlagDebug[ int (rand(@FlagDebug)) ]; }; (($ipaddr,$flag) = &GetIpAddress($host)); $flag && return $flag; $Delay && sleep($Delay); $DB{7} && print "\n$method http://$host$path\n", "host ip: $IpAddr{$host}\n"; $request = "$method $path HTTP/1.0"; $USER_HEADERS and $request .= $USER_HEADERS; $REQHEAD{"User-Agent"} = $UserAgent; for my $name (sort keys %REQHEAD) { $USER_REQ_HEAD{$name} and next; $request .= "${CRLF}$name: $REQHEAD{$name}"; } $ALARMFLAG = -12; $@ = ''; $TimeOut && alarm($TimeOut); $SOCKETOPEN = 0; eval { $flag = &FragileRequest($ipaddr, $request, $getmethod || \NUL, @params); }; $TimeOut && alarm(0); $SOCKETOPEN && close(S); $SOCKETOPEN = 0; $@ || return $flag; $@ =~ /^timeout/ && return $ALARMFLAG; $@ =~ /^user interrupt/ || return -6000; print STDERR "\nUser Interrupt.\n", "Interrupt again to abort or wait 2 seconds for linklint to resume\n"; sleep(2); return -5000; } #-------------------------------------------------------------------------- # FragileRequest($ipaddr, $request, *getmethod, @params) # # Actually sends $request, and reads status, header and data from socket. # Should be called from within an eval() to implement timeout. #-------------------------------------------------------------------------- sub FragileRequest { local($ipaddr, $request, *getmethod, @params) = @_; my $iaddr = inet_aton($ipaddr); my $paddr = sockaddr_in($port, $iaddr); my $proto = getprotobyname('tcp'); socket(S, PF_INET, SOCK_STREAM, $proto) or return -2; $SOCKETOPEN = 1; $ALARMFLAG = -14; # "connecting to host" $DB{4} && &main'Progress(" Connecting"); connect(S, $paddr) || return -4; # -4: could not connect $DB{4} && &main'Progress(" Connect successful"); $ALARMFLAG = -15; # "waiting for data"; local($lastsel) = select(S); $| = 1; select($lastsel); $DB{8} && print "\n$request\n\n"; print S "$request$CRLF$CRLF"; # $TimeOut && do { # local($rin) = ''; # vec($rin, fileno(S), 1 ) = 1; # select($rin, undef, undef, $TimeOut ) || return -5; # $ALARMFLAG = -16; # "reading status" # }; $_ = ; # read status line $_ || return -7; # -7: no status (will try GET) $TimeOut && alarm($TimeOut); s/\s*$//; $flag = /^\s*\S+\s+(\d+)/ ? $1 : -8; # -8: Malformed status line ($DB{7} || $DB{8}) && print "$_\n"; $flag == -8 && return $flag; $ALARMFLAG = -17; # "reading header"; local($name); while ( ) { # put header info into %HEADER s/\s*$//; $DB{8} && print "$_\n"; ($DB{7} && $DB{8}) && next; last unless m/\w/; next unless m/(\S+):\s+(\S*.*\S*)\s*$/; ($name = $1) =~ tr/A-Z/a-z/; $HEADER{$name} = $2; } $flag == 200 && @_ > 2 && defined &getmethod && do { $ALARMFLAG = -18; # "reading data" return &getmethod($flag, *S, @params); }; return $flag; } sub AlarmHandler { die "timeout"; } sub IntHandler { die "user interrupt"; } #-------------------------------------------------------------------------- # GetIpAddress($host) # # Returns UNpacked IP address for host. Caches the address in # $IpAddr{$host}, caches alternate IP addresses in $IpAddr2{$host}. #-------------------------------------------------------------------------- sub GetIpAddress { local($host) = @_; local($_, @addrlist); $host =~ m/(\d+)\.(\d+)\.(\d+)\.(\d+)/ && ($IpAddr{$host} = $host); $IpAddr{$host} && return ($IpAddr{$host}, 0); # use cached ip address $TimeOut && alarm($TimeOut); eval('($_,$_,$_,$_, @addrlist) = gethostbyname($host)'); $TimeOut && alarm(0); $@ && $@ =~ /^timeout/ && return (0, -19); @addrlist || return (0, -1); # could not find host grep($_ = join(".", unpack("C4", $_) ), @addrlist); $IpAddr{$host} = shift @addrlist; # 1st one is default $IpAddr2{$host} = join("\n", @addrlist); # save others just in case return ($IpAddr{$host}, 0); } #-------------------------------------------------------------------------- # SplitUrl($url) # # Split the given URL into its component parts according to HTTP rules. # returns ($scheme, $host, $port, $path, $query, $frag) #-------------------------------------------------------------------------- sub SplitUrl { local($_) = $_[0]; local($scheme) = s#^(\w+):## ? $1 : ''; local($host) = s#^//([^/]*)## ? $1 : ''; local($port) = $host =~ s/:(\d*)$// ? $1 : ''; local($query) = s/\?(.*)$// ? $1 : ''; local($frag) = s/#([^#]*)$// ? $1 : ''; $scheme =~ tr/A-Z/a-z/; $_ = $_ || '/'; return ($scheme, $host, $port, $_, $query, $frag); } #-------------------------------------------------------------------------- # SplitUrl($url) # # Split the given URL into its component parts according to HTTP rules. # returns ($scheme, $host, $port, $path, $query, $frag) #-------------------------------------------------------------------------- sub SplitUrlQ { local($scheme, $host, $port, $path, $query) = &SplitUrl(@_); $query && ( $path .= "?" . $query); return ($scheme, $host, $port, $path); } #-------------------------------------------------------------------------- # StatusMsg(*all, *ok, *fail, *warn) # # Fills %ok, %fail, %retry with urls and error MESSAGES from # urls and error FLAGS in %all. #-------------------------------------------------------------------------- sub StatusMsg { local(*all, *ok, *fail, *warn) = @_; while ( ($url, $flag) = each %all) { $msg = &ErrorMsg($flag); $FlagOk{$flag} && ($ok{$url} = $msg, next); $FlagWarn{$flag} && ($warn{$url} = $msg, next); $fail{$url} = $msg; } } sub RetryCount { local(*flags) = @_; scalar grep($FlagRetry{$_}, values %flags); } #-------------------------------------------------------------------------- # ErrorMsg($flag) # # Returns the error message associated with flag. #-------------------------------------------------------------------------- sub ErrorMsg { local($flag) = @_; $DB{6} || ($FlagOk{$flag} && return 'ok'); $FlagOk{$flag} || $FlagFail{$flag} || $FlagWarn{$flag} || $FlagRetry{$flag} || "unknown error ($flag)"; } #-------------------------------------------------------------------------- # OtherStatus(*urlmod, *hostfail, *urlmoved, *redirect) # # Fills hashes with modified, moved, and host fail messeges # Fills *redirect with redirected urls and destinations. #-------------------------------------------------------------------------- sub OtherStatus { local(*urlmod, *hostfail, *urlmoved, *redirect) = @_; %urlmod = %UrlMod; %urlmoved = %UrlMoved; %hostfail = %HostFail; %redirect = %Redirect; } #-------------------------------------------------------------------------- # ReadCache($file, $type, *hash) # # Reads $file, looking for key val1 on each line. # Fills hash{key} = val1. Skips entries that have expired. # Sets $OpenedCache{$type} to $file for writing later. # Also used to only read the cache file once. #-------------------------------------------------------------------------- sub ReadCache { local($file, $type, *hash) = @_; $OpenedCache{$type} && return; $OpenedCache{$type} = $file; open(CACHE, $file) || return; ($DB{7} || $DB{8}) && &main'Progress(qq~reading $type from "$file"~); foreach () { /^#/ && next; /^(\S+)\s+(\S+)\s+(\S.*\S)\s*$/ || next; $hash{$1} = "$2 $3"; } close(CACHE); } #-------------------------------------------------------------------------- # WriteCaches() #-------------------------------------------------------------------------- sub WriteCaches { &WriteCache($BotType, *BotExclude); &WriteCache($StatType, *StatCache); } #-------------------------------------------------------------------------- # WriteCache($type, *hash) # # Writes key(hash1) value(hash1) to file. Preceded by $header. # Will only write if cache was opened via ReadCache() and if the # cache has been tainted via %TaintCache. #-------------------------------------------------------------------------- sub WriteCache { local($type, *hash) = @_; local($file) = $OpenedCache{$type} || return; ($TaintCache{$type} && %hash) || return; open(CACHE, ">$file") || do { &main'Warn(qq~Could not write $type to "$file"\n~, 'sys'); return; }; local($header) = "# $type created by linklint\n" . "# Use -cache flag or set environment variable LINKLINT to\n" . "# to change this file's directory.\n\n"; print CACHE $header; foreach ( sort keys %hash ) { print CACHE "$_ $hash{$_}\n"; } close CACHE; &main'Progress(qq~\nWrote $type to "$file"~); } sub FlagOk { $FlagOk{$_[0]} ? 1 : 0; } sub FlagWarn { $FlagWarn{$_[0]} ? 1 : 0; } sub Base64Encode { local($plain) = @_; local($out, @out); local(@bits) = split(//, unpack('B*', $plain)); while (@bits > 0) { $out = join('', splice(@bits, 0, 6)); while (length($out) < 6) { $out .= '0' }; push(@out, $Base64[hex(unpack('H*', pack('B8', "00$out")))]); } while ( @out % 4) { push(@out, "="); } join('', @out); } sub d_u_m_m_y { $main'FileList = 0 } 1; # required packages must return true #========================================================================== # End of linkhttp.pl #========================================================================== linklint-2.3.5.orig/linklint.bat0100664000175000017500000000005507341361011016415 0ustar barbierbarbierperl \bin\linklint %1 %2 %3 %4 %5 %6 %7 %8 %9