Redirect specific eprints to somewhere else

From EPrints Documentation
Jump to: navigation, search

This page is based on a question posed on the EP-Tech mailing list: Adding 301 redir_permanent for some migrated items before 404 kicks in., around approaches to redirect specific EPrints to different URLs (e.g. where certain items have been migrated to different services).

There were two solutions suggested, and an additional (untested) approach, detailed below.

  • use an EPrints URL rewrite trigger
  • use the `rewrite_exceptions` configuration, in combination with an Apache `RewriteMap`
  • register a new PerlTransHandler to do rewrites

The best approach will depend on the size of the repository, the percentage of items that need to be redirected, whether additional aspects (e.g. mapping document-level requests or export URLs) are needed.

The `EP_TRIGGER_URL_REWRITE` will be triggered on each request - so it's important to keep the code 'quick' - and return quickly for URLs that are of no interest.

The `rewrite_exceptions` array is searched for each request too - so having a very large array to search through may also slow your repository down.

EPrints URL rewrite trigger

# Save as e.g. EPRINTS_ROOT/archives/ARCHIVE_ID/cfg/cfg.d/z_url_rewrite_map.pl
# source is available from: https://gist.github.com/jesusbagpuss/2e9172a2ee94d4dcfc5f23a08486f3fe 
# Maps specific URLs to be permanently redirected

use EPrints::Const; # for trigger return values

# define specific URLs that have been moved.
# If all the new URLs are to the same base URL, you could have e.g 1234 => 5678 in the 
#    hash, and prepend a static 'https://new.repo.com/' in the 'Location' line below.
# the hash could also be defined by reading data in from a file (e.g. csv)

$c->{z_url_rewrite_map} = {
	'1234' => 'https://abc.de/1/',
	'2345' => 'https://abc.de/9/',
};

$c->add_trigger( EP_TRIGGER_URL_REWRITE, sub {
	my( %o ) = @_;

	# Available hash keys in %o
	#   request, lang (en), args ("" or "?foo=bar"), urlpath ("" or "/subdir"), cgipath ("/cgi" or "/subdir/cgi")
	#   uri (/foo/bar), secure ( boolean ), return_code (set to trigger a return - stop stack of triggers), repository

	my $redirect_map = $o{repository}->get_conf( "z_url_rewrite_map" );
	return if !defined $redirect_map;

	# Optimisation over the 404 handler regex...
	# If you have a long-standing repository, there may be EPrint URLs out in the wild that refer to repo.com/1234 (without trailing slash).
	# Currently, EPrints will redirect this to repo.com/1234/ - and then m#^/(\d+)/# (from 404 handler) would capture it.
	# This could result in multiple redirects ( /1234 --> /1234/ --> other server ) which isn't best practice for SEO.
	#
	# The first regex below will match `/1234` ,  `/1234/`, `/00001234`, `/00001234/` (zero-padded IDs were from old versions of EPrints - but may still be in use.
        # These are currently handled in EPrints::Apache::Rewrite with a redirect, so capturing them here stops one redirect loop.
        #
	# If your repo is _really_ old, you might also want to capture repo.com/archive/1234 or repo.com/archive/00001234...
	# ...I guess looking in the Apache logs will indicate if this is necessary!

	if( defined $o{uri} && ( $o{uri} =~ m#^/0{0,9}(\d+)(?>/|$)#  || $o{uri} =~ m#^/id/eprint/0{0,9}(\d+)/# ) )
	{ 
		# We've got what looks like an EPrintID...
		# Is there a redirect map for it?
		if( defined $redirect_map->{$1} ){
			EPrints::Apache::AnApache::send_status_line( $o{request}, 301, "Moved Permanently" );
			EPrints::Apache::AnApache::header_out( $o{request}, "Location", "$redirect_map->{$1}" );
			EPrints::Apache::AnApache::send_http_header( $o{request} );

			${$o{return_code}} = EPrints::Const::DONE;
			return EP_TRIGGER_DONE;
		}
		# an EPrintID, but not redirected.
	}
	# not an EPrintID URL - just return (could explicitly set a trigger return value, but not necessary)
} );

Apache RewriteMap solution

By default, EPrints will process all requests, so including additional directives in the Apache config doesn't work. Luckily, EPrints provides a configuration variable for URL stubs to not be processed.

# Save as e.g. EPRINTS_ROOT/archives/ARCHIVE_ID/cfg/cfg.d/z_rewrite_exceptions.pl - ***BUT CHECK YOUR CONFIG FOR AN EXISTING DEFINITION!!!***
$c->{rewrite_exceptions} = [ '/1234/', '/2345/' ];

URLs matching those specified will not be handled by the EPrints stack, allowing Apache to handle the requests. The `newid` part shows how to map the 'exception' IDs to one new system (with new IDs). The `newurl` part can be used to map the exception IDs to any URL.

# In apache vhost config file
RewriteMap newid "txt:/etc/apache2/newid.txt"
RewriteMap newurl "txt:/etc/apache2/newurl.txt"

RewriteCond ${newid:$1|Unknown} !Unknown
RewriteRule "^/([0-9]+)/?$" "new.server/${newid:$1|$1}" [R,L]

RewriteCond ${newurl:$1|Unknown} !Unknown
RewriteRule "^/([0-9]+)/?$" "${newurl:$1}" [R,L]

The newid.txt file would contain something like:

 1234    1234
 2345    9876

The newurl.txt file would contain something like:

 1234    https://new.host/item1000
 2345    https://some.other.host/42/life

The Apache RewriteMap also allows other approaches - e.g. querying a database to find matches. Just make sure it's quick and stable!

PerlTransHandler (possible solution for proper perl geeks - completely untested!)

EPrints is added to the Apache configuration with the following config:

   PerlTransHandler +EPrints::Apache::Rewrite

The `+` before the module name tells Apache to load the specified module before using it. It's equivalent to including a specific `PerlModule Apache::Foo` line.

`PerlTransHandler` is a stacked handler. This means you could write your own custom Perl module to handle the redirects:

# save as e.g. EPRINTS_ROOT/lib/plugins/Custom/Rewrites.pm
# THIS APPROACH IS UNTESTED! (feel free to test it, and correct it ;)
package Custom::Rewrites;

use strict;
use warnings;

use Apache2::RequestRec ();
use Apache2::Const -compile => qw(DECLINED HTTP_MOVED_PERMANENTLY);

# default method name used by Apache Perl handlers
sub handler {
    my $r = shift;

    my ( $id ) = $r->uri =~ m#^/0{0,9}(\d+)(?>/|$)#; # matching id. could also capture pos, filename etc. if needed

    # get a list of IDs/Locations somehow.
    # do some logic to work out if $id should be redirected, and to where - $new_location
    if( $needs_redirection )
    {
        $r->err_headers_out->add( Location => "$new_location" );
        return Apache2::Const::HTTP_MOVED_PERMANENTLY;
    }

    # This ID does not need to be redirected. This handler returns 'DECLINED', and the next handler is passed the request to process
    return Apache2::Const::DECLINED;
}
1;

The above handler can be added to the `PerlTransHandler` stack before the EPrints one in the `VirtualHost` definition:

<VirtualHost *:443>
  ServerName eprints.somehwere
  ...
  ...
  # Handle redirects with custom module first
  PerlTransHandler +Custom::Rewrites
  # then fall back to 'normal' EPrints handling
  PerlTransHandler +EPrints::Apache::Rewrite
</VirtualHost>

See Also