New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating checksum on updated indexcodes.txt file? #201
Comments
I think this should be an epadmin command line option: /opt/eprints3/bin/epadmin redo_hashes file ... - Regenerate hashes for all the file ID listed using the current hash type specified or default to md5 (maybe have a flag for the hash type) /opt/eprints3/bin/epadmin redo_hashes document - Regenerate hashes for all the files associated with the document with docid. (Allow multiple docids to be specified). /opt/eprints3/bin/epadmin redo_hashes eprint - Regenerate hashes for all the files associated with all documents associated with eprint with eprintid. (Allow multiple eprintids to be specified). This is a bit like the redo_mime_type epadmin option but update the hash (and potentially hash_type) rather than the mime_type. |
I think that would be a super useful epadmin command for digital preservation, allowing for easy API access to troubleshoot an incorrect hash for a file. We can end up with an incorrect hash stored for many different possible reasons, including unusual history of operations or file corruption. I realize that the example I am using is that of a derivative file, so arguably, the checksum for that is less important than that for an uploaded file, but nevertheless, since we are storing checksums for these files, the effort to maintain a correct checksum is worthwhile. Also, this same epadmin enhancement would be useful for updating/fixing the stored checksum for any specific uploaded/deposited files as well. |
I have added this function in 2c0f6b3 I have made a few tweaks to be consistent with the redo_mime_type option:
I am happy to consider opening new issues for 2 or 3 if deemed necessary. 2 would also allow redo_mime_type to be run against the eprint data objects. 3 would require a considerable amount of work to ensure this was configurable and that all cases used the configured algorithm. |
Thank you! That is great, I will test it out on the file that was causing issues and let you know if it worked. |
I am testing out the epadmin redo_hash command, passing a specific documentid, like this: Line 1321 in 2c0f6b3
that looks like this: print "redo_hash for ".$file."\n"; The output looks like this, one line only: redo_hash for EPrints::DataObj::File=HASH(0x561c3b729fb8) Since I only see that 1 file's hash was updated, that's probably the "main" file for the document, but not the derivatives? The other option would be to pass the fileid of the indexcodes.txt file, right? something like this? ./epadmin redo_hash repoid file fileid How would I find out what the fileid is for the indexcodes.txt? Thank you! |
indexcodes.txt will be a separate document with its own docid to the original document. Using the document dataset is really only useful if multiple files have been uploaded for a document (e.g. upload from url for a webpage with accompaying images, CSS, etc.). Therefore it sounds like redo_hash for eprint will be necessary, so you can redo the hash for the original documents and indexcodes at the same time. I will look at updating redo_mime_type at the same time, so this can be applied to all files associated with an eprint. |
Thank you for the clarification. Sorry for my confusion, it makes sense now. Yes, it would be useful to be able to run redo_hash (and redo_mime_type) on an eprint, but I also figured that I can include the fileids in the processing log in the eprints-archivematica plugin itself, so that if/when a checksum mismatch happens, I would know the fileid of any file that throws a mismatch, even if it is a derivative file. |
I added the fileid, and the stored checksum to the processing log of the eprints-archivematica plugin. I also confirmed using md5sum that the checksum value is incorrect for the indexcodes.txt file. When I call the new epadmin redo_hash with the fileid like this: |
I think I have figured out what the bigger issue is here. As indexcodes.txt file does not have an eol (end of line) but if you edit it with vim or similar then this can add an eol (unless you do :set binary :set noeol). Here is an example of the hexdump for a hello world file with and without an eol:
The respectively have the md5sums If you then md5sum this at the command line and use redo_hash at the command line the value you get in the database is different from what you get at the command line. This is because the EPrints code will remove this eol before it generates the md5 but the command line tool will not. My assumption is that whatever is checking the indexcodes.txt for Archivematica is somehow including the eol before doing its md5 generation. This would explain why you have the mismatch in the first place. |
Thank you, @drn05r , I don't think this is caused by VIM or another text editor. You had the idea to check filesize, and I think that's it! After adding some additional tracking info (fileid, docid, filesize, hash) to the output of process_transfers log (eprintsug/EPrintsArchivematica@d229b8a), I was able to confirm that in this case, the filesize stored in Eprints for this file is incorrect, 11 bytes less than what ls -l shows on disk. |
The redo_hash will only work for local files. It is unlikely this will ever need to be used for remote files but a change would be needed to facilitate this if it did. |
Thank you. It worked! This is such a great improvement to the rehash functionality! |
The scenario is that of an update to an existing document's indexcodes.txt file.
I believe that the line that makes the update is here?
eprints3.4/perl_lib/EPrints/DataObj/Document.pm
Line 1674 in d2c7fe5
The question is: would that skip the rehashing of the updated file? So you would end up with the incorrect hash for the updated indexcodes.txt file?
The text was updated successfully, but these errors were encountered: