Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating checksum on updated indexcodes.txt file? #201

Closed
photomedia opened this issue Mar 7, 2022 · 12 comments
Closed

updating checksum on updated indexcodes.txt file? #201

photomedia opened this issue Mar 7, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@photomedia
Copy link

The scenario is that of an update to an existing document's indexcodes.txt file.

I believe that the line that makes the update is here?

$file->set_file( sub {

The question is: would that skip the rehashing of the updated file? So you would end up with the incorrect hash for the updated indexcodes.txt file?

@drn05r drn05r self-assigned this Mar 7, 2022
@drn05r drn05r added the enhancement New feature or request label Mar 7, 2022
@drn05r drn05r added this to the 3.4.3 milestone Mar 7, 2022
@drn05r
Copy link
Contributor

drn05r commented Mar 7, 2022

I think this should be an epadmin command line option:

/opt/eprints3/bin/epadmin redo_hashes file ... - Regenerate hashes for all the file ID listed using the current hash type specified or default to md5 (maybe have a flag for the hash type)

/opt/eprints3/bin/epadmin redo_hashes document - Regenerate hashes for all the files associated with the document with docid. (Allow multiple docids to be specified).

/opt/eprints3/bin/epadmin redo_hashes eprint - Regenerate hashes for all the files associated with all documents associated with eprint with eprintid. (Allow multiple eprintids to be specified).

This is a bit like the redo_mime_type epadmin option but update the hash (and potentially hash_type) rather than the mime_type.

@photomedia
Copy link
Author

photomedia commented Mar 7, 2022

I think that would be a super useful epadmin command for digital preservation, allowing for easy API access to troubleshoot an incorrect hash for a file. We can end up with an incorrect hash stored for many different possible reasons, including unusual history of operations or file corruption. I realize that the example I am using is that of a derivative file, so arguably, the checksum for that is less important than that for an uploaded file, but nevertheless, since we are storing checksums for these files, the effort to maintain a correct checksum is worthwhile. Also, this same epadmin enhancement would be useful for updating/fixing the stored checksum for any specific uploaded/deposited files as well.
The ability to do this for an entire eprint or document would be very useful as well. Our repository had hundreds (if not thousands) of deposited documents with no checksum/hash stored in the database at all. I'm not sure why/how that came about, but having a command line for this would create a "standard" way of troubleshooting that, filling in a checksum for any deposited documents that don't have one.

@drn05r
Copy link
Contributor

drn05r commented Mar 9, 2022

I have added this function in 2c0f6b3

I have made a few tweaks to be consistent with the redo_mime_type option:

  1. Changed option name to redo_hash rather than redo_hashes.
  2. Only allowed redo_hash to be used against file or document data objects (not eprint)
  3. Although not original mentioned above (only on eprints-tech list), abandoned the idea of being able to choose hash type. This is hardcoded to MD5, so allowing SHA as an option could lead to inconsistent hash type being used. Although SHA is less likely to produce duplicates for different files, the purpose is for checking file integrity rather than testing a file has not been intercepted and modified, where SHA256 or a more sophisticated algorithm would be necessary to guarantee this.

I am happy to consider opening new issues for 2 or 3 if deemed necessary. 2 would also allow redo_mime_type to be run against the eprint data objects. 3 would require a considerable amount of work to ensure this was configurable and that all cases used the configured algorithm.

@photomedia
Copy link
Author

Thank you! That is great, I will test it out on the file that was causing issues and let you know if it worked.
About 2, I'm not sure if I understand "2 would also allow redo_mime_type to be run against the eprint data objects"? You mean that for consistency with redo_mime_type, it can only be run on documents and files (not eprints)?
About 3, I agree with the choice of MD5. We did some testing/digital archeology of EPrints, and we found that it is using MD5 consistently throughout. The "hash_type" is there, but I don't think anyone has this set to anything other than MD5. It makes sense to keep it that way. You are right, the MD5 hash is used for preservation/integrity checking.

@photomedia
Copy link
Author

I am testing out the epadmin redo_hash command, passing a specific documentid, like this:
./epadmin redo_hash repoid document docid
I was hoping/thinking that this should update the hash for all of the derivative files associated with this document.
However, it didn't work, the hash for indexcodes.txt was not updated. I know this because when I added trace here:

$file->update_md5;

that looks like this:
print "redo_hash for ".$file."\n";
The output looks like this, one line only:
redo_hash for EPrints::DataObj::File=HASH(0x561c3b729fb8)
Since I only see that 1 file's hash was updated, that's probably the "main" file for the document, but not the derivatives?
The other option would be to pass the fileid of the indexcodes.txt file, right? something like this?
./epadmin redo_hash repoid file fileid
How would I find out what the fileid is for the indexcodes.txt? Thank you!

@drn05r
Copy link
Contributor

drn05r commented Mar 10, 2022

indexcodes.txt will be a separate document with its own docid to the original document. Using the document dataset is really only useful if multiple files have been uploaded for a document (e.g. upload from url for a webpage with accompaying images, CSS, etc.). Therefore it sounds like redo_hash for eprint will be necessary, so you can redo the hash for the original documents and indexcodes at the same time. I will look at updating redo_mime_type at the same time, so this can be applied to all files associated with an eprint.

@photomedia
Copy link
Author

Thank you for the clarification. Sorry for my confusion, it makes sense now. Yes, it would be useful to be able to run redo_hash (and redo_mime_type) on an eprint, but I also figured that I can include the fileids in the processing log in the eprints-archivematica plugin itself, so that if/when a checksum mismatch happens, I would know the fileid of any file that throws a mismatch, even if it is a derivative file.

@photomedia
Copy link
Author

photomedia commented Mar 10, 2022

I added the fileid, and the stored checksum to the processing log of the eprints-archivematica plugin. I also confirmed using md5sum that the checksum value is incorrect for the indexcodes.txt file. When I call the new epadmin redo_hash with the fileid like this:
./epadmin redo_hash REPOID file FILEID
This does not change/update the incorrect hash in the database. There is no trace/verbose output, so I don't know where it fails, but when I attempt to re-process the eprint, the same incorrect hash is traced on the file. Checksum mismatch results, because the plugin calculates the correct MD5 and throws the mismatch with what is stored in the EPrints database.
I also added the docid to the log of the plugin, so I was able to attempt the redo_hash command with the DOCID like this:
./epadmin redo_hash REPOID document DOCID
This also didn't work, the incorrect hash remains stored for this one file.

@drn05r
Copy link
Contributor

drn05r commented Mar 10, 2022

I think I have figured out what the bigger issue is here. As indexcodes.txt file does not have an eol (end of line) but if you edit it with vim or similar then this can add an eol (unless you do :set binary :set noeol). Here is an example of the hexdump for a hello world file with and without an eol:

0000000 6568 6c6c 206f 6f77 6c72 0a64
000000c

0000000 6568 6c6c 206f 6f77 6c72 0064
000000b

The respectively have the md5sums 6f5902ac237024bdd0c176cb93063dc4 and 5eb63bbbe01eeed093cb22bb8f5acdc3.

If you then md5sum this at the command line and use redo_hash at the command line the value you get in the database is different from what you get at the command line. This is because the EPrints code will remove this eol before it generates the md5 but the command line tool will not. My assumption is that whatever is checking the indexcodes.txt for Archivematica is somehow including the eol before doing its md5 generation. This would explain why you have the mismatch in the first place.

@photomedia
Copy link
Author

photomedia commented Mar 10, 2022

Thank you, @drn05r , I don't think this is caused by VIM or another text editor. You had the idea to check filesize, and I think that's it! After adding some additional tracking info (fileid, docid, filesize, hash) to the output of process_transfers log (eprintsug/EPrintsArchivematica@d229b8a), I was able to confirm that in this case, the filesize stored in Eprints for this file is incorrect, 11 bytes less than what ls -l shows on disk.

drn05r added a commit that referenced this issue Mar 11, 2022
@drn05r
Copy link
Contributor

drn05r commented Mar 11, 2022

The redo_hash will only work for local files. It is unlikely this will ever need to be used for remote files but a change would be needed to facilitate this if it did.

@drn05r drn05r closed this as completed Mar 11, 2022
@photomedia
Copy link
Author

Thank you. It worked! This is such a great improvement to the rehash functionality!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants