Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epadmin reindex not indexing all document words or empty words #320

Closed
drn05r opened this issue Apr 5, 2023 · 1 comment
Closed

epadmin reindex not indexing all document words or empty words #320

drn05r opened this issue Apr 5, 2023 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@drn05r
Copy link
Contributor

drn05r commented Apr 5, 2023

Some documents do not seem to get all their words indexed and sometimes empty words get indexed. Sometimes there are attempts to index empty words multiple times and give the error message:

DBD::mysql::st execute failed: Duplicate entry 'documents--75' for key 'PRIMARY' at /opt/eprints3/bin/../perl_lib/EPrints/Database.pm line 1289.'
@drn05r drn05r added the bug Something isn't working label Apr 5, 2023
@drn05r drn05r added this to the 3.4.5 milestone Apr 5, 2023
@drn05r drn05r self-assigned this Apr 5, 2023
@drn05r
Copy link
Contributor Author

drn05r commented Apr 8, 2023

Reviewing the particular characters that were reported by the person who lead me to create this issue, I think most if not all the characters come from the range 0x1d400 to 0x1d7ff. This includes alphabetical characters, greek letters and numbers using certain font styles. These are probably used in formulae that appear within research publications.

Adding extra entries to $EPrints::Index::FREETEXT_CHAR_MAPPING should solve the problem. The problem may also be solved by changing EPrints' database to use a utf8mb4 character set. However, changing from utf8 to utf8mb4 for an existing table is non-trivial.

It is liable that there will continue to be more characters used in publications that fall outside utf8 (3-bytes) range. So support for utf8mb4 in EPrints should be given greater importance. Probably as a feature of 3.5.

@drn05r drn05r closed this as completed in 49865e5 Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant