Apache OpenOffice (AOO) Bugzilla – Issue 103402
need to skip diacritics in Hebrew spellchecking
Last modified: 2013-03-12 11:44:15 UTC
Hebrew is usually written without diacritics. However, sometimes the diacritics are written as special marks located within, above, or below consonants. The diacritics are represented internally as separate Unicode characters. Hebrew dictionaries check for words without diacritics and will continue to do so for the foreseeable future. This patch filters the diacritics out of a word, before spellchecking it. (Using breakiterator is not appropriate, since we don't want word-breaking at the diacritics) I don't know whether this functionality is needed for other languages as well, perhaps Arabic or Persian, or maybe some LTR languages. The patch is written in a generalized way, so that adding a language is fairly easy: 1) add another "case LANGUAGE_WHATEVER" to the "switch (nLanguage)" statement, and create a string with the diacritics to be skipped 2) add "|| nLanguage == LANGUAGE_WHATEVER" to the assignments of the boolean variables
Created attachment 63421 [details] proposed patch
Created attachment 63425 [details] revised - changed a < to <=
As I understand it, Sanskrit-based languages such as Hindi also employ diacritics.
#99796 has a very similar problem, I think the two should be fixed together (probably the same code). Notice this is not the same problem, just a similar one.
taking ownership as well. tl->ayaniger: If you provide patches for the linguistic please assign them directly to me, if by bad luck I may not see them in the issues ML and nobody else is assigning them to me they will just loiter around, probably until someone else makes a new comment and have them appear in the ML once more. tl->nemeth: won't it be possible to take care of this in the spell check dictionary or hunspell itself? I'm just asking because removing them in the SpellCheckerDispatcher will have the following two side effects: a) the replacement word will probably also not provide diacritics which may look somewhat odd if all the surrounding text is using them. b) if there ever were another spell checker implementation for Hebrew that could properly work with diacritics and provide them in replacements as well, then the patch will effectively suppress that feature. Thus I'm a little hesitant until told this patch actually has to be the solution to take.
#51772 also has a very similar problem, I think the two should be fixed together (probably the same code). Notice this is not the same problem, just a similar one.
set target 3.x not relevant for 3.4 release
I'm adding this comment to all open issues with Issue Type == PATCH. We have 220 such issues, many of them quite old. I apologize for that. We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0. If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know. On the other hand, if the patch is no longer relevant, please let us know that as well. If you have any general questions or want to discuss this further, please send a note to our dev mailing list: dev@openoffice.apache.org Thanks! -Rob
The patch is Hebrew specific, I think it should be more general.