Apache OpenOffice (AOO) Bugzilla – Issue 103898
Several non-breaker symbols are considered word breakers by hunspell backend
Last modified: 2013-02-24 20:43:13 UTC
Whenever two or more words included into an installed dictionary, and followed by "%" and "+25" symbols, are merged to form a compound word, they are considered correct rather than mistakes. Examples: test%bug test+25bug As far as I know, no language in the world utilizes those symbols as word breakers. This seems to be an issue with the Hunspell backend since other applications that use it (Mozilla Firefox, Thunderbird, Seamonkey, Opera beta) have the same problem. However, it surely depends on a specific Hunspell version because, for example, Opera beta considers correct "test%bug" but not "test+25bug". This issue worths further investigation.
@ sba: Please have a look
Confirmed. Set Component to lingucomponent. But a small correction for the "+25" thingie: In Tools-Options-Language Settings-Writing Aids -> In the "Options" list, you can check "Check word with numbers". After that, "25bug" will be marked misspelled as expected (DEV300_m53, EN_US spellcheck). SBA->TL: Is this related to the "Forward dashes to spellchecker" thing you are working on? (issue 64400, issue 102815).
OK, if I tick that option +25bug is seen as an error. So the issue is the word breaker "+". From other users' tests, all these characters are considered word breakers from the spellchecker: test!bug testâ€bug test£bug test$bug test%bug test(bug test)bug test/bug test\bug
"/" is the only word breaker that is used in the languages i know. all the others "£, %, &" are not word breakers in my opinion.
i would suggest to modify the title of the issue since "%" is not the only symbol which causes problems and "+25" is not completely accurate, actually it's only the "+" that creates troubles
"/" is used in URL too, so that is a more complicated issue, I suppose.
From my point of view only the characters '+', '&' and '%' need a closer inspection if they should NOT be word breakers. For examples as in AT&T (sometimes written as AT+T) 5%ig The '+' and '&' may be useful for all languages while '%' is probably always a word breaker in English but not in German. Thus everyone please comment what you think about those three characters. Also, is there a language where it will hurt to have '+' and '&' as part of the word? For ohter charcters like /,$,€,(,),@ ... it should not hurt to have them as word breaker. At least I don't know any example where the left and right part of such text should not be a complete word in its own right.
@tl: indeed, I'm not able to find an example where "€" and "$" can be *inside* a legitimate word. If there's another common employment of those symbols other than in currency figures, please tell me. However, on my standard Italian QWERTY keyboard, all these symbols are dead keys that can be displayed only by pressing key combinations of "Shift", "Control", and "Alt", so it isn't easy to input them by mistake. Indeed, I've found this issue by voluntarily typing the words quoted in my first message while reporting a bug report for the Seamonkey browser that, incredibly, has an identical issue with strings passed to search engines. Nevertheless, you should consider how easy or difficult is to mistakenly type those symbols on other international keyboards, according to their key positions, before leaving the situation as it is now.
What if we write "read+able gives readable"? I’m not sure it would please users if there are forced to separate these signs with spaces. We should not care that "/" are in URL, as there is no chance we put URLs in dictionaries. -- In French, I think % and + are word breakers. I am not sure for &, but probably also.
tl->luctur: I don't think how easy it is to make a mistake should be the measure what characters should be considered as word breaker. Spellchecking works because of the combined effort of the spell check implementation and the dictionaries povided by many dedicated users. Thus we should not make any of the above task overly complicated. Adding additional new characters to the ones allowed within a word will likely require many of the existing dictionaries to be revised in order to give the best result in combination with the spell checker. Thus we should burden the dictionary providers only with the really needed changes. From my point of view that means only characters that are part of actually existing words/terms should be considered.
@tl: then, if it's a question of opportunity of the change (rather than correctness of the compound words) all symbols listed in this issue should be considered word breakers. It's the less painful solution for dictionaries' makers (I know how that work is hard, I was the maintainer of the Italian dic). Furthermore, it's rather unlikely that this issue affects a large user base.
tl->luctur: well, compound words should of course be handled by the dictionary/spell-checker. Thus any character that is used in compound words should not be a word breaker. At least that's what I believe.