Issue 103898 - Several non-breaker symbols are considered word breakers by hunspell backend
Summary: Several non-breaker symbols are considered word breakers by hunspell backend
Status: CONFIRMED
Alias: None
Product: General
Classification: Code
Component: spell checking (show other issues)
Version: 3.3.0 or older (OOo)
Hardware: Unknown All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-30 10:13 UTC by luctur
Modified: 2013-02-24 20:43 UTC (History)
5 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description luctur 2009-07-30 10:13:16 UTC
Whenever two or more words included into an installed dictionary, and followed 
by "%" and "+25" symbols, are merged to form a compound word, they are 
considered correct rather than mistakes.

Examples:
test%bug
test+25bug

As far as I know, no language in the world utilizes those symbols as word 
breakers.

This seems to be an issue with the Hunspell backend since other applications 
that use it (Mozilla Firefox, Thunderbird, Seamonkey, Opera beta) have the same 
problem.

However, it surely depends on a specific Hunspell version because, for example, 
Opera beta considers correct "test%bug" but not "test+25bug".

This issue worths further investigation.
Comment 1 Olaf Felka 2009-07-30 10:24:55 UTC
@ sba: Please have a look
Comment 2 stefan.baltzer 2009-07-31 17:44:55 UTC
Confirmed. Set Component to lingucomponent.

But a small correction for the "+25" thingie: In Tools-Options-Language
Settings-Writing Aids -> In the "Options" list, you can check "Check word with
numbers". After that, "25bug" will be marked misspelled as expected (DEV300_m53,
EN_US spellcheck). 

SBA->TL: Is this related to the "Forward dashes to spellchecker" thing you are
working on? (issue 64400, issue 102815).
Comment 3 luctur 2009-07-31 19:03:50 UTC
OK, if I tick that option +25bug is seen as an error. So the issue is the word 
breaker "+".

From other users' tests, all these characters are considered word breakers from 
the spellchecker:

test!bug
testâ€bug
test£bug
test$bug
test%bug
test(bug
test)bug
test/bug
test\bug
Comment 4 tommy27 2009-07-31 20:15:30 UTC
"/" is the only word breaker that is used in the languages i know.

all the others "£, %, &" are not word breakers in my opinion.
Comment 5 tommy27 2009-07-31 22:18:50 UTC
i would suggest to modify the title of the issue since "%" is not the only 
symbol which causes problems and "+25" is not completely accurate, actually 
it's only the "+" that creates troubles
Comment 6 luctur 2009-08-01 08:43:14 UTC
"/" is used in URL too, so that is a more complicated issue, I suppose.
Comment 7 thomas.lange 2009-08-03 09:18:02 UTC
From my point of view only the characters '+', '&' and '%' need a closer
inspection if they should NOT be word breakers.
For examples as in
  AT&T (sometimes written as AT+T)
  5%ig
The '+' and '&' may be useful for all languages while '%' is probably always a
word breaker in English but not in German.

Thus everyone please comment what you think about those three characters.
Also, is there a language where it will hurt to have '+' and '&' as part of the
word?


For ohter charcters like /,$,€,(,),@ ... it should not hurt to have them as word
breaker. At least I don't know any example where the left and right part 
of such text should not be a complete word in its own right.

Comment 8 luctur 2009-08-03 10:28:43 UTC
@tl: indeed, I'm not able to find an example where "€" and "$" can be *inside* a
legitimate word. If there's another common employment of those symbols other
than in currency figures, please tell me.

However, on my standard Italian QWERTY keyboard, all these symbols are dead keys
that can be displayed only by pressing key combinations of "Shift", "Control",
and "Alt", so it isn't easy to input them by mistake.

Indeed, I've found this issue by voluntarily typing the words quoted in my first
message while reporting a bug report for the Seamonkey browser that, incredibly,
has an identical issue with strings passed to search engines.

Nevertheless, you should consider how easy or difficult is to mistakenly type
those symbols on other international keyboards, according to their key
positions, before leaving the situation as it is now.
Comment 9 auberon 2009-08-03 10:37:18 UTC
What if we write "read+able gives readable"?
I’m not sure it would please users if there are forced to separate these signs
with spaces.

We should not care that "/" are in URL, as there is no chance we put URLs in
dictionaries.

--

In French, I think % and + are word breakers. I am not sure for &, but probably
also.
Comment 10 thomas.lange 2009-08-03 10:50:51 UTC
tl->luctur: I don't think how easy it is to make a mistake should be the measure
what characters should be considered as word breaker.

Spellchecking works because of the combined effort of the spell check
implementation and the dictionaries povided by many dedicated users.
Thus we should not make any of the above task overly complicated. Adding
additional new characters to the ones allowed within a word will likely require
many of the existing dictionaries to be revised in order to give the best result
in combination with the spell checker. Thus we should burden the dictionary
providers only with the really needed changes. 
From my point of view that means only characters that are part of actually
existing words/terms should be considered.
Comment 11 luctur 2009-08-03 11:18:42 UTC
@tl: then, if it's a question of opportunity of the change (rather than
correctness of the compound words) all symbols listed in this issue should be
considered word breakers.

It's the less painful solution for dictionaries' makers (I know how that work is
hard, I was the maintainer of the Italian dic).

Furthermore, it's rather unlikely that this issue affects a large user base.
Comment 12 thomas.lange 2009-08-03 11:55:23 UTC
tl->luctur: well, compound words should of course be handled by the
dictionary/spell-checker. Thus any character that is used in compound words
should not be a word breaker. At least that's what I believe.