I see two different questions being posed:
a) The correctness of using an ndash within a word.
b) The ability to search for words containing ndash or any kind of dash, 
including a simple hyphen.

I'll start with my conclusion: Changing the ndash to a simple hyphen does not 
really address the questions.

Regarding correctness:
The usage of ndash in the KJV is within names only. At the bottom, I've 
included a list of the names having an ndash. In the 2003 version of the 1769 
KJV, these words were not hyphenated. They were hyphenated with an ndash in the 
2006 cleanup. As an interesting aside, I looked at some of the non-name words 
that are hyphenated in the 1769 KJV and compared them to a photocopy of the 
1611. These are word such as God-ward, us-ward, thee-ward, joint-heirs, .... My 
search was not exhaustive, but the 1611 didn't have hyphens, but either 
concatenated the words as with the -ward suffixes or with a space as in joint 
heirs. The other thing I noticed was that in each case where the KJV (either 
1769 or 1611) had a hyphenated name, it was a Hebrew transliteration of some 
sort and had an attached note to at least one of the instances.

One question is whether they should be taken as a whole or parts? So, is 
Beth–el, equivalent to Beth el or to Bethel? Another question, does a dash 
(hyphen, ndash, mdash, ...) have the same meaning today as it did hundreds of 
years ago? Same question but regarding different languages: Do different 
languages use a dash with different semantics than modern English?

Regarding search:
This regards several issues:
How does Lucene handle these different characters?
What does an end user want/expect?
Can we leverage that to meet user expectation?

Lucene's handling:
Lucene uses an Analyzer to split text into words on punctuation for indexing 
and for search. JSword uses SimpleAnalyzer because it makes no further 
assumptions on the text. SWORD lib uses StandardAnalyzer which does. I think 
the StandardAnalyzer has special rules for hyphens. In Lucene 3.6 the 
StandardAnalyzer behavior changes to use UAX 29 rules for splitting the text. 
This is a huge step forward. I don't know whether it handles '-' differently 
than other punctuation. (JSword switched from the StandardAnalyzer to the 
SimpleAnalyzer very early on because of the extra assumptions that 
StandardAnalyzer makes about what the user wants to index and not index and 
because it was significantly slower.)

With the SimpleAnalyzer a dash (hyphen, ndash, mdash) are used to create 
phrases. As such Beth–el, Beth-el and "Beth el" are equivalent. (This is with 
Lucene 3.0.3, earlier versions may differ). Note, it really doesn't matter that 
it's a dash, any punctuation will do. I don't think this is the case with the 
StandardAnalyzer.

One of the impacts of having hypenated words is that searching for Bethlehem 
won't find Beth–lehem. (The NT and OT differ on the spelling in the KJV.) It 
doesn't matter what kind of dash is used. The user cannot omit the hyphen to 
concatenate the words.

Another impact of hyphenated words is that it is much harder to do a wild card 
search. It doesn't matter what kind of dash is used. If the search request has 
a dash a * cannot be used.

So Lucene can do the right thing wrt the ndash and hyphen. They are identical 
wrt indexing and searching. The user does not have to know the form that is 
used in the file and match that.

The other feature that Lucene offers out of the box is Fuzzy Searching. I will 
find close approximations to the word that you are requesting. All that needs 
to be done is append a ~ to the end of the word. For example, Abimelek~ finds 
Abimael, Abimelech, Abiezer and Ahimelech. This is not a Soundex search, so the 
results are often surprising. Bethelham~ finds Meshullam and Bethlehem~ finds 
betrothed but not Bethlehem.

Some front-ends don't use Lucene for indexing. Some use an older version. So 
the behavior can differ.
Also, SWORD doesn't require indexing for "slow" search. Don't know if the SWORD 
"slow" search treats the various dashes the same or differently. (I think this 
is the Multi-word search mentioned by David)

User expectation:
The hyphenation of these names is not common in other translations. I think 
that most users would expect Bethel and not Beth–el or Beth-el. Together this 
makes searching multiple Bibles at the same time very difficult.

I think that a user might have a reasonable expectation not knowing that proper 
spelling of more than a few of them. Let alone that they are hyphenated. 

Leveraging:
I think that if StandardAnalyzer does not give expected behavior then 
SimpleAnalyzer should be used.

I think that hyphenated words should also be indexed as unhyphenated.

Adding a simple filter to change different forms of dashes into a single form 
for both search and index is a good solution but would break backward 
compatibility with existing indexes and changing from StandardAnalyzer to 
SimpleAnalyzer would be as much of a pain and a better solution (at least until 
3.6, which I have not evaluated to see if it changes the behavior sufficiently.)

Conclusion: Changing the ndash to a simple hyphen does not really address the 
problems.

In Him,
        DM

Abed–nego
Abel–beth–maachah
Abel–maim
Abel–meholah
Abel–mizraim
Abel–shittim
Abi–albon
Abi–ezer
Abi–ezrite
Adoni–bezek
Adoni–zedek
Allon–bachuth
Almon–diblathaim
Ashdoth–pisgah
Ataroth–adar
Ataroth–addar
Aznoth–tabor
Baalath–beer
Baal–berith
Baal–gad
Baal–hamon
Baal–hanan
Baal–hazor
Baal–hermon
Baal–meon
Baal–peor
Baal–perazim
Baal–shalisha
Baal–tamar
Baal–zebub
Baal–zephon
Bamoth–baal
Bashan–havoth–jair
Bath–rabbim
Bath–sheba
Bath–shua
Beer–elim
Beer–lahai–roi
Beer–sheba
Beesh–terah
Ben–ammi
Bene–berak
Bene–jaakan
Ben–hadad
Ben–hail
Ben–hanan
Ben–oni
Ben–zoheth
Berodach–baladan
Beth–anath
Beth–anoth
Beth–arabah
Beth–aram
Beth–arbel
Beth–aven
Beth–azmaveth
Beth–baal–meon
Beth–barah
Beth–birei
Beth–car
Beth–dagon
Beth–diblathaim
Beth–el
Beth–emek
Beth–ezel
Beth–gader
Beth–gamul
Beth–haccerem
Beth–haran
Beth–hoglah
Beth–hogla
Beth–horon
Beth–jeshimoth
Beth–jesimoth
Beth–lebaoth
Beth–lehem–judah
Beth–lehem
Beth–maachah
Beth–marcaboth
Beth–meon
Beth–nimrah
Beth–palet
Beth–pazzez
Beth–peor
Beth–phelet
Beth–rapha
Beth–rehob
Beth–shan
Beth–shean
Beth–shemesh
Beth–shemite
Beth–shittah
Beth–tappuah
Beth–zur
Caleb–ephratah
Chephar–haammonai
Chisloth–tabor
Chor–ashan
Chushan–rishathaim
Col–hozeh
Dan–jaan
Dibon–gad
Ebed–melech
Eben–ezer
El–beth–el
El–elohe–Israel
El–elohe–Israel
Elon–beth–hanan
El–paran
En–eglaim
En–gannim
En–gedi
En–haddah
En–hakkore
En–hazor
En–mishpat
En–rimmon
En–rogel
En–shemesh
En–tappuah
Ephes–dammim
Esar–haddon
Esh–baal
Evil–merodach
Ezion–gaber
Ezion–geber
Gath–hepher
Gath–rimmon
Gibeah–haaraloth
Gittah–hepher
Gur–baal
Hamath–zobah
Hammoth–dor
Hamon–gog
Havoth–jair
Hazar–addar
Hazar–enan
Hazar–gaddah
Hazar–hatticon
Hazar–maveth
Hazar–shual
Hazar–susah
Hazar–susim
Hazazon–tamar
Hazezon–tamar
Helkath–hazzurim
Hephzi–bah
Hor–hagidgad
I–chabod
Ije–abarim
Ir–nahash
Ir–shemesh
Ishbi–benob
Ish–bosheth
Ish–tob
Ittah–kazin
Jaare–oregim
Jabesh–gilead
Jashubi–lehem
Jegar–sahadutha
Jehovah–jireh
Jehovah–nissi
Jehovah–shalom
Jiphthah–el
Jushab–hesed
Kadesh–barnea
Kedesh–naphtali
Keren–happuch
Kibroth–hattaavah
Kir–haraseth
Kir–hareseth
Kir–haresh
KirhereKir–heres
Kirjath–arba
Kirjath–arim
Kirjath–baal
Kirjath–huzoth
Kirjath–jearim
Kirjath–sannah
Kirjath–sepher
Lahai–roi
Lo–ammi
Lo–debar
Lo–ruhamah
Maaleh–acrabbim
Magor–missabib
Mahaneh–dan
Maher–shalal–hash–baz
Malchi–shua
Me–jarkon
Melchi–shua
Meribah–Kadesh
Merib–baal
Merodach–baladan
Metheg–ammah
Migdal–el
Migdal–gad
Misrephoth–maim
Moresheth–gath
Nathan–melech
Nebuzar–adan
Nergal–sharezer
Obed–edom
Padan–aram
Pahath–moab
Pas–dammim
Perez–uzzah
Perez–uzza
Pharaoh–hophra
Pharaoh–nechoh
Pharaoh–necho
Pi–beseth
Pi–hahiroth
Poti–pherah
RabsariRab–saris
Rab–shakeh
Ramathaim–zophim
Ramath–lehi
Ramath–mizpeh
Ramoth–gilead
Regem–melech
Remmon–methoar
Rimmon–parez
Romamti–ezer
Ru–hamah
Samgar–nebo
Sela–hammahlekoth
Shear–jashub
Shethar–boznai
Shihor–libnath
Shimron–meron
Succoth–benoth
Syria–damascus
Syria–maachah
Taanath–shiloh
Tahtim–hodshi
Tel–abib
Tel–haresha
Tel–harsa
Tel–melah
Tiglath–pileser
Tilgath–pilneser
Timnath–heres
Timnath–serah
Tob–adonijah
Tubal–cain
Uzzen–sherah
Zareth–shahar
Zaphnath–paaneah


On Mar 2, 2013, at 6:01 AM, Chris Burrell <ch...@burrell.me.uk> wrote:

> Can't this be done with a simple filter, i.e. always change the '-' to one 
> kind regardless of the length. And when the user input comes in, do the same.
> Chris
> 
> 
> On 2 March 2013 02:36, Nic Carter <niccar...@mac.com> wrote:
> 
> Do you have a proposed solution to this, David?
> 
> I know that on my iPhone it is very simple to use a proper ndash & so I will 
> always use the correct type of dash according to what I am writing. (same 
> with on a Mac!)
> However, the more significant issue is simply that people don't know there is 
> a difference (or why they are different lengths, etc)...  ;)
> 
> On 25/02/2013, at 2:48 AM, David Haslam <dfh...@googlemail.com> wrote:
> 
> > In the KJV module, if you want to search for [say] the hyphenated name
> > "Maher–shalal–hash–baz", you first have to be aware that this module uses
> > the ndash in place of the hyphen.
> >
> > btw.  It's not so easy to enter the ndash from a keyboard, and probably even
> > harder in an Android tablet or mobile.
> >
> > If you use ordinary hyphen/minus for the search key hyphen for this module,
> > you don't find anything with "Exact phrase".
> > If you use "Multi-word", you do find "Maher" highlighted in the found verse.
> > (e.g. using Xiphos).
> >
> > For modules in general, however, the user cannot usually know in advance
> > whether hyphenated words use the ndash, the hyphen or something else.
> >
> > Has anyone else looked into this aspect of the search feature?
> >
> > David
> >
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://sword-dev.350566.n4.nabble.com/Searching-for-hyphenated-words-tp4652016.html
> > Sent from the SWORD Dev mailing list archive at Nabble.com.
> >
> > _______________________________________________
> > sword-devel mailing list: sword-devel@crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
> 
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to