Re: Can SOLR Index UTF-16 Text

2012-10-03 Thread vybe3142
Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work 

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011634.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Can SOLR Index UTF-16 Text

2012-10-03 Thread Fuad Efendi
Something is missing from the body of your Email... As I pointed in my
previous message, in general Solr can index _everything_ (provided that
you have Tokenizer for that); but, additionally to _indexing_ you need an
HTTP-based _search_ which must understand UTF-16 (for instance)

Easiest solution is to transfer files to UTF-8 before indexing and to use
UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8
...; including even Tomcat HTTP settings). This is really the simplest...
and fastest by performance... and you should be able to use Highlighter
feature and etc...


-Fuad Efendi
http://www.tokenizer.ca





-Original Message-
From: vybe3142 [mailto:vybe3...@gmail.com] 
Sent: October-03-12 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Can SOLR Index UTF-16 Text

Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work 

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011
634.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Can SOLR Index UTF-16 Text

2012-10-03 Thread Fuad Efendi
Hi, my previous message was partially wrong:


Please note that ANY IMAGINABLE SOLUTION will use encoding/decoding; and the
real question is where should it happen?
A. (Solr) Java Container is responsible for UTF-16 - Java String
B. Client will do UTF-8 -UTF-16 before submitting data to (Solr)
Java Container

And the correct answer is A. Because Java internally stores everything in
UTF-16. So that overhead of (Document)UTF16-(Java)UTF16 is absolutely
minimal (and performance is the best possible; although file sizes could be
higher...)

You need to start SOLR (Tomcat Java) with the parameter 

java -Dfile.encoding=UTF-16

http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html


And, possibly, configure HTTP Connector of Tomcat to UTF-16
Connector port=8080 URIEncoding=UTF-16/

(and use proper encoding HTTP Request Headers when you POST your file to
Solr)



-Fuad Efendi
http://www.tokenizer.ca




-Original Message-
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: October-03-12 1:30 PM
To: solr-user@lucene.apache.org
Subject: RE: Can SOLR Index UTF-16 Text

Something is missing from the body of your Email... As I pointed in my
previous message, in general Solr can index _everything_ (provided that
you have Tokenizer for that); but, additionally to _indexing_ you need an
HTTP-based _search_ which must understand UTF-16 (for instance)

Easiest solution is to transfer files to UTF-8 before indexing and to use
UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8
...; including even Tomcat HTTP settings). This is really the simplest...
and fastest by performance... and you should be able to use Highlighter
feature and etc...


-Fuad Efendi
http://www.tokenizer.ca





-Original Message-
From: vybe3142 [mailto:vybe3...@gmail.com]
Sent: October-03-12 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Can SOLR Index UTF-16 Text

Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work 

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011
634.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: Can SOLR Index UTF-16 Text

2012-10-02 Thread Lance Norskog
If it is a simple text file, does that text file start with the UTF-16 BOM 
marker?
http://unicode.org/faq/utf_bom.html

Also, do UTF-8 files work? If not, then your setup has a basic encoding problem.
And, when you post such a text file (for example, with curl), use the UTF-16 
charset mime-type: I think it is text/plain; charset=utf-16.


- Original Message -
| From: Chris Hostetter hossman_luc...@fucit.org
| To: solr-user@lucene.apache.org
| Sent: Friday, September 28, 2012 5:17:15 PM
| Subject: Re: Can SOLR Index UTF-16 Text
| 
| 
| : Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when
| indexing UTF-8
| : files. Recently, however, we noticed that it has issues with
| indexing
| : certain text files eg. UTF-16 files.  See attachment for an example
| : (tarred+zipped)
| :
| : tesla-utf16.txt
| : http://lucene.472066.n3.nabble.com/file/n4010834/tesla-utf16.txt
| 
| No attachment came through to the list, and the URL nabble seems to
| have
| provided when you posted your message leads to a 404.
| 
| IN general, the question of is indexing a UTF-16 file supported
| largely
| depneds on *how* you are indexing this file -- if it's plain text,
| are you
| parsing it yourself using some client code, and then sending it to
| solr,
| are you using DIH to read it from disk? are you using
| ExtractingRequestHandler?
| 
| those are all very differnet ways to index data in Solr -- and
| depending
| on what you are doing determins how/where the encoding of that file
| is
| processed.
| 
| 
| -Hoss
| 


RE: Can SOLR Index UTF-16 Text

2012-10-02 Thread Fuad Efendi
Solr can index bytearrays too: unigram, bigram, trigram... even bitsets, 
tritsets, qatrisets ;- ) 
LOL I got strong cold... 
BTW, don't forget to configure UTF-8 as your default (Java) container 
encoding...
-Fuad






Re: Can SOLR Index UTF-16 Text

2012-09-28 Thread Shawn Heisey

On 9/27/2012 2:55 PM, vybe3142 wrote:

Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when indexing UTF-8
files. Recently, however, we noticed that it has issues with indexing
certain text files eg. UTF-16 files.


I'd wait for a yes/no vote on this from one of the actual experts on 
this mailing list on this, not just take my word.  Here is my guess 
based on what I know:


Solr uses and expects UTF8. If the program you are using to index the 
files (which you didn't specify) is capable of working in more than one 
character set, you should be able to make it work.  In order to do so, 
it must be aware that it is reading UTF16 on the input and translate it 
(either implicitly or explicitly) into UTF8 when it sends the data to 
Solr.  Your results suggest that the program is assuming UTF8 on the 
input, perhaps because it can't detect on its own with text files, so if 
it is capable of multiple character sets, you may have to tell it what 
it's reading.


I have no idea if the typical way of reading text/word/pdf/other 
documents (which I think is SolrCell / Tika) can do this, as I have 
never used it.  The data for my Solr index comes from MySQL, which is 
working entirely in UTF8.


Thanks,
Shawn



Re: Can SOLR Index UTF-16 Text

2012-09-28 Thread Chris Hostetter

: Our SOLR setup  (4.0.BETA on Tomcat 6) works as expected when indexing UTF-8
: files. Recently, however, we noticed that it has issues with indexing
: certain text files eg. UTF-16 files.  See attachment for an example
: (tarred+zipped)
: 
: tesla-utf16.txt
: http://lucene.472066.n3.nabble.com/file/n4010834/tesla-utf16.txt  

No attachment came through to the list, and the URL nabble seems to have 
provided when you posted your message leads to a 404.

IN general, the question of is indexing a UTF-16 file supported largely 
depneds on *how* you are indexing this file -- if it's plain text, are you 
parsing it yourself using some client code, and then sending it to solr, 
are you using DIH to read it from disk? are you using 
ExtractingRequestHandler?

those are all very differnet ways to index data in Solr -- and depending 
on what you are doing determins how/where the encoding of that file is 
processed.


-Hoss