Re: Problem with directories in hebrew and jcifs

Yossi Nachum Mon, 20 May 2013 22:14:50 -0700

Thanks again for your help.
I used the metadata option that you suggested and it works fine. Now people
know where are the documents that they find in solr. The URL is not working
when I click on it but at least we know where the docs are.
Sorry about the late response I was on a vacation
On May 6, 2013 3:36 PM, "Karl Wright" <[email protected]> wrote:


> Hi Yossi,
>
> I looked into this further over the weekend, to try and recall some of the
> thinking that went into how our file IRI's are constructed.
>
> (1) There is a constraint, which comes from certain output connectors, and
> which may no longer be valid, that all file IRI's must be legal URI's.  If
> that is still true, it REQUIRES us to %-encode non-ASCII characters.  The
> standard URI encoding is UTF-8, which is why we use that encoding.
>
> (2) For other characters that cannot be legally put in a URI, such as "+"
> and " " and "#", browsers I have access to behave as follows:
>
> <?xml version="1.0" encoding="utf-8"?>
> <html>
> <body>
>     <a href="file:///c:/test/test.html">click here for test</a><br/>
>     <a href="file:///c:/test/hi#there.html">click here for
> hi#there</a>(works on IE 8, but not Firefox - base form on IE8)<br/>
>     <a href="file:///c:/test/hi%23there.html">click here for
> hi%23there</a>(works on both, base form on Firefox)<br/>
>     <a href="file:///c:/test/hi<there.html">click here for
> hi&lt;there</a>(works on both but represents a file that can't be
> loaded)<br/>
>     <a href="file:///c:/test/hi%3cthere.html">click here for
> hi%3cthere</a>(works on both but represents a file that can't be loaded -
> base form on both)<br/>
>     <a href="file:///c:/test/hi there.html">click here for hi
> there</a>(works on both, base form on both)<br/>
>     <a href="file:///c:/test/hi%20there.html">click here for
> hi%20there</a>(works on both)<br/>
> </body>
> </html>
>
> As you can see, there's some common ground, but always the common ground
> requires more encoding rather than less.
>
> (3) Even assuming we relax the URI requirement, non-encoded, non-ASCII
> characters are interpreted in the encoding of the document they are
> embedded in.  So, for instance, if you wanted to include Hebrew characters
> in a file IRI, you will have to have a web page that is encoded in
> something that can represent Hebrew characters.
>
> Karl
>
>
>
> On Fri, May 3, 2013 at 10:05 AM, Karl Wright <[email protected]> wrote:
>
>> I should clarify.  IF you can propose a better IRI form than the one the
>> connector generates, AND it will work for all languages/encodings and most
>> modern browsers, we should consider changing the connector code.
>>
>> Karl
>>
>>
>> On Fri, May 3, 2013 at 8:32 AM, Karl Wright <[email protected]> wrote:
>>
>>> Here is the code in the JCIFS connector:
>>>
>>>               String pathAttributeValue = documentIdentifier;
>>>               // 3/13/2008
>>>               // In looking at what comes into the path metadata
>>> attribute by default, and cogitating a bit, I've concluded that
>>>               // the smb:// and the server/domain name at the start of
>>> the path are just plain old noise, and should be stripped.
>>>               // This changes a behavior that has been around for a
>>> while, so there is a risk, but a quick back-and-forth with the
>>>               // SE's leads me to believe that this is safe.
>>>
>>>               if (pathAttributeValue.startsWith("smb://"))
>>>               {
>>>                 int index =
>>> pathAttributeValue.indexOf("/","smb://".length());
>>>                 if (index == -1)
>>>                   index = pathAttributeValue.length();
>>>                 pathAttributeValue = pathAttributeValue.substring(index);
>>>               }
>>>               // Now, translate
>>>               pathAttributeValue =
>>> matchMap.translate(pathAttributeValue);
>>>               pack(sb,pathAttributeValue,'+');
>>>             }
>>>             else
>>>               sb.append('-');
>>>
>>> Since the JCIFS connection determines the server name, the document
>>> identifier does not need to repeat that information.  If you need to send
>>> the server name to Solr for some reason, you can certainly do that on a
>>> per-job basis by putting in yet another bit of metadata, via the "Forced
>>> Metadata" tab in your job.  If you have a really strong reason for
>>> including the server name in the same path, it would also be possible to
>>> add another feature to the JCIFS connector to do it based on a checkbox or
>>> some such; but this would complicate further an already very complicated
>>> user interface.
>>>
>>> It looks, however, like you are trying to construct an IRI, which the
>>> JCIFS connector is supposed to be doing.  Can you explain what your needs
>>> are here?  What do you believe is the correct form of an IRI?
>>>
>>> Karl
>>>
>>>
>>>
>>> On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <[email protected]>wrote:
>>>
>>>> That is working. I created a path field in my schema and use the "path
>>>> attribute".
>>>> I have one problem, I don't see the name of the cifs server, just the
>>>> path inside it.
>>>> I try to use "Match Regexp" in the metadata tab with the following
>>>> values:
>>>> Match regexp: "(.*)"
>>>> Replace string: "file:////server_name/$(1)"
>>>>
>>>> but it did not work. Still seeing the path only.
>>>>
>>>> What am I doing wrong? How can I add my server name to the path?
>>>>
>>>> Thanks
>>>> Yossi
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <[email protected]>wrote:
>>>>
>>>>> Thanks I will try that
>>>>> On May 1, 2013 3:54 PM, "Karl Wright" <[email protected]> wrote:
>>>>>
>>>>>> There is also a different way to do this entirely - there is a path
>>>>>> attribute you can send as metadata to Solr.  Just include the entire 
>>>>>> path,
>>>>>> and put it into a different field that you declare in your schema.  See
>>>>>> "path attribute" in the end-user documentation for the JCIFS connector.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <[email protected]>wrote:
>>>>>>
>>>>>>> IE 6 is extremely old and I believe we developed for IE 7 at a
>>>>>>> minimum (there were two different versions with different functionality 
>>>>>>> we
>>>>>>> had to support there), and made further changes for IE 8 when it came 
>>>>>>> out.
>>>>>>> I have no idea what IE 9 or IE 10 do.
>>>>>>>
>>>>>>> The only way to change the encoding of the IRI is to modify the
>>>>>>> JCIFS connector code.  But please bear in mind that unless you can show
>>>>>>> your modifications will work across a wide variety of browsers, we are
>>>>>>> unlikely to accept these changes back into the code base.
>>>>>>>
>>>>>>> The alternative is, since the encoding IS deterministic and
>>>>>>> reversible, you could readily write a Tika plugin that would modify at
>>>>>>> least the URL field in the manner you desire.  But you could not modify 
>>>>>>> the
>>>>>>> ID field since ManifoldCF uses this to delete documents that have
>>>>>>> disappeared.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <[email protected]>wrote:
>>>>>>>
>>>>>>>> The IRI is not working in my IE. I am using old version of IE V6
>>>>>>>> SP3.
>>>>>>>> But what I realy want is to display the correct name of the path
>>>>>>>> with hebrew characters.
>>>>>>>> If I understand you right, then I need to change the representation
>>>>>>>> of the IRI. How can I do that?
>>>>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Right, that is exactly what I would expect.
>>>>>>>>>
>>>>>>>>> ManifoldCF uses a URL (which is constructed by the connector) as
>>>>>>>>> the primary key for every document as indexed in the search engine.  
>>>>>>>>> The
>>>>>>>>> URL has two purposes: first, it is supposed to be unique, and second, 
>>>>>>>>> it is
>>>>>>>>> supposed to allow someone who browses to that result to locate the
>>>>>>>>> document.  In the case of JCIFS, the environment is presumed to be the
>>>>>>>>> local active directory domain(s), and the "URL" generated is really a 
>>>>>>>>> file
>>>>>>>>> IRI, usually of the form "file://///server.domain/path/filename".  
>>>>>>>>> You thus
>>>>>>>>> should be able to paste the "URL" of the document from Solr into a 
>>>>>>>>> browser
>>>>>>>>> on a machine in the domain, and see the document load.
>>>>>>>>>
>>>>>>>>> As I said before, however, there are already certain problems with
>>>>>>>>> this because each version of IE differs somewhat in how it deals with
>>>>>>>>> non-ASCII characters.  IRI legal character rules are somewhat 
>>>>>>>>> different
>>>>>>>>> than URL rules, but IRI's are still nevertheless escaped in various 
>>>>>>>>> ways.
>>>>>>>>> There are also multiple equivalent ways of representing the same file 
>>>>>>>>> path
>>>>>>>>> with different IRI's.
>>>>>>>>>
>>>>>>>>> It is not typical that the ID and URL fields of a document are
>>>>>>>>> presented to the user in any meaningful way, so your question is 
>>>>>>>>> usually
>>>>>>>>> academic in most settings.  If you have a problem with the IRI's not
>>>>>>>>> actually working in a browser, that's of more immediate interest.  
>>>>>>>>> Please
>>>>>>>>> let us know if that's the case.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for your response
>>>>>>>>>> I am seeing these characters in solr when I search these files.
>>>>>>>>>> I am using the solr example site and these characters show up in
>>>>>>>>>> the ID field and URL field.
>>>>>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Where are you seeing these characters?  Are you talking about
>>>>>>>>>>> the file IRI's that the JCIFS connector generates?  Those IRI's are
>>>>>>>>>>> supposed to be constructed so that your browser would find them if 
>>>>>>>>>>> you
>>>>>>>>>>> paste them into the browser URL window.  Unfortunately, there is no 
>>>>>>>>>>> good
>>>>>>>>>>> standard, and people follow IE's behavior, and IE has changed 
>>>>>>>>>>> multiple
>>>>>>>>>>> times in how it deals with non-latin-1 characters.
>>>>>>>>>>>
>>>>>>>>>>> Please provide a bit more information so that we can provide a
>>>>>>>>>>> better answer.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>>>>>> I want to index my netapp files over cifs and I have a problem
>>>>>>>>>>>> with hebrew files and directories.
>>>>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..."
>>>>>>>>>>>> instead of the directory path that contain hebrew characters .
>>>>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255"
>>>>>>>>>>>> but it didn't help.
>>>>>>>>>>>> Can anyone help and tell me how can I index directories/files
>>>>>>>>>>>> in hebrew?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Yossi
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Problem with directories in hebrew and jcifs

Reply via email to