Hi Yossi,

I looked into this further over the weekend, to try and recall some of the
thinking that went into how our file IRI's are constructed.

(1) There is a constraint, which comes from certain output connectors, and
which may no longer be valid, that all file IRI's must be legal URI's.  If
that is still true, it REQUIRES us to %-encode non-ASCII characters.  The
standard URI encoding is UTF-8, which is why we use that encoding.

(2) For other characters that cannot be legally put in a URI, such as "+"
and " " and "#", browsers I have access to behave as follows:

<?xml version="1.0" encoding="utf-8"?>
<html>
<body>
    <a href="file:///c:/test/test.html">click here for test</a><br/>
    <a href="file:///c:/test/hi#there.html">click here for
hi#there</a>(works on IE 8, but not Firefox - base form on IE8)<br/>
    <a href="file:///c:/test/hi%23there.html">click here for
hi%23there</a>(works on both, base form on Firefox)<br/>
    <a href="file:///c:/test/hi<there.html">click here for
hi&lt;there</a>(works on both but represents a file that can't be
loaded)<br/>
    <a href="file:///c:/test/hi%3cthere.html">click here for
hi%3cthere</a>(works on both but represents a file that can't be loaded -
base form on both)<br/>
    <a href="file:///c:/test/hi there.html">click here for hi
there</a>(works on both, base form on both)<br/>
    <a href="file:///c:/test/hi%20there.html">click here for
hi%20there</a>(works on both)<br/>
</body>
</html>

As you can see, there's some common ground, but always the common ground
requires more encoding rather than less.

(3) Even assuming we relax the URI requirement, non-encoded, non-ASCII
characters are interpreted in the encoding of the document they are
embedded in.  So, for instance, if you wanted to include Hebrew characters
in a file IRI, you will have to have a web page that is encoded in
something that can represent Hebrew characters.

Karl



On Fri, May 3, 2013 at 10:05 AM, Karl Wright <[email protected]> wrote:

> I should clarify.  IF you can propose a better IRI form than the one the
> connector generates, AND it will work for all languages/encodings and most
> modern browsers, we should consider changing the connector code.
>
> Karl
>
>
> On Fri, May 3, 2013 at 8:32 AM, Karl Wright <[email protected]> wrote:
>
>> Here is the code in the JCIFS connector:
>>
>>               String pathAttributeValue = documentIdentifier;
>>               // 3/13/2008
>>               // In looking at what comes into the path metadata
>> attribute by default, and cogitating a bit, I've concluded that
>>               // the smb:// and the server/domain name at the start of
>> the path are just plain old noise, and should be stripped.
>>               // This changes a behavior that has been around for a
>> while, so there is a risk, but a quick back-and-forth with the
>>               // SE's leads me to believe that this is safe.
>>
>>               if (pathAttributeValue.startsWith("smb://"))
>>               {
>>                 int index =
>> pathAttributeValue.indexOf("/","smb://".length());
>>                 if (index == -1)
>>                   index = pathAttributeValue.length();
>>                 pathAttributeValue = pathAttributeValue.substring(index);
>>               }
>>               // Now, translate
>>               pathAttributeValue = matchMap.translate(pathAttributeValue);
>>               pack(sb,pathAttributeValue,'+');
>>             }
>>             else
>>               sb.append('-');
>>
>> Since the JCIFS connection determines the server name, the document
>> identifier does not need to repeat that information.  If you need to send
>> the server name to Solr for some reason, you can certainly do that on a
>> per-job basis by putting in yet another bit of metadata, via the "Forced
>> Metadata" tab in your job.  If you have a really strong reason for
>> including the server name in the same path, it would also be possible to
>> add another feature to the JCIFS connector to do it based on a checkbox or
>> some such; but this would complicate further an already very complicated
>> user interface.
>>
>> It looks, however, like you are trying to construct an IRI, which the
>> JCIFS connector is supposed to be doing.  Can you explain what your needs
>> are here?  What do you believe is the correct form of an IRI?
>>
>> Karl
>>
>>
>>
>> On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <[email protected]> wrote:
>>
>>> That is working. I created a path field in my schema and use the "path
>>> attribute".
>>> I have one problem, I don't see the name of the cifs server, just the
>>> path inside it.
>>> I try to use "Match Regexp" in the metadata tab with the following
>>> values:
>>> Match regexp: "(.*)"
>>> Replace string: "file:////server_name/$(1)"
>>>
>>> but it did not work. Still seeing the path only.
>>>
>>> What am I doing wrong? How can I add my server name to the path?
>>>
>>> Thanks
>>> Yossi
>>>
>>>
>>>
>>> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <[email protected]>wrote:
>>>
>>>> Thanks I will try that
>>>> On May 1, 2013 3:54 PM, "Karl Wright" <[email protected]> wrote:
>>>>
>>>>> There is also a different way to do this entirely - there is a path
>>>>> attribute you can send as metadata to Solr.  Just include the entire path,
>>>>> and put it into a different field that you declare in your schema.  See
>>>>> "path attribute" in the end-user documentation for the JCIFS connector.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> IE 6 is extremely old and I believe we developed for IE 7 at a
>>>>>> minimum (there were two different versions with different functionality 
>>>>>> we
>>>>>> had to support there), and made further changes for IE 8 when it came 
>>>>>> out.
>>>>>> I have no idea what IE 9 or IE 10 do.
>>>>>>
>>>>>> The only way to change the encoding of the IRI is to modify the JCIFS
>>>>>> connector code.  But please bear in mind that unless you can show your
>>>>>> modifications will work across a wide variety of browsers, we are 
>>>>>> unlikely
>>>>>> to accept these changes back into the code base.
>>>>>>
>>>>>> The alternative is, since the encoding IS deterministic and
>>>>>> reversible, you could readily write a Tika plugin that would modify at
>>>>>> least the URL field in the manner you desire.  But you could not modify 
>>>>>> the
>>>>>> ID field since ManifoldCF uses this to delete documents that have
>>>>>> disappeared.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <[email protected]>wrote:
>>>>>>
>>>>>>> The IRI is not working in my IE. I am using old version of IE V6 SP3.
>>>>>>> But what I realy want is to display the correct name of the path
>>>>>>> with hebrew characters.
>>>>>>> If I understand you right, then I need to change the representation
>>>>>>> of the IRI. How can I do that?
>>>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <[email protected]> wrote:
>>>>>>>
>>>>>>>> Right, that is exactly what I would expect.
>>>>>>>>
>>>>>>>> ManifoldCF uses a URL (which is constructed by the connector) as
>>>>>>>> the primary key for every document as indexed in the search engine.  
>>>>>>>> The
>>>>>>>> URL has two purposes: first, it is supposed to be unique, and second, 
>>>>>>>> it is
>>>>>>>> supposed to allow someone who browses to that result to locate the
>>>>>>>> document.  In the case of JCIFS, the environment is presumed to be the
>>>>>>>> local active directory domain(s), and the "URL" generated is really a 
>>>>>>>> file
>>>>>>>> IRI, usually of the form "file://///server.domain/path/filename".  You 
>>>>>>>> thus
>>>>>>>> should be able to paste the "URL" of the document from Solr into a 
>>>>>>>> browser
>>>>>>>> on a machine in the domain, and see the document load.
>>>>>>>>
>>>>>>>> As I said before, however, there are already certain problems with
>>>>>>>> this because each version of IE differs somewhat in how it deals with
>>>>>>>> non-ASCII characters.  IRI legal character rules are somewhat different
>>>>>>>> than URL rules, but IRI's are still nevertheless escaped in various 
>>>>>>>> ways.
>>>>>>>> There are also multiple equivalent ways of representing the same file 
>>>>>>>> path
>>>>>>>> with different IRI's.
>>>>>>>>
>>>>>>>> It is not typical that the ID and URL fields of a document are
>>>>>>>> presented to the user in any meaningful way, so your question is 
>>>>>>>> usually
>>>>>>>> academic in most settings.  If you have a problem with the IRI's not
>>>>>>>> actually working in a browser, that's of more immediate interest.  
>>>>>>>> Please
>>>>>>>> let us know if that's the case.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Thanks for your response
>>>>>>>>> I am seeing these characters in solr when I search these files.
>>>>>>>>> I am using the solr example site and these characters show up in
>>>>>>>>> the ID field and URL field.
>>>>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Where are you seeing these characters?  Are you talking about the
>>>>>>>>>> file IRI's that the JCIFS connector generates?  Those IRI's are 
>>>>>>>>>> supposed to
>>>>>>>>>> be constructed so that your browser would find them if you paste 
>>>>>>>>>> them into
>>>>>>>>>> the browser URL window.  Unfortunately, there is no good standard, 
>>>>>>>>>> and
>>>>>>>>>> people follow IE's behavior, and IE has changed multiple times in 
>>>>>>>>>> how it
>>>>>>>>>> deals with non-latin-1 characters.
>>>>>>>>>>
>>>>>>>>>> Please provide a bit more information so that we can provide a
>>>>>>>>>> better answer.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>>>>> I want to index my netapp files over cifs and I have a problem
>>>>>>>>>>> with hebrew files and directories.
>>>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..."
>>>>>>>>>>> instead of the directory path that contain hebrew characters .
>>>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255" but
>>>>>>>>>>> it didn't help.
>>>>>>>>>>> Can anyone help and tell me how can I index directories/files in
>>>>>>>>>>> hebrew?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Yossi
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Reply via email to