Thanks again for your help. I used the metadata option that you suggested and it works fine. Now people know where are the documents that they find in solr. The URL is not working when I click on it but at least we know where the docs are. Sorry about the late response I was on a vacation On May 6, 2013 3:36 PM, "Karl Wright" <[email protected]> wrote:
> Hi Yossi, > > I looked into this further over the weekend, to try and recall some of the > thinking that went into how our file IRI's are constructed. > > (1) There is a constraint, which comes from certain output connectors, and > which may no longer be valid, that all file IRI's must be legal URI's. If > that is still true, it REQUIRES us to %-encode non-ASCII characters. The > standard URI encoding is UTF-8, which is why we use that encoding. > > (2) For other characters that cannot be legally put in a URI, such as "+" > and " " and "#", browsers I have access to behave as follows: > > <?xml version="1.0" encoding="utf-8"?> > <html> > <body> > <a href="file:///c:/test/test.html">click here for test</a><br/> > <a href="file:///c:/test/hi#there.html">click here for > hi#there</a>(works on IE 8, but not Firefox - base form on IE8)<br/> > <a href="file:///c:/test/hi%23there.html">click here for > hi%23there</a>(works on both, base form on Firefox)<br/> > <a href="file:///c:/test/hi<there.html">click here for > hi<there</a>(works on both but represents a file that can't be > loaded)<br/> > <a href="file:///c:/test/hi%3cthere.html">click here for > hi%3cthere</a>(works on both but represents a file that can't be loaded - > base form on both)<br/> > <a href="file:///c:/test/hi there.html">click here for hi > there</a>(works on both, base form on both)<br/> > <a href="file:///c:/test/hi%20there.html">click here for > hi%20there</a>(works on both)<br/> > </body> > </html> > > As you can see, there's some common ground, but always the common ground > requires more encoding rather than less. > > (3) Even assuming we relax the URI requirement, non-encoded, non-ASCII > characters are interpreted in the encoding of the document they are > embedded in. So, for instance, if you wanted to include Hebrew characters > in a file IRI, you will have to have a web page that is encoded in > something that can represent Hebrew characters. > > Karl > > > > On Fri, May 3, 2013 at 10:05 AM, Karl Wright <[email protected]> wrote: > >> I should clarify. IF you can propose a better IRI form than the one the >> connector generates, AND it will work for all languages/encodings and most >> modern browsers, we should consider changing the connector code. >> >> Karl >> >> >> On Fri, May 3, 2013 at 8:32 AM, Karl Wright <[email protected]> wrote: >> >>> Here is the code in the JCIFS connector: >>> >>> String pathAttributeValue = documentIdentifier; >>> // 3/13/2008 >>> // In looking at what comes into the path metadata >>> attribute by default, and cogitating a bit, I've concluded that >>> // the smb:// and the server/domain name at the start of >>> the path are just plain old noise, and should be stripped. >>> // This changes a behavior that has been around for a >>> while, so there is a risk, but a quick back-and-forth with the >>> // SE's leads me to believe that this is safe. >>> >>> if (pathAttributeValue.startsWith("smb://")) >>> { >>> int index = >>> pathAttributeValue.indexOf("/","smb://".length()); >>> if (index == -1) >>> index = pathAttributeValue.length(); >>> pathAttributeValue = pathAttributeValue.substring(index); >>> } >>> // Now, translate >>> pathAttributeValue = >>> matchMap.translate(pathAttributeValue); >>> pack(sb,pathAttributeValue,'+'); >>> } >>> else >>> sb.append('-'); >>> >>> Since the JCIFS connection determines the server name, the document >>> identifier does not need to repeat that information. If you need to send >>> the server name to Solr for some reason, you can certainly do that on a >>> per-job basis by putting in yet another bit of metadata, via the "Forced >>> Metadata" tab in your job. If you have a really strong reason for >>> including the server name in the same path, it would also be possible to >>> add another feature to the JCIFS connector to do it based on a checkbox or >>> some such; but this would complicate further an already very complicated >>> user interface. >>> >>> It looks, however, like you are trying to construct an IRI, which the >>> JCIFS connector is supposed to be doing. Can you explain what your needs >>> are here? What do you believe is the correct form of an IRI? >>> >>> Karl >>> >>> >>> >>> On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <[email protected]>wrote: >>> >>>> That is working. I created a path field in my schema and use the "path >>>> attribute". >>>> I have one problem, I don't see the name of the cifs server, just the >>>> path inside it. >>>> I try to use "Match Regexp" in the metadata tab with the following >>>> values: >>>> Match regexp: "(.*)" >>>> Replace string: "file:////server_name/$(1)" >>>> >>>> but it did not work. Still seeing the path only. >>>> >>>> What am I doing wrong? How can I add my server name to the path? >>>> >>>> Thanks >>>> Yossi >>>> >>>> >>>> >>>> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <[email protected]>wrote: >>>> >>>>> Thanks I will try that >>>>> On May 1, 2013 3:54 PM, "Karl Wright" <[email protected]> wrote: >>>>> >>>>>> There is also a different way to do this entirely - there is a path >>>>>> attribute you can send as metadata to Solr. Just include the entire >>>>>> path, >>>>>> and put it into a different field that you declare in your schema. See >>>>>> "path attribute" in the end-user documentation for the JCIFS connector. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <[email protected]>wrote: >>>>>> >>>>>>> IE 6 is extremely old and I believe we developed for IE 7 at a >>>>>>> minimum (there were two different versions with different functionality >>>>>>> we >>>>>>> had to support there), and made further changes for IE 8 when it came >>>>>>> out. >>>>>>> I have no idea what IE 9 or IE 10 do. >>>>>>> >>>>>>> The only way to change the encoding of the IRI is to modify the >>>>>>> JCIFS connector code. But please bear in mind that unless you can show >>>>>>> your modifications will work across a wide variety of browsers, we are >>>>>>> unlikely to accept these changes back into the code base. >>>>>>> >>>>>>> The alternative is, since the encoding IS deterministic and >>>>>>> reversible, you could readily write a Tika plugin that would modify at >>>>>>> least the URL field in the manner you desire. But you could not modify >>>>>>> the >>>>>>> ID field since ManifoldCF uses this to delete documents that have >>>>>>> disappeared. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <[email protected]>wrote: >>>>>>> >>>>>>>> The IRI is not working in my IE. I am using old version of IE V6 >>>>>>>> SP3. >>>>>>>> But what I realy want is to display the correct name of the path >>>>>>>> with hebrew characters. >>>>>>>> If I understand you right, then I need to change the representation >>>>>>>> of the IRI. How can I do that? >>>>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <[email protected]> wrote: >>>>>>>> >>>>>>>>> Right, that is exactly what I would expect. >>>>>>>>> >>>>>>>>> ManifoldCF uses a URL (which is constructed by the connector) as >>>>>>>>> the primary key for every document as indexed in the search engine. >>>>>>>>> The >>>>>>>>> URL has two purposes: first, it is supposed to be unique, and second, >>>>>>>>> it is >>>>>>>>> supposed to allow someone who browses to that result to locate the >>>>>>>>> document. In the case of JCIFS, the environment is presumed to be the >>>>>>>>> local active directory domain(s), and the "URL" generated is really a >>>>>>>>> file >>>>>>>>> IRI, usually of the form "file://///server.domain/path/filename". >>>>>>>>> You thus >>>>>>>>> should be able to paste the "URL" of the document from Solr into a >>>>>>>>> browser >>>>>>>>> on a machine in the domain, and see the document load. >>>>>>>>> >>>>>>>>> As I said before, however, there are already certain problems with >>>>>>>>> this because each version of IE differs somewhat in how it deals with >>>>>>>>> non-ASCII characters. IRI legal character rules are somewhat >>>>>>>>> different >>>>>>>>> than URL rules, but IRI's are still nevertheless escaped in various >>>>>>>>> ways. >>>>>>>>> There are also multiple equivalent ways of representing the same file >>>>>>>>> path >>>>>>>>> with different IRI's. >>>>>>>>> >>>>>>>>> It is not typical that the ID and URL fields of a document are >>>>>>>>> presented to the user in any meaningful way, so your question is >>>>>>>>> usually >>>>>>>>> academic in most settings. If you have a problem with the IRI's not >>>>>>>>> actually working in a browser, that's of more immediate interest. >>>>>>>>> Please >>>>>>>>> let us know if that's the case. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> Thanks for your response >>>>>>>>>> I am seeing these characters in solr when I search these files. >>>>>>>>>> I am using the solr example site and these characters show up in >>>>>>>>>> the ID field and URL field. >>>>>>>>>> BTW I am running solr and mcf on a linux server >>>>>>>>>> On May 1, 2013 1:11 PM, "Karl Wright" <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Where are you seeing these characters? Are you talking about >>>>>>>>>>> the file IRI's that the JCIFS connector generates? Those IRI's are >>>>>>>>>>> supposed to be constructed so that your browser would find them if >>>>>>>>>>> you >>>>>>>>>>> paste them into the browser URL window. Unfortunately, there is no >>>>>>>>>>> good >>>>>>>>>>> standard, and people follow IE's behavior, and IE has changed >>>>>>>>>>> multiple >>>>>>>>>>> times in how it deals with non-latin-1 characters. >>>>>>>>>>> >>>>>>>>>>> Please provide a bit more information so that we can provide a >>>>>>>>>>> better answer. >>>>>>>>>>> >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> I install search server with solr and manifoldcf. >>>>>>>>>>>> I want to index my netapp files over cifs and I have a problem >>>>>>>>>>>> with hebrew files and directories. >>>>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..." >>>>>>>>>>>> instead of the directory path that contain hebrew characters . >>>>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255" >>>>>>>>>>>> but it didn't help. >>>>>>>>>>>> Can anyone help and tell me how can I index directories/files >>>>>>>>>>>> in hebrew? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Yossi >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >
