I should clarify. IF you can propose a better IRI form than the one the connector generates, AND it will work for all languages/encodings and most modern browsers, we should consider changing the connector code.
Karl On Fri, May 3, 2013 at 8:32 AM, Karl Wright <[email protected]> wrote: > Here is the code in the JCIFS connector: > > String pathAttributeValue = documentIdentifier; > // 3/13/2008 > // In looking at what comes into the path metadata attribute > by default, and cogitating a bit, I've concluded that > // the smb:// and the server/domain name at the start of the > path are just plain old noise, and should be stripped. > // This changes a behavior that has been around for a while, > so there is a risk, but a quick back-and-forth with the > // SE's leads me to believe that this is safe. > > if (pathAttributeValue.startsWith("smb://")) > { > int index = > pathAttributeValue.indexOf("/","smb://".length()); > if (index == -1) > index = pathAttributeValue.length(); > pathAttributeValue = pathAttributeValue.substring(index); > } > // Now, translate > pathAttributeValue = matchMap.translate(pathAttributeValue); > pack(sb,pathAttributeValue,'+'); > } > else > sb.append('-'); > > Since the JCIFS connection determines the server name, the document > identifier does not need to repeat that information. If you need to send > the server name to Solr for some reason, you can certainly do that on a > per-job basis by putting in yet another bit of metadata, via the "Forced > Metadata" tab in your job. If you have a really strong reason for > including the server name in the same path, it would also be possible to > add another feature to the JCIFS connector to do it based on a checkbox or > some such; but this would complicate further an already very complicated > user interface. > > It looks, however, like you are trying to construct an IRI, which the > JCIFS connector is supposed to be doing. Can you explain what your needs > are here? What do you believe is the correct form of an IRI? > > Karl > > > > On Fri, May 3, 2013 at 7:49 AM, Yossi Nachum <[email protected]> wrote: > >> That is working. I created a path field in my schema and use the "path >> attribute". >> I have one problem, I don't see the name of the cifs server, just the >> path inside it. >> I try to use "Match Regexp" in the metadata tab with the following values: >> Match regexp: "(.*)" >> Replace string: "file:////server_name/$(1)" >> >> but it did not work. Still seeing the path only. >> >> What am I doing wrong? How can I add my server name to the path? >> >> Thanks >> Yossi >> >> >> >> On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <[email protected]> wrote: >> >>> Thanks I will try that >>> On May 1, 2013 3:54 PM, "Karl Wright" <[email protected]> wrote: >>> >>>> There is also a different way to do this entirely - there is a path >>>> attribute you can send as metadata to Solr. Just include the entire path, >>>> and put it into a different field that you declare in your schema. See >>>> "path attribute" in the end-user documentation for the JCIFS connector. >>>> >>>> >>>> >>>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <[email protected]> wrote: >>>> >>>>> IE 6 is extremely old and I believe we developed for IE 7 at a minimum >>>>> (there were two different versions with different functionality we had to >>>>> support there), and made further changes for IE 8 when it came out. I >>>>> have >>>>> no idea what IE 9 or IE 10 do. >>>>> >>>>> The only way to change the encoding of the IRI is to modify the JCIFS >>>>> connector code. But please bear in mind that unless you can show your >>>>> modifications will work across a wide variety of browsers, we are unlikely >>>>> to accept these changes back into the code base. >>>>> >>>>> The alternative is, since the encoding IS deterministic and >>>>> reversible, you could readily write a Tika plugin that would modify at >>>>> least the URL field in the manner you desire. But you could not modify >>>>> the >>>>> ID field since ManifoldCF uses this to delete documents that have >>>>> disappeared. >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <[email protected]>wrote: >>>>> >>>>>> The IRI is not working in my IE. I am using old version of IE V6 SP3. >>>>>> But what I realy want is to display the correct name of the path with >>>>>> hebrew characters. >>>>>> If I understand you right, then I need to change the representation >>>>>> of the IRI. How can I do that? >>>>>> On May 1, 2013 3:14 PM, "Karl Wright" <[email protected]> wrote: >>>>>> >>>>>>> Right, that is exactly what I would expect. >>>>>>> >>>>>>> ManifoldCF uses a URL (which is constructed by the connector) as the >>>>>>> primary key for every document as indexed in the search engine. The URL >>>>>>> has two purposes: first, it is supposed to be unique, and second, it is >>>>>>> supposed to allow someone who browses to that result to locate the >>>>>>> document. In the case of JCIFS, the environment is presumed to be the >>>>>>> local active directory domain(s), and the "URL" generated is really a >>>>>>> file >>>>>>> IRI, usually of the form "file://///server.domain/path/filename". You >>>>>>> thus >>>>>>> should be able to paste the "URL" of the document from Solr into a >>>>>>> browser >>>>>>> on a machine in the domain, and see the document load. >>>>>>> >>>>>>> As I said before, however, there are already certain problems with >>>>>>> this because each version of IE differs somewhat in how it deals with >>>>>>> non-ASCII characters. IRI legal character rules are somewhat different >>>>>>> than URL rules, but IRI's are still nevertheless escaped in various >>>>>>> ways. >>>>>>> There are also multiple equivalent ways of representing the same file >>>>>>> path >>>>>>> with different IRI's. >>>>>>> >>>>>>> It is not typical that the ID and URL fields of a document are >>>>>>> presented to the user in any meaningful way, so your question is usually >>>>>>> academic in most settings. If you have a problem with the IRI's not >>>>>>> actually working in a browser, that's of more immediate interest. >>>>>>> Please >>>>>>> let us know if that's the case. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <[email protected]>wrote: >>>>>>> >>>>>>>> Thanks for your response >>>>>>>> I am seeing these characters in solr when I search these files. >>>>>>>> I am using the solr example site and these characters show up in >>>>>>>> the ID field and URL field. >>>>>>>> BTW I am running solr and mcf on a linux server >>>>>>>> On May 1, 2013 1:11 PM, "Karl Wright" <[email protected]> wrote: >>>>>>>> >>>>>>>>> Where are you seeing these characters? Are you talking about the >>>>>>>>> file IRI's that the JCIFS connector generates? Those IRI's are >>>>>>>>> supposed to >>>>>>>>> be constructed so that your browser would find them if you paste them >>>>>>>>> into >>>>>>>>> the browser URL window. Unfortunately, there is no good standard, and >>>>>>>>> people follow IE's behavior, and IE has changed multiple times in how >>>>>>>>> it >>>>>>>>> deals with non-latin-1 characters. >>>>>>>>> >>>>>>>>> Please provide a bit more information so that we can provide a >>>>>>>>> better answer. >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum >>>>>>>>> <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> I install search server with solr and manifoldcf. >>>>>>>>>> I want to index my netapp files over cifs and I have a problem >>>>>>>>>> with hebrew files and directories. >>>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..." >>>>>>>>>> instead of the directory path that contain hebrew characters . >>>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255" but >>>>>>>>>> it didn't help. >>>>>>>>>> Can anyone help and tell me how can I index directories/files in >>>>>>>>>> hebrew? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Yossi >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >> >
