The elastic search connector always base-64 encodes the content. I gather that is standard for ElasticSearch.
Karl On Mon, Feb 11, 2013 at 4:00 PM, Tony Edgin <[email protected]> wrote: > Thanks again. > > I just ran an example set up to understand better what you said. > > As you said, the web page URL get's set to the _id field. > The metadata that is sent to Elastic Search is as follows: > > header-Content-Type: "text/html; charset=UTF-8" > header-Content-Length: "3278" > header-Keep-Alive: "timeout=5, max=100" > header-Server: "Apache/2.2" > header-Connection: "Keep-Alive" > type: "attachment" > file: ... > > The file field looks to be base64 encoded. Is this always the case, or is > this unique to web repo + elastic search? > > This must be the web page. I'm guessing header-Content-Type field holds the > document type and not the type field. > > > > > > On Mon, Feb 11, 2013 at 1:17 PM, Karl Wright <[email protected]> wrote: >> >> What emerges from the web connector is the following: >> >> - metadata, which you define on the web connector’s “Metadata” tab, >> that are named however you want; >> - forced acls, which get added to the document based on what you >> select on the “Security” tab; >> - the document’s content type; >> - the document’s url; >> - the document itself. >> >> What the elastic search connector does is: >> - Map the document’s url to ElasticSearch’s document id field (which >> I >> guess shows up in Elastic Search as the ‘uri’ field) >> - Output all the metadata directly to ElasticSearch using the name >> provided by the repository connector >> - Set the file value to “” (which seems wrong, since that could be >> helpful if available - let me know if you think a fix for this would >> be useful) >> - NONE of the rest of the document fields (content type, acls, etc) >> are communicated to Elastic Search at all right now, except for the >> document itself. >> >> Karl >> >> >> On Mon, Feb 11, 2013 at 2:55 PM, Tony Edgin <[email protected]> >> wrote: >> > Thanks for the speedy response! >> > >> > I eventually want to index the contents of our local website with >> > Elastic >> > Search. >> > >> > I would use the Web repository connector with the no authority connector >> > and >> > the Elasticsearch output connector. Would you mind letting me know the >> > names and meanings of the metadata that get's passed to Elastic Search? >> > >> > Thanks again. >> > >> > >> > On Mon, Feb 11, 2013 at 12:45 PM, Karl Wright <[email protected]> >> > wrote: >> >> >> >> So let me get this clear - you are looking to find out what the >> >> names/meanings are of the metadata that gets passed to the output >> >> connector, for a given repository connection? >> >> >> >> If this is what you are looking for, I'm afraid that while at one >> >> point the end-user documentation described this pretty accurately, it >> >> is now significantly out of date. While it's not terribly hard to >> >> compile this information from source code etc., the work definitely >> >> needs to be repeated by somebody. >> >> >> >> If you want to ask this question about a specific connector, I can >> >> certainly try to answer it, though. If you want to contribute either >> >> the information or a documentation patch, this would be great too. >> >> >> >> Karl >> >> >> >> On Mon, Feb 11, 2013 at 2:38 PM, Tony Edgin <[email protected]> >> >> wrote: >> >> > I'm sure this is documented somewhere, and I apologize in advance for >> >> > not >> >> > being able to find it. >> >> > >> >> > How do I determine the model or schema of the document passed to the >> >> > search >> >> > engine by a given job? >> >> > >> >> > For instance, I'm running a job that crawls a directory on my local >> >> > file >> >> > system and passes to to Elastic Search. Interrogating Elastic >> >> > Search, I >> >> > can >> >> > determine that the document has three fields, "file", "type" and >> >> > "uri", >> >> > all >> >> > strings. How would I have known that in advance? >> >> > >> >> > Thanks for any help. >> > >> > > >
