Hi Karl, Thanks a lot for your help.
My Datafari setup uses a file system crawler to crawl files (repository connector -> job), from which text is extracted via the tika plugin. This is then sent to SolR via the SolR output connector. I am already using a transformation connector to add a field based on the name of the job (using the file system repository connector) to distinguish the origin of the indexed file later. Actually, I ended up at the same solution you presented me (but I did not mention it beforehand to spoil the answers): writing my own transformation connector to retrieve the information from the database. The connector should: - know the file name - compile a SQL statement from the file name - send this SQL statement to the database - retrieve the file number - add it to a certain field I do know little to nothing about java, but I am able to teach myself if necessary. Is there any starting point to begin with developing my on transformation connector? Thanks in advance, Wilhelm Am Mittwoch, 22. Februar 2017, 13:15:23 CET schrieb Karl Wright: > Hi Wilhelm, > > I don't know anything about how datafari uses ManifoldCF to crawl. All I > can do is describe how ManifoldCF works, and then maybe you can see how it > integrates with datafari. > > MCF gets documents from a repository using one of many kinds of repository > connector. It then can transform the document in many different ways, > before sending the (transformed) document to one of many output > connectors. I gather that datafari injects documents primarily into Solr. > > Each job in MCF has its own "pipeline", which describes the flow of a > document through the system for that job. > > The transformations that are available in MCF include: > > - ability to extract metadata from the document (using Tika) > - ability to modify or add metadata properties (you specify this in the job > UI) > - OpenNLP metadata extraction > - Filter out documents based on characteristics of the document > > Writing connectors is relatively straightforward and there are online > materials available to help you do this. I can provide a link, if you need > it. Without any more information as to what exactly you are using for a > repository connector, and what that connector provides as part of the > document information, I can't really give you the best approach here, but > it may be possible to write a transformation connector that would look up > the information you want to add as metadata from your database and include > that in the document that gets sent to Solr. > > Please let us know how we can help. > > Thanks, > Karl > > > On Wed, Feb 22, 2017 at 7:01 AM, Wilhelm Eger <[email protected]> > > wrote: > > Hi! > > > > I am using a setup of datafari (www.datafari.com), which more or less > > combines > > a ManifoldCF file index with SolR as a search engine. > > > > My setup consists of ~350000 files, which are composed mainly of doc(x), > > xls(x), msg and pdf files. pdf files are ocr'd externally before they are > > added > > to the ManifoldCF index. Only remaining image files (png, jpg) are ocr'd > > on- > > the-fly, when being imported. > > > > The files are actually part of an external file management system (files > > in the > > literal meaning of files, not files in the meaning of entities saved on > > the hard > > disk), which is not related to ManifoldCF/SolR at all. This system > > unfortunately does not provide a proper full text search, hence I > > implemented > > it as outlined above. > > > > However, the users are used to certain file numbers provided by this file > > management system. These file numbers are stored in a MSSQL database, > > which is > > accessible from the host my setup is running on. I can easily get the file > > number by sending a respective SQL statement based on the file name (of > > the > > entity saved on the hard disk) to the SQL Server. Hence, for each file > > name, > > there is a file number stored in the database. I would like to have these > > file > > numbers to be stored in a specific field of the solr index to be shown by > > the > > (tomcat) output, e.g: > > > > File name: /data/1003234234.docx > > Content: "This is the content. You searched for _text_." > > File name belongs to file number: SUI-G-25-A > > > > Is there any possibility to achieve that? Did I understand it correctly > > that > > this could happen either in ManifoldCF during indexing or in SolR during > > importing? > > > > I know that there is a tika plugin to talk to databases, which could be > > fed > > with a SQL statement. But how to connect it with the data retrieved from > > the > > files crawler? > > > > Alternatively, I could also call an external script (bash, python) to > > retrieve > > the respective data from the database using bsqldb. > > > > Any hint in the right direction is very much appreciated. > > > > Thanks in advance, > > > > Wilhelm
