Re: Integrating Oauth2 with Solr MailEntityProcessor
Hi again, Anybody interested in this feature for Solr MailEntityProcessor? WDYT? Thanks, Dileepa On Thu, Jan 30, 2014 at 11:00 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I think Oauth2 integration is a valid usecase for Solr when it comes to importing data from user-accounts like email, social-networks, enterprise stores etc. Do you think Oauth2 integration in Solr will be an useful feature? If so I would like to start working on this. I feel this could also be a good project for GSoC 2014. Thanks, Dileepa On Wed, Jan 29, 2014 at 3:57 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm doing a research project on : Email Reputation Analysis and for this project I'm planning to use Apache Solr, Tika and Mahout projects to analyse, store and query reputation of emails and correspondents. For indexing emails in Solr I'm going to use the MailEntityProcessor [1]. But I see that it requires the user to provide their email credentials to the DIH which is a security risk. Also I feel current MailEntityProcessor doesn't allow importing data from multiple mail boxes. What do you think of integrating an authorization mechanism like OAuth2 in Solr? Appreciate your ideas on using this for indexing multiple mailboxes without requiring users to give their username passwords. document entity processor=MailEntityProcessor user=someb...@gmail.compassword=something host=imap.gmail.comprotocol=imaps folders = x,y,z//document Regards, Dileepa [1] http://wiki.apache.org/solr/MailEntityProcessor
API to get documents imported in dataimport from EventListener.onEvent(Context cntxt)
Hi All, Is there a way to retrieve the documents being imported in a dataimport request from a EventListener configured to run at onImportEnd? I need to get the set of values of the field:content of all the documents imported to perform an enhancement task. Is there a way to retrieve the documents imported in dataimport from my EventListener? My example data-config is as below; dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/test user=usr1 password=pass1 batchSize=1 / document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener entity name=stanbolrequest query=SELECT * FROM documents field column=id name=id / field column=content name=content / field column=title name=title / /entity /document /dataConfig Currently what I do is as below; 1. The /dataimport requesthandler is configured with a custom UpdateRequestProcessor which intercepts the documents being imported, gets the value of the field I want and updates a static Maplong, String contents in my custom EventListener class with the documentID and the content String. 2. At the end of the import process the StanbolEventListener is triggered onImportEnd of the dataimport; in the onEvent(Context cntxt) method, the contents Map is iterated and all the content field values are sent to an external Server to be enhanced. 3. The documents with the IDs (keys of contents Map) are updated with the enhanced fields and committed. This mechanism works fine for a single dataimport process at a time. But when there are concurrent dataimport requests, the system behaves abruptly. I suspect the static Maplong,Stringcontents is updated abruptly by concurrent update requests initiated by the dataimport process. To make the contents Map thread safe, I used a ConcurrentHashMap implementation. However I still get abrupt results in the update process. What I'm looking for is an alternative to bypass data concurrency handling in EventListeners. I think this can be achieved if the whole dataimport process is executed as a single transaction and at the onImportEnd EventListener, all the documents imported are retrieved to get the content field of each document. Is there a way to access the set of documents imported in the onEvent(Context context) method of EventListener in Solr? Can I use the Context object to access my documents? Any suggestions
Concurrency handling in DataImportHandler
Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Re: Concurrency handling in DataImportHandler
I would particularly like to know how DIH handles concurrency in JDBC database connections during datamport.. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123 batchSize=1 / Thanks, Dileepa On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Re: Concurrency handling in DataImportHandler
Hi All, I triggered a /dataimport for first 100 rows from my database and while it's running issued another import request for rows 101-200. In my log I see below exception; It seems multiple JDBC connections cannot be opened. Does this mean concurrency is not supported in DIH for JDBC datasources? Please share your thoughts on how to tackle concurrency in dataimport.. [Thread-15] ERROR org.apache.solr.handler.dataimport.JdbcDataSource - Ignoring Error when closing connection java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic@1e820764 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries. at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924) at com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:3314) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2477) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2809) at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:5165) at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:5048) at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4654) at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1630) at org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410) at org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395) at org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468) Thanks, Dileepa On Thu, Jan 30, 2014 at 4:13 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: I would particularly like to know how DIH handles concurrency in JDBC database connections during datamport.. dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123 batchSize=1 / Thanks, Dileepa On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know about how concurrency is handled in the DIH? What happens if multiple /dataimport requests are issued to the same Datasource? I'm doing some custom processing at the end of dataimport process as an EventListener configured in the data-config.xml as below. document name=stanboldata onImportEnd=com.solr.stanbol.processor.StanbolEventListener Will each DIH request create a new EventListener object? I'm copying some field values from my custom processor configured in the /dataimport request handler to a static Map in my StanbolEventListener class. I need to figure out how to handle concurrency when data is copied to my EvenetListener object to perform the rest of my update process. Thanks, Dileepa
Integrating Oauth2 with Solr MailEntityProcessor
Hi All, I'm doing a research project on : Email Reputation Analysis and for this project I'm planning to use Apache Solr, Tika and Mahout projects to analyse, store and query reputation of emails and correspondents. For indexing emails in Solr I'm going to use the MailEntityProcessor [1]. But I see that it requires the user to provide their email credentials to the DIH which is a security risk. Also I feel current MailEntityProcessor doesn't allow importing data from multiple mail boxes. What do you think of integrating an authorization mechanism like OAuth2 in Solr? Appreciate your ideas on using this for indexing multiple mailboxes without requiring users to give their username passwords. document entity processor=MailEntityProcessor user=someb...@gmail.compassword=something host=imap.gmail.comprotocol=imaps folders = x,y,z//document Regards, Dileepa [1] http://wiki.apache.org/solr/MailEntityProcessor
Re: Integrating Oauth2 with Solr MailEntityProcessor
Hi All, I think Oauth2 integration is a valid usecase for Solr when it comes to importing data from user-accounts like email, social-networks, enterprise stores etc. Do you think Oauth2 integration in Solr will be an useful feature? If so I would like to start working on this. I feel this could also be a good project for GSoC 2014. Thanks, Dileepa On Wed, Jan 29, 2014 at 3:57 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm doing a research project on : Email Reputation Analysis and for this project I'm planning to use Apache Solr, Tika and Mahout projects to analyse, store and query reputation of emails and correspondents. For indexing emails in Solr I'm going to use the MailEntityProcessor [1]. But I see that it requires the user to provide their email credentials to the DIH which is a security risk. Also I feel current MailEntityProcessor doesn't allow importing data from multiple mail boxes. What do you think of integrating an authorization mechanism like OAuth2 in Solr? Appreciate your ideas on using this for indexing multiple mailboxes without requiring users to give their username passwords. document entity processor=MailEntityProcessor user=someb...@gmail.compassword=something host=imap.gmail.comprotocol=imaps folders = x,y,z//document Regards, Dileepa [1] http://wiki.apache.org/solr/MailEntityProcessor
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi Varun and all, Thanks for your input. On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker varunthacker1...@gmail.comwrote: Hi Dileepa, If I understand correctly this is what happens in your system correctly : 1. DIH Sends data to Solr 2. You have written a custom update processor ( http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your Stanbol server for meta data, adds it to the document and then indexes it. Its the part where you query the Stanbol server and wait for the response which takes time and you want to reduce this. Yes, this is what I'm trying to achieve. For each document I'm sending the value of the content field to Stanbol and I process the Stanbol response to add certain metadata fields to the document in my UpdateRequestProcessor. According to me instead of waiting for your response from the Stanbol server and then indexing it, You could send the required field data from the doc to your Stanbol server and continue. Once Stanbol as enriched the document, you re-index the document and update it with the meta-data. To update a document I need to invoke a /update request with the doc id and the field to update/add. So in the method you have suggested, for each Stanbol request I will need to process the response and create a Solr /update query to update the document with the Stanbol enhancements. To Stanbol I just send the value of the content to be enhanced and no document ID is sent. How would you recommend to execute the Stanbol request-response handling process separately? Currently what I have done in my custom update processor is as below; I process the Stanbol response and add NLP fields to the document in the processAdd() method of my UpdateRequestProcessor. public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String request = ; for (String field : STANBOL_REQUEST_FIELDS) { if (null != doc.getFieldValue(field)) { request += (String) doc.getFieldValue(field) + . ; } } try { EnhancementResult result = stanbolPost(request, getBaseURI()); CollectionTextAnnotation textAnnotations = result .getTextAnnotations(); // extracting text annotations SetString personSet = new HashSetString(); SetString orgSet = new HashSetString(); for (TextAnnotation text : textAnnotations) { String type = text.getType(); String language = text.getLanguage(); langSet.add(language); String selectedText = text.getSelectedText(); if (null != type null != selectedText) { if (type.equalsIgnoreCase(StanbolConstants.PERSON)) { personSet.add(selectedText); } else if (type .equalsIgnoreCase(StanbolConstants.ORGANIZATION)) { orgSet.add(selectedText); } } } CollectionEntityAnnotation entityAnnotations = result.getEntityAnnotations(); for (String person : personSet) { doc.addField(NLP_PERSON, person); } for (String org : orgSet) { doc.addField(NLP_ORGANIZATION, org); } cmd.solrDoc = doc; super.processAdd(cmd); } catch (Exception ex) { ex.printStackTrace(); } } } private EnhancementResult stanbolPost(String request, URI uri) { Client client = Client.create(); WebResource webResource = client.resource(uri); ClientResponse response = webResource.type(MediaType.TEXT_PLAIN) .accept(new MediaType(application, rdf+xml)) .entity(request, MediaType.TEXT_PLAIN) .post(ClientResponse.class); int status = response.getStatus(); if (status != 200 status != 201 status != 202) { throw new RuntimeException(Failed : HTTP error code : + response.getStatus()); } String output = response.getEntity(String.class); // Parse the RDF model Model model = ModelFactory.createDefaultModel(); StringReader reader = new StringReader(output); model.read(reader, null); return new EnhancementResult(model); } Thanks, Dileepa This method makes you re-index the document but the changes from your client would be visible faster. Alternately you could do the same thing at the DIH level by writing a customer Transformer ( http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers) On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr documents and run text enhancement thing on them. Probably this process will add new additional fields. And then you will update these solr documents? Did I understand your use case correctly? Yes, that is exactly what I want to achieve. I want to separate out the enhancement process from the dataimport process. The dataimport process will be invoked by a client when new data is added/updated to the mysql database
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi All, I have implemented my requirement as a EventListener which runs on importEnd of the dataimporthandler. I'm running Solrj based client to send Stanbol enhancement updates to the documents within my EventListener. Thanks, Dileepa On Mon, Jan 27, 2014 at 4:34 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Varun and all, Thanks for your input. On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker varunthacker1...@gmail.com wrote: Hi Dileepa, If I understand correctly this is what happens in your system correctly : 1. DIH Sends data to Solr 2. You have written a custom update processor ( http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your Stanbol server for meta data, adds it to the document and then indexes it. Its the part where you query the Stanbol server and wait for the response which takes time and you want to reduce this. Yes, this is what I'm trying to achieve. For each document I'm sending the value of the content field to Stanbol and I process the Stanbol response to add certain metadata fields to the document in my UpdateRequestProcessor. According to me instead of waiting for your response from the Stanbol server and then indexing it, You could send the required field data from the doc to your Stanbol server and continue. Once Stanbol as enriched the document, you re-index the document and update it with the meta-data. To update a document I need to invoke a /update request with the doc id and the field to update/add. So in the method you have suggested, for each Stanbol request I will need to process the response and create a Solr /update query to update the document with the Stanbol enhancements. To Stanbol I just send the value of the content to be enhanced and no document ID is sent. How would you recommend to execute the Stanbol request-response handling process separately? Currently what I have done in my custom update processor is as below; I process the Stanbol response and add NLP fields to the document in the processAdd() method of my UpdateRequestProcessor. public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String request = ; for (String field : STANBOL_REQUEST_FIELDS) { if (null != doc.getFieldValue(field)) { request += (String) doc.getFieldValue(field) + . ; } } try { EnhancementResult result = stanbolPost(request, getBaseURI()); CollectionTextAnnotation textAnnotations = result .getTextAnnotations(); // extracting text annotations SetString personSet = new HashSetString(); SetString orgSet = new HashSetString(); for (TextAnnotation text : textAnnotations) { String type = text.getType(); String language = text.getLanguage(); langSet.add(language); String selectedText = text.getSelectedText(); if (null != type null != selectedText) { if (type.equalsIgnoreCase(StanbolConstants.PERSON)) { personSet.add(selectedText); } else if (type .equalsIgnoreCase(StanbolConstants.ORGANIZATION)) { orgSet.add(selectedText); } } } CollectionEntityAnnotation entityAnnotations = result.getEntityAnnotations(); for (String person : personSet) { doc.addField(NLP_PERSON, person); } for (String org : orgSet) { doc.addField(NLP_ORGANIZATION, org); } cmd.solrDoc = doc; super.processAdd(cmd); } catch (Exception ex) { ex.printStackTrace(); } } } private EnhancementResult stanbolPost(String request, URI uri) { Client client = Client.create(); WebResource webResource = client.resource(uri); ClientResponse response = webResource.type(MediaType.TEXT_PLAIN) .accept(new MediaType(application, rdf+xml)) .entity(request, MediaType.TEXT_PLAIN) .post(ClientResponse.class); int status = response.getStatus(); if (status != 200 status != 201 status != 202) { throw new RuntimeException(Failed : HTTP error code : + response.getStatus()); } String output = response.getEntity(String.class); // Parse the RDF model Model model = ModelFactory.createDefaultModel(); StringReader reader = new StringReader(output); model.read(reader, null); return new EnhancementResult(model); } Thanks, Dileepa This method makes you re-index the document but the changes from your client would be visible faster. Alternately you could do the same thing at the DIH level by writing a customer Transformer ( http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers ) On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str /lst /requestHandler updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain What I need now is to separate the 2 processes of dataimport and stanbol-enhancement. So this is like runing a separate re-indexing process periodically over the documents imported from /dataimport for Stanbol fields. The question is how to trigger my Stanbol update process to the documents imported from /dataimport? In Solr to trigger /update query we need to know the id and the fields of the document to be updated. In my case I need to run all the documents imported from the previous /dataimport process through a stanbol update.chain. Is there a way to keep track of the documents ids imported from /dataimport? Any advice or pointers will be really helpful. Thanks, Dileepa
What is the last_index_time in dataimport.properties?
Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: What is the last_index_time in dataimport.properties?
Hi Ahmet, Thanks a lot. It means I can use the last_index_time to query documents indexed during the last dataimport request? I need to run a subsequent update process to all documents imported from a dataimport. Thanks, Dileepa On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi Ahmet, On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr documents and run text enhancement thing on them. Probably this process will add new additional fields. And then you will update these solr documents? Did I understand your use case correctly? Yes, that is exactly what I want to achieve. I want to separate out the enhancement process from the dataimport process. The dataimport process will be invoked by a client when new data is added/updated to the mysql database. Therefore the dataimport process with mandatory fields of the documents should be indexed as soon as possible. Mandatory fields are mapped to the data table columns in the data-config.xml and the normal /dataimport process doesn't take much time. The enhancements are done in my custom processor by sending the content field of the document to an external Stanbol[1] server to detect NLP enhancements. Then new NLP fields are added to the document (detected persons, organizations, places in the content) in the custom update processor and if this is executed during the dataimport process, it takes a lot of time. The NLP fields are not mandatory for the primary usage of the application which is to query documents with mandatory fields. The NLP fields are required for custom queries for Person, Organization entities. Therefore the NLP update process should be run as a background process detached from the primary /dataimport process. It should not slow down the existing /dataimport process. That's why I am looking for the best way to achieve my objective. I want to implement a way to separately update the imported documents from /dataimport to detect NLP enhancements. Currently I'm having the idea of adopting a timestamp based approach to trigger a /update query to all documents imported after the last_index_time in dataimport.prop and update them with NLP fields. Hope my requirement is clear :). Appreciate your suggestions. [1] http://stanbol.apache.org/ On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name
Re: What is the last_index_time in dataimport.properties?
Yes Ahmet. I want to use the last_index_time to find the documents imported in the last /dataimport process and send them through a update process. I have explained this requirement in my other thread. Thanks, Dileepa On Mon, Jan 27, 2014 at 3:23 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, last_index_time traditionally is used to query Database. But it seems that you want to query solr, right? On Sunday, January 26, 2014 11:15 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, Thanks a lot. It means I can use the last_index_time to query documents indexed during the last dataimport request? I need to run a subsequent update process to all documents imported from a dataimport. Thanks, Dileepa On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
How to run a subsequent update query to documents indexed from a dataimport query
Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str /lst /requestHandler updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain What I need now is to separate the 2 processes of dataimport and stanbol-enhancement. So this is like runing a separate re-indexing process periodically over the documents imported from /dataimport for Stanbol fields. The question is how to trigger my Stanbol update process to the documents imported from /dataimport? In Solr to trigger /update query we need to know the id and the fields of the document to be updated. In my case I need to run all the documents imported from the previous /dataimport process through a stanbol update.chain. Is there a way to keep track of the documents ids imported from /dataimport? Any advice or pointers will be really helpful. Thanks, Dileepa
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str /lst /requestHandler updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain What I need now is to separate the 2 processes of dataimport and stanbol-enhancement. So this is like runing a separate re-indexing process periodically over the documents imported from /dataimport for Stanbol fields. The question is how to trigger my Stanbol update process to the documents imported from /dataimport? In Solr to trigger /update query we need to know the id and the fields of the document to be updated. In my case I need to run all the documents imported from the previous /dataimport process through a stanbol update.chain. Is there a way to keep track of the documents ids imported from /dataimport? Any advice or pointers will be really helpful. Thanks, Dileepa
Concurrent request configurations for Solr Processors
Hi All, I have written a custom update request processor and configured a UpdateRequestProcessor chain in solrconfig.xml as below; updateRequestProcessorChain name=stanbolInterceptor processor class= *com.solr.stanbol.processor.StanbolContentProcessorFactory* / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Can I please know how can I configure the number of concurrent requests for my processor? What is the default number of concurrent requests per a Solr processor? Thanks, Dileepa
Re: Passing a Parameter to a Custom Processor
Thanks a lot for the info Koji. I'm going through the source-code, to find out. Regards, Dileepa On Fri, Dec 13, 2013 at 5:40 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Dileepa, The stanbolInterceptor processor chain will be used in multiple request handlers. Then I will have to pass the stanbol.enhancer.url param in each of those request handler which will cause redundant configurations. Therefore I need to pass the param to the processor directly. But when I pass the params to the Processor as below the parameter is not received to my ProcessorFactory class; processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor * str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer http://localhost:8080/enhancer/str* /processor Can someone point out what might be wrong here? Can someone please advice on how to pass parameters directly to the Processor? I don't know why your Processor cannot get the parameters, but Processor should get them. For example, StatelessScriptUpdateProcessorFactory can get script parameter like this: processor class=solr.StatelessScriptUpdateProcessorFactory str name=scriptupdateProcessor.js/str /processor http://lucene.apache.org/solr/4_5_0/solr-core/org/apache/ solr/update/processor/StatelessScriptUpdateProcessorFactory.html So why don't you consult the source code of StatelessScriptUpdateProcessorFactory, etc? koji -- http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from- wikipedia.html
Passing a Parameter to a Custom Processor
Hi All, I have written a custom update-request processor and need to pass certain parameters to the Processor. I believe solrconfig.xml is the place to pass these parameters. At the moment I define my parameter in the request handler as below; requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str *str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer http://localhost:8080/enhancer/str* /lst /requestHandler My processor is defined in the stanbolInterceptor update.chain as below; updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain The stanbolInterceptor processor chain will be used in multiple request handlers. Then I will have to pass the stanbol.enhancer.url param in each of those request handler which will cause redundant configurations. Therefore I need to pass the param to the processor directly. But when I pass the params to the Processor as below the parameter is not received to my ProcessorFactory class; processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor * str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer http://localhost:8080/enhancer/str* /processor Can someone point out what might be wrong here? Can someone please advice on how to pass parameters directly to the Processor? Thanks, Dileepa
Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode
Thanks all, for your valuable ideas into this matter. I will try them. :) Regards, Dileepa On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is no support for throttling built into DIH. You can probably write a Transformer which sleeps a while after every N requests to simulate throttling. On 26 Nov 2013 14:21, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a requirement to import a large amount of data from a mysql database and index documents (about 1000 documents). During indexing process I need to do a special processing of a field by sending a enhancement requests to an external Apache Stanbol server. I have configured my dataimport-handler in solrconfig.xml to use the StanbolContentProcessor in the update chain, as below; *updateRequestProcessorChain name=stanbolInterceptor* * processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/* *processor class=solr.RunUpdateProcessorFactory /* * /updateRequestProcessorChain* * requestHandler name=/dataimport class=solr.DataImportHandler * * lst name=defaults * * str name=configdata-config.xml/str* * str name=update.chainstanbolInterceptor/str* * /lst * * /requestHandler* My sample data-config.xml is as below; *dataConfig* *dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrTest user=test password=test123 batchSize=1 /* *document name=stanboldata* *entity name=stanbolrequest query=SELECT * FROM documents* *field column=id name=id /* *field column=content name=content /* * field column=title name=title /* */entity* */document* */dataConfig* When running a large import with about 1000 documents, my stanbol server goes down, I suspect due to heavy load from the above Solr Stanbolnterceptor. I would like to throttle the dataimport in batches, so that Stanbol can process a manageable number of requests concurrently. Is this achievable using batchSize parameter in dataSource element in the data-config? Can someone please give some ideas to throttle the dataimport load in Solr? Thanks, Dileepa
Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode
I actually tweaked the Stanbol server to handle more results and successfully ran 10K imports within 30 minutes with no server issue. I'm looking for further improving the results with regard to the efficiency and NLP accuracy. Thanks, Dileepa On Sun, Dec 1, 2013 at 8:17 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Thanks all, for your valuable ideas into this matter. I will try them. :) Regards, Dileepa On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There is no support for throttling built into DIH. You can probably write a Transformer which sleeps a while after every N requests to simulate throttling. On 26 Nov 2013 14:21, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a requirement to import a large amount of data from a mysql database and index documents (about 1000 documents). During indexing process I need to do a special processing of a field by sending a enhancement requests to an external Apache Stanbol server. I have configured my dataimport-handler in solrconfig.xml to use the StanbolContentProcessor in the update chain, as below; *updateRequestProcessorChain name=stanbolInterceptor* * processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/* *processor class=solr.RunUpdateProcessorFactory /* * /updateRequestProcessorChain* * requestHandler name=/dataimport class=solr.DataImportHandler * * lst name=defaults * * str name=configdata-config.xml/str* * str name=update.chainstanbolInterceptor/str* * /lst * * /requestHandler* My sample data-config.xml is as below; *dataConfig* *dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrTest user=test password=test123 batchSize=1 /* *document name=stanboldata* *entity name=stanbolrequest query=SELECT * FROM documents* *field column=id name=id /* *field column=content name=content /* * field column=title name=title /* */entity* */document* */dataConfig* When running a large import with about 1000 documents, my stanbol server goes down, I suspect due to heavy load from the above Solr Stanbolnterceptor. I would like to throttle the dataimport in batches, so that Stanbol can process a manageable number of requests concurrently. Is this achievable using batchSize parameter in dataSource element in the data-config? Can someone please give some ideas to throttle the dataimport load in Solr? Thanks, Dileepa
How to use batchSize in DataImportHandler to throttle updates in a batch-mode
Hi All, I have a requirement to import a large amount of data from a mysql database and index documents (about 1000 documents). During indexing process I need to do a special processing of a field by sending a enhancement requests to an external Apache Stanbol server. I have configured my dataimport-handler in solrconfig.xml to use the StanbolContentProcessor in the update chain, as below; *updateRequestProcessorChain name=stanbolInterceptor* * processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/* *processor class=solr.RunUpdateProcessorFactory /* * /updateRequestProcessorChain* * requestHandler name=/dataimport class=solr.DataImportHandler * * lst name=defaults * * str name=configdata-config.xml/str* * str name=update.chainstanbolInterceptor/str* * /lst * * /requestHandler* My sample data-config.xml is as below; *dataConfig* *dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/solrTest user=test password=test123 batchSize=1 /* *document name=stanboldata* *entity name=stanbolrequest query=SELECT * FROM documents* *field column=id name=id /* *field column=content name=content /* * field column=title name=title /* */entity* */document* */dataConfig* When running a large import with about 1000 documents, my stanbol server goes down, I suspect due to heavy load from the above Solr Stanbolnterceptor. I would like to throttle the dataimport in batches, so that Stanbol can process a manageable number of requests concurrently. Is this achievable using batchSize parameter in dataSource element in the data-config? Can someone please give some ideas to throttle the dataimport load in Solr? Thanks, Dileepa
Re: An UpdateHandler to run following a MySql DataImport
I found out that you can configure any requestHandler to run a requestProcessor chain. So in my /dataimport requestHandler I just called my custom requestHandler as a chain; eg: requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str *str name=update.chainstanbolInterceptor/str* /lst /requestHandler It works. Thanks, Dileepa On Fri, Nov 15, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, don't quite know the answer to that, but when things start getting complex with DIH, you should seriously consider a SolrJ solution unless someone comes up with a quick fix. Here's an example. http://searchhub.org/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Nov 15, 2013 at 2:48 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have written a custom update request handler to do some custom processing of documents and configured the /update handler to use my custom handler in the default: update.chain. The same requestHandler should be configured for the data-import-handler when it loads documents to solr index. Is there a way configure the dataimport handler to use my custom updatehandler in a update.chain? If not how can I perform the required custom processing of the document while importing data from a mysql database? Thanks, Dileepa
An UpdateHandler to run following a MySql DataImport
Hi All, I have written a custom update request handler to do some custom processing of documents and configured the /update handler to use my custom handler in the default: update.chain. The same requestHandler should be configured for the data-import-handler when it loads documents to solr index. Is there a way configure the dataimport handler to use my custom updatehandler in a update.chain? If not how can I perform the required custom processing of the document while importing data from a mysql database? Thanks, Dileepa
Re: Indexing a token to a different field in a custom filter
I need to index the processed token to a different feild (eg: stanbolResponse), in the same document that's being indexed. I am looking for a way to retrieve the document.id from the TokenStream so that I can update the same document with new field values. (In my sample code above I'm adding a new document, instead of updating the same document) Any pointers please? Thanks, Dileepa On Tue, Nov 12, 2013 at 12:01 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, In my custom filter, I need to index the processed token into a different field. The processed token is a Stanbol enhancement response. The solution I have so far found is to use a Solr client (solj) to add a new Document with my processed field into Solr. Below is the sample code segment; SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( id, id1, 1.0f ); doc1.addField(stanbolResponse, response); try { server.add(doc1); server.commit(); } catch (SolrServerException e) { e.printStackTrace(); } This mechanism requires a new HTTP call to the local Solr server for every token I process for the stanbolRequest field, and I feel it's not very efficient. Is there any other alternative way to invoke a update request to add a new field to the indexing document within the filter (without making an explicit HTTP call using Solrj)? Thanks, Dileepa
Re: Indexing a token to a different field in a custom filter
Thanks all for your valuable inputs. I looked at suggested solutions and I too feel, a* custom update processor*during indexing will be the best solution to handle the content field by changing the value and storing it in another value. Do I only need to change the below request handler to intercept all indexing documents to perform my custom analysis during indexing? Or do I need to change any other request handler also? requestHandler name=/update class=solr.UpdateRequestHandler Thanks, Dileepa On Tue, Nov 12, 2013 at 7:37 PM, Jack Krupansky j...@basetechnology.comwrote: Any kind of cross-field processing is best done in an update processor. There are a lot of built-in update processors as well as a JavaScript script update processor. -- Jack Krupansky -Original Message- From: Dileepa Jayakody Sent: Tuesday, November 12, 2013 1:31 AM To: solr-user@lucene.apache.org Subject: Indexing a token to a different field in a custom filter Hi All, In my custom filter, I need to index the processed token into a different field. The processed token is a Stanbol enhancement response. The solution I have so far found is to use a Solr client (solj) to add a new Document with my processed field into Solr. Below is the sample code segment; SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( id, id1, 1.0f ); doc1.addField(stanbolResponse, response); try { server.add(doc1); server.commit(); } catch (SolrServerException e) { e.printStackTrace(); } This mechanism requires a new HTTP call to the local Solr server for every token I process for the stanbolRequest field, and I feel it's not very efficient. Is there any other alternative way to invoke a update request to add a new field to the indexing document within the filter (without making an explicit HTTP call using Solrj)? Thanks, Dileepa
HTTP 500 error when invoking a REST client in Solr Analyzer
Hi All, I am working on a custom analyzer in Solr to post content to Apache Stanbol for enhancement during indexing. To post content to Stanbol, inside my custom analyzer's incrementToken() method I have written below code using Jersey client API sample [1]; public boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } char[] buffer = charTermAttr.buffer(); String content = new String(buffer); Client client = Client.create(); WebResource webResource = client.resource(http://localhost:8080/enhancer;); ClientResponse response = webResource.type(text/plain).accept(new MediaType(application, rdf+xml)).post(ClientResponse.class, content); int status = response.getStatus(); if (status != 200 status != 201 status != 202) { throw new RuntimeException(Failed : HTTP error code : + response.getStatus()); } String output = response.getEntity(String.class); System.out.println(output); charTermAttr.setEmpty(); char[] newBuffer = output.toCharArray(); charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length); return true; } When testing the analyzer I always get a HTTP 500 response from Stanbol server and I cannot process the enhancement response properly. But I could successfully execute the same jersey client code above in a Java application (in main method) and retrieve desired enhancement response from Stanbol. Any ideas why I always get a HTTP 500 error when invoking a rest endpoint in Solr analyzer? Could it be a permission problem in my Solr analyzer ? Appreciate your help. Thanks, Dileepa [1] https://blogs.oracle.com/enterprisetechtips/entry/consuming_restful_web_services_with [2] 6424 [qtp918598659-11] ERROR org.apache.solr.core.SolrCore – java.lang.RuntimeException: Failed : HTTP error code : 500 at com.solr.test.analyzer.ContentFilter.incrementToken(ContentFilter.java:70) at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:179) at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:126) at org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:221) at org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:190) at org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:101) at org.apache.solr.handler.AnalysisRequestHandlerBase.handleRequestBody(AnalysisRequestHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at
Re: HTTP 500 error when invoking a REST client in Solr Analyzer
This seems to be a weird intermittent issue when I use the Analysis UI ( http://localhost:8983/solr/#/collection1/analysis) for testing my Analyzer. It works fine when I hard code the input value in the Analyzer and index. I gave the same input : Tim Bernes Lee is a professor at MIT hard coded in the Analyzer class and from the Solr Analysis UI. The UI response failed intermittently when I adjust the field value. This could be a problem with character encoding of the field value it seems. Thanks, Dileepa On Tue, Nov 12, 2013 at 1:33 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I am working on a custom analyzer in Solr to post content to Apache Stanbol for enhancement during indexing. To post content to Stanbol, inside my custom analyzer's incrementToken() method I have written below code using Jersey client API sample [1]; public boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } char[] buffer = charTermAttr.buffer(); String content = new String(buffer); Client client = Client.create(); WebResource webResource = client.resource(http://localhost:8080/enhancer ); ClientResponse response = webResource.type(text/plain).accept(new MediaType(application, rdf+xml)).post(ClientResponse.class, content); int status = response.getStatus(); if (status != 200 status != 201 status != 202) { throw new RuntimeException(Failed : HTTP error code : + response.getStatus()); } String output = response.getEntity(String.class); System.out.println(output); charTermAttr.setEmpty(); char[] newBuffer = output.toCharArray(); charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length); return true; } When testing the analyzer I always get a HTTP 500 response from Stanbol server and I cannot process the enhancement response properly. But I could successfully execute the same jersey client code above in a Java application (in main method) and retrieve desired enhancement response from Stanbol. Any ideas why I always get a HTTP 500 error when invoking a rest endpoint in Solr analyzer? Could it be a permission problem in my Solr analyzer ? Appreciate your help. Thanks, Dileepa [1] https://blogs.oracle.com/enterprisetechtips/entry/consuming_restful_web_services_with [2] 6424 [qtp918598659-11] ERROR org.apache.solr.core.SolrCore – java.lang.RuntimeException: Failed : HTTP error code : 500 at com.solr.test.analyzer.ContentFilter.incrementToken(ContentFilter.java:70) at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:179) at org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:126) at org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:221) at org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:190) at org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:101) at org.apache.solr.handler.AnalysisRequestHandlerBase.handleRequestBody(AnalysisRequestHandlerBase.java:59) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368
Indexing a token to a different field in a custom filter
Hi All, In my custom filter, I need to index the processed token into a different field. The processed token is a Stanbol enhancement response. The solution I have so far found is to use a Solr client (solj) to add a new Document with my processed field into Solr. Below is the sample code segment; SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;); SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField( id, id1, 1.0f ); doc1.addField(stanbolResponse, response); try { server.add(doc1); server.commit(); } catch (SolrServerException e) { e.printStackTrace(); } This mechanism requires a new HTTP call to the local Solr server for every token I process for the stanbolRequest field, and I feel it's not very efficient. Is there any other alternative way to invoke a update request to add a new field to the indexing document within the filter (without making an explicit HTTP call using Solrj)? Thanks, Dileepa
Re: Error instantiating a Custom Filter in Solr
Thanks guys, I got the problem resolved. It was a constructor API mismatch between the code I wrote and the library I used. I used the latest lucene-common 4.5.0 with my sample code and the startup issue was resolved. related stackoverflow discussion : http://stackoverflow.com/questions/19840129/error-instantiating-the-custom-filterfactory-class-in-solr Regards, Dileepa On Fri, Nov 8, 2013 at 9:21 PM, Jack Krupansky j...@basetechnology.comwrote: Thanks for the plug Erick, but my deep dive doesn't go quite that deep (yet.) But I'm sure a 2,500 page book on how to develop all manner of custom Solr plugin would indeed be valuable though. But I do have plenty of example of using the many builtin Solr analysis filters. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Friday, November 08, 2013 10:36 AM To: solr-user@lucene.apache.org Subject: Re: Error instantiating a Custom Filter in Solr Well, I think Jack Krupansky's book has some examples, at $10 it's probably a steal. Best, Erick On Fri, Nov 8, 2013 at 1:49 AM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Hi Erick, Thanks a lot for the pointer. I looked at the LowerCaseFilterFactory class [1] and it's parent abstract class AbstractAnalysisFactory API [2] , and modified my custom filter factory class as below; public class ContentFilterFactory extends TokenFilterFactory { public ContentFilterFactory() { super(); } @Override public void init(MapString, String args) { super.init(args); } @Override public ContentFilter create(TokenStream input) { assureMatchVersion(); return new ContentFilter(input); } } I have called the parent's init method as above, but I'm still getting the same error of : java.lang.NoSuchMethodException: com.solr.test.analyzer. ContentFilterFactory.init(java.util.Map) Any input on this? Can some one please point me to a doc/blog or any sample to implement a custom filter with Solr 4.0 I'm using Solr 4.5.0 server. Thanks, Dileepa [1] http://search-lucene.com/c/Lucene:analysis/common/src/ java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java [2] https://lucene.apache.org/core/4_2_0/analyzers-common/ org/apache/lucene/analysis/util/AbstractAnalysisFactory.html On Fri, Nov 8, 2013 at 4:25 AM, Erick Erickson erickerick...@gmail.com wrote: Well, the example you linked to is based on 3.6, and things have changed assuming you're using 4.0. It's probably that your ContentFilter isn't implementing what it needs to or it's not subclassing from the correct class for 4.0. Maybe take a look at something simple like LowerCaseFilterFactory and use that as a model, although you probably don't need to implement the MultiTermAware bit. FWIW, Erick On Thu, Nov 7, 2013 at 1:31 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Hi All, I'm a novice in Solr and I'm continuously bumping into problems with my custom filter I'm trying to use for analyzing a fieldType during indexing as below; fieldType name=stanbolRequestType class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= com.solr.test.analyzer.ContentFilterFactory/ /analyzer /fieldType Below is my custom FilterFactory class; *public class ContentFilterFactory extends TokenFilterFactory {* * public ContentFilterFactory() {* * super();* * }* * @Override* * public TokenStream create(TokenStream input) {* * return new ContentFilter(input);* * }* *}* I'm getting below error stack trace [1] caused by a NoSuchMethodException when starting the server. Solr complains that it cannot init the Plugin (my custom filter) as the FilterFactory class doesn't have a init method; But in the example [2] I was following didn't have any notion of a init method in the FilterFactory class, nor I was required to override an init method when extending TokenFilterFactory class. Can someone please help me resolve this error and get my custom filter working? Thanks, Dileepa [1] Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType stanbolRequestType: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load( AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema( IndexSchema.java:468) ... 13 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load( AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer
Re: Help to find BaseTokenFilterFactory to write a Custom TokenFilter
Thanks Anuj, The jar containing the class can be found here : http://www.java2s.com/Code/JarDownload/lucene/lucene-analyzers-common-4.2.0.jar.zip On Thu, Nov 7, 2013 at 2:18 PM, Anuj Kumar anujs...@gmail.com wrote: http://stackoverflow.com/questions/13149627/where-did-basetokenfilterfactory-go-in-solr-4-0 On Thu, Nov 7, 2013 at 1:05 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Hi All, I am writing a custom TokenFilter to post a token value to Apache Stanbol for enhancement. In this Custom TokenFilter I'm trying to retrieve the response from Stanbol and index it as a new document in Solr. I'm following [1] to write a custom filter, but I'm having trouble locating BaseTokenFilterFactory to create a TokenFactory. Can someone please point me to a Jar location to get this library? Thanks, Dileepa [1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
Re: Help to find BaseTokenFilterFactory to write a Custom TokenFilter
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType stanbolRequestType: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory'. Schema file is /home/dileepa/MyData/desk/solr/solr-4.5.0/example/solr/collection1/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:521) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:559) ... 8 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType stanbolRequestType: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468) ... 13 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 14 more Caused by: org.apache.solr.common.SolrException: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 18 more *Caused by: java.lang.NoSuchMethodException: com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map)* at java.lang.Class.getConstructor0(Class.java:2810) at java.lang.Class.getConstructor(Class.java:1718) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552) ... 21 more On Thu, Nov 7, 2013 at 2:39 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Thanks Anuj, The jar containing the class can be found here : http://www.java2s.com/Code/JarDownload/lucene/lucene-analyzers-common-4.2.0.jar.zip On Thu, Nov 7, 2013 at 2:18 PM, Anuj Kumar anujs...@gmail.com wrote: http://stackoverflow.com/questions/13149627/where-did-basetokenfilterfactory-go-in-solr-4-0 On Thu, Nov 7, 2013 at 1:05 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Hi All, I am writing a custom TokenFilter to post a token value to Apache Stanbol for enhancement. In this Custom TokenFilter I'm trying to retrieve the response from Stanbol and index it as a new document in Solr. I'm following [1] to write a custom filter, but I'm having trouble locating BaseTokenFilterFactory to create a TokenFactory. Can someone please point me to a Jar location to get this library? Thanks, Dileepa [1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
Error instantiating a Custom Filter in Solr
Hi All, I'm a novice in Solr and I'm continuously bumping into problems with my custom filter I'm trying to use for analyzing a fieldType during indexing as below; fieldType name=stanbolRequestType class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= com.solr.test.analyzer.ContentFilterFactory/ /analyzer /fieldType Below is my custom FilterFactory class; *public class ContentFilterFactory extends TokenFilterFactory {* * public ContentFilterFactory() {* * super();* * }* * @Override* * public TokenStream create(TokenStream input) {* * return new ContentFilter(input);* * }* *}* I'm getting below error stack trace [1] caused by a NoSuchMethodException when starting the server. Solr complains that it cannot init the Plugin (my custom filter) as the FilterFactory class doesn't have a init method; But in the example [2] I was following didn't have any notion of a init method in the FilterFactory class, nor I was required to override an init method when extending TokenFilterFactory class. Can someone please help me resolve this error and get my custom filter working? Thanks, Dileepa [1] Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType stanbolRequestType: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468) ... 13 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 14 more Caused by: org.apache.solr.common.SolrException: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 18 more Caused by: java.lang.NoSuchMethodException: com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map) at java.lang.Class.getConstructor0(Class.java:2810) at java.lang.Class.getConstructor(Class.java:1718) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552) ... 21 more [2] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
Re: Error instantiating a Custom Filter in Solr
Hi Erick, Thanks a lot for the pointer. I looked at the LowerCaseFilterFactory class [1] and it's parent abstract class AbstractAnalysisFactory API [2] , and modified my custom filter factory class as below; public class ContentFilterFactory extends TokenFilterFactory { public ContentFilterFactory() { super(); } @Override public void init(MapString, String args) { super.init(args); } @Override public ContentFilter create(TokenStream input) { assureMatchVersion(); return new ContentFilter(input); } } I have called the parent's init method as above, but I'm still getting the same error of : java.lang.NoSuchMethodException: com.solr.test.analyzer. ContentFilterFactory.init(java.util.Map) Any input on this? Can some one please point me to a doc/blog or any sample to implement a custom filter with Solr 4.0 I'm using Solr 4.5.0 server. Thanks, Dileepa [1] http://search-lucene.com/c/Lucene:analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java [2] https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html On Fri, Nov 8, 2013 at 4:25 AM, Erick Erickson erickerick...@gmail.comwrote: Well, the example you linked to is based on 3.6, and things have changed assuming you're using 4.0. It's probably that your ContentFilter isn't implementing what it needs to or it's not subclassing from the correct class for 4.0. Maybe take a look at something simple like LowerCaseFilterFactory and use that as a model, although you probably don't need to implement the MultiTermAware bit. FWIW, Erick On Thu, Nov 7, 2013 at 1:31 PM, Dileepa Jayakody dileepajayak...@gmail.comwrote: Hi All, I'm a novice in Solr and I'm continuously bumping into problems with my custom filter I'm trying to use for analyzing a fieldType during indexing as below; fieldType name=stanbolRequestType class=solr.TextField analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class= com.solr.test.analyzer.ContentFilterFactory/ /analyzer /fieldType Below is my custom FilterFactory class; *public class ContentFilterFactory extends TokenFilterFactory {* * public ContentFilterFactory() {* * super();* * }* * @Override* * public TokenStream create(TokenStream input) {* * return new ContentFilter(input);* * }* *}* I'm getting below error stack trace [1] caused by a NoSuchMethodException when starting the server. Solr complains that it cannot init the Plugin (my custom filter) as the FilterFactory class doesn't have a init method; But in the example [2] I was following didn't have any notion of a init method in the FilterFactory class, nor I was required to override an init method when extending TokenFilterFactory class. Can someone please help me resolve this error and get my custom filter working? Thanks, Dileepa [1] Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType stanbolRequestType: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468) ... 13 more Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 14 more Caused by: org.apache.solr.common.SolrException: Error instantiating class: 'com.solr.test.analyzer.ContentFilterFactory' at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382) at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 18 more Caused by: java.lang.NoSuchMethodException: com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map) at java.lang.Class.getConstructor0(Class.java:2810) at java.lang.Class.getConstructor(Class.java:1718) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552) ... 21 more [2] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
Help to find BaseTokenFilterFactory to write a Custom TokenFilter
Hi All, I am writing a custom TokenFilter to post a token value to Apache Stanbol for enhancement. In this Custom TokenFilter I'm trying to retrieve the response from Stanbol and index it as a new document in Solr. I'm following [1] to write a custom filter, but I'm having trouble locating BaseTokenFilterFactory to create a TokenFactory. Can someone please point me to a Jar location to get this library? Thanks, Dileepa [1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
Writing a Solr custom analyzer to post content to Stanbol {was: Need additional data processing in Data Import Handler prior to indexing}
Hi All, I went through possible solutions for my requirement of triggering a Stanbol enhancement during Solr indexing, and I got the requirement simplified. I only need to process the field named content to perform the Stanbol enhancement to extract Person and Organizations. So I think it will be easier to do the Stanbol request during indexing the content field , after the data is imported (from DIH). I think the best solution will be to write a custom Analyzer to process the content and post it to Stanbol. In the analyzer I also need to process the Stanbol enhancement response. The response should be processed as a new document to index and store the identified Person and Organization entities in a field called extractedEntities. So my current idea is as follows; in the schema.xml copyField source=content dest=stanbolRequest / field name=stanbolRequest type=stanbolRequestType indexed=true stored=true docValues=truerequired=false/ fieldType name=stanbolRequestType class=solr.TextField analyzer class=MyCustomAnalyzer/ /fieldType In the : MyCustomAnalyzer class the content will be posted and enhanced from Stanbol. The Person and Organization entities in the response should be indexed into the Solr field extractedEntities. Am I going in the correct path for my requirement? Please share your ideas. Appreciate any relevant pointers to samples/documentation. Thanks, Dileea On Wed, Oct 30, 2013 at 11:26 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Thanks guys for your ideas. I will go through them and come back with questions. Regards, Dileepa On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson erickerick...@gmail.comwrote: Third time tonight I've been able to paste this link Also, you can consider just moving to SolrJ and taking DIH out of the process, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Whichever approach fits your needs of course. Best, Erick On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: It's also possible to combine Update Request Processor with DIH. That way if a debug entry needs to be inserted it could go through the same Stanbol process. Just define a processing chain the DIH handler and write custom URP to call out to Stanbol web service. You have access to a full record in URP, so can add/delete/change the fields at will. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi Dileepa, You can write your own Transformers in Java. If it doesn't make sense to run Stanbol calls in a Transformer, maybe setting up a web service that grabs a record out of MySQL, sends the data to Stanbol, and displays the results could be used in conjunction with HttpDataSource rather than JdbcDataSource. http://wiki.apache.org/solr/DIHCustomTransformer http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm a newbie to Solr, and I have a requirement to import data from a mysql database; enhance the imported content to identify Persons mentioned and index it as a separate field in Solr along with the other fields defined for the original db query. I'm using Apache Stanbol [1] for the content enhancement requirement. I can get enhancement results for 'Person' type data in the content as the enhancement result. The data flow will be; mysql-db Solr data-import handler Stanbol enhancer Solr index For the above requirement I need to perform additional processing at the data-import handler prior to indexing to send a request to Stanbol and process the enhancement response. I found some related examples on modifying mysql data import handler to customize the query results in db-data-config.xml by using a transformer script. As per my requirement, In the data-import-handler I need to send a request to Stanbol and process the response prior to indexing. But I'm not sure if this can be achieved using a simple javascript. Is there any other better way
Need additional data processing in Data Import Handler prior to indexing
Hi All, I'm a newbie to Solr, and I have a requirement to import data from a mysql database; enhance the imported content to identify Persons mentioned and index it as a separate field in Solr along with the other fields defined for the original db query. I'm using Apache Stanbol [1] for the content enhancement requirement. I can get enhancement results for 'Person' type data in the content as the enhancement result. The data flow will be; mysql-db Solr data-import handler Stanbol enhancer Solr index For the above requirement I need to perform additional processing at the data-import handler prior to indexing to send a request to Stanbol and process the enhancement response. I found some related examples on modifying mysql data import handler to customize the query results in db-data-config.xml by using a transformer script. As per my requirement, In the data-import-handler I need to send a request to Stanbol and process the response prior to indexing. But I'm not sure if this can be achieved using a simple javascript. Is there any other better way of achieving my requirement? Maybe writing a custom filter in Solr? Please share your thoughts. Appreciate any pointers as I'm a beginner for Solr. Thanks, Dileepa [1] https://stanbol.apache.org
Re: Need additional data processing in Data Import Handler prior to indexing
Thanks guys for your ideas. I will go through them and come back with questions. Regards, Dileepa On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson erickerick...@gmail.comwrote: Third time tonight I've been able to paste this link Also, you can consider just moving to SolrJ and taking DIH out of the process, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Whichever approach fits your needs of course. Best, Erick On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: It's also possible to combine Update Request Processor with DIH. That way if a debug entry needs to be inserted it could go through the same Stanbol process. Just define a processing chain the DIH handler and write custom URP to call out to Stanbol web service. You have access to a full record in URP, so can add/delete/change the fields at will. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi Dileepa, You can write your own Transformers in Java. If it doesn't make sense to run Stanbol calls in a Transformer, maybe setting up a web service that grabs a record out of MySQL, sends the data to Stanbol, and displays the results could be used in conjunction with HttpDataSource rather than JdbcDataSource. http://wiki.apache.org/solr/DIHCustomTransformer http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I'm a newbie to Solr, and I have a requirement to import data from a mysql database; enhance the imported content to identify Persons mentioned and index it as a separate field in Solr along with the other fields defined for the original db query. I'm using Apache Stanbol [1] for the content enhancement requirement. I can get enhancement results for 'Person' type data in the content as the enhancement result. The data flow will be; mysql-db Solr data-import handler Stanbol enhancer Solr index For the above requirement I need to perform additional processing at the data-import handler prior to indexing to send a request to Stanbol and process the enhancement response. I found some related examples on modifying mysql data import handler to customize the query results in db-data-config.xml by using a transformer script. As per my requirement, In the data-import-handler I need to send a request to Stanbol and process the response prior to indexing. But I'm not sure if this can be achieved using a simple javascript. Is there any other better way of achieving my requirement? Maybe writing a custom filter in Solr? Please share your thoughts. Appreciate any pointers as I'm a beginner for Solr. Thanks, Dileepa [1] https://stanbol.apache.org