Re: Integrating Oauth2 with Solr MailEntityProcessor

2014-02-12 Thread Dileepa Jayakody
Hi again,

Anybody interested in this feature for Solr MailEntityProcessor?
WDYT?

Thanks,
Dileepa


On Thu, Jan 30, 2014 at 11:00 AM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi All,

 I think Oauth2 integration is a valid usecase for Solr when it comes to
 importing data from user-accounts like email, social-networks, enterprise
 stores etc.
 Do you think Oauth2 integration in Solr will be an useful feature? If so I
 would like to start working on this.
 I feel this could also be a good project for GSoC 2014.

 Thanks,
 Dileepa


 On Wed, Jan 29, 2014 at 3:57 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

 Hi All,

 I'm doing a research project on : Email Reputation Analysis and for this
 project I'm planning to use Apache Solr, Tika and Mahout projects to
 analyse, store and query reputation of emails and correspondents.

 For indexing emails in Solr I'm going to use the MailEntityProcessor [1].
 But I see that it requires the user to provide their email credentials to
 the DIH which is a security risk. Also I feel current MailEntityProcessor
 doesn't allow importing data from multiple mail boxes.

 What do you think of integrating an authorization mechanism like OAuth2
 in Solr?
 Appreciate your ideas on using this for indexing multiple mailboxes
 without requiring users to give their username passwords.

 document   entity processor=MailEntityProcessor
 user=someb...@gmail.compassword=something
 host=imap.gmail.comprotocol=imaps   folders = 
 x,y,z//document

 Regards,
 Dileepa

 [1] http://wiki.apache.org/solr/MailEntityProcessor





API to get documents imported in dataimport from EventListener.onEvent(Context cntxt)

2014-02-04 Thread Dileepa Jayakody
Hi All,

Is there a way to retrieve the documents being imported in a dataimport
request from a EventListener configured to run at onImportEnd?
I need to get the set of values of the field:content of all the documents
imported to perform an enhancement task. Is there a way to retrieve the
documents imported in dataimport from my EventListener?

My example data-config is as below;
dataConfig
dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/test user=usr1 password=pass1
batchSize=1 /
document name=stanboldata
onImportEnd=com.solr.stanbol.processor.StanbolEventListener
entity name=stanbolrequest query=SELECT * FROM documents
field column=id name=id /
field column=content name=content /
 field column=title name=title /
/entity
/document
/dataConfig

Currently what I do is as below;

1. The /dataimport requesthandler is configured with a custom
UpdateRequestProcessor which intercepts the documents being imported, gets
the value of the field I want and updates a static Maplong, String
contents in my custom EventListener class with the documentID and the
content String.
2. At the end of the import process the StanbolEventListener is triggered
onImportEnd of the dataimport; in the onEvent(Context cntxt) method, the
contents Map is iterated and all the content field values are sent to an
external Server to be enhanced.
3. The documents with the IDs (keys of contents Map) are updated with the
enhanced fields and committed.

This mechanism works fine for a single dataimport process at a time.
But when there are concurrent dataimport requests, the system behaves
abruptly. I suspect the static Maplong,Stringcontents is updated abruptly
by concurrent update requests initiated by the dataimport process. To make
the contents Map thread safe, I used a ConcurrentHashMap implementation.
However I still get abrupt results in the update process.

What I'm looking for is an alternative to bypass data concurrency handling
in EventListeners.
I think this can be achieved if the whole dataimport process is executed as
a single transaction and at the onImportEnd EventListener, all the
documents imported are retrieved to get the content field of each document.
Is there a way to access the set of documents imported in the
onEvent(Context context) method of EventListener in Solr? Can I use the
Context object to access my documents?

Any suggestions


Concurrency handling in DataImportHandler

2014-01-30 Thread Dileepa Jayakody
Hi All,

Can I please know about how concurrency is handled in the DIH?
What happens if multiple /dataimport requests are issued to the same
Datasource?

I'm doing some custom processing at the end of dataimport process as an
EventListener configured in the data-config.xml as below.
 document name=stanboldata
onImportEnd=com.solr.stanbol.processor.StanbolEventListener

Will each DIH request create a new EventListener object?

I'm copying some field values from my custom processor configured in the
/dataimport request handler to a static Map in my StanbolEventListener
class.
I need to figure out how to handle concurrency when data is copied to my
EvenetListener object to perform the rest of my update process.

Thanks,
Dileepa


Re: Concurrency handling in DataImportHandler

2014-01-30 Thread Dileepa Jayakody
I would particularly like to know how DIH handles concurrency in JDBC
database connections during datamport..

dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123
batchSize=1 /

Thanks,
Dileepa


On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi All,

 Can I please know about how concurrency is handled in the DIH?
 What happens if multiple /dataimport requests are issued to the same
 Datasource?

 I'm doing some custom processing at the end of dataimport process as an
 EventListener configured in the data-config.xml as below.
  document name=stanboldata
 onImportEnd=com.solr.stanbol.processor.StanbolEventListener

 Will each DIH request create a new EventListener object?

 I'm copying some field values from my custom processor configured in the
 /dataimport request handler to a static Map in my StanbolEventListener
 class.
 I need to figure out how to handle concurrency when data is copied to my
 EvenetListener object to perform the rest of my update process.

 Thanks,
 Dileepa



Re: Concurrency handling in DataImportHandler

2014-01-30 Thread Dileepa Jayakody
Hi All,

I triggered a /dataimport for first 100 rows from my database and while
it's running issued another import request for rows 101-200.

In my log I see below exception; It seems multiple JDBC connections cannot
be opened. Does this mean concurrency is not supported in DIH for JDBC
datasources?

Please share your thoughts on how to tackle concurrency in dataimport..

[Thread-15] ERROR org.apache.solr.handler.dataimport.JdbcDataSource  -
Ignoring Error when closing connection
java.sql.SQLException: Streaming result set
com.mysql.jdbc.RowDataDynamic@1e820764 is still active. No statements may
be issued when any streaming result sets are open and in use on a given
connection. Ensure that you have called .close() on any active streaming
result sets before attempting more queries.
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:927)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
at
com.mysql.jdbc.MysqlIO.checkForOutstandingStreamingData(MysqlIO.java:3314)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2477)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2809)
at com.mysql.jdbc.ConnectionImpl.rollbackNoChecks(ConnectionImpl.java:5165)
at com.mysql.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:5048)
at com.mysql.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:4654)
at com.mysql.jdbc.ConnectionImpl.close(ConnectionImpl.java:1630)
at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410)
at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395)
at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)


Thanks,
Dileepa


On Thu, Jan 30, 2014 at 4:13 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 I would particularly like to know how DIH handles concurrency in JDBC
 database connections during datamport..

 dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost:3306/solrtest user=usr1 password=123
 batchSize=1 /

 Thanks,
 Dileepa


 On Thu, Jan 30, 2014 at 4:05 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

 Hi All,

 Can I please know about how concurrency is handled in the DIH?
 What happens if multiple /dataimport requests are issued to the same
 Datasource?

 I'm doing some custom processing at the end of dataimport process as an
 EventListener configured in the data-config.xml as below.
  document name=stanboldata
 onImportEnd=com.solr.stanbol.processor.StanbolEventListener

 Will each DIH request create a new EventListener object?

 I'm copying some field values from my custom processor configured in the
 /dataimport request handler to a static Map in my StanbolEventListener
 class.
 I need to figure out how to handle concurrency when data is copied to my
 EvenetListener object to perform the rest of my update process.

 Thanks,
 Dileepa





Integrating Oauth2 with Solr MailEntityProcessor

2014-01-29 Thread Dileepa Jayakody
Hi All,

I'm doing a research project on : Email Reputation Analysis and for this
project I'm planning to use Apache Solr, Tika and Mahout projects to
analyse, store and query reputation of emails and correspondents.

For indexing emails in Solr I'm going to use the MailEntityProcessor [1].
But I see that it requires the user to provide their email credentials to
the DIH which is a security risk. Also I feel current MailEntityProcessor
doesn't allow importing data from multiple mail boxes.

What do you think of integrating an authorization mechanism like OAuth2 in
Solr?
Appreciate your ideas on using this for indexing multiple mailboxes without
requiring users to give their username passwords.

document   entity processor=MailEntityProcessor
user=someb...@gmail.compassword=something
host=imap.gmail.comprotocol=imaps   folders =
x,y,z//document

Regards,
Dileepa

[1] http://wiki.apache.org/solr/MailEntityProcessor


Re: Integrating Oauth2 with Solr MailEntityProcessor

2014-01-29 Thread Dileepa Jayakody
Hi All,

I think Oauth2 integration is a valid usecase for Solr when it comes to
importing data from user-accounts like email, social-networks, enterprise
stores etc.
Do you think Oauth2 integration in Solr will be an useful feature? If so I
would like to start working on this.
I feel this could also be a good project for GSoC 2014.

Thanks,
Dileepa


On Wed, Jan 29, 2014 at 3:57 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi All,

 I'm doing a research project on : Email Reputation Analysis and for this
 project I'm planning to use Apache Solr, Tika and Mahout projects to
 analyse, store and query reputation of emails and correspondents.

 For indexing emails in Solr I'm going to use the MailEntityProcessor [1].
 But I see that it requires the user to provide their email credentials to
 the DIH which is a security risk. Also I feel current MailEntityProcessor
 doesn't allow importing data from multiple mail boxes.

 What do you think of integrating an authorization mechanism like OAuth2 in
 Solr?
 Appreciate your ideas on using this for indexing multiple mailboxes
 without requiring users to give their username passwords.

 document   entity processor=MailEntityProcessor
 user=someb...@gmail.compassword=something
 host=imap.gmail.comprotocol=imaps   folders = 
 x,y,z//document

 Regards,
 Dileepa

 [1] http://wiki.apache.org/solr/MailEntityProcessor



Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-27 Thread Dileepa Jayakody
Hi Varun and all,

Thanks for your input.

On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker
varunthacker1...@gmail.comwrote:

 Hi Dileepa,

 If I understand correctly this is what happens in your system correctly :

 1. DIH Sends data to Solr
 2. You have written a custom update processor (
 http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
 Stanbol server for meta data, adds it to the document and then indexes it.

 Its the part where you query the Stanbol server and wait for the response
 which takes time and you want to reduce this.


Yes, this is what I'm trying to achieve. For each document I'm sending the
value of the content field to Stanbol and I process the Stanbol response to
add certain metadata fields to the document in my UpdateRequestProcessor.


 According to me instead of waiting for your response from the Stanbol
 server and then indexing it, You could send the required field data from
 the doc to your Stanbol server and continue. Once Stanbol as enriched the
 document, you re-index the document and update it with the meta-data.

 To update a document I need to invoke a /update request with the doc id
and the field to update/add. So in the method you have suggested, for each
Stanbol request I will need to process the response and create a Solr
/update query to update the document with the Stanbol enhancements.
To Stanbol I just send the value of the content to be enhanced and no
document ID is sent. How would you recommend to execute the Stanbol
request-response handling process separately?

Currently what I have done in my custom update processor is as below; I
process the Stanbol response and add NLP fields to the document in the
processAdd() method of my UpdateRequestProcessor.

public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
 String request = ;
for (String field : STANBOL_REQUEST_FIELDS) {
if (null != doc.getFieldValue(field)) {
request += (String) doc.getFieldValue(field) + . ;
}
}
 try {
EnhancementResult result = stanbolPost(request, getBaseURI());
CollectionTextAnnotation textAnnotations = result
.getTextAnnotations();
 // extracting text annotations
SetString personSet = new HashSetString();
SetString orgSet = new HashSetString();

for (TextAnnotation text : textAnnotations) {
String type = text.getType();
String language = text.getLanguage();
langSet.add(language);
String selectedText = text.getSelectedText();
if (null != type  null != selectedText) {
if (type.equalsIgnoreCase(StanbolConstants.PERSON)) {
personSet.add(selectedText);
} else if (type
.equalsIgnoreCase(StanbolConstants.ORGANIZATION)) {
orgSet.add(selectedText);
}
}
}
CollectionEntityAnnotation entityAnnotations =
result.getEntityAnnotations();
 for (String person : personSet) {
doc.addField(NLP_PERSON, person);
}
for (String org : orgSet) {
doc.addField(NLP_ORGANIZATION, org);
}
cmd.solrDoc = doc;
super.processAdd(cmd);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}

private EnhancementResult stanbolPost(String request, URI uri) {
Client client = Client.create();
WebResource webResource = client.resource(uri);
ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
.accept(new MediaType(application, rdf+xml))
.entity(request, MediaType.TEXT_PLAIN)
.post(ClientResponse.class);

int status = response.getStatus();
if (status != 200  status != 201  status != 202) {
throw new RuntimeException(Failed : HTTP error code : 
+ response.getStatus());
}
String output = response.getEntity(String.class);
// Parse the RDF model
Model model = ModelFactory.createDefaultModel();
StringReader reader = new StringReader(output);
model.read(reader, null);
return new EnhancementResult(model);

}

Thanks,
Dileepa

This method makes you re-index the document but the changes from your
 client would be visible faster.

 Alternately you could do the same thing at the DIH level by writing a
 customer Transformer (
 http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers)


 On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi Ahmet,
 
 
 
  On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote:
 
   Hi,
  
   Here is what I understand from your Question.
  
   You have a custom update processor that runs with DIH. But it is slow.
  You
   want to run that text enhancement component after DIH. How would this
  help
   to speed up things?
 
 
   In this approach you will read/query/search already indexed and
 committed
   solr documents and run text enhancement thing on them. Probably this
   process will add new additional fields. And then you will update these
  solr
   documents?
  
   Did I understand your use case correctly?
  
 
  Yes, that is exactly what I want to achieve.
  I want to separate out the enhancement process from the dataimport
 process.
  The dataimport process will be invoked by a client when new data is
  added/updated to the mysql database

Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-27 Thread Dileepa Jayakody
Hi All,

I have implemented my requirement as a EventListener which runs on
importEnd of the dataimporthandler.

I'm running Solrj based client to send Stanbol enhancement updates to the
documents within my EventListener.

Thanks,
Dileepa


On Mon, Jan 27, 2014 at 4:34 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi Varun and all,

 Thanks for your input.

 On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker 
 varunthacker1...@gmail.com wrote:

 Hi Dileepa,

 If I understand correctly this is what happens in your system correctly :

 1. DIH Sends data to Solr
 2. You have written a custom update processor (
 http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
 Stanbol server for meta data, adds it to the document and then indexes it.

 Its the part where you query the Stanbol server and wait for the response
 which takes time and you want to reduce this.


 Yes, this is what I'm trying to achieve. For each document I'm sending the
 value of the content field to Stanbol and I process the Stanbol response to
 add certain metadata fields to the document in my UpdateRequestProcessor.


 According to me instead of waiting for your response from the Stanbol
 server and then indexing it, You could send the required field data from
 the doc to your Stanbol server and continue. Once Stanbol as enriched the
 document, you re-index the document and update it with the meta-data.

 To update a document I need to invoke a /update request with the doc id
 and the field to update/add. So in the method you have suggested, for each
 Stanbol request I will need to process the response and create a Solr
 /update query to update the document with the Stanbol enhancements.
 To Stanbol I just send the value of the content to be enhanced and no
 document ID is sent. How would you recommend to execute the Stanbol
 request-response handling process separately?

 Currently what I have done in my custom update processor is as below; I
 process the Stanbol response and add NLP fields to the document in the
 processAdd() method of my UpdateRequestProcessor.

 public void processAdd(AddUpdateCommand cmd) throws IOException {
 SolrInputDocument doc = cmd.getSolrInputDocument();
  String request = ;
  for (String field : STANBOL_REQUEST_FIELDS) {
 if (null != doc.getFieldValue(field)) {
  request += (String) doc.getFieldValue(field) + . ;
 }
  }
  try {
 EnhancementResult result = stanbolPost(request, getBaseURI());
  CollectionTextAnnotation textAnnotations = result
 .getTextAnnotations();
  // extracting text annotations
  SetString personSet = new HashSetString();
 SetString orgSet = new HashSetString();

 for (TextAnnotation text : textAnnotations) {
 String type = text.getType();
  String language = text.getLanguage();
 langSet.add(language);
  String selectedText = text.getSelectedText();
 if (null != type  null != selectedText) {
  if (type.equalsIgnoreCase(StanbolConstants.PERSON)) {
  personSet.add(selectedText);
 } else if (type
  .equalsIgnoreCase(StanbolConstants.ORGANIZATION)) {
  orgSet.add(selectedText);
 }
  }
 }
  CollectionEntityAnnotation entityAnnotations =
 result.getEntityAnnotations();
  for (String person : personSet) {
 doc.addField(NLP_PERSON, person);
  }
 for (String org : orgSet) {
  doc.addField(NLP_ORGANIZATION, org);
 }
  cmd.solrDoc = doc;
 super.processAdd(cmd);
  } catch (Exception ex) {
 ex.printStackTrace();
  }
 }

 }

 private EnhancementResult stanbolPost(String request, URI uri) {
 Client client = Client.create();
  WebResource webResource = client.resource(uri);
 ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
  .accept(new MediaType(application, rdf+xml))
 .entity(request, MediaType.TEXT_PLAIN)
  .post(ClientResponse.class);

  int status = response.getStatus();
 if (status != 200  status != 201  status != 202) {
  throw new RuntimeException(Failed : HTTP error code : 
 + response.getStatus());
  }
 String output = response.getEntity(String.class);
  // Parse the RDF model
 Model model = ModelFactory.createDefaultModel();
  StringReader reader = new StringReader(output);
 model.read(reader, null);
  return new EnhancementResult(model);

  }

 Thanks,
 Dileepa

  This method makes you re-index the document but the changes from your
 client would be visible faster.

 Alternately you could do the same thing at the DIH level by writing a
 customer Transformer (
 http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers
 )


 On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi Ahmet,
 
 
 
  On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 
   Hi,
  
   Here is what I understand from your Question.
  
   You have a custom update processor that runs with DIH. But it is slow.
  You
   want to run that text enhancement component after DIH. How would this
  help
   to speed up things?
 
 
   In this approach you will read/query/search already indexed and
 committed
   solr

Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Dileepa Jayakody
Hi all,

Any ideas on how to run a reindex update process for all the imported
documents from a /dataimport query?
Appreciate your help.


Thanks,
Dileepa


On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi All,

 I did some research on this and found some alternatives useful to my
 usecase. Please give your ideas.

 Can I update all documents indexed after a /dataimport query using the
 last_indexed_time in dataimport.properties?
 If so can anyone please give me some pointers?
 What I currently have in mind is something like below;

 1. Store the indexing timestamp of the document as a field
 eg: field name=timestamp type=date indexed=true stored=true 
 default=NOW
 multiValued=false/

 2. Read the last_index_time from the dataimport.properties

 3. Query all document id's indexed after the last_index_time and send them
 through the Stanbol update processor.

 But I have a question here;
 Does the last_index_time refer to when the dataimport is
 started(onImportStart) or when the dataimport is finished (onImportEnd)?
 If it's onImportEnd timestamp, them this solution won't work because the
 timestamp indexed in the document field will be : onImportStart
 doc-index-timestamp  onImportEnd.


 Another alternative I can think of is trigger an update chain via a
 EventListener configured to run after a dataimport is processed
 (onImportEnd).
 In this case can the context in DIH give the list of document ids
 processed in the /dataimport request? If so I can send those doc ids with
 an /update query to run the Stanbol update process.

 Please give me your ideas and suggestions.

 Thanks,
 Dileepa




 On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

 Hi All,

 I have a Solr requirement to send all the documents imported from a
 /dataimport query to go through another update chain as a separate
 background process.

 Currently I have configured my custom update chain in the /dataimport
 handler itself. But since my custom update process need to connect to an
 external enhancement engine (Apache Stanbol) to enhance the documents with
 some NLP fields, it has a negative impact on /dataimport process.
 The solution will be to have a separate update process running to enhance
 the content of the documents imported from /dataimport.

 Currently I have configured my custom Stanbol Processor as below in my
 /dataimport handler.

 requestHandler name=/dataimport class=solr.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 str name=update.chainstanbolInterceptor/str
  /lst
/requestHandler

 updateRequestProcessorChain name=stanbolInterceptor
  processor
 class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 What I need now is to separate the 2 processes of dataimport and
 stanbol-enhancement.
 So this is like runing a separate re-indexing process periodically over
 the documents imported from /dataimport for Stanbol fields.

 The question is how to trigger my Stanbol update process to the documents
 imported from /dataimport?
 In Solr to trigger /update query we need to know the id and the fields of
 the document to be updated. In my case I need to run all the documents
 imported from the previous /dataimport process through a stanbol
 update.chain.

 Is there a way to keep track of the documents ids imported from
 /dataimport?
 Any advice or pointers will be really helpful.

 Thanks,
 Dileepa





What is the last_index_time in dataimport.properties?

2014-01-26 Thread Dileepa Jayakody
Hi All,

Can I please know what timestamp in the dataimport process is reordered as
the last_index_time in dataimport.properties?

Is it the time that the last dataimport process started ?
OR
Is it the time that the last dataimport process finished?


Thanks,
Dileepa


Re: What is the last_index_time in dataimport.properties?

2014-01-26 Thread Dileepa Jayakody
Hi Ahmet,

Thanks a lot.
It means I can use the last_index_time to query documents indexed during
the last dataimport request?
I need to run a subsequent update process to all documents imported from a
dataimport.

Thanks,
Dileepa


On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Dileepa,

 It is the time that the last dataimport process started. So it is safe to
 use it when considering updated documenets during the import.

 Ahmet



 On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:
 Hi All,

 Can I please know what timestamp in the dataimport process is reordered as
 the last_index_time in dataimport.properties?

 Is it the time that the last dataimport process started ?
 OR
 Is it the time that the last dataimport process finished?


 Thanks,
 Dileepa




Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Dileepa Jayakody
Hi Ahmet,



On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,

 Here is what I understand from your Question.

 You have a custom update processor that runs with DIH. But it is slow. You
 want to run that text enhancement component after DIH. How would this help
 to speed up things?


 In this approach you will read/query/search already indexed and committed
 solr documents and run text enhancement thing on them. Probably this
 process will add new additional fields. And then you will update these solr
 documents?

 Did I understand your use case correctly?


Yes, that is exactly what I want to achieve.
I want to separate out the enhancement process from the dataimport process.
The dataimport process will be invoked by a client when new data is
added/updated to the mysql database. Therefore the dataimport process with
mandatory fields of the documents should be indexed as soon as possible.
Mandatory fields are mapped to the data table columns in the
data-config.xml and the normal /dataimport process doesn't take much time.
The enhancements are done in my custom processor by sending the content
field of the document to an external Stanbol[1] server to detect NLP
enhancements. Then new NLP fields are added to the document (detected
persons, organizations, places in the content) in the custom update
processor and if this is executed during the dataimport process, it takes a
lot of time.

The NLP fields are not mandatory for the primary usage of the application
which is to query documents with mandatory fields. The NLP fields are
required for custom queries for Person, Organization entities. Therefore
the NLP update process should be run as a background process detached from
the primary /dataimport process. It should not slow down the existing
/dataimport process.

That's why I am looking for the best way to achieve my objective. I want to
implement a way to separately update the imported documents from
/dataimport  to detect NLP enhancements. Currently I'm having the idea of
adopting a timestamp based approach to trigger a /update query to all
documents imported after the last_index_time in dataimport.prop and update
them with NLP fields.

Hope my requirement is clear :). Appreciate your suggestions.

[1] http://stanbol.apache.org/





 On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:
 Hi all,

 Any ideas on how to run a reindex update process for all the imported
 documents from a /dataimport query?
 Appreciate your help.


 Thanks,
 Dileepa



 On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi All,
 
  I did some research on this and found some alternatives useful to my
  usecase. Please give your ideas.
 
  Can I update all documents indexed after a /dataimport query using the
  last_indexed_time in dataimport.properties?
  If so can anyone please give me some pointers?
  What I currently have in mind is something like below;
 
  1. Store the indexing timestamp of the document as a field
  eg: field name=timestamp type=date indexed=true stored=true
 default=NOW
  multiValued=false/
 
  2. Read the last_index_time from the dataimport.properties
 
  3. Query all document id's indexed after the last_index_time and send
 them
  through the Stanbol update processor.
 
  But I have a question here;
  Does the last_index_time refer to when the dataimport is
  started(onImportStart) or when the dataimport is finished (onImportEnd)?
  If it's onImportEnd timestamp, them this solution won't work because the
  timestamp indexed in the document field will be : onImportStart
  doc-index-timestamp  onImportEnd.
 
 
  Another alternative I can think of is trigger an update chain via a
  EventListener configured to run after a dataimport is processed
  (onImportEnd).
  In this case can the context in DIH give the list of document ids
  processed in the /dataimport request? If so I can send those doc ids with
  an /update query to run the Stanbol update process.
 
  Please give me your ideas and suggestions.
 
  Thanks,
  Dileepa
 
 
 
 
  On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody 
  dileepajayak...@gmail.com wrote:
 
  Hi All,
 
  I have a Solr requirement to send all the documents imported from a
  /dataimport query to go through another update chain as a separate
  background process.
 
  Currently I have configured my custom update chain in the /dataimport
  handler itself. But since my custom update process need to connect to an
  external enhancement engine (Apache Stanbol) to enhance the documents
 with
  some NLP fields, it has a negative impact on /dataimport process.
  The solution will be to have a separate update process running to
 enhance
  the content of the documents imported from /dataimport.
 
  Currently I have configured my custom Stanbol Processor as below in my
  /dataimport handler.
 
  requestHandler name=/dataimport class=solr.DataImportHandler
  lst name=defaults
   str name

Re: What is the last_index_time in dataimport.properties?

2014-01-26 Thread Dileepa Jayakody
Yes Ahmet.
I want to use the last_index_time to find the documents imported in the
last /dataimport process and send them through a update process. I have
explained this requirement in my other thread.

Thanks,
Dileepa


On Mon, Jan 27, 2014 at 3:23 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,

 last_index_time traditionally is used to query Database. But it seems that
 you want to query solr, right?




 On Sunday, January 26, 2014 11:15 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:
 Hi Ahmet,

 Thanks a lot.
 It means I can use the last_index_time to query documents indexed during
 the last dataimport request?
 I need to run a subsequent update process to all documents imported from a
 dataimport.

 Thanks,
 Dileepa



 On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote:

  Hi Dileepa,
 
  It is the time that the last dataimport process started. So it is safe to
  use it when considering updated documenets during the import.
 
  Ahmet
 
 
 
  On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody 
  dileepajayak...@gmail.com wrote:
  Hi All,
 
  Can I please know what timestamp in the dataimport process is reordered
 as
  the last_index_time in dataimport.properties?
 
  Is it the time that the last dataimport process started ?
  OR
  Is it the time that the last dataimport process finished?
 
 
  Thanks,
  Dileepa
 
 




How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I have a Solr requirement to send all the documents imported from a
/dataimport query to go through another update chain as a separate
background process.

Currently I have configured my custom update chain in the /dataimport
handler itself. But since my custom update process need to connect to an
external enhancement engine (Apache Stanbol) to enhance the documents with
some NLP fields, it has a negative impact on /dataimport process.
The solution will be to have a separate update process running to enhance
the content of the documents imported from /dataimport.

Currently I have configured my custom Stanbol Processor as below in my
/dataimport handler.

requestHandler name=/dataimport class=solr.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.chainstanbolInterceptor/str
/lst
   /requestHandler

updateRequestProcessorChain name=stanbolInterceptor
processor
class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain


What I need now is to separate the 2 processes of dataimport and
stanbol-enhancement.
So this is like runing a separate re-indexing process periodically over the
documents imported from /dataimport for Stanbol fields.

The question is how to trigger my Stanbol update process to the documents
imported from /dataimport?
In Solr to trigger /update query we need to know the id and the fields of
the document to be updated. In my case I need to run all the documents
imported from the previous /dataimport process through a stanbol
update.chain.

Is there a way to keep track of the documents ids imported from
/dataimport?
Any advice or pointers will be really helpful.

Thanks,
Dileepa


Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I did some research on this and found some alternatives useful to my
usecase. Please give your ideas.

Can I update all documents indexed after a /dataimport query using the
last_indexed_time in dataimport.properties?
If so can anyone please give me some pointers?
What I currently have in mind is something like below;

1. Store the indexing timestamp of the document as a field
eg: field name=timestamp type=date indexed=true stored=true
default=NOW
multiValued=false/

2. Read the last_index_time from the dataimport.properties

3. Query all document id's indexed after the last_index_time and send them
through the Stanbol update processor.

But I have a question here;
Does the last_index_time refer to when the dataimport is
started(onImportStart) or when the dataimport is finished (onImportEnd)?
If it's onImportEnd timestamp, them this solution won't work because the
timestamp indexed in the document field will be : onImportStart
doc-index-timestamp  onImportEnd.


Another alternative I can think of is trigger an update chain via a
EventListener configured to run after a dataimport is processed
(onImportEnd).
In this case can the context in DIH give the list of document ids processed
in the /dataimport request? If so I can send those doc ids with an /update
query to run the Stanbol update process.

Please give me your ideas and suggestions.

Thanks,
Dileepa




On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi All,

 I have a Solr requirement to send all the documents imported from a
 /dataimport query to go through another update chain as a separate
 background process.

 Currently I have configured my custom update chain in the /dataimport
 handler itself. But since my custom update process need to connect to an
 external enhancement engine (Apache Stanbol) to enhance the documents with
 some NLP fields, it has a negative impact on /dataimport process.
 The solution will be to have a separate update process running to enhance
 the content of the documents imported from /dataimport.

 Currently I have configured my custom Stanbol Processor as below in my
 /dataimport handler.

 requestHandler name=/dataimport class=solr.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 str name=update.chainstanbolInterceptor/str
  /lst
/requestHandler

 updateRequestProcessorChain name=stanbolInterceptor
  processor
 class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 What I need now is to separate the 2 processes of dataimport and
 stanbol-enhancement.
 So this is like runing a separate re-indexing process periodically over
 the documents imported from /dataimport for Stanbol fields.

 The question is how to trigger my Stanbol update process to the documents
 imported from /dataimport?
 In Solr to trigger /update query we need to know the id and the fields of
 the document to be updated. In my case I need to run all the documents
 imported from the previous /dataimport process through a stanbol
 update.chain.

 Is there a way to keep track of the documents ids imported from
 /dataimport?
 Any advice or pointers will be really helpful.

 Thanks,
 Dileepa



Concurrent request configurations for Solr Processors

2013-12-18 Thread Dileepa Jayakody
Hi All,

I have written a custom update request processor and configured a
UpdateRequestProcessor chain in solrconfig.xml as below;

updateRequestProcessorChain name=stanbolInterceptor
processor class=
*com.solr.stanbol.processor.StanbolContentProcessorFactory* /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

Can I please know how can I configure the number of concurrent requests for
my processor? What is the default number of concurrent requests per a Solr
processor?

Thanks,
Dileepa


Re: Passing a Parameter to a Custom Processor

2013-12-17 Thread Dileepa Jayakody
Thanks a lot for the info Koji. I'm going through the source-code, to find
out.

Regards,
Dileepa


On Fri, Dec 13, 2013 at 5:40 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 Hi Dileepa,

  The stanbolInterceptor processor chain will be used in multiple request
 handlers. Then I will have to pass the stanbol.enhancer.url param in each
 of those request handler which will cause redundant configurations.
 Therefore I need to pass the param to the processor directly.

 But when I pass the params to the Processor as below the parameter is not
 received to my ProcessorFactory class;
 processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor
 *

 str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer
 http://localhost:8080/enhancer/str*

 /processor

 Can someone point out what might be wrong here? Can someone please advice
 on how to pass parameters directly to the Processor?


 I don't know why your Processor cannot get the parameters, but Processor
 should
 get them. For example, StatelessScriptUpdateProcessorFactory can get
 script
 parameter like this:

 processor class=solr.StatelessScriptUpdateProcessorFactory
str name=scriptupdateProcessor.js/str
 /processor

 http://lucene.apache.org/solr/4_5_0/solr-core/org/apache/
 solr/update/processor/StatelessScriptUpdateProcessorFactory.html

 So why don't you consult the source code of 
 StatelessScriptUpdateProcessorFactory,
 etc?

 koji
 --
 http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-
 wikipedia.html



Passing a Parameter to a Custom Processor

2013-12-13 Thread Dileepa Jayakody
Hi All,

I have written a custom update-request processor and need to pass certain
parameters to the Processor.
I believe solrconfig.xml is the place to pass these parameters. At the
moment I define my parameter in  the request handler as below;

requestHandler name=/dataimport class=solr.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.chainstanbolInterceptor/str
*str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer
http://localhost:8080/enhancer/str*
/lst
 /requestHandler

My processor is defined in the stanbolInterceptor update.chain as below;

updateRequestProcessorChain name=stanbolInterceptor processor
class=com.solr.stanbol.processor.StanbolContentProcessorFactor /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

The stanbolInterceptor processor chain will be used in multiple request
handlers. Then I will have to pass the stanbol.enhancer.url param in each
of those request handler which will cause redundant configurations.
Therefore I need to pass the param to the processor directly.

But when I pass the params to the Processor as below the parameter is not
received to my ProcessorFactory class;
processor class=com.solr.stanbol.processor.StanbolContentProcessorFactor *
   str name=stanbol.enhancer.urlhttp://localhost:8080/enhancer
http://localhost:8080/enhancer/str*
/processor

Can someone point out what might be wrong here? Can someone please advice
on how to pass parameters directly to the Processor?

Thanks,
Dileepa


Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

2013-12-01 Thread Dileepa Jayakody
Thanks all, for your valuable ideas into this matter. I will try them. :)

Regards,
Dileepa


On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 There is no support for throttling built into DIH. You can probably write a
 Transformer which sleeps a while after every N requests to simulate
 throttling.
 On 26 Nov 2013 14:21, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

  Hi All,
 
  I have a requirement to import a large amount of data from a mysql
 database
  and index documents (about 1000 documents).
  During indexing process I need to do a special processing of a field by
  sending a enhancement requests to an external Apache Stanbol server.
  I have configured my dataimport-handler in solrconfig.xml to use the
  StanbolContentProcessor in the update chain, as below;
 
   *updateRequestProcessorChain name=stanbolInterceptor*
  * processor
  class=com.solr.stanbol.processor.StanbolContentProcessorFactory/*
  *processor class=solr.RunUpdateProcessorFactory /*
  *  /updateRequestProcessorChain*
 
  *  requestHandler name=/dataimport class=solr.DataImportHandler   *
  * lst name=defaults  *
  * str name=configdata-config.xml/str*
  * str name=update.chainstanbolInterceptor/str*
  * /lst *
  *   /requestHandler*
 
  My sample data-config.xml is as below;
 
  *dataConfig*
  *dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost:3306/solrTest user=test password=test123
  batchSize=1 /*
  *document name=stanboldata*
  *entity name=stanbolrequest query=SELECT * FROM documents*
  *field column=id name=id /*
  *field column=content name=content /*
  * field column=title name=title /*
  */entity*
  */document*
  */dataConfig*
 
  When running a large import with about 1000 documents, my stanbol server
  goes down, I suspect due to heavy load from the above Solr
  Stanbolnterceptor.
  I would like to throttle the dataimport in batches, so that Stanbol can
  process a manageable number of requests concurrently.
  Is this achievable using batchSize parameter in dataSource element in the
  data-config?
  Can someone please give some ideas to throttle the dataimport load in
 Solr?
 
  Thanks,
  Dileepa
 



Re: How to use batchSize in DataImportHandler to throttle updates in a batch-mode

2013-12-01 Thread Dileepa Jayakody
I actually tweaked the Stanbol server to handle more results and
successfully ran 10K imports within 30 minutes with no server issue.
I'm looking for further improving the results with regard to the efficiency
and NLP accuracy.

Thanks,
Dileepa


On Sun, Dec 1, 2013 at 8:17 PM, Dileepa Jayakody
dileepajayak...@gmail.comwrote:

 Thanks all, for your valuable ideas into this matter. I will try them. :)

 Regards,
 Dileepa


 On Sun, Dec 1, 2013 at 6:05 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 There is no support for throttling built into DIH. You can probably write
 a
 Transformer which sleeps a while after every N requests to simulate
 throttling.
 On 26 Nov 2013 14:21, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

  Hi All,
 
  I have a requirement to import a large amount of data from a mysql
 database
  and index documents (about 1000 documents).
  During indexing process I need to do a special processing of a field by
  sending a enhancement requests to an external Apache Stanbol server.
  I have configured my dataimport-handler in solrconfig.xml to use the
  StanbolContentProcessor in the update chain, as below;
 
   *updateRequestProcessorChain name=stanbolInterceptor*
  * processor
  class=com.solr.stanbol.processor.StanbolContentProcessorFactory/*
  *processor class=solr.RunUpdateProcessorFactory /*
  *  /updateRequestProcessorChain*
 
  *  requestHandler name=/dataimport class=solr.DataImportHandler
 *
  * lst name=defaults  *
  * str name=configdata-config.xml/str*
  * str name=update.chainstanbolInterceptor/str*
  * /lst *
  *   /requestHandler*
 
  My sample data-config.xml is as below;
 
  *dataConfig*
  *dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost:3306/solrTest user=test
 password=test123
  batchSize=1 /*
  *document name=stanboldata*
  *entity name=stanbolrequest query=SELECT * FROM documents*
  *field column=id name=id /*
  *field column=content name=content /*
  * field column=title name=title /*
  */entity*
  */document*
  */dataConfig*
 
  When running a large import with about 1000 documents, my stanbol server
  goes down, I suspect due to heavy load from the above Solr
  Stanbolnterceptor.
  I would like to throttle the dataimport in batches, so that Stanbol can
  process a manageable number of requests concurrently.
  Is this achievable using batchSize parameter in dataSource element in
 the
  data-config?
  Can someone please give some ideas to throttle the dataimport load in
 Solr?
 
  Thanks,
  Dileepa
 





How to use batchSize in DataImportHandler to throttle updates in a batch-mode

2013-11-26 Thread Dileepa Jayakody
Hi All,

I have a requirement to import a large amount of data from a mysql database
and index documents (about 1000 documents).
During indexing process I need to do a special processing of a field by
sending a enhancement requests to an external Apache Stanbol server.
I have configured my dataimport-handler in solrconfig.xml to use the
StanbolContentProcessor in the update chain, as below;

 *updateRequestProcessorChain name=stanbolInterceptor*
* processor
class=com.solr.stanbol.processor.StanbolContentProcessorFactory/*
*processor class=solr.RunUpdateProcessorFactory /*
*  /updateRequestProcessorChain*

*  requestHandler name=/dataimport class=solr.DataImportHandler   *
* lst name=defaults  *
* str name=configdata-config.xml/str*
* str name=update.chainstanbolInterceptor/str*
* /lst *
*   /requestHandler*

My sample data-config.xml is as below;

*dataConfig*
*dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/solrTest user=test password=test123
batchSize=1 /*
*document name=stanboldata*
*entity name=stanbolrequest query=SELECT * FROM documents*
*field column=id name=id /*
*field column=content name=content /*
* field column=title name=title /*
*/entity*
*/document*
*/dataConfig*

When running a large import with about 1000 documents, my stanbol server
goes down, I suspect due to heavy load from the above Solr
Stanbolnterceptor.
I would like to throttle the dataimport in batches, so that Stanbol can
process a manageable number of requests concurrently.
Is this achievable using batchSize parameter in dataSource element in the
data-config?
Can someone please give some ideas to throttle the dataimport load in Solr?

Thanks,
Dileepa


Re: An UpdateHandler to run following a MySql DataImport

2013-11-15 Thread Dileepa Jayakody
I found out that you can configure any requestHandler to run a
requestProcessor chain.
So in my /dataimport requestHandler I just called my custom requestHandler
as a chain;

eg:

 requestHandler name=/dataimport class=solr.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
*str name=update.chainstanbolInterceptor/str*
/lst
   /requestHandler

It works.

Thanks,
Dileepa


On Fri, Nov 15, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, don't quite know the answer to that, but when things
 start getting complex with DIH, you should seriously consider
 a SolrJ solution unless someone comes up with a quick fix.
 Here's an example.

 http://searchhub.org/2012/02/14/indexing-with-solrj/

 Best,
 Erick


 On Fri, Nov 15, 2013 at 2:48 AM, Dileepa Jayakody 
 dileepajayak...@gmail.com
  wrote:

  Hi All,
 
  I have written a custom update request handler to do some custom
 processing
  of documents and configured the /update handler to use my custom handler
 in
  the default: update.chain.
 
  The same requestHandler should be configured for the data-import-handler
  when it loads documents to solr index.
  Is there a way configure the dataimport handler to use my custom
  updatehandler in a update.chain?
 
  If not how can I perform the required custom processing of the document
  while importing data from a mysql database?
 
  Thanks,
  Dileepa
 



An UpdateHandler to run following a MySql DataImport

2013-11-14 Thread Dileepa Jayakody
Hi All,

I have written a custom update request handler to do some custom processing
of documents and configured the /update handler to use my custom handler in
the default: update.chain.

The same requestHandler should be configured for the data-import-handler
when it loads documents to solr index.
Is there a way configure the dataimport handler to use my custom
updatehandler in a update.chain?

If not how can I perform the required custom processing of the document
while importing data from a mysql database?

Thanks,
Dileepa


Re: Indexing a token to a different field in a custom filter

2013-11-12 Thread Dileepa Jayakody
I need to index the processed token to a different feild (eg:
stanbolResponse), in the same document that's being indexed.

I am looking for a way to retrieve the document.id from the TokenStream so
that I can update the same document with new field values. (In my sample
code above I'm adding a new document, instead of updating the same document)
Any pointers please?

Thanks,
Dileepa


On Tue, Nov 12, 2013 at 12:01 PM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi All,

 In my custom filter, I need to index the processed token into a different
 field. The processed token is a Stanbol enhancement response.

 The solution I have so far found is to use a Solr client (solj) to add a
 new Document with my processed field into Solr. Below is the sample code
 segment;

  SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;);
 SolrInputDocument doc1 = new SolrInputDocument();
 doc1.addField( id, id1, 1.0f );
 doc1.addField(stanbolResponse, response);
 try {
 server.add(doc1);
 server.commit();
  } catch (SolrServerException e) {
 e.printStackTrace();
 }


 This mechanism requires a new HTTP call to the local Solr server for every
 token I process for the stanbolRequest field, and I feel it's not very
 efficient.

 Is there any other alternative way to invoke a update request to add a new
 field to the indexing document within the filter (without making an
 explicit HTTP call using Solrj)?

 Thanks,
 Dileepa



Re: Indexing a token to a different field in a custom filter

2013-11-12 Thread Dileepa Jayakody
Thanks all for your valuable inputs.

I looked at suggested solutions and I too feel, a* custom update
processor*during indexing will be the best solution to handle the
content field by
changing the value and storing it in another value.

Do I only need to change the below request handler to intercept all
indexing documents to perform my custom analysis during indexing? Or do I
need to change any other request handler also?
 requestHandler name=/update class=solr.UpdateRequestHandler

Thanks,
Dileepa


On Tue, Nov 12, 2013 at 7:37 PM, Jack Krupansky j...@basetechnology.comwrote:

 Any kind of cross-field processing is best done in an update processor.
 There are a lot of built-in update processors as well as a JavaScript
 script update processor.

 -- Jack Krupansky

 -Original Message- From: Dileepa Jayakody
 Sent: Tuesday, November 12, 2013 1:31 AM
 To: solr-user@lucene.apache.org
 Subject: Indexing a token to a different field in a custom filter


 Hi All,

 In my custom filter, I need to index the processed token into a different
 field. The processed token is a Stanbol enhancement response.

 The solution I have so far found is to use a Solr client (solj) to add a
 new Document with my processed field into Solr. Below is the sample code
 segment;

 SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;);
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( id, id1, 1.0f );
doc1.addField(stanbolResponse, response);
try {
 server.add(doc1);
 server.commit();
 } catch (SolrServerException e) {
 e.printStackTrace();
 }


 This mechanism requires a new HTTP call to the local Solr server for every
 token I process for the stanbolRequest field, and I feel it's not very
 efficient.

 Is there any other alternative way to invoke a update request to add a new
 field to the indexing document within the filter (without making an
 explicit HTTP call using Solrj)?

 Thanks,
 Dileepa



HTTP 500 error when invoking a REST client in Solr Analyzer

2013-11-11 Thread Dileepa Jayakody
Hi All,

I am working on a custom analyzer in Solr to post content to Apache Stanbol
for enhancement during indexing. To post content to Stanbol, inside my
custom analyzer's incrementToken() method I have written below code using
Jersey client API sample [1];

public boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
  return false;
}
char[] buffer = charTermAttr.buffer();
String content = new String(buffer);
Client client = Client.create();
WebResource webResource = client.resource(http://localhost:8080/enhancer;);
ClientResponse response = webResource.type(text/plain).accept(new
MediaType(application, rdf+xml)).post(ClientResponse.class, content);
int status = response.getStatus();
if (status != 200  status != 201  status != 202) {
throw new RuntimeException(Failed : HTTP error code : 
 + response.getStatus());
}

String output = response.getEntity(String.class);
System.out.println(output);
   charTermAttr.setEmpty();
   char[] newBuffer = output.toCharArray();
   charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}

When testing the analyzer I always get a HTTP 500 response from Stanbol
server and I cannot process the enhancement response properly. But I could
successfully execute the same jersey client code above in a Java
application (in main method) and retrieve desired enhancement response from
Stanbol.

Any ideas why I always get a HTTP 500 error when invoking a rest endpoint
in Solr analyzer? Could it be a permission problem in my Solr analyzer ?
Appreciate your help.

Thanks,
Dileepa

[1]
https://blogs.oracle.com/enterprisetechtips/entry/consuming_restful_web_services_with

[2]
6424 [qtp918598659-11] ERROR org.apache.solr.core.SolrCore  –
java.lang.RuntimeException: Failed : HTTP error code : 500
at
com.solr.test.analyzer.ContentFilter.incrementToken(ContentFilter.java:70)
at
org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:179)
at
org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:126)
at
org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:221)
at
org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:190)
at
org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:101)
at
org.apache.solr.handler.AnalysisRequestHandlerBase.handleRequestBody(AnalysisRequestHandlerBase.java:59)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at

Re: HTTP 500 error when invoking a REST client in Solr Analyzer

2013-11-11 Thread Dileepa Jayakody
This seems to be a weird intermittent issue when I use the Analysis UI (
http://localhost:8983/solr/#/collection1/analysis) for testing my Analyzer.
It works fine when I hard code the input value in the Analyzer and index. I
gave the same input : Tim Bernes Lee is a professor at MIT hard coded in
the Analyzer class and from the Solr Analysis UI. The UI response failed
intermittently when I adjust the field value.
This could be a problem with character encoding of the field value it seems.

Thanks,
Dileepa


On Tue, Nov 12, 2013 at 1:33 AM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi All,

 I am working on a custom analyzer in Solr to post content to Apache
 Stanbol for enhancement during indexing. To post content to Stanbol, inside
 my custom analyzer's incrementToken() method I have written below code
 using Jersey client API sample [1];

 public boolean incrementToken() throws IOException {
 if (!input.incrementToken()) {
   return false;
 }
 char[] buffer = charTermAttr.buffer();
 String content = new String(buffer);
 Client client = Client.create();
 WebResource webResource = client.resource(http://localhost:8080/enhancer
 );
  ClientResponse response = webResource.type(text/plain).accept(new
 MediaType(application, rdf+xml)).post(ClientResponse.class, content);
  int status = response.getStatus();
 if (status != 200  status != 201  status != 202) {
  throw new RuntimeException(Failed : HTTP error code : 
  + response.getStatus());
  }

 String output = response.getEntity(String.class);
  System.out.println(output);
charTermAttr.setEmpty();
char[] newBuffer = output.toCharArray();
 charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
 return true;
 }

 When testing the analyzer I always get a HTTP 500 response from Stanbol
 server and I cannot process the enhancement response properly. But I could
 successfully execute the same jersey client code above in a Java
 application (in main method) and retrieve desired enhancement response from
 Stanbol.

 Any ideas why I always get a HTTP 500 error when invoking a rest endpoint
 in Solr analyzer? Could it be a permission problem in my Solr analyzer ?
 Appreciate your help.

 Thanks,
 Dileepa

 [1]
 https://blogs.oracle.com/enterprisetechtips/entry/consuming_restful_web_services_with

 [2]
 6424 [qtp918598659-11] ERROR org.apache.solr.core.SolrCore  –
 java.lang.RuntimeException: Failed : HTTP error code : 500
 at
 com.solr.test.analyzer.ContentFilter.incrementToken(ContentFilter.java:70)
  at
 org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeTokenStream(AnalysisRequestHandlerBase.java:179)
 at
 org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:126)
  at
 org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:221)
 at
 org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:190)
  at
 org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:101)
 at
 org.apache.solr.handler.AnalysisRequestHandlerBase.handleRequestBody(AnalysisRequestHandlerBase.java:59)
  at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
  at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
  at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
  at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
  at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
  at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
  at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
  at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368

Indexing a token to a different field in a custom filter

2013-11-11 Thread Dileepa Jayakody
Hi All,

In my custom filter, I need to index the processed token into a different
field. The processed token is a Stanbol enhancement response.

The solution I have so far found is to use a Solr client (solj) to add a
new Document with my processed field into Solr. Below is the sample code
segment;

 SolrServer server = new HttpSolrServer(http://localhost:8983/solr/;);
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField( id, id1, 1.0f );
doc1.addField(stanbolResponse, response);
try {
server.add(doc1);
server.commit();
} catch (SolrServerException e) {
e.printStackTrace();
}


This mechanism requires a new HTTP call to the local Solr server for every
token I process for the stanbolRequest field, and I feel it's not very
efficient.

Is there any other alternative way to invoke a update request to add a new
field to the indexing document within the filter (without making an
explicit HTTP call using Solrj)?

Thanks,
Dileepa


Re: Error instantiating a Custom Filter in Solr

2013-11-10 Thread Dileepa Jayakody
Thanks guys,

I got the problem resolved. It was a constructor API mismatch between the
code I wrote and the library I used.

I used the latest lucene-common 4.5.0 with my sample code and the startup
issue was resolved.

related stackoverflow discussion :
http://stackoverflow.com/questions/19840129/error-instantiating-the-custom-filterfactory-class-in-solr

Regards,
Dileepa


On Fri, Nov 8, 2013 at 9:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 Thanks for the plug Erick, but my deep dive doesn't go quite that deep
 (yet.)

 But I'm sure a 2,500 page book on how to develop all manner of custom Solr
 plugin would indeed be valuable though.

 But I do have plenty of example of using the many builtin Solr analysis
 filters.

 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Friday, November 08, 2013 10:36 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Error instantiating a Custom Filter in Solr


 Well, I think Jack Krupansky's book has some examples, at $10 it's probably
 a steal.

 Best,
 Erick




 On Fri, Nov 8, 2013 at 1:49 AM, Dileepa Jayakody
 dileepajayak...@gmail.comwrote:

  Hi Erick,

 Thanks a lot for the pointer.
 I looked at the LowerCaseFilterFactory class [1] and it's parent abstract
 class AbstractAnalysisFactory API [2] , and modified my custom filter
 factory class as below;

 public class ContentFilterFactory extends TokenFilterFactory {

 public ContentFilterFactory() {
 super();
 }

 @Override
 public void init(MapString, String args) {
 super.init(args);
 }

 @Override
 public ContentFilter create(TokenStream input) {
 assureMatchVersion();
 return new ContentFilter(input);
 }
 }

 I have called the parent's init method as above, but I'm still getting the
 same error of : java.lang.NoSuchMethodException: com.solr.test.analyzer.
 ContentFilterFactory.init(java.util.Map)

 Any input on this?
 Can some one please point me to a doc/blog or any sample to implement a
 custom filter with Solr  4.0
 I'm using Solr 4.5.0 server.

 Thanks,
 Dileepa

 [1]

 http://search-lucene.com/c/Lucene:analysis/common/src/
 java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java
 [2]

 https://lucene.apache.org/core/4_2_0/analyzers-common/
 org/apache/lucene/analysis/util/AbstractAnalysisFactory.html


 On Fri, Nov 8, 2013 at 4:25 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Well, the example you linked to is based on 3.6, and things have
  changed assuming you're using 4.0.
 
  It's probably that your ContentFilter isn't implementing what it needs
  to
  or it's not subclassing from the correct class for 4.0.
 
  Maybe take a look at something simple like LowerCaseFilterFactory
  and use that as a model, although you probably don't need to implement
  the MultiTermAware bit.
 
  FWIW,
  Erick
 
 
  On Thu, Nov 7, 2013 at 1:31 PM, Dileepa Jayakody
  dileepajayak...@gmail.comwrote:
 
   Hi All,
  
   I'm  a novice in Solr and I'm continuously bumping into problems with
 my
   custom filter I'm trying to use for analyzing a fieldType during
 indexing
   as below;
  
   fieldType name=stanbolRequestType class=solr.TextField
 analyzer type=index
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class= com.solr.test.analyzer.ContentFilterFactory/
 /analyzer
   /fieldType
  
   Below is my custom FilterFactory class;
  
   *public class ContentFilterFactory extends TokenFilterFactory {*
  
   * public ContentFilterFactory() {*
   * super();*
   * }*
  
   * @Override*
   * public TokenStream create(TokenStream input) {*
   * return new ContentFilter(input);*
   * }*
   *}*
  
   I'm getting below error stack trace [1] caused by a
 NoSuchMethodException
   when starting the server.
   Solr complains that it cannot init the Plugin (my custom filter)  as
 the
   FilterFactory class doesn't have a init method; But in the example [2]
 I
   was following didn't have any notion of a init method in the
  FilterFactory
   class, nor I was required to override an init method when extending
   TokenFilterFactory class.
  
   Can someone please help me resolve this error and get my custom filter
   working?
  
   Thanks,
   Dileepa
  
   [1]
   Caused by: org.apache.solr.common.SolrException: Plugin init failure
 for
   [schema.xml] fieldType stanbolRequestType: Plugin init failure for
   [schema.xml] analyzer/filter: Error instantiating class:
   'com.solr.test.analyzer.ContentFilterFactory'
   at
  
  
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(
 AbstractPluginLoader.java:177)
   at org.apache.solr.schema.IndexSchema.readSchema(
 IndexSchema.java:468)
   ... 13 more
   Caused by: org.apache.solr.common.SolrException: Plugin init failure
 for
   [schema.xml] analyzer/filter: Error instantiating class:
   'com.solr.test.analyzer.ContentFilterFactory'
   at
  
  
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(
 AbstractPluginLoader.java:177)
   at
  
  
 
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer

Re: Help to find BaseTokenFilterFactory to write a Custom TokenFilter

2013-11-07 Thread Dileepa Jayakody
Thanks Anuj,
The jar containing the class can be found here :
http://www.java2s.com/Code/JarDownload/lucene/lucene-analyzers-common-4.2.0.jar.zip


On Thu, Nov 7, 2013 at 2:18 PM, Anuj Kumar anujs...@gmail.com wrote:


 http://stackoverflow.com/questions/13149627/where-did-basetokenfilterfactory-go-in-solr-4-0


 On Thu, Nov 7, 2013 at 1:05 PM, Dileepa Jayakody
 dileepajayak...@gmail.comwrote:

  Hi All,
 
  I am writing a custom TokenFilter to post a token value to Apache Stanbol
  for enhancement. In this Custom TokenFilter I'm trying to retrieve the
  response from Stanbol and index it as a new document in Solr.
 
  I'm following [1] to write a custom filter, but I'm having trouble
  locating BaseTokenFilterFactory to create a TokenFactory. Can someone
  please point me to a Jar location to get this library?
 
  Thanks,
  Dileepa
 
  [1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
 



Re: Help to find BaseTokenFilterFactory to write a Custom TokenFilter

2013-11-07 Thread Dileepa Jayakody
)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType stanbolRequestType: Plugin init failure for
[schema.xml] analyzer/filter: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'. Schema file is
/home/dileepa/MyData/desk/solr/solr-4.5.0/example/solr/collection1/schema.xml
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:521)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:559)
... 8 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType stanbolRequestType: Plugin init failure for
[schema.xml] analyzer/filter: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468)
... 13 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] analyzer/filter: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
... 14 more
Caused by: org.apache.solr.common.SolrException: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556)
at
org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382)
at
org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
... 18 more
*Caused by: java.lang.NoSuchMethodException:
com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map)*
at java.lang.Class.getConstructor0(Class.java:2810)
at java.lang.Class.getConstructor(Class.java:1718)
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552)
... 21 more


On Thu, Nov 7, 2013 at 2:39 PM, Dileepa Jayakody
dileepajayak...@gmail.comwrote:

 Thanks Anuj,
 The jar containing the class can be found here :
 http://www.java2s.com/Code/JarDownload/lucene/lucene-analyzers-common-4.2.0.jar.zip


 On Thu, Nov 7, 2013 at 2:18 PM, Anuj Kumar anujs...@gmail.com wrote:


 http://stackoverflow.com/questions/13149627/where-did-basetokenfilterfactory-go-in-solr-4-0


 On Thu, Nov 7, 2013 at 1:05 PM, Dileepa Jayakody
 dileepajayak...@gmail.comwrote:

  Hi All,
 
  I am writing a custom TokenFilter to post a token value to Apache
 Stanbol
  for enhancement. In this Custom TokenFilter I'm trying to retrieve the
  response from Stanbol and index it as a new document in Solr.
 
  I'm following [1] to write a custom filter, but I'm having trouble
  locating BaseTokenFilterFactory to create a TokenFactory. Can someone
  please point me to a Jar location to get this library?
 
  Thanks,
  Dileepa
 
  [1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
 





Error instantiating a Custom Filter in Solr

2013-11-07 Thread Dileepa Jayakody
Hi All,

I'm  a novice in Solr and I'm continuously bumping into problems with my
custom filter I'm trying to use for analyzing a fieldType during indexing
as below;

fieldType name=stanbolRequestType class=solr.TextField
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class= com.solr.test.analyzer.ContentFilterFactory/
  /analyzer
/fieldType

Below is my custom FilterFactory class;

*public class ContentFilterFactory extends TokenFilterFactory {*

* public ContentFilterFactory() {*
* super();*
* }*

* @Override*
* public TokenStream create(TokenStream input) {*
* return new ContentFilter(input);*
* }*
*}*

I'm getting below error stack trace [1] caused by a NoSuchMethodException
when starting the server.
Solr complains that it cannot init the Plugin (my custom filter)  as the
FilterFactory class doesn't have a init method; But in the example [2] I
was following didn't have any notion of a init method in the FilterFactory
class, nor I was required to override an init method when extending
TokenFilterFactory class.

Can someone please help me resolve this error and get my custom filter
working?

Thanks,
Dileepa

[1]
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] fieldType stanbolRequestType: Plugin init failure for
[schema.xml] analyzer/filter: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468)
... 13 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for
[schema.xml] analyzer/filter: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
at
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
at
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
... 14 more
Caused by: org.apache.solr.common.SolrException: Error instantiating class:
'com.solr.test.analyzer.ContentFilterFactory'
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556)
at
org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382)
at
org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
... 18 more
Caused by: java.lang.NoSuchMethodException:
com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map)
at java.lang.Class.getConstructor0(Class.java:2810)
at java.lang.Class.getConstructor(Class.java:1718)
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552)
... 21 more

[2] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/


Re: Error instantiating a Custom Filter in Solr

2013-11-07 Thread Dileepa Jayakody
Hi Erick,

Thanks a lot for the pointer.
I looked at the LowerCaseFilterFactory class [1] and it's parent abstract
class AbstractAnalysisFactory API [2] , and modified my custom filter
factory class as below;

public class ContentFilterFactory extends TokenFilterFactory {

public ContentFilterFactory() {
super();
}

@Override
public void init(MapString, String args) {
super.init(args);
}

@Override
public ContentFilter create(TokenStream input) {
assureMatchVersion();
return new ContentFilter(input);
}
}

I have called the parent's init method as above, but I'm still getting the
same error of : java.lang.NoSuchMethodException: com.solr.test.analyzer.
ContentFilterFactory.init(java.util.Map)

Any input on this?
Can some one please point me to a doc/blog or any sample to implement a
custom filter with Solr  4.0
I'm using Solr 4.5.0 server.

Thanks,
Dileepa

[1]
http://search-lucene.com/c/Lucene:analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java
[2]
https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html


On Fri, Nov 8, 2013 at 4:25 AM, Erick Erickson erickerick...@gmail.comwrote:

 Well, the example you linked to is based on 3.6, and things have
 changed assuming you're using 4.0.

 It's probably that your ContentFilter isn't implementing what it needs to
 or it's not subclassing from the correct class for 4.0.

 Maybe take a look at something simple like LowerCaseFilterFactory
 and use that as a model, although you probably don't need to implement
 the MultiTermAware bit.

 FWIW,
 Erick


 On Thu, Nov 7, 2013 at 1:31 PM, Dileepa Jayakody
 dileepajayak...@gmail.comwrote:

  Hi All,
 
  I'm  a novice in Solr and I'm continuously bumping into problems with my
  custom filter I'm trying to use for analyzing a fieldType during indexing
  as below;
 
  fieldType name=stanbolRequestType class=solr.TextField
analyzer type=index
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class= com.solr.test.analyzer.ContentFilterFactory/
/analyzer
  /fieldType
 
  Below is my custom FilterFactory class;
 
  *public class ContentFilterFactory extends TokenFilterFactory {*
 
  * public ContentFilterFactory() {*
  * super();*
  * }*
 
  * @Override*
  * public TokenStream create(TokenStream input) {*
  * return new ContentFilter(input);*
  * }*
  *}*
 
  I'm getting below error stack trace [1] caused by a NoSuchMethodException
  when starting the server.
  Solr complains that it cannot init the Plugin (my custom filter)  as the
  FilterFactory class doesn't have a init method; But in the example [2] I
  was following didn't have any notion of a init method in the
 FilterFactory
  class, nor I was required to override an init method when extending
  TokenFilterFactory class.
 
  Can someone please help me resolve this error and get my custom filter
  working?
 
  Thanks,
  Dileepa
 
  [1]
  Caused by: org.apache.solr.common.SolrException: Plugin init failure for
  [schema.xml] fieldType stanbolRequestType: Plugin init failure for
  [schema.xml] analyzer/filter: Error instantiating class:
  'com.solr.test.analyzer.ContentFilterFactory'
  at
 
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:468)
  ... 13 more
  Caused by: org.apache.solr.common.SolrException: Plugin init failure for
  [schema.xml] analyzer/filter: Error instantiating class:
  'com.solr.test.analyzer.ContentFilterFactory'
  at
 
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
  at
 
 
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:400)
  at
 
 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
  at
 
 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
  at
 
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
  ... 14 more
  Caused by: org.apache.solr.common.SolrException: Error instantiating
 class:
  'com.solr.test.analyzer.ContentFilterFactory'
  at
 
 
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:556)
  at
 
 
 org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:382)
  at
 
 
 org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:376)
  at
 
 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
  ... 18 more
  Caused by: java.lang.NoSuchMethodException:
  com.solr.test.analyzer.ContentFilterFactory.init(java.util.Map)
  at java.lang.Class.getConstructor0(Class.java:2810)
  at java.lang.Class.getConstructor(Class.java:1718)
  at
 
 
 org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:552)
  ... 21 more
 
  [2] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/
 



Help to find BaseTokenFilterFactory to write a Custom TokenFilter

2013-11-06 Thread Dileepa Jayakody
Hi All,

I am writing a custom TokenFilter to post a token value to Apache Stanbol
for enhancement. In this Custom TokenFilter I'm trying to retrieve the
response from Stanbol and index it as a new document in Solr.

I'm following [1] to write a custom filter, but I'm having trouble
locating BaseTokenFilterFactory to create a TokenFactory. Can someone
please point me to a Jar location to get this library?

Thanks,
Dileepa

[1] http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/


Writing a Solr custom analyzer to post content to Stanbol {was: Need additional data processing in Data Import Handler prior to indexing}

2013-11-02 Thread Dileepa Jayakody
Hi All,

I went through possible solutions for my requirement of triggering a
Stanbol enhancement during Solr indexing, and I got the requirement
simplified.

I only need to process the field named content to perform the Stanbol
enhancement to extract Person and Organizations.
So I think it will be easier to do the Stanbol request during indexing the
content field , after the data is imported (from DIH).

I think the best solution will be to write a custom Analyzer to process the
content and post it to Stanbol.
In the analyzer I also need to process the Stanbol enhancement response.
The response should be processed as a new document to index and store the
identified Person and Organization entities in a field called
extractedEntities.

So my current idea is as follows;

in the schema.xml

copyField source=content dest=stanbolRequest /

field name=stanbolRequest type=stanbolRequestType indexed=true
stored=true docValues=truerequired=false/

 fieldType name=stanbolRequestType class=solr.TextField
  analyzer class=MyCustomAnalyzer/
 /fieldType

In the : MyCustomAnalyzer class the content will be posted and enhanced
from Stanbol. The Person and Organization entities in the response should
be indexed into the Solr field extractedEntities.
Am I going in the correct path for my requirement? Please share your ideas.
Appreciate any relevant pointers to samples/documentation.

Thanks,
Dileea

On Wed, Oct 30, 2013 at 11:26 AM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Thanks guys for your ideas.

 I will go through them and come back with questions.

 Regards,
 Dileepa


 On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Third time tonight I've been able to paste this link

 Also, you can consider just moving to SolrJ and
 taking DIH out of the process, see:
 http://searchhub.org/2012/02/14/indexing-with-solrj/

 Whichever approach fits your needs of course.

 Best,
 Erick


 On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  It's also possible to combine Update Request Processor with DIH. That
 way
  if a debug entry needs to be inserted it could go through the same
 Stanbol
  process.
 
  Just define a processing chain the DIH handler and write custom URP to
 call
  out to Stanbol web service. You have access to a full record in URP, so
 can
  add/delete/change the fields at will.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
  On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Hi Dileepa,
  
   You can write your own Transformers in Java. If it doesn't make sense
 to
   run Stanbol calls in a Transformer, maybe setting up a web service
 that
   grabs a record out of MySQL, sends the data to Stanbol, and displays
 the
   results could be used in conjunction with HttpDataSource rather than
   JdbcDataSource.
  
   http://wiki.apache.org/solr/DIHCustomTransformer
  
  
 
 http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
  
 
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
   
   w: appinions.com http://www.appinions.com/
  
  
   On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody 
   dileepajayak...@gmail.com
wrote:
  
Hi All,
   
I'm a newbie to Solr, and I have a requirement to import data from a
   mysql
database; enhance  the imported content to identify Persons
 mentioned
and
index it as a separate field in Solr along with the other fields
  defined
for the original db query.
   
I'm using Apache Stanbol [1] for the content enhancement
 requirement.
I can get enhancement results for 'Person' type data in the content
 as
   the
enhancement result.
   
The data flow will be;
mysql-db  Solr data-import handler  Stanbol enhancer  Solr index
   
For the above requirement I need to perform additional processing at
  the
data-import handler prior to indexing to send a request to Stanbol
 and
process the enhancement response. I found some related examples on
modifying mysql data import handler to customize the query results
 in
db-data-config.xml by using a transformer script.
As per my requirement, In the data-import-handler I need to send a
   request
to Stanbol and process the response prior to indexing. But I'm not
 sure
   if
this can be achieved using a simple javascript.
   
Is there any other better way

Need additional data processing in Data Import Handler prior to indexing

2013-10-29 Thread Dileepa Jayakody
Hi All,

I'm a newbie to Solr, and I have a requirement to import data from a mysql
database; enhance  the imported content to identify Persons mentioned  and
index it as a separate field in Solr along with the other fields defined
for the original db query.

I'm using Apache Stanbol [1] for the content enhancement requirement.
I can get enhancement results for 'Person' type data in the content as the
enhancement result.

The data flow will be;
mysql-db  Solr data-import handler  Stanbol enhancer  Solr index

For the above requirement I need to perform additional processing at the
data-import handler prior to indexing to send a request to Stanbol and
process the enhancement response. I found some related examples on
modifying mysql data import handler to customize the query results in
db-data-config.xml by using a transformer script.
As per my requirement, In the data-import-handler I need to send a request
to Stanbol and process the response prior to indexing. But I'm not sure if
this can be achieved using a simple javascript.

Is there any other better way of achieving my requirement? Maybe writing a
custom filter in Solr?
Please share your thoughts. Appreciate any pointers as I'm a beginner for
Solr.

Thanks,
Dileepa


[1] https://stanbol.apache.org


Re: Need additional data processing in Data Import Handler prior to indexing

2013-10-29 Thread Dileepa Jayakody
Thanks guys for your ideas.

I will go through them and come back with questions.

Regards,
Dileepa


On Wed, Oct 30, 2013 at 7:00 AM, Erick Erickson erickerick...@gmail.comwrote:

 Third time tonight I've been able to paste this link

 Also, you can consider just moving to SolrJ and
 taking DIH out of the process, see:
 http://searchhub.org/2012/02/14/indexing-with-solrj/

 Whichever approach fits your needs of course.

 Best,
 Erick


 On Tue, Oct 29, 2013 at 7:15 PM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  It's also possible to combine Update Request Processor with DIH. That way
  if a debug entry needs to be inserted it could go through the same
 Stanbol
  process.
 
  Just define a processing chain the DIH handler and write custom URP to
 call
  out to Stanbol web service. You have access to a full record in URP, so
 can
  add/delete/change the fields at will.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
  On Wed, Oct 30, 2013 at 4:09 AM, Michael Della Bitta 
  michael.della.bi...@appinions.com wrote:
 
   Hi Dileepa,
  
   You can write your own Transformers in Java. If it doesn't make sense
 to
   run Stanbol calls in a Transformer, maybe setting up a web service that
   grabs a record out of MySQL, sends the data to Stanbol, and displays
 the
   results could be used in conjunction with HttpDataSource rather than
   JdbcDataSource.
  
   http://wiki.apache.org/solr/DIHCustomTransformer
  
  
 
 http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2FHTTP_Datasource
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
  
 
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
   
   w: appinions.com http://www.appinions.com/
  
  
   On Tue, Oct 29, 2013 at 4:47 PM, Dileepa Jayakody 
   dileepajayak...@gmail.com
wrote:
  
Hi All,
   
I'm a newbie to Solr, and I have a requirement to import data from a
   mysql
database; enhance  the imported content to identify Persons mentioned
and
index it as a separate field in Solr along with the other fields
  defined
for the original db query.
   
I'm using Apache Stanbol [1] for the content enhancement requirement.
I can get enhancement results for 'Person' type data in the content
 as
   the
enhancement result.
   
The data flow will be;
mysql-db  Solr data-import handler  Stanbol enhancer  Solr index
   
For the above requirement I need to perform additional processing at
  the
data-import handler prior to indexing to send a request to Stanbol
 and
process the enhancement response. I found some related examples on
modifying mysql data import handler to customize the query results in
db-data-config.xml by using a transformer script.
As per my requirement, In the data-import-handler I need to send a
   request
to Stanbol and process the response prior to indexing. But I'm not
 sure
   if
this can be achieved using a simple javascript.
   
Is there any other better way of achieving my requirement? Maybe
  writing
   a
custom filter in Solr?
Please share your thoughts. Appreciate any pointers as I'm a beginner
  for
Solr.
   
Thanks,
Dileepa
   
   
[1] https://stanbol.apache.org