Carrot2 using rawtext of field for clustering
Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd been filtered with customer tokenizer/filters instead of rawtext that it currently uses for clustering ? I read an issue in following link too . https://issues.apache.org/jira/browse/SOLR-2917 Is writing our own parsers to filter text documents before indexing to SOLR could be only the right approach currently ? please let me know if anyone have come across this issue and have other better suggestions? -- Chandan Tamrakar * *
Re: timeAllowed flag in the response
Hi Laurent, alas there is currently no such option. The time limit is handled by an internal TimeLimitingCollector, which is used inside SolrIndexSearcher. Since the using method only returns the DocList and doesn't have access to the QueryResult, it won't be easy to return this information in a beautiful way. Aborted Queries don't feed the caches, so you maybe can check whether the cache fill rate has changed, Of course, this is no reasonable approach in production environment. The only way you can get the information is by patching Solr with a dirty hack. Greetings, Kuli Am 07.06.2012 22:14, schrieb Laurent Vaills: Hi everyone, We have some grouping queries that are quite long to execute. Some are too long to execute and are not acceptable. We have setup timeout for the socket but with this we get no result and the query is still running on the Solr side. So, we are now using the timeAllowed parameter which is a good compromise. However, in the response, how can we know that the query was stopped because it was too long ? I need this information for monitoring and to tell the user that the results are not complete. Regards, Laurent
Re: Sorting performance
Hi, probably this may help you start: https://issues.apache.org/jira/browse/SOLR-1297 Dmitry On Mon, Jun 4, 2012 at 9:51 PM, Gau gauravshe...@gmail.com wrote: Here is the usecase: I am using synonym expansion at query time to get results. this is essentially a name search, so a search for Jim may be expanded at query time for James, Jung, Jimmy, etc. So ranking fields like TF, IDF, Norms do not mean anything to me. I just reset them to zero. so all the results which I get have the same rank. I have used a copy field to boost the weights of exact match, so Jim would be boosted to the top. However I want the other results like Jimmy, Jung, James to be sorted by Levenstein Distance with respect to word Jim (the original query). The number of results returned are quite large. So a genereal strdist sort takes 6-7 seconds. Is there any other option than applying a sort= in the query to achieve the same functionality? Any particular way to index the data to achieve the same result? any idea to boost the performance and get the intended functionality? -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-performance-tp3987633.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Dmitry Kan
RE: per-fieldtype similarity not working
Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. I'll open an issue for the lack of Similarity impl. in the debug output when per-field similarity is enabled. Cheers! -Original message- From:Robert Muir rcm...@gmail.com Sent: Fri 01-Jun-2012 18:16 To: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi! Ah, it makes sense now! This global configured similarity knows returns a fieldType defined similarity if available and if not the standard Lucene similarity. This would, i assume, mean that the two defined similarities below without per fieldType declared similarities would always yield the same results? Not true: note that two methods (coord and querynorm) are not perfield but global across the entire query tree. By default these are disabled in the wrapper, as they only skew or confuse most modern scoring algorithms (eg all the new ranking algorithms in lucene 4) respectively. So if you want to do per-field scoring where *all* of your sims are vector-space, it could make sense to customize (e.g. subclass) SchemaSimilarityFactory and do something useful for these methods. -- lucidimagination.com
Re: Carrot2 using rawtext of field for clustering
Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd been filtered with customer tokenizer/filters instead of rawtext that it currently uses for clustering ? I read an issue in following link too . https://issues.apache.org/jira/browse/SOLR-2917 Is writing our own parsers to filter text documents before indexing to SOLR could be only the right approach currently ? please let me know if anyone have come across this issue and have other better suggestions? Until SOLR-2917 is resolved, this solutions seems the easiest to implement. Alternatively, you could provide a custom implementation of Carrot2's tokenizer ( http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html) through the appropriate factory attribute ( http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory). The custom implementation would need to apply the required filtering. Regardless of the approach, one thing to keep in mind is that Carrot2 draws labels from the input text, so if your filtered stream omits e.g. prepositions, the labels will be less readable. Staszek
Re: what's better for in memory searching?
Yes, use MMapDirectory. It is faster and uses memory more efficiently than RAMDirectory. This sounds wrong, but it is true. With RAMDirectory, Java has to work harder doing garbage collection. On Fri, Jun 8, 2012 at 1:30 AM, Li Li fancye...@gmail.com wrote: hi all I want to use lucene 3.6 providing searching service. my data is not very large, raw data is less that 1GB and I want to use load all indexes into memory. also I need save all indexes into disk persistently. I originally want to use RAMDirectory. But when I read its javadoc. Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte [1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments. It is recommended to materialize large indexes on disk and use MMapDirectory, which is a high-performance directory implementation working directly on the file system cache of the operating system, so copying data to Java heap space is not useful. should I use MMapDirectory? it seems another contrib instantiated. anyone test it with RAMDirectory? -- Lance Norskog goks...@gmail.com
Re: timeAllowed flag in the response
Hi Michael, Thanks for the details that helped me to take a deeper look in the source code. I noticed that each time a TimeExceededException is caught the method setPartialResults(true) is called...which seems to be what I'm looking for. I have to investigate, since this partialResults does not seem to be set for the sharded queries. Regards, Laurent Maybe there is a way to write a not so dirty patch with a new . 2012/6/8 Michael Kuhlmann k...@solarier.de Hi Laurent, alas there is currently no such option. The time limit is handled by an internal TimeLimitingCollector, which is used inside SolrIndexSearcher. Since the using method only returns the DocList and doesn't have access to the QueryResult, it won't be easy to return this information in a beautiful way. Aborted Queries don't feed the caches, so you maybe can check whether the cache fill rate has changed, Of course, this is no reasonable approach in production environment. The only way you can get the information is by patching Solr with a dirty hack. Greetings, Kuli Am 07.06.2012 22:14, schrieb Laurent Vaills: Hi everyone, We have some grouping queries that are quite long to execute. Some are too long to execute and are not acceptable. We have setup timeout for the socket but with this we get no result and the query is still running on the Solr side. So, we are now using the timeAllowed parameter which is a good compromise. However, in the response, how can we know that the query was stopped because it was too long ? I need this information for monitoring and to tell the user that the results are not complete. Regards, Laurent
Re: How to cap facet counts beyond a specified limit
On Thu, 2012-06-07 at 10:01 +0200, Andrew Laird wrote: For our needs we don't really need to know that a particular facet has exactly 14,203,527 matches - just knowing that there are more than a million is enough. If I could somehow limit the hit counts to a million (say) [...] It should be feasible to stop the collector after 1M documents has been processed. If nothing else then just by ignoring subsequent IDs. However, the ID's received would be in index-order, which normally means old-to-new. If the nature of the corpus, and thereby the facet values, changes over time, this change would not be reflected in the facets that has many hits as the collector never reaches the newer documents. it seems like that could decrease the work required to compute the values (just stop counting after the limit is reached) and potentially improve faceted search time - especially when we have 20-30 fields to facet on. Has anyone else tried to do something like this? The current Solr facet implementation treats every facet structure individually. It works fine in a lot of areas but it also means that the list of IDs for matching documents is iterated once for every facet: In the sample case, 14M+ hits * 25 fields = 350M+ hits processed. I have been experimenting with an alternative approach (SOLR-2412) that packs the terms in the facets as a single structure underneath the hood, which means only 14M+ hits processed in the current case. Unfortunately it is not mature and only works for text fields. - Toke Eskildsen, State and University Library, Denmark
appear garbled when I use DIH from oracle database
Hello: when I use DIH from oracle database,it appears garbled,why? ps:my oracle database is GBK encoding with chinese. how can I solve the problem? thanks!
Re: per-fieldtype similarity not working
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com
track unused parts of config, schema
Hi, Our configs, schemas are quite big. Are there any tools, code snippets in various languages, methodologies that people use in cleaning such up? For methodologies I might instead say things to look for that are almost always there and almost never used so I can look at those first. Thanks, Bryan Rasmussen
Re: ExtendedDisMax Question - Strange behaviour
Thank's Jack. It is exactly this. My mistake. Thank's * -- * *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)* *andre.maldonado*@gmail.com andre.maldon...@gmail.com (11) 9112-4227 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.facebook.com/profile.php?id=10659376883 http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado https://profiles.google.com/105605760943701739931 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3 http://www.youtube.com/andremaldonado On Wed, Jun 6, 2012 at 5:50 PM, Jack Krupansky j...@basetechnology.comwrote: First, it appears that you are using the dismax query parser, not the extended dismax (edismax) query parser. My hunch is that some of those fields may be non-tokenized string fields in which one or more of your search keywords do appear but not as the full string value or maybe with a different case than in the query. But when you do a copyField from a string field to a tokenized text field those strings would be broken up into individual keywords and probably lowercased. So, it will be easier for a document to match the combined text field than the source string fields. A fair percentage of the terms may occur in both text and string fields, but it looks like a fair percentage may occur only in the string fields. Identify a specific document that is returned by the first query and not the second. Then examine each non-text string field value of that document to see if the query terms would match after text field analysis but are not exact string matches for the string fields in which the terms do occur. -- Jack Krupansky -Original Message- From: André Maldonado Sent: Wednesday, June 06, 2012 9:23 AM To: solr-user@lucene.apache.org Subject: Re: ExtendedDisMax Question - Strange behaviour Erick, thanks for your reply and sorry for the confusion in last e-mail. But it is hard to explain the situation without that bunch of code. ...
Re: timeAllowed flag in the response
Am 08.06.2012 11:55, schrieb Laurent Vaills: Hi Michael, Thanks for the details that helped me to take a deeper look in the source code. I noticed that each time a TimeExceededException is caught the method setPartialResults(true) is called...which seems to be what I'm looking for. I have to investigate, since this partialResults does not seem to be set for the sharded queries. Ah, I simply was too blind! ;) The partial results flag indeed is set in the response header. Then I think this is a bug that it's not filled in a sharded response, or it simply is not there when sharding. Greeting, Kuli
RE: per-fieldtype similarity not working
Excellent! Thanks -Original message- From:Robert Muir rcm...@gmail.com Sent: Fri 08-Jun-2012 13:06 To: Markus Jelsma markus.jel...@openindex.io Cc: solr-user@lucene.apache.org Subject: Re: per-fieldtype similarity not working On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Thanks Robert, The difference in scores is clear now so it shouldn't matter as queryNorm doesn't affect ranking but coord does. Can you explain why coord is left out now and why it is considered to skew results and why queryNorm skews results? And which specific new ranking algorithms they confuse, BM25F? I think its easiest to compare the two TF normalization functions, DefaultSimilarity really needs something like this because its function (sqrt) grows very fast for a single term. On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates rather quickly for a single term, so when multiple terms are being scored, huge numbers of occurrences of a single term won't dominate the overall score. You can see this visually here (give it a second to load, and imagine documentLength = averageDocumentLength and k=1.2): http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100 Also, i would expect the default SchemaSimilarityFactory to behave the same as DefaultSimilarity this might raise some further confusion down the line. Thats ok: I'd rather the very expert case (Per-Field scoring) be trickier than have a trap for people that try to use any algorithm other than TFIDFSimilarity -- lucidimagination.com
defaultSearchField and param df are messed up in 3.6.x
Unfortunately I must see that defaultSearchField and param df are pretty much messed up in solr 3.6.x Yes, I have seen issue SOLR-2724 and SOLR-3292. So if defaultSearchField has been removed (deprecated) from schema.xml then why are the still calls to org.apache.solr.schema.IndexSchema.getDefaultSearchFieldName()? All these calls get no result, because there is no defaultSearchField. This also breaks edismax (ExtendedDismaxQParserPlugin) and several other. As example in method parse() it tries ... queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF)); if (0 == queryFields.size()) { queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f); } ... Guess what, yes no result and an empty search :-( A grep for getDefaultSearchFieldName pointed out that there are several places where this method is still in use for sorl 3.6.x. A workaround is to enable defaultSearchField in schema.xml again. Or to fix all places in the code, e.g. for ExtendedDismaxQParserPlugin method parse() must then read ... queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF)); if (0 == queryFields.size()) { queryFields.put(solrParams.getParams(df)); } ... or something similar. I would also recommend to enable defaultOperator in schema.xml again. Just in case they forgot to fix places where they try to access defaultOperator. Regards Bernd
Re: highlighter not respecting sentence boundry
hi Here is how i get the snippet i phone is highlighted == , a car charger and a battery backup for iPods and iPhones. I expect this to start from starting of sentence. here is my solr config === searchComponent class=solr.HighlightComponent name=highlight highlighting boundaryScanner class=solr.highlight.SimpleBoundaryScanner default=false name=simple lst name=defaults str name=hl.bs.maxScan200/str str name=hl.bs.chars./str /lst /boundaryScanner boundaryScanner class=solr.highlight.BreakIteratorBoundaryScanner default=true name=breakIterator lst name=defaults str name=hl.bs.typeSENTENCE/str str name=hl.bs.languageen/str str name=hl.bs.countryUS/str /lst /boundaryScanner == I m using default breakIterator. This specific case of snippet gets better if i use large fragSize like fragSize=300 but then some other snippets are still not showing up from start of sentence -- View this message in context: http://lucene.472066.n3.nabble.com/highlighter-not-respecting-sentence-boundry-tp3984327p3988491.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: defaultSearchField and param df are messed up in 3.6.x
Besides the obvious need to clean up the getDefaultSearchFieldName references, I would also suggest that the df param have a hard-wired default of text since that is the obvious default. -- Jack Krupansky -Original Message- From: Bernd Fehling Sent: Friday, June 08, 2012 10:15 AM To: solr-user@lucene.apache.org Subject: defaultSearchField and param df are messed up in 3.6.x Unfortunately I must see that defaultSearchField and param df are pretty much messed up in solr 3.6.x Yes, I have seen issue SOLR-2724 and SOLR-3292. So if defaultSearchField has been removed (deprecated) from schema.xml then why are the still calls to org.apache.solr.schema.IndexSchema.getDefaultSearchFieldName()? All these calls get no result, because there is no defaultSearchField. This also breaks edismax (ExtendedDismaxQParserPlugin) and several other. As example in method parse() it tries ... queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF)); if (0 == queryFields.size()) { queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f); } ... Guess what, yes no result and an empty search :-( A grep for getDefaultSearchFieldName pointed out that there are several places where this method is still in use for sorl 3.6.x. A workaround is to enable defaultSearchField in schema.xml again. Or to fix all places in the code, e.g. for ExtendedDismaxQParserPlugin method parse() must then read ... queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF)); if (0 == queryFields.size()) { queryFields.put(solrParams.getParams(df)); } ... or something similar. I would also recommend to enable defaultOperator in schema.xml again. Just in case they forgot to fix places where they try to access defaultOperator. Regards Bernd
terms count in multivalues field
Is it possible to get number of entries present in a multivalued field by solr query. Lets say I want to query to solr to get all documents having * count* of some multivalued field 1. Is it possible in solr ? -- Thanks Regards Preetesh Dubey
Re: ContentStreamUpdateRequest method addFile in 4.0 release.
for the ExtractingRequestHandler, you can put anything into the request contentType. try: addFile( file, application/octet-stream ) but anything should work ryan On Thu, Jun 7, 2012 at 2:32 PM, Koorosh Vakhshoori kvakhsho...@gmail.com wrote: In latest 4.0 release, the addFile() method has a new argument 'contentType': addFile(File file, String contentType) In context of Solr Cell how should addFile() method be called? Specifically I refer to the Wiki example: ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(/update/extract); up.addFile(new File(mailing_lists.pdf)); up.setParam(literal.id, mailing_lists.pdf); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); result = server.request(up); assertNotNull(Couldn't upload mailing_lists.pdf, result); rsp = server.query( new SolrQuery( *:*) ); Assert.assertEquals( 1, rsp.getResults().getNumFound() ); given at URL: http://wiki.apache.org/solr/ExtractingRequestHandler Since Solr Cell is calling Tika under the hood, doesn't the file content-type is already identified by Tika? Looking at the code, it seems passing NULL would do the job, is that correct? Also for Solr Cell, is the ContentStreamUpdateRequest class is the right one to use or there is a different class that is more appropriate here? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/ContentStreamUpdateRequest-method-addFile-in-4-0-release-tp3988344.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help! Confused about using Jquery for the Search query - Want to ditch it
Hi, what you want to do is not that difficult, you can use json, eg. try: conn = urllib.urlopen(url, params) page = conn.read() rsp = simplejson.loads(page) conn.close() return rsp except Exception, e: log.error(str(e)) log.error(page) raise e but this way you are initiating connection each time, which is expensive - it would be better to pool the connections but as you can see, you can get json or xml either way another option is to use solrpy import solr import urllib # create a connection to a solr server s = solr.SolrConnection('http://localhost:8984/solr') s.select = solr.SearchHandler(s, '/invenio') def search(query, kwargs=None, fields=['id'], qt='invenio'): # do a remote search in solr url_params = urllib.urlencode([(k, v) for k,v in kwargs.items() if k not in ['_', 'req']]) if 'rg' in kwargs and kwargs['rg']: rows = min(kwargs['rg'], 100) #inv maximum limit is 100 else: rows = 25 response = s.query(query, fields=fields, rows=rows, qt=qt, inv_params=url_params) num_found = response.numFound q_time = response.header['QTime'] # more and return r On Thu, Jun 7, 2012 at 3:16 PM, Ben Woods bwo...@quincyinc.com wrote: But, check out things like httplib2 and urllib2. -Original Message- From: Spadez [mailto:james_will...@hotmail.com] Sent: Thursday, June 07, 2012 2:09 PM To: solr-user@lucene.apache.org Subject: RE: Help! Confused about using Jquery for the Search query - Want to ditch it Thank you, that helps. The bit I am still confused about how the server sends the response to the server though. I get the impression that there are different ways that this could be done, but is sending an XML response back to the Python server the best way to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/Help-Confused-about-using-Jquery-for-the-Search-query-Want-to-ditch-it-tp3988123p3988302.html Sent from the Solr - User mailing list archive at Nabble.com. Quincy and its subsidiaries do not discriminate in the sale of advertising in any medium (broadcast, print, or internet), and will accept no advertising which is placed with an intent to discriminate on the basis of race or ethnicity.
Writing custom data import handler for Solr.
Hi, I am planning to write a custom data import handler for SOLR for some data source. Could you give me some pointers to documentation, examples on how to write a custom data import handler and how to integrate it with SOLR. Thank you for help. Thanks and regards,Ram Anam.
Re: Writing custom data import handler for Solr.
You need to back up a bit and describe _why_ you want to do this, perhaps there's an easy way to do what you want. This could easily be an XY problem... For instance, you can write a SolrJ program to index data, which _might_ be what you want. It's a separate process runnable anywhere. See: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/ Best Erick On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote: Hi, I am planning to write a custom data import handler for SOLR for some data source. Could you give me some pointers to documentation, examples on how to write a custom data import handler and how to integrate it with SOLR. Thank you for help. Thanks and regards,Ram Anam.
Adding Custom-Parser to Tika
Hi, I have written a new parser for tika. The problem is, that I have to edit org.apache.tika.parser.Parser in the tika.jar. But I do not want to edit the jar. Is the another way to register the new parser? It must work with a plain AutoDetectParser, since this is used in oder Parsers directly (e.g. RFC822Parser). Thank you.
Re: Adding Custom-Parser to Tika
Solr will find libs in top-level directory solr/lib (next to solr.xml) or a lib/ directory inside each core directory. You can put your new parser in a jar file in one of those places. Like this: solr/ solr/solr.xml solr/lib solr/lib/yourjar.jar solr/collection1 solr/collection1/conf solr/collection1/lib solr/collection1/lib/yourjar.jar On Fri, Jun 8, 2012 at 12:35 PM, spr...@gmx.eu wrote: Hi, I have written a new parser for tika. The problem is, that I have to edit org.apache.tika.parser.Parser in the tika.jar. But I do not want to edit the jar. Is the another way to register the new parser? It must work with a plain AutoDetectParser, since this is used in oder Parsers directly (e.g. RFC822Parser). Thank you. -- Lance Norskog goks...@gmail.com
RE: Adding Custom-Parser to Tika
The parser must get registered in the service registry (META-INF/services/org.apache.tika.parser.Parser). Just being in the classpath does not work. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Freitag, 8. Juni 2012 22:38 To: solr-user@lucene.apache.org Subject: Re: Adding Custom-Parser to Tika Solr will find libs in top-level directory solr/lib (next to solr.xml) or a lib/ directory inside each core directory. You can put your new parser in a jar file in one of those places. Like this: solr/ solr/solr.xml solr/lib solr/lib/yourjar.jar solr/collection1 solr/collection1/conf solr/collection1/lib solr/collection1/lib/yourjar.jar On Fri, Jun 8, 2012 at 12:35 PM, spr...@gmx.eu wrote: Hi, I have written a new parser for tika. The problem is, that I have to edit org.apache.tika.parser.Parser in the tika.jar. But I do not want to edit the jar. Is the another way to register the new parser? It must work with a plain AutoDetectParser, since this is used in oder Parsers directly (e.g. RFC822Parser). Thank you. -- Lance Norskog goks...@gmail.com
RE: Adding Custom-Parser to Tika
You canspecify a tika.config option pointing at your own tika-config.xml files that ExtractionRequestHandler will use to configure Tika with... http://wiki.apache.org/solr/ExtractingRequestHandler The tika.config entry points to a file containing a Tika configuration. You would only need this if you have customized your own Tika configuration. The Tika config contains info about parsers, mime types, etc. -Hoss
Re: Boost by Nested Query / Join Needed?
: For posterity, I think we're going to remove 'preference' data from Solr : indexing and go in the custom Function Query direction with a key-value : store. that would be my suggestion. Assuming you really are modeling candy users, my guess is the number if distinct candies you have is very large and hte number of distinct users you have is very large but the number of prefrences per user is small to medium you can probably go very far by just storying your $user-[candy,weight] prefrence data in the key+val store of your choice, and then whenever a $user does a $search, augment the $search with the boost params based on the $user-[candy,weight] prefs. if you find that you have too many prefs from some users, put a cap on the number of prefrences you let influence the query (ie: only the top N weights, or only the N most confident weights, or N most recent prefs) or aggregate some prefs into category/manufactorur prefs instead of specific $candies, etc... Having said all that: with the new Solr NRT stuff and the /get handler real time gets, you can treat another solr core/server as your key+val store if you want -- but using straight SolrJoin won't let you take advantage of the weight boostings. -Hoss
RE: Writing custom data import handler for Solr.
Hi Eric, I cannot disclose the data source which we are planning to index inside SOLR as it is confidential. But client wants it be in the form of Import Handler. We plan to install Solr and our custom data import handlers so that client can just consume it. Could you please provide me the pointers to examples of Custom Data Import Handlers. Thanks and regards,Ram Anam. Date: Fri, 8 Jun 2012 13:59:34 -0400 Subject: Re: Writing custom data import handler for Solr. From: erickerick...@gmail.com To: solr-user@lucene.apache.org You need to back up a bit and describe _why_ you want to do this, perhaps there's an easy way to do what you want. This could easily be an XY problem... For instance, you can write a SolrJ program to index data, which _might_ be what you want. It's a separate process runnable anywhere. See: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/ Best Erick On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote: Hi, I am planning to write a custom data import handler for SOLR for some data source. Could you give me some pointers to documentation, examples on how to write a custom data import handler and how to integrate it with SOLR. Thank you for help. Thanks and regards,Ram Anam.
Re: Writing custom data import handler for Solr.
The DataImportHandler is a toolkit in Solr. It has a few different kinds of plugins. It is very possible that you do not have to write any Java code. If you have an unusual external data feed (database, file system, Amazon S3 buckets) then you would write a Datasource. The only examples are the source code in trunk/solr/contrib/dataimporthandler. http://wiki.apache.org/solr/DataImportHandler On Fri, Jun 8, 2012 at 8:35 PM, ram anam ram_a...@hotmail.com wrote: Hi Eric, I cannot disclose the data source which we are planning to index inside SOLR as it is confidential. But client wants it be in the form of Import Handler. We plan to install Solr and our custom data import handlers so that client can just consume it. Could you please provide me the pointers to examples of Custom Data Import Handlers. Thanks and regards,Ram Anam. Date: Fri, 8 Jun 2012 13:59:34 -0400 Subject: Re: Writing custom data import handler for Solr. From: erickerick...@gmail.com To: solr-user@lucene.apache.org You need to back up a bit and describe _why_ you want to do this, perhaps there's an easy way to do what you want. This could easily be an XY problem... For instance, you can write a SolrJ program to index data, which _might_ be what you want. It's a separate process runnable anywhere. See: http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/ Best Erick On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote: Hi, I am planning to write a custom data import handler for SOLR for some data source. Could you give me some pointers to documentation, examples on how to write a custom data import handler and how to integrate it with SOLR. Thank you for help. Thanks and regards,Ram Anam. -- Lance Norskog goks...@gmail.com
Re: Adding Custom-Parser to Tika
The doc is old. Tika hunts for parsers in the classpath now. http://www.lucidimagination.com/search/link?url=https://issues.apache.org/jira/browse/SOLR-2116?focusedCommentId=12977072#action_12977072 On Fri, Jun 8, 2012 at 2:20 PM, Chris Hostetter hossman_luc...@fucit.org wrote: You canspecify a tika.config option pointing at your own tika-config.xml files that ExtractionRequestHandler will use to configure Tika with... http://wiki.apache.org/solr/ExtractingRequestHandler The tika.config entry points to a file containing a Tika configuration. You would only need this if you have customized your own Tika configuration. The Tika config contains info about parsers, mime types, etc. -Hoss -- Lance Norskog goks...@gmail.com
What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
Greetings, I am in the process of updating custom code and schema from Solr 1.4 to 3.6.0 and have run into the following issue with our two custom Tokenizer and Token Filter components. I've been banging my head against this one for far too long, especially since it must be something obvious I'm missing. I have custom Tokenizer and Token Filter components along with corresponding factories. The code for all looks very similar to the Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0 (and I have also read through http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I have ensured my custom code is on the classpath, it is in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar: ---output snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create --snip--- After successfully parsing the schema and creating many fields, etc.. the following is logged: ---snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : com.company.MyCustomTokenizerFactory Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:102) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:748) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:249) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1222) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:676) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:455) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) at
Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
Just in case it is helpful, here are the relevant pieces of my schema.xml: ---snip-- fieldtype name=customfield class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=com.company.MyCustomTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ !--filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/-- /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ !--filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/-- /analyzer /fieldtype ---snip--- and ---snip--- fieldtype name=customterms class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=com.company.MyCustomFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt expand=false/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=\- replacement= replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=amp;amp; replacement=amp; replace=all/ filter class=solr.PatternReplaceFilterFactory pattern=\s+ replacement= replace=all/ filter class=solr.TrimFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldtype ---snip--- On Sat, Jun 9, 2012 at 12:03 AM, Aaron Daubman daub...@gmail.com wrote: Greetings, I am in the process of updating custom code and schema from Solr 1.4 to 3.6.0 and have run into the following issue with our two custom Tokenizer and Token Filter components. I've been banging my head against this one for far too long, especially since it must be something obvious I'm missing. I have custom Tokenizer and Token Filter components along with corresponding factories. The code for all looks very similar to the Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0 (and I have also read through http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters I have ensured my custom code is on the classpath, it is in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar: ---output snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar' to classloader Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create --snip--- After successfully parsing the schema and creating many fields, etc.. the following is logged: ---snip--- Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created : com.company.MyCustomTokenizerFactory Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219) at