Carrot2 using rawtext of field for clustering

2012-06-08 Thread Chandan Tamrakar
Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd
been filtered with customer tokenizer/filters instead of rawtext that it
currently
uses for clustering ?

I read an issue in following link too .

https://issues.apache.org/jira/browse/SOLR-2917


Is writing our own parsers to filter text documents before indexing to SOLR
could be only the right approach currently ? please let me know if anyone
have come across this issue and have other better suggestions?

-- 
Chandan Tamrakar
*
*


Re: timeAllowed flag in the response

2012-06-08 Thread Michael Kuhlmann

Hi Laurent,

alas there is currently no such option. The time limit is handled by an 
internal TimeLimitingCollector, which is used inside SolrIndexSearcher. 
Since the using method only returns the DocList and doesn't have access 
to the QueryResult, it won't be easy to return this information in a 
beautiful way.


Aborted Queries don't feed the caches, so you maybe can check whether 
the cache fill rate has changed, Of course, this is no reasonable 
approach in production environment.


The only way you can get the information is by patching Solr with a 
dirty hack.


Greetings,
Kuli

Am 07.06.2012 22:14, schrieb Laurent Vaills:

Hi everyone,

We have some grouping queries that are quite long to execute. Some are too
long to execute and are not acceptable. We have setup timeout for the
socket but with this we get no result and the query is still running on the
Solr side.
So, we are now using the timeAllowed parameter which is a good compromise.
However, in the response, how can we know that the query was stopped
because it was too long ?

I need this information for monitoring and to tell the user that the
results are not complete.

Regards,
Laurent





Re: Sorting performance

2012-06-08 Thread Dmitry Kan
Hi,

probably this may help you start:

https://issues.apache.org/jira/browse/SOLR-1297

Dmitry

On Mon, Jun 4, 2012 at 9:51 PM, Gau gauravshe...@gmail.com wrote:

 Here is the usecase:
 I am using synonym expansion at query time to get results. this is
 essentially a name search, so a search for Jim may be expanded at query
 time
 for James, Jung, Jimmy, etc.

 So ranking fields like TF, IDF, Norms do not mean anything to me. I just
 reset them to zero. so all the results which I get have the same rank. I
 have used a copy field to boost the weights of exact match, so Jim would be
 boosted to the top.

 However I want the other results like Jimmy, Jung, James to be sorted by
 Levenstein Distance with respect to word Jim (the original query). The
 number of results returned are quite large. So a genereal strdist sort
 takes
 6-7 seconds. Is there any other option than applying a sort= in the query
 to
 achieve the same functionality? Any particular way to index the data to
 achieve the same result? any idea to boost the performance and get the
 intended functionality?

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Sorting-performance-tp3987633.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,

Dmitry Kan


RE: per-fieldtype similarity not working

2012-06-08 Thread Markus Jelsma
Thanks Robert,

The difference in scores is clear now so it shouldn't matter as queryNorm 
doesn't affect ranking but coord does. Can you explain why coord is left out 
now and why it is considered to skew results and why queryNorm skews results? 
And which specific new ranking algorithms they confuse, BM25F? 

Also, i would expect the default SchemaSimilarityFactory to behave the same as 
DefaultSimilarity this might raise some further confusion down the line.

I'll open an issue for the lack of Similarity impl. in the debug output when 
per-field similarity is enabled.

Cheers!

 
 
-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Fri 01-Jun-2012 18:16
 To: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Fri, Jun 1, 2012 at 11:39 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Hi!
 
 
  Ah, it makes sense now! This global configured similarity knows returns a 
  fieldType defined similarity if available and if not the standard Lucene 
  similarity. This would, i assume, mean that the two defined similarities 
  below without per fieldType declared similarities would always yield the 
  same results?
 
 Not true: note that two methods (coord and querynorm) are not perfield
 but global across the entire query tree.
 
 By default these are disabled in the wrapper, as they only skew or
 confuse most modern scoring algorithms (eg all the new ranking
 algorithms in lucene 4) respectively.
 
 So if you want to do per-field scoring where *all* of your sims are
 vector-space, it could make sense to customize (e.g. subclass)
 SchemaSimilarityFactory and do something useful for these methods.
 
 
 -- 
 lucidimagination.com
 


Re: Carrot2 using rawtext of field for clustering

2012-06-08 Thread Stanislaw Osinski

 Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd
 been filtered with customer tokenizer/filters instead of rawtext that it
 currently
 uses for clustering ?

 I read an issue in following link too .

 https://issues.apache.org/jira/browse/SOLR-2917


 Is writing our own parsers to filter text documents before indexing to SOLR
 could be only the right approach currently ? please let me know if anyone
 have come across this issue and have other better suggestions?


Until SOLR-2917 is resolved, this solutions seems the easiest to implement.
Alternatively, you could provide a custom implementation of Carrot2's
tokenizer (
http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html)
through the appropriate factory attribute (
http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory).
The custom implementation would need to apply the required filtering.

Regardless of the approach, one thing to keep in mind is that Carrot2 draws
labels from the input text, so if your filtered stream omits e.g.
prepositions, the labels will be less readable.

Staszek


Re: what's better for in memory searching?

2012-06-08 Thread Lance Norskog
Yes, use MMapDirectory. It is faster and uses memory more efficiently
than RAMDirectory. This sounds wrong, but it is true. With
RAMDirectory, Java has to work harder doing garbage collection.

On Fri, Jun 8, 2012 at 1:30 AM, Li Li fancye...@gmail.com wrote:
 hi all
   I want to use lucene 3.6 providing searching service. my data is
 not very large, raw data is less that 1GB and I want to use load all
 indexes into memory. also I need save all indexes into disk
 persistently.
   I originally want to use RAMDirectory. But when I read its javadoc.

   Warning: This class is not intended to work with huge indexes.
 Everything beyond several hundred megabytes
  will waste resources (GC cycles), because it uses an internal buffer
 size of 1024 bytes, producing millions of byte
  [1024] arrays. This class is optimized for small memory-resident
 indexes. It also has bad concurrency on
  multithreaded environments.
 It is recommended to materialize large indexes on disk and use
 MMapDirectory, which is a high-performance
  directory implementation working directly on the file system cache of
 the operating system, so copying data to
  Java heap space is not useful.

    should I use MMapDirectory? it seems another contrib instantiated.
 anyone test it with RAMDirectory?



-- 
Lance Norskog
goks...@gmail.com


Re: timeAllowed flag in the response

2012-06-08 Thread Laurent Vaills
Hi Michael,

Thanks for the details that helped me to take a deeper look in the source
code. I noticed that each time a TimeExceededException is caught the method
 setPartialResults(true) is called...which seems to be what I'm looking for.
I have to investigate, since this partialResults does not seem to be set
for the sharded queries.

Regards,
Laurent

Maybe there is a way to write a not so dirty patch with a new .

2012/6/8 Michael Kuhlmann k...@solarier.de

 Hi Laurent,

 alas there is currently no such option. The time limit is handled by an
 internal TimeLimitingCollector, which is used inside SolrIndexSearcher.
 Since the using method only returns the DocList and doesn't have access to
 the QueryResult, it won't be easy to return this information in a beautiful
 way.

 Aborted Queries don't feed the caches, so you maybe can check whether the
 cache fill rate has changed, Of course, this is no reasonable approach in
 production environment.

 The only way you can get the information is by patching Solr with a dirty
 hack.

 Greetings,
 Kuli

 Am 07.06.2012 22:14, schrieb Laurent Vaills:

  Hi everyone,

 We have some grouping queries that are quite long to execute. Some are too
 long to execute and are not acceptable. We have setup timeout for the
 socket but with this we get no result and the query is still running on
 the
 Solr side.
 So, we are now using the timeAllowed parameter which is a good compromise.
 However, in the response, how can we know that the query was stopped
 because it was too long ?

 I need this information for monitoring and to tell the user that the
 results are not complete.

 Regards,
 Laurent





Re: How to cap facet counts beyond a specified limit

2012-06-08 Thread Toke Eskildsen
On Thu, 2012-06-07 at 10:01 +0200, Andrew Laird wrote:
 For our needs we don't really need to know that a particular facet has
 exactly 14,203,527 matches - just knowing that there are more than a
 million is enough.  If I could somehow limit the hit counts to a
 million (say) [...]

It should be feasible to stop the collector after 1M documents has been
processed. If nothing else then just by ignoring subsequent IDs.
However, the ID's received would be in index-order, which normally means
old-to-new. If the nature of the corpus, and thereby the facet values,
changes over time, this change would not be reflected in the facets that
has many hits as the collector never reaches the newer documents.

 it seems like that could decrease the work required to
 compute the values (just stop counting after the limit is reached) and
 potentially improve faceted search time - especially when we have 20-30
 fields to facet on.  Has anyone else tried to do something like this?

The current Solr facet implementation treats every facet structure
individually. It works fine in a lot of areas but it also means that the
list of IDs for matching documents is iterated once for every facet: In
the sample case, 14M+ hits * 25 fields = 350M+ hits processed.

I have been experimenting with an alternative approach (SOLR-2412) that
packs the terms in the facets as a single structure underneath the hood,
which means only 14M+ hits processed in the current case. Unfortunately
it is not mature and only works for text fields.

- Toke Eskildsen, State and University Library, Denmark



appear garbled when I use DIH from oracle database

2012-06-08 Thread 涂小刚
Hello:
 when I use DIH from oracle database,it appears garbled,why? ps:my
oracle database is  GBK encoding with chinese.
how can I solve the problem?
thanks!


Re: per-fieldtype similarity not working

2012-06-08 Thread Robert Muir
On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks Robert,

 The difference in scores is clear now so it shouldn't matter as queryNorm 
 doesn't affect ranking but coord does. Can you explain why coord is left out 
 now and why it is considered to skew results and why queryNorm skews results? 
 And which specific new ranking algorithms they confuse, BM25F?

I think its easiest to compare the two TF normalization functions,
DefaultSimilarity really needs something like this because its
function (sqrt) grows very fast for a single term.
On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
rather quickly for a single term, so when multiple terms are being
scored, huge numbers of occurrences of a single term won't dominate
the overall score.

You can see this visually here (give it a second to load, and imagine
documentLength = averageDocumentLength and k=1.2):
http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100


 Also, i would expect the default SchemaSimilarityFactory to behave the same 
 as DefaultSimilarity this might raise some further confusion down the line.

Thats ok: I'd rather the very expert case (Per-Field scoring) be
trickier than have a trap for people that try to use any algorithm
other than TFIDFSimilarity

-- 
lucidimagination.com


track unused parts of config, schema

2012-06-08 Thread bryan rasmussen
Hi,

Our configs, schemas are quite big. Are there any tools, code snippets
in various languages, methodologies that people use in cleaning such
up?

For methodologies I might instead say things to look for that are
almost always there and almost never used so I can look at those
first.

Thanks,
Bryan Rasmussen


Re: ExtendedDisMax Question - Strange behaviour

2012-06-08 Thread André Maldonado
Thank's Jack. It is exactly this. My mistake.

Thank's
*
--
*
*E conhecereis a verdade, e a verdade vos libertará. (João 8:32)*

 *andre.maldonado*@gmail.com andre.maldon...@gmail.com
 (11) 9112-4227

http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
http://www.facebook.com/profile.php?id=10659376883
  http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado
  https://profiles.google.com/105605760943701739931
http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3
  http://www.youtube.com/andremaldonado




On Wed, Jun 6, 2012 at 5:50 PM, Jack Krupansky j...@basetechnology.comwrote:

 First, it appears that you are using the dismax query parser, not the
 extended dismax (edismax) query parser.

 My hunch is that some of those fields may be non-tokenized string fields
 in which one or more of your search keywords do appear but not as the full
 string value or maybe with a different case than in the query. But when you
 do a copyField from a string field to a tokenized text field those
 strings
 would be broken up into individual keywords and probably lowercased. So, it
 will be easier for a document to match the combined text field than the
 source string fields. A fair percentage of the terms may occur in both
 text and string fields, but it looks like a fair percentage may occur
 only in the string fields.

 Identify a specific document that is returned by the first query and not
 the
 second. Then examine each non-text string field value of that document to
 see if the query terms would match after text field analysis but are not
 exact string matches for the string fields in which the terms do occur.

 -- Jack Krupansky
 -Original Message- From: André Maldonado
 Sent: Wednesday, June 06, 2012 9:23 AM
 To: solr-user@lucene.apache.org
 Subject: Re: ExtendedDisMax Question - Strange behaviour


 Erick, thanks for your reply and sorry for the confusion in last e-mail.
 But it is hard to explain the situation without that bunch of code.
 ...




Re: timeAllowed flag in the response

2012-06-08 Thread Michael Kuhlmann

Am 08.06.2012 11:55, schrieb Laurent Vaills:

Hi Michael,

Thanks for the details that helped me to take a deeper look in the source
code. I noticed that each time a TimeExceededException is caught the method
  setPartialResults(true) is called...which seems to be what I'm looking for.
I have to investigate, since this partialResults does not seem to be set
for the sharded queries.


Ah, I simply was too blind! ;) The partial results flag indeed is set in 
the response header.


Then I think this is a bug that it's not filled in a sharded response, 
or it simply is not there when sharding.


Greeting,
Kuli


RE: per-fieldtype similarity not working

2012-06-08 Thread Markus Jelsma
Excellent!
Thanks

 
 
-Original message-
 From:Robert Muir rcm...@gmail.com
 Sent: Fri 08-Jun-2012 13:06
 To: Markus Jelsma markus.jel...@openindex.io
 Cc: solr-user@lucene.apache.org
 Subject: Re: per-fieldtype similarity not working
 
 On Fri, Jun 8, 2012 at 5:04 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Thanks Robert,
 
  The difference in scores is clear now so it shouldn't matter as queryNorm 
  doesn't affect ranking but coord does. Can you explain why coord is left 
  out now and why it is considered to skew results and why queryNorm skews 
  results? And which specific new ranking algorithms they confuse, BM25F?
 
 I think its easiest to compare the two TF normalization functions,
 DefaultSimilarity really needs something like this because its
 function (sqrt) grows very fast for a single term.
 On the other hand, consider BM25's: tf/(tf+lengthNorm), it saturates
 rather quickly for a single term, so when multiple terms are being
 scored, huge numbers of occurrences of a single term won't dominate
 the overall score.
 
 You can see this visually here (give it a second to load, and imagine
 documentLength = averageDocumentLength and k=1.2):
 http://www.wolframalpha.com/input/?i=plot+sqrt%28x%29%2C+x%2F%28x%2B1.2%29%2C+x%3D1+to+100
 
 
  Also, i would expect the default SchemaSimilarityFactory to behave the same 
  as DefaultSimilarity this might raise some further confusion down the line.
 
 Thats ok: I'd rather the very expert case (Per-Field scoring) be
 trickier than have a trap for people that try to use any algorithm
 other than TFIDFSimilarity
 
 -- 
 lucidimagination.com
 


defaultSearchField and param df are messed up in 3.6.x

2012-06-08 Thread Bernd Fehling
Unfortunately I must see that defaultSearchField and param df are
pretty much messed up in solr 3.6.x
Yes, I have seen issue SOLR-2724 and SOLR-3292.

So if defaultSearchField has been removed (deprecated) from schema.xml then why
are the still calls to 
org.apache.solr.schema.IndexSchema.getDefaultSearchFieldName()?

All these calls get no result, because there is no defaultSearchField.
This also breaks edismax (ExtendedDismaxQParserPlugin) and several other.
As example in method parse() it tries
...
queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF));
if (0 == queryFields.size()) {
  queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f);
}
...

Guess what, yes no result and an empty search :-(
A grep for getDefaultSearchFieldName pointed out that there are several
places where this method is still in use for sorl 3.6.x.

A workaround is to enable defaultSearchField in schema.xml again.

Or to fix all places in the code, e.g. for ExtendedDismaxQParserPlugin method 
parse()
must then read
...
queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF));
if (0 == queryFields.size()) {
  queryFields.put(solrParams.getParams(df));
}
...

or something similar.

I would also recommend to enable defaultOperator in schema.xml again. Just in 
case
they forgot to fix places where they try to access defaultOperator.


Regards
Bernd


Re: highlighter not respecting sentence boundry

2012-06-08 Thread abhayd
hi 
Here is how i get the snippet i phone is highlighted
==
, a car charger and a battery backup for iPods and iPhones.

I expect this to start from starting of sentence. 

here is my solr config
===
searchComponent class=solr.HighlightComponent name=highlight
highlighting
boundaryScanner class=solr.highlight.SimpleBoundaryScanner
default=false name=simple
  lst name=defaults
str name=hl.bs.maxScan200/str
str name=hl.bs.chars./str
 /lst
/boundaryScanner
boundaryScanner class=solr.highlight.BreakIteratorBoundaryScanner
default=true name=breakIterator
  lst name=defaults
 str name=hl.bs.typeSENTENCE/str
str name=hl.bs.languageen/str
str name=hl.bs.countryUS/str
  /lst
/boundaryScanner
==
I m using default breakIterator. This specific case of snippet gets better
if i use large fragSize like fragSize=300 but then some other snippets are
still not showing up from start of sentence 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/highlighter-not-respecting-sentence-boundry-tp3984327p3988491.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: defaultSearchField and param df are messed up in 3.6.x

2012-06-08 Thread Jack Krupansky
Besides the obvious need to clean up the getDefaultSearchFieldName 
references, I would also suggest that the df param have a hard-wired 
default of text since that is the obvious default.


-- Jack Krupansky

-Original Message- 
From: Bernd Fehling

Sent: Friday, June 08, 2012 10:15 AM
To: solr-user@lucene.apache.org
Subject: defaultSearchField and param df are messed up in 3.6.x

Unfortunately I must see that defaultSearchField and param df are
pretty much messed up in solr 3.6.x
Yes, I have seen issue SOLR-2724 and SOLR-3292.

So if defaultSearchField has been removed (deprecated) from schema.xml then 
why
are the still calls to 
org.apache.solr.schema.IndexSchema.getDefaultSearchFieldName()?


All these calls get no result, because there is no defaultSearchField.
This also breaks edismax (ExtendedDismaxQParserPlugin) and several other.
As example in method parse() it tries
...
   queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF));
   if (0 == queryFields.size()) {
 queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f);
   }
...

Guess what, yes no result and an empty search :-(
A grep for getDefaultSearchFieldName pointed out that there are several
places where this method is still in use for sorl 3.6.x.

A workaround is to enable defaultSearchField in schema.xml again.

Or to fix all places in the code, e.g. for ExtendedDismaxQParserPlugin 
method parse()

must then read
...
   queryFields = U.parseFieldBoosts(solrParams.getParams(DMP.QF));
   if (0 == queryFields.size()) {
 queryFields.put(solrParams.getParams(df));
   }
...

or something similar.

I would also recommend to enable defaultOperator in schema.xml again. Just 
in case

they forgot to fix places where they try to access defaultOperator.


Regards
Bernd 



terms count in multivalues field

2012-06-08 Thread preetesh dubey
Is it possible to get number of entries present in a multivalued field by
solr query. Lets say I want to query to solr to get all documents having *
count* of some multivalued field 1. Is it possible in solr ?

-- 
Thanks  Regards
Preetesh Dubey


Re: ContentStreamUpdateRequest method addFile in 4.0 release.

2012-06-08 Thread Ryan McKinley
for the ExtractingRequestHandler, you can put anything into the
request contentType.

try:
addFile( file, application/octet-stream )

but anything should work

ryan




On Thu, Jun 7, 2012 at 2:32 PM, Koorosh Vakhshoori
kvakhsho...@gmail.com wrote:
 In latest 4.0 release, the addFile() method has a new argument 'contentType':

 addFile(File file, String contentType)

 In context of Solr Cell how should addFile() method be called? Specifically
 I refer to the Wiki example:

 ContentStreamUpdateRequest up = new
 ContentStreamUpdateRequest(/update/extract);
 up.addFile(new File(mailing_lists.pdf));
 up.setParam(literal.id, mailing_lists.pdf);
 up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
 result = server.request(up);
 assertNotNull(Couldn't upload mailing_lists.pdf, result);
 rsp = server.query( new SolrQuery( *:*) );
 Assert.assertEquals( 1, rsp.getResults().getNumFound() );

 given at URL: http://wiki.apache.org/solr/ExtractingRequestHandler

 Since Solr Cell is calling Tika under the hood, doesn't the file
 content-type is already identified by Tika? Looking at the code, it seems
 passing NULL would do the job, is that correct? Also for Solr Cell, is the
 ContentStreamUpdateRequest class is the right one to use or there is a
 different class that is more appropriate here?

 Thanks


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/ContentStreamUpdateRequest-method-addFile-in-4-0-release-tp3988344.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Help! Confused about using Jquery for the Search query - Want to ditch it

2012-06-08 Thread Roman Chyla
Hi,
what you want to do is not that difficult, you can use json, eg.

try:
conn = urllib.urlopen(url, params)
page = conn.read()
rsp = simplejson.loads(page)
conn.close()
return rsp
except Exception, e:
log.error(str(e))
log.error(page)
raise e

but this way you are initiating connection each time, which is
expensive - it would be better to pool the connections

but as you can see, you can get json or xml either way

another option is to use solrpy

import solr
import urllib
# create a connection to a solr server
s = solr.SolrConnection('http://localhost:8984/solr')
s.select = solr.SearchHandler(s, '/invenio')

def search(query, kwargs=None, fields=['id'], qt='invenio'):

# do a remote search in solr
url_params = urllib.urlencode([(k, v) for k,v in kwargs.items() if
k not in ['_', 'req']])

if 'rg' in kwargs and kwargs['rg']:
rows = min(kwargs['rg'], 100) #inv maximum limit is 100
else:
rows = 25
response = s.query(query, fields=fields, rows=rows, qt=qt,
inv_params=url_params)
num_found = response.numFound
q_time = response.header['QTime']
# more and return

r


On Thu, Jun 7, 2012 at 3:16 PM, Ben Woods bwo...@quincyinc.com wrote:
 But, check out things like httplib2 and urllib2.

 -Original Message-
 From: Spadez [mailto:james_will...@hotmail.com]
 Sent: Thursday, June 07, 2012 2:09 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Help! Confused about using Jquery for the Search query - Want to 
 ditch it

 Thank you, that helps. The bit I am still confused about how the server sends 
 the response to the server though. I get the impression that there are 
 different ways that this could be done, but is sending an XML response back 
 to the Python server the best way to do this?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Help-Confused-about-using-Jquery-for-the-Search-query-Want-to-ditch-it-tp3988123p3988302.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 Quincy and its subsidiaries do not discriminate in the sale of advertising in 
 any medium (broadcast, print, or internet), and will accept no advertising 
 which is placed with an intent to discriminate on the basis of race or 
 ethnicity.



Writing custom data import handler for Solr.

2012-06-08 Thread ram anam

Hi,
 
I am planning to write a custom data import handler for SOLR for some data 
source. Could you give me some pointers to documentation, examples on how to 
write a custom data import handler and how to integrate it with SOLR. Thank you 
for help. Thanks and regards,Ram Anam.

Re: Writing custom data import handler for Solr.

2012-06-08 Thread Erick Erickson
You need to back up a bit and describe _why_ you want to do this,
perhaps there's
an easy way to do what you want. This could easily be an XY problem...

For instance, you can write a SolrJ program to index data, which _might_ be
what you want. It's a separate process runnable anywhere. See:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Best
Erick

On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote:

 Hi,

 I am planning to write a custom data import handler for SOLR for some data 
 source. Could you give me some pointers to documentation, examples on how to 
 write a custom data import handler and how to integrate it with SOLR. Thank 
 you for help. Thanks and regards,Ram Anam.


Adding Custom-Parser to Tika

2012-06-08 Thread spring
Hi,

I have written a new parser for tika. The problem is, that I have to edit
org.apache.tika.parser.Parser in the tika.jar. But I do not want to edit the
jar. Is the another way to register the new parser? It must work with a
plain AutoDetectParser, since this is used in oder Parsers directly (e.g.
RFC822Parser).

Thank you.



Re: Adding Custom-Parser to Tika

2012-06-08 Thread Lance Norskog
Solr will find libs in top-level directory solr/lib (next to solr.xml)
or a lib/ directory inside each core directory. You can put your new
parser in a jar file in one of those places. Like this:

solr/
solr/solr.xml
solr/lib
solr/lib/yourjar.jar
solr/collection1
solr/collection1/conf
solr/collection1/lib
solr/collection1/lib/yourjar.jar

On Fri, Jun 8, 2012 at 12:35 PM,  spr...@gmx.eu wrote:
 Hi,

 I have written a new parser for tika. The problem is, that I have to edit
 org.apache.tika.parser.Parser in the tika.jar. But I do not want to edit the
 jar. Is the another way to register the new parser? It must work with a
 plain AutoDetectParser, since this is used in oder Parsers directly (e.g.
 RFC822Parser).

 Thank you.




-- 
Lance Norskog
goks...@gmail.com


RE: Adding Custom-Parser to Tika

2012-06-08 Thread spring
The parser must get registered in the service registry
(META-INF/services/org.apache.tika.parser.Parser). Just being in the
classpath does not work. 

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com] 
 Sent: Freitag, 8. Juni 2012 22:38
 To: solr-user@lucene.apache.org
 Subject: Re: Adding Custom-Parser to Tika
 
 Solr will find libs in top-level directory solr/lib (next to solr.xml)
 or a lib/ directory inside each core directory. You can put your new
 parser in a jar file in one of those places. Like this:
 
 solr/
 solr/solr.xml
 solr/lib
 solr/lib/yourjar.jar
 solr/collection1
 solr/collection1/conf
 solr/collection1/lib
 solr/collection1/lib/yourjar.jar
 
 On Fri, Jun 8, 2012 at 12:35 PM,  spr...@gmx.eu wrote:
  Hi,
 
  I have written a new parser for tika. The problem is, that 
 I have to edit
  org.apache.tika.parser.Parser in the tika.jar. But I do not 
 want to edit the
  jar. Is the another way to register the new parser? It must 
 work with a
  plain AutoDetectParser, since this is used in oder Parsers 
 directly (e.g.
  RFC822Parser).
 
  Thank you.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 



RE: Adding Custom-Parser to Tika

2012-06-08 Thread Chris Hostetter
You canspecify a tika.config option pointing at your own 
tika-config.xml files that ExtractionRequestHandler will use to configure 
Tika with...

http://wiki.apache.org/solr/ExtractingRequestHandler

The tika.config entry points to a file containing a Tika configuration. 
You would only need this if you have customized your own Tika 
configuration. The Tika config contains info about parsers, mime types, 
etc.


-Hoss


Re: Boost by Nested Query / Join Needed?

2012-06-08 Thread Chris Hostetter

: For posterity, I think we're going to remove 'preference' data from Solr
: indexing and go in the custom Function Query direction with a key-value
: store.

that would be my suggestion.

Assuming you really are modeling candy  users, my guess is the number if 
distinct candies you have is very large and hte number of distinct users 
you have is very large but the number of prefrences per user is small 
to medium

you can probably go very far by just storying your $user-[candy,weight] 
prefrence data in the key+val store of your choice, and then whenever a 
$user does a $search, augment the $search with the boost params based on 
the $user-[candy,weight] prefs.

if you find that you have too many prefs from some users, put a cap on the 
number of prefrences you let influence the query (ie: only the top N 
weights, or only the N most confident weights, or N most recent prefs) or 
aggregate some prefs into category/manufactorur prefs instead of specific 
$candies, etc...

Having said all that: with the new Solr NRT stuff and the /get handler 
real time gets, you can treat another solr core/server as your key+val 
store if you want -- but using straight SolrJoin won't let you take 
advantage of the weight boostings.


-Hoss


RE: Writing custom data import handler for Solr.

2012-06-08 Thread ram anam

Hi Eric,
I cannot disclose the data source which we are planning to index inside SOLR as 
it is confidential. But client wants it be in the form of Import Handler. We 
plan to install Solr and our custom data import handlers so that client can 
just consume it. Could you please provide me the pointers to examples of Custom 
Data Import Handlers. 

Thanks and regards,Ram Anam.

 Date: Fri, 8 Jun 2012 13:59:34 -0400
 Subject: Re: Writing custom data import handler for Solr.
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 
 You need to back up a bit and describe _why_ you want to do this,
 perhaps there's
 an easy way to do what you want. This could easily be an XY problem...
 
 For instance, you can write a SolrJ program to index data, which _might_ be
 what you want. It's a separate process runnable anywhere. See:
 http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/
 
 Best
 Erick
 
 On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote:
 
  Hi,
 
  I am planning to write a custom data import handler for SOLR for some data 
  source. Could you give me some pointers to documentation, examples on how 
  to write a custom data import handler and how to integrate it with SOLR. 
  Thank you for help. Thanks and regards,Ram Anam.
  

Re: Writing custom data import handler for Solr.

2012-06-08 Thread Lance Norskog
The DataImportHandler is a toolkit in Solr. It has a few different
kinds of plugins. It is very possible that you do not have to write
any Java code.

If you have an unusual external data feed (database, file system,
Amazon S3 buckets) then you would write a Datasource. The only
examples are the source code in trunk/solr/contrib/dataimporthandler.

http://wiki.apache.org/solr/DataImportHandler

On Fri, Jun 8, 2012 at 8:35 PM, ram anam ram_a...@hotmail.com wrote:

 Hi Eric,
 I cannot disclose the data source which we are planning to index inside SOLR 
 as it is confidential. But client wants it be in the form of Import Handler. 
 We plan to install Solr and our custom data import handlers so that client 
 can just consume it. Could you please provide me the pointers to examples of 
 Custom Data Import Handlers.

 Thanks and regards,Ram Anam.

 Date: Fri, 8 Jun 2012 13:59:34 -0400
 Subject: Re: Writing custom data import handler for Solr.
 From: erickerick...@gmail.com
 To: solr-user@lucene.apache.org

 You need to back up a bit and describe _why_ you want to do this,
 perhaps there's
 an easy way to do what you want. This could easily be an XY problem...

 For instance, you can write a SolrJ program to index data, which _might_ be
 what you want. It's a separate process runnable anywhere. See:
 http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

 Best
 Erick

 On Fri, Jun 8, 2012 at 1:29 PM, ram anam ram_a...@hotmail.com wrote:
 
  Hi,
 
  I am planning to write a custom data import handler for SOLR for some data 
  source. Could you give me some pointers to documentation, examples on how 
  to write a custom data import handler and how to integrate it with SOLR. 
  Thank you for help. Thanks and regards,Ram Anam.




-- 
Lance Norskog
goks...@gmail.com


Re: Adding Custom-Parser to Tika

2012-06-08 Thread Lance Norskog
The doc is old. Tika hunts for parsers in the classpath now.

http://www.lucidimagination.com/search/link?url=https://issues.apache.org/jira/browse/SOLR-2116?focusedCommentId=12977072#action_12977072

On Fri, Jun 8, 2012 at 2:20 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 You canspecify a tika.config option pointing at your own
 tika-config.xml files that ExtractionRequestHandler will use to configure
 Tika with...

 http://wiki.apache.org/solr/ExtractingRequestHandler

 The tika.config entry points to a file containing a Tika configuration.
 You would only need this if you have customized your own Tika
 configuration. The Tika config contains info about parsers, mime types,
 etc.


 -Hoss



-- 
Lance Norskog
goks...@gmail.com


What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory

2012-06-08 Thread Aaron Daubman
Greetings,

I am in the process of updating custom code and schema from Solr 1.4 to
3.6.0 and have run into the following issue with our two custom Tokenizer
and Token Filter components.

I've been banging my head against this one for far too long, especially
since it must be something obvious I'm missing.

I have  custom Tokenizer and Token Filter components along with
corresponding factories. The code for all looks very similar to the
Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0
(and I have also read through
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

I have ensured my custom code is on the classpath, it is
in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar:
---output snip---
Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load
INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en
Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar'
to classloader
Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar'
to classloader
Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create
--snip---

After successfully parsing the schema and creating many fields, etc.. the
following is logged:
---snip---
Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : com.company.MyCustomTokenizerFactory
Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory
cannot be cast to org.apache.solr.analysis.TokenizerFactory
at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148)
at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:161)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:102)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:748)
at
org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:249)
at
org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1222)
at
org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:676)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:455)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36)
at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183)
at
org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491)
at
org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53)
at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604)
at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535)
at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398)
at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552)
at
org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227)
at
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59)
at
org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63)
at
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53)
at

Re: What would cause: SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory

2012-06-08 Thread Aaron Daubman
Just in case it is helpful, here are the relevant pieces of my schema.xml:

---snip--
fieldtype name=customfield class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=com.company.MyCustomTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/--
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
!--filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/--
/analyzer
/fieldtype
---snip---

and

---snip---
fieldtype name=customterms class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=com.company.MyCustomFilterFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt expand=false/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=\- replacement=  replace=all/
filter class=solr.PatternReplaceFilterFactory
pattern=amp;amp; replacement=amp; replace=all/
filter class=solr.PatternReplaceFilterFactory
pattern=\s+ replacement=  replace=all/
filter class=solr.TrimFilterFactory/
filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
/analyzer
/fieldtype
---snip---

On Sat, Jun 9, 2012 at 12:03 AM, Aaron Daubman daub...@gmail.com wrote:

 Greetings,

 I am in the process of updating custom code and schema from Solr 1.4 to
 3.6.0 and have run into the following issue with our two custom Tokenizer
 and Token Filter components.

 I've been banging my head against this one for far too long, especially
 since it must be something obvious I'm missing.

 I have  custom Tokenizer and Token Filter components along with
 corresponding factories. The code for all looks very similar to the
 Tokenizer and TokenFilter (and Factory) code that is standard with 3.6.0
 (and I have also read through
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 I have ensured my custom code is on the classpath, it is
 in ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar:
 ---output snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer load
 INFO: loading shared library: /opt/test_artists_solr/jetty-solr/lib/en
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENSolrComponents-1.0-SNAPSHOT-jar-with-dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.SolrResourceLoader
 replaceClassLoader
 INFO: Adding
 'file:/opt/test_artists_solr/jetty-solr/lib/en/ENUtil-1.0-SNAPSHOT-jar-with-dependencies.jar'
 to classloader
 Jun 8, 2012 10:41:00 PM org.apache.solr.core.CoreContainer create
 --snip---

 After successfully parsing the schema and creating many fields, etc.. the
 following is logged:
 ---snip---
 Jun 8, 2012 10:41:00 PM org.apache.solr.util.plugin.AbstractPluginLoader
 load
 INFO: created : com.company.MyCustomTokenizerFactory
 Jun 8, 2012 10:41:00 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.ClassCastException: com.company.MyCustomTokenizerFactory
 cannot be cast to org.apache.solr.analysis.TokenizerFactory
 at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966)
  at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148)
 at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
  at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
 at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
  at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433)
 at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
 at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:123)
  at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481)
 at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335)
  at org.apache.solr.core.CoreContainer.load(CoreContainer.java:219)
 at