How to index on basis of a condition?

2010-10-25 Thread Pawan Darira
Hi

I want to index a particular field on one if() condition. Can i do it
through DIH?

Please suggest.

-- 
Thanks,
Pawan Darira


AW: FieldCache

2010-10-25 Thread Mathias Walter
I don't think it is an XY problem.

I indexed about 90 million sentences and the PAS (predicate argument 
structures) they consist of (which are about 500 million). Then
I try to do NER (named entity recognition) by searching about 5 million 
entities. For each entity I need the all search results, not
just the top X. Since about 10 percent of the entities are high frequent (i. e. 
there are more than 5 million hits for human), it
takes very long to obtain the data from the index. Very long means about a 
day with 15 distributed Katta nodes. Katta is just a
distribution and shard balancing solution on top of Lucene.

Initially, I tried distributed search with Solr. But it was too slow to 
retrieve a large set of documents. Then I switch to Lucene
and made some improvements. I enabled the field cache for my ID field and 
another single char field (PAS type) to get the benefit of
accessing the fields with an array. Unfortunately, the IDs are too large to fit 
in memory. I gave 12 GB of RAM to each node and also
tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of 
memory.

Then I investigated the storage of the fields. String fields are stored in 
UTF-8 encoding. But my ID will never contain UTF8
characters. It consists of number schema but does not fit into a single long. I 
encoded it into a byte array of 11 bytes (compared
to 30 bytes of UTF-8 encoding). Then I changed the field description in 
schema.xml to binary. I still use the EmbeddedSolrServer to
create the indices.
Also, I had to remove the uniquekey node because binary fields cannot be 
indexed, which is the requirement for the unique key.

After reindexing I discovered that nonindexed or binary fields cannot be used 
with the FieldCache.

Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. 
The size was increased to 7 characters (= 14 bytes)
which is still a gain of more than 50 percent compared to the UTF8 encoding. 
BTW: I found no sample how to use the
IndexableBinaryStringTools class except in the unit tests.

Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene 
client. The search result never looked identical
compared to the IDs used to create the SolrInputDocument.

I assume that the char[] returned form IndexableBinaryStringTools.encode is 
encoded in UTF-8 again and then stored. At some point
the information is lost and cannot be recovered.

Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from 
FieldCache.DEFAULT.getTerms directly. But the bytes are
encoded in an unknown form (unknown to me) and cannot be decoded with 
IndexableBinaryStringTools.decode.

The question is now, how to increase the performance of the binary field 
retrieval by not exploding the memory?

I also read some comments which suggest using of payloads. But I never tried 
this approach. Also, the column-stride fields approach
(LUCENE-2186) looks promising but is not released yet.

BTW: I made some tests with a smaller index and the ID encoded as string. Using 
the field cache improves the hit retrieval
dramatically (from 18 seconds down to 2 seconds per query, with a large number 
of results).

--
Kind regards,
Mathias

 -Ursprüngliche Nachricht-
 Von: Erick Erickson [mailto:erickerick...@gmail.com]
 Gesendet: Samstag, 23. Oktober 2010 21:40
 An: solr-user@lucene.apache.org
 Betreff: Re: FieldCache
 
 Why do you want to? Basically, the caches are there to improve
 #searching#. To search something, you must index it. Retrieving
 it is usually a rare enough operation that caching is irrelevant.
 
 This smells like an XY problem, see:
 http://people.apache.org/~hossman/#xyproblem
 
 If this seems like gibberish, could you explain your problem
 a little more?
 
 Best
 Erick
 
 On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter 
 mathias.wal...@gmx.netwrote:
 
  Hi,
 
  does a field which should be cached needs to be indexed?
 
  I have a binary field which is just stored. Retrieving it via
  FieldCache.DEFAULT.getTerms returns empty ByteRefs.
 
  Then I found the following post:
  http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html
 
  How can I use the FieldCache with a binary field?
 
  --
  Kind regards,
  Mathias
 
 



Re: How to index on basis of a condition?

2010-10-25 Thread Jan Høydahl / Cominvent
Do you want to use a field's content do decide whether the document should be 
indexed or not?
You could write an UpdateProcessor for that, simply aborting the chain for the 
docs that don't pass your test.

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
String value = (String) doc.getFieldValue(myfield);
String condition = foobar;
if(value == condition) {
super.processAdd(cmd);
}
}

But if what you meant was to skip only that field if it does not match 
condition, you could use doc.removeField(name) instead. Now you can feed your 
content using whatever method you like.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 25. okt. 2010, at 08.38, Pawan Darira wrote:

 Hi
 
 I want to index a particular field on one if() condition. Can i do it
 through DIH?
 
 Please suggest.
 
 -- 
 Thanks,
 Pawan Darira



Re: a bug of solr distributed search

2010-10-25 Thread Toke Eskildsen
On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
 But itshows a problem of distrubted search without common idf.
 A doc will get different score in different shard.

Bingo.

I really don't understand why this fundamental problem with sharding
isn't mentioned more often. Every time the advice use sharding is
given, it should be followed with a but be aware that it will make
relevance ranking unreliable.

Regards,
Toke Eskildsen



Seattle Scalability Meetup: Rackspace OpenStack, Karmasphere Hadoop, Wed Oct 27

2010-10-25 Thread Bradford Stephens
Link/Details:
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/calendar/13704371/

This meetup focuses on Scalability and technologies to enable handling
large amounts of data: Hadoop, HBase, distributed NoSQL databases, and
more! There's not only a focus on technology, but also everything
surrounding it including operations, management, business use cases,
etc. We've had great success in the past, and are growing quickly!
Including guests from LinkedIn, Amazon, Twitter, Facebook, Cloudant,
and 10gen/MongoDB.

This month's guests:
Mike Mayo, Rackspace, Learn details on Rackspace's new Open Cloud
offering -- a complete scalable cloud stack, but open source!
Abe Taha, VP Engineering, Karmasphere: Karmasphere produces a Hadoop
development environment. Learn more about working with Hadoop
effectively, and see their exciting new offerings.

Location:
Amazon HQ, Von Vorst Building, 426 Terry Ave N., Seattle, WA 98109-5210

Afterparty:
Fierabend, 422 Yale Ave N


--
Bradford Stephens,
Founder, Drawn to Scale
drawntoscalehq.com
727.697.7528

http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
solution. Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science


Re: Import From MYSQL database

2010-10-25 Thread virtas

Why don't you paste log excerpt here which is generated when you are trying
to import the data. 


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-From-MYSQL-database-tp1738753p1766375.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-10-25 Thread Andrzej Bialecki
On 2010-10-25 11:22, Toke Eskildsen wrote:
 On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
 But itshows a problem of distrubted search without common idf.
 A doc will get different score in different shard.
 
 Bingo.
 
 I really don't understand why this fundamental problem with sharding
 isn't mentioned more often. Every time the advice use sharding is
 given, it should be followed with a but be aware that it will make
 relevance ranking unreliable.

The reason is twofold, I think:

* there is an exact solution to this problem, namely to make two
distributed calls instead of one (first call to collect per-shard IDFs
for given query terms, second call to submit a query rewritten with the
global IDF-s). This solution is implemented in SOLR-1632, with some
caching to reduce the cost for common queries. However, this means that
now for every query you need to make two calls instead of one, which
potentially doubles the time to return results (for simple common
queries - for rare complex queries the time will be still dominated by
the query runtime on shard servers).

* another reason is that in many many cases the difference between using
exact global IDF and per-shard IDFs is not that significant. If shards
are more or less homogenous (e.g. you assign documents to shards by
hash(docId)) then term distributions will be also similar. So then the
question is whether you can accept an N% variance in scores across
shards, or whether you want to bear the cost of an additional
distributed RPC for every query...

To summarize, I would qualify your statement with: ...if the
composition of your shards is drastically different. Otherwise the cost
of using global IDF is not worth it, IMHO.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Javascript+JSON not optimized for SEO

2010-10-25 Thread Nick Jenkin
The solution is to offer both, and provide fallback for browsers that
don't support javascript (e.g. Googlebot)
I would also ponder the question how does this ajax feature help my
users?. If you can't find a good answer to that, you should probably
just not use ajax. (NB: it's faster is not a valid answer!)
-Nick

On Sun, Oct 24, 2010 at 12:30 AM, PeterKerk vettepa...@hotmail.com wrote:

 Unfortunately its not online yet, but is there anything I can clarify in more
 detail?

 Thanks!
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1758054.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Javascript+JSON not optimized for SEO

2010-10-25 Thread PeterKerk

Offering both...that sounds to me like duplicating development efforts? Or am
I overseeing something here?


Nick Jenkin-2 wrote:
 
 NB: it's faster is not a valid answer!
 
Why is it not valid? Because its not necessarily faster or...?

And what about user experience? Instead of needing to refresh the entire
page I can now do partial page updates?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1766762.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: a bug of solr distributed search

2010-10-25 Thread Toke Eskildsen
On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
 * there is an exact solution to this problem, namely to make two
 distributed calls instead of one (first call to collect per-shard IDFs
 for given query terms, second call to submit a query rewritten with the
 global IDF-s). This solution is implemented in SOLR-1632, with some
 caching to reduce the cost for common queries.

I must admit that I have not tried the patch myself. Looking at
https://issues.apache.org/jira/browse/SOLR-1632
i see that the last comment is from LiLi with a failed patch, but as
there are no further comments it is unclear if the problem is general or
just with LiLi's setup. I might be a bit harsh here, but the other
comments for the JIRA issue also indicate that one would have to be
somewhat adventurous to run this in production. 

 * another reason is that in many many cases the difference between using
 exact global IDF and per-shard IDFs is not that significant. If shards
 are more or less homogenous (e.g. you assign documents to shards by
 hash(docId)) then term distributions will be also similar.

While I agree on the validity of the solution, it does put some serious
constraints on the shard-setup.

 To summarize, I would qualify your statement with: ...if the
 composition of your shards is drastically different. Otherwise the cost
 of using global IDF is not worth it, IMHO.

Do you know of any studies of the differences in ranking with regard to
indexing-distribution by hashing, logical grouping and distributed IDF?

Regards,
Toke Eskildsen



solr 1.4 suggester component

2010-10-25 Thread abhayd

hi
I was looking into using solr suggester component as described in
http://wiki.apache.org/solr/Suggester

I have a file which has words, phrases in it.

I was wondering how to make following possible.

file has
-
rebate form
form

when i look for form or even for i would like to have rebate form to be
included too.
I tried using
  str
name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
but no luck, wiki suggests some one liner change to get fuzzy suggestions.
But not sure whats that one liner change would be

Also wiki suggests * If you want to use a dictionary file that contains
phrases (actually, strings that can be split into multiple tokens by the
default QueryConverter) then define a different QueryConverter 
but i dont see the desired result
here is my solrconfig.xml

searchComponent class=solr.SpellCheckComponent name=suggest
lst name=spellchecker
  str name=namesuggest/str
  str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  !--
  str
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
  --
 
  str
name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
  str name=sourceLocationamerican-english.txt/str
  str name=fieldname/str  !-- the indexed field to derive
suggestions from --
  float name=threshold0.005/float
  str name=buildOnCommitfalse/str
  queryConverter name=queryConverter
class=org.apache.solr.spelling.MySpellingQueryConverter/
/lst
  /searchComponent
  requestHandler class=org.apache.solr.handler.component.SearchHandler
name=/suggest
lst name=defaults
  str name=spellcheckfalse/str
  str name=spellcheck.dictionarysuggest/str
  str name=spellcheck.onlyMorePopulartrue/str
  str name=spellcheck.count5/str
  str name=spellcheck.collatetrue/str
/lst
arr name=components
  strsuggest/str
/arr
  /requestHandler 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1766915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: AW: FieldCache

2010-10-25 Thread Toke Eskildsen
On Mon, 2010-10-25 at 09:41 +0200, Mathias Walter wrote:
 [...] I enabled the field cache for my ID field and another
 single char field (PAS type) to get the benefit of accessing
 the fields with an array. Unfortunately, the IDs are too
 large to fit in memory. I gave 12 GB of RAM to each node and
 also tried to use the MMapDirectory and/or CompressedOops.
 Lucene always runs out of memory.

That is a known problem with Lucene 3-. The cache uses Strings for the
terms, which has a lot of overhead. As you discovered, reducing the
length of the ID's does not help much.

[Encoding ID as 11 stored bytes]

 Recently I upgraded to trunk (4.0) and tried to use the ByteRefs
 from FieldCache.DEFAULT.getTerms directly. But the bytes are
 encoded in an unknown form (unknown to me) and cannot be decoded
 with IndexableBinaryStringTools.decode.

It depends on what you put into it, but if you represent your IDs as
normal Strings at index time, they will be stored in UTF-8 encoding.
Since you're using 11 ASCII characters for an ID, this means 11 bytes.
You can get your Strings back by calling myBytesRef.utf8ToString().

The  overhead for BytesRefs is a lot lower than Strings, so simply
indexing your ID's and using the field cache might solve your problem
when you're using trunk.

- Toke



Re: Modelling Access Control

2010-10-25 Thread Paul Carey
Many thanks for all the responses. I now plan on benchmarking and
validating both the filter query approach, and maintaining the ACL
entirely outside of Solr. I'll decide from there.

Paul


Re: FieldCache

2010-10-25 Thread Robert Muir
On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter mathias.wal...@gmx.net wrote:
 I indexed about 90 million sentences and the PAS (predicate argument 
 structures) they consist of (which are about 500 million). Then
 I try to do NER (named entity recognition) by searching about 5 million 
 entities. For each entity I need the all search results, not
 just the top X. Since about 10 percent of the entities are high frequent (i. 
 e. there are more than 5 million hits for human), it
 takes very long to obtain the data from the index. Very long means about a 
 day with 15 distributed Katta nodes. Katta is just a
 distribution and shard balancing solution on top of Lucene.

if you aren't getting top-N results/doing search, are you sure a
search engine library/server is the right tool for this job?

 Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. 
 The size was increased to 7 characters (= 14 bytes)
 which is still a gain of more than 50 percent compared to the UTF8 encoding. 
 BTW: I found no sample how to use the
 IndexableBinaryStringTools class except in the unit tests.

it is deprecated in trunk, because you can index binary terms (your
own byte[]) directly if you want. To do this, you need to use a custom
AttributeFactory.

See src/test/org/apache/lucene/index/Test2BTerms or
https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how
to do this.


Re: Integrating Carrot2/Solr Deafult Example

2010-10-25 Thread Grant Ingersoll

On Oct 24, 2010, at 1:45 PM, Eric Martin wrote:

 Hello,
 
 
 
 Welcome to all. I am a very basic user. I have limited knowledge. I read the
 documentation, I have an 'example' Solr installation working on my server. I
 have Drupal 6. I have Drupal using Solr (apachesolr) as its default search
 engine. I have 1 document in the database that is searchable for testing
 purposes. I would like to know, if I am using all default paths in my Solr
 installation, how do I enable Carrot2? Once enabled, how do I verify that it
 is clustering properly?

You would verify it is working by asking it to do some clustering and getting 
back cluster results.  Can you run the example in the wiki page and get results?

 
 
 
 Carrot2 doc I read:
 http://download.carrot2.org/head/manual/index.html#chapter.application-suite
 
 Clustering Wiki Solr I read: http://wiki.apache.org/solr/ClusteringComponent
 
 
 
 I know this is really basic stuff and I really appreciate the help. I
 fumbled my way through installing Solr on my own, setting up Drupal, etc. I
 am a former Natural V2 3270 programmer (basic flat file OO) and have limited
 experience in PHP, Java, Jetty etc. However, I can read code, decipher what
 it is doing, and find a solution and then implement it. I just really have
 no foundation for Carrot2/Solr, yet.
 
 
 
 Any help, pointers and look here's would very much be appreciated. 




RE: FieldCache

2010-10-25 Thread Steven A Rowe
Hi Mathias,

 [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte
 array. The size was increased to 7 characters (= 14 bytes)
 which is still a gain of more than 50 percent compared to the UTF8
 encoding. BTW: I found no sample how to use the
 IndexableBinaryStringTools class except in the unit tests.

IndexableBinaryStringTools will eventually be deprecated and then dropped, in 
favor of native indexable/searchable binary terms.  More work is required 
before these are possible, though.

Well-maintained unit tests are not a bad way to describe functionality...
 
 I assume that the char[] returned form IndexableBinaryStringTools.encode
 is encoded in UTF-8 again and then stored. At some point
 the information is lost and cannot be recovered.

Can you give an example?  This should not happen.

Steve



RE: FieldCache

2010-10-25 Thread Steven A Rowe
Hi Robert,

On 10/25/2010 at 8:20 AM, Robert Muir wrote:
 it is deprecated in trunk, because you can index binary terms (your
 own byte[]) directly if you want. To do this, you need to use a custom
 AttributeFactory.

It's not actually deprecated yet.

 See src/test/org/apache/lucene/index/Test2BTerms or
 https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how
 to do this.

AFAICT, Test2BTerms only deals with the indexing side of this issue, and 
doesn't test searching.

LUCENE-2551 does, however, test searching.  Why hasn't this been committed yet? 
 I had just assumed that it was because fully indexable/searchable binary terms 
were not yet ready for prime time.

I hadn't realized that native binary terms were fully functional - is there any 
reason why integers (for example) could not be directly indexable/searchable?

Steve



Re: Modelling Access Control

2010-10-25 Thread Israel Ekpo
On Mon, Oct 25, 2010 at 8:16 AM, Paul Carey paul.p.ca...@gmail.com wrote:

 Many thanks for all the responses. I now plan on benchmarking and
 validating both the filter query approach, and maintaining the ACL
 entirely outside of Solr. I'll decide from there.

 Paul



Great.

I am looking forward for some feedback on the benchmarks.
-- 
°O°
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?

2010-10-25 Thread Paolo Castagna


I've found two ways which allow me to load all the config files from a
jar file, however with the first solution I cannot specify the dataDir.

This is the first way:

System.setProperty(solr.solr.home, solrHome);
CoreContainer.Initializer initializer =
  new CoreContainer.Initializer();
CoreContainer coreContainer =
  initializer.initialize();
EmbeddedSolrServer server =
  new EmbeddedSolrServer(coreContainer, coreName);

This is what http://wiki.apache.org/solr/Solrj suggests, however using
this way it's not possible to specify the dataDir which is, by default,
${solr.solr.home}/data/index.


This is my attempt to do the same, but in a way I can specify the
dataDir:

System.setProperty(solr.solr.home, solrHome);
System.setProperty(solr.core.dataDir, dataDir);
CoreContainer coreContainer = new CoreContainer();
SolrConfig solrConfig = new SolrConfig();
IndexSchema indexSchema =
  new IndexSchema(solrConfig, null, null);
SolrCore core =
  new SolrCore(dataDir, indexSchema);
core.setName(coreName);
coreContainer.register(core, false);
EmbeddedSolrServer server =
  new EmbeddedSolrServer(coreContainer, coreName);


Do you see any problems with the second solution?

Is there a better way?

Paolo

Paolo Castagna wrote:

Hi,
I am trying to use EmbeddedSolrServer with just one core and I'd like to
load solrconfig.xml, schema.xml and other configuration files from a jar
via getResourceAsStream(...).

I've tried to use SolrResourceLoader, but all my attempts failed with a
RuntimeException: Can't find resource [...].

Is it possible to construct an EmbeddedSolrServer loading all the config
files from a jar file?

Thank you in advance for your help,
Paolo




RE: How to index on basis of a condition?

2010-10-25 Thread Ephraim Ofir
Assuming you're talking about data that comes from a DB, I find it easiest to 
do this kind of logic on the DB's side (mssql example):
SELECT IF(someField = someValue, desiredValue, NULL) AS desiredName from 
someTable

If that's not possible, you can use 
RegexTransformer(http://wiki.apache.org/solr/DataImportHandler#RegexTransformer)
 or (worst case and worst performance) 
ScriptTransformer(http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer)
 and actually write a JS script to do your logic.

Ephraim Ofir

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: Monday, October 25, 2010 10:23 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index on basis of a condition?

Do you want to use a field's content do decide whether the document should be 
indexed or not?
You could write an UpdateProcessor for that, simply aborting the chain for the 
docs that don't pass your test.

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
String value = (String) doc.getFieldValue(myfield);
String condition = foobar;
if(value == condition) {
super.processAdd(cmd);
}
}

But if what you meant was to skip only that field if it does not match 
condition, you could use doc.removeField(name) instead. Now you can feed your 
content using whatever method you like.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 25. okt. 2010, at 08.38, Pawan Darira wrote:

 Hi
 
 I want to index a particular field on one if() condition. Can i do it
 through DIH?
 
 Please suggest.
 
 -- 
 Thanks,
 Pawan Darira



Re: FieldCache

2010-10-25 Thread Robert Muir
On Mon, Oct 25, 2010 at 9:00 AM, Steven A Rowe sar...@syr.edu wrote:
 It's not actually deprecated yet.

you are right! only in my patch!

 AFAICT, Test2BTerms only deals with the indexing side of this issue, and 
 doesn't test searching.

 LUCENE-2551 does, however, test searching.  Why hasn't this been committed 
 yet?  I had just assumed that it was because fully indexable/searchable 
 binary terms were not yet ready for prime time.

 I hadn't realized that native binary terms were fully functional - is there 
 any reason why integers (for example) could not be directly 
 indexable/searchable?

they are! Term itself now holds a BytesRef behind the scenes, and
pretty much everything is fully-functional (for example, the collated
sort use case works with the patch in LUCENE-2551)

But, the short answer is we still need to fix TermRangeQuery to just
work on bytes.
The problem is i didnt link the dependent issue: LUCENE-2514 (I just did this)

There is a patch to fix all the range query stuff there... its not
finished but not far. The basic idea is to make using
[ICU]CollationAnalyzer the supported way of doing this, including
queryparser support, etc.

The long answer is even after LUCENE-2514 is resolved, there are still
some things to figure out: for example how should we properly expose
stuff like this in Solr? Do we really need to modify the
TokenizerFactories to take AttributeFactory and add
AttributeFactoryFactory?

Or is it better to add a Solr fieldtype for these kind of things, and
do it that way? Or we could just add a special
CollatedKeywordTokenizerFactory with the current model that supports
the sorting use case easily, but we still want range query support I
think...


ApacheCon Atlanta next week

2010-10-25 Thread Grant Ingersoll
Hi All,

Just a couple of notes about ApacheCon next week for those who either are 
attending or are thinking of attending.

1. There will be Lucene and Solr 2 day trainings done by Erik Hatcher (Solr) 
and me (Lucene).  It's not too late to sign up.  See 
http://na.apachecon.com/c/acna2010/schedule/grid

2. We've got a good deal of content on Lucene, Solr, Tika, Mahout, etc. planned 
for the week (Thursday and Friday)  Again, see 
http://na.apachecon.com/c/acna2010/schedule/grid

3. There will be a Meetup on Tuesday night.  See 
http://wiki.apache.org/apachecon/ApacheMeetupsNa10.  On this front, we are 
looking for people interested in giving 20-30 min. presentations on what they 
are doing with any of the Lucene ecosystem technologies.  If you are 
interested, let me know.  Otherwise, we will likely make it more informal as a 
networking/QA meetup.

Hope to see you there,
Grant

Re: solr 1.4 suggester component

2010-10-25 Thread Erick Erickson
Try here:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/For
the infix-type match you're using, you might not want the
edge version of ngram...

Best
Erick

On Mon, Oct 25, 2010 at 8:16 AM, abhayd ajdabhol...@hotmail.com wrote:


 hi
 I was looking into using solr suggester component as described in
 http://wiki.apache.org/solr/Suggester

 I have a file which has words, phrases in it.

 I was wondering how to make following possible.

 file has
 -
 rebate form
 form

 when i look for form or even for i would like to have rebate form to be
 included too.
 I tried using
  str

 name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
 but no luck, wiki suggests some one liner change to get fuzzy suggestions.
 But not sure whats that one liner change would be

 Also wiki suggests * If you want to use a dictionary file that contains
 phrases (actually, strings that can be split into multiple tokens by the
 default QueryConverter) then define a different QueryConverter 
 but i dont see the desired result
 here is my solrconfig.xml

 searchComponent class=solr.SpellCheckComponent name=suggest
lst name=spellchecker
  str name=namesuggest/str
  str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  !--
  str
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
  --

  str

 name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
  str name=sourceLocationamerican-english.txt/str
  str name=fieldname/str  !-- the indexed field to derive
 suggestions from --
  float name=threshold0.005/float
  str name=buildOnCommitfalse/str
  queryConverter name=queryConverter
 class=org.apache.solr.spelling.MySpellingQueryConverter/
/lst
  /searchComponent
  requestHandler class=org.apache.solr.handler.component.SearchHandler
 name=/suggest
lst name=defaults
  str name=spellcheckfalse/str
  str name=spellcheck.dictionarysuggest/str
  str name=spellcheck.onlyMorePopulartrue/str
  str name=spellcheck.count5/str
  str name=spellcheck.collatetrue/str
/lst
arr name=components
  strsuggest/str
/arr
  /requestHandler
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1766915.html
 Sent from the Solr - User mailing list archive at Nabble.com.



How to use AND as opposed to OR as the default query operator.

2010-10-25 Thread Swapnonil Mukherjee
Hi Everybody,

I simply want to use AND as the default operator in queries. When a user 
searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On 
the other hand I want solr to treat this query as Jennifer AND Lopez and not as 
Jennifer OR Lopez.

In other words I want a default AND behavior in phrase queries instead of OR. 

I have seen in this presentation 
http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that 
this OR behavior is configurable.

Could you please tell me where this configuration is located? I could not 
locate it in schema.xml.

Swapnonil Mukherjee
+91-40092712
+91-9007131999





Re: How to use AND as opposed to OR as the default query operator.

2010-10-25 Thread Markus Jelsma
http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator

On Monday 25 October 2010 15:41:50 Swapnonil Mukherjee wrote:
 Hi Everybody,
 
 I simply want to use AND as the default operator in queries. When a user
 searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez
 query. On the other hand I want solr to treat this query as Jennifer AND
 Lopez and not as Jennifer OR Lopez.
 
 In other words I want a default AND behavior in phrase queries instead of
 OR.
 
 I have seen in this presentation
 http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52
 that this OR behavior is configurable.
 
 Could you please tell me where this configuration is located? I could not
 locate it in schema.xml.
 
 Swapnonil Mukherjee
 +91-40092712
 +91-9007131999

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


DataImporter using pure solr add XML

2010-10-25 Thread Dario Rigolin
Looking at DataImporter I'm not sure if it's possible to import using a 
standard adddoc... xml document representing a document add operation.
Generating adddoc is quite expensive in my application and I have cached 
all those documents into a text column into MySQL database.
It will be easier for me to push all updated documents directly from 
Database instead passing via multiple xml files posted in stream mode to 
Solr.

Thank you.

Dario.


Re: How to use AND as opposed to OR as the default query operator.

2010-10-25 Thread Pradeep Singh
Which query handler are you using? For a standard query handler you can set
q.op per request or set defaultOperator in schema.xml.

For a dismax handler you will have to work with min should match.

On Mon, Oct 25, 2010 at 6:41 AM, Swapnonil Mukherjee 
swapnonil.mukher...@gettyimages.com wrote:

 Hi Everybody,

 I simply want to use AND as the default operator in queries. When a user
 searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query.
 On the other hand I want solr to treat this query as Jennifer AND Lopez and
 not as Jennifer OR Lopez.

 In other words I want a default AND behavior in phrase queries instead of
 OR.

 I have seen in this presentation
 http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52
 that this OR behavior is configurable.

 Could you please tell me where this configuration is located? I could not
 locate it in schema.xml.

 Swapnonil Mukherjee
 +91-40092712
 +91-9007131999






Re: How to use AND as opposed to OR as the default query operator.

2010-10-25 Thread Swapnonil Mukherjee
Hi Pradeep,

I am using the standard query parser. I made the changes in schema.xml and it 
works. 
It is also good to know that this can done on a per query basis as well.

Swapnonil Mukherjee



On 25-Oct-2010, at 7:48 PM, Pradeep Singh wrote:

 Which query handler are you using? For a standard query handler you can set
 q.op per request or set defaultOperator in schema.xml.
 
 For a dismax handler you will have to work with min should match.
 
 On Mon, Oct 25, 2010 at 6:41 AM, Swapnonil Mukherjee 
 swapnonil.mukher...@gettyimages.com wrote:
 
 Hi Everybody,
 
 I simply want to use AND as the default operator in queries. When a user
 searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query.
 On the other hand I want solr to treat this query as Jennifer AND Lopez and
 not as Jennifer OR Lopez.
 
 In other words I want a default AND behavior in phrase queries instead of
 OR.
 
 I have seen in this presentation
 http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52
 that this OR behavior is configurable.
 
 Could you please tell me where this configuration is located? I could not
 locate it in schema.xml.
 
 Swapnonil Mukherjee
 +91-40092712
 +91-9007131999
 
 
 
 



Re: solr 1.4 suggester component

2010-10-25 Thread abhayd

hi erick,
Thanks for the link.

Problem is we dont want to have another solr core for implementing this, So
was trying suggester component as it allows file based auto suggest.

It works fine only issue is how to get prefix ignored . Any idea?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1767639.html
Sent from the Solr - User mailing list archive at Nabble.com.


London open-source search social - 28th Oct - NEW VENUE

2010-10-25 Thread Richard Marr
Just a reminder that we're meeting this Thursday near St James Park/Westminster.

Details on the Meetup page:
http://www.meetup.com/london-search-social/

Rich


-- 
Richard Marr


Re: OutOfMemory and auto-commit

2010-10-25 Thread Jonathan Rochkind

Yes, that's my question too.  Anyone?

Dennis Gearon wrote:

How is this avoided?

Dennis Gearon




--- On Thu, 10/21/10, Lance Norskog goks...@gmail.com wrote:

  

From: Lance Norskog goks...@gmail.com
Subject: Re: OutOfMemory and auto-commit
To: solr-user@lucene.apache.org
Date: Thursday, October 21, 2010, 9:53 PM
Yes. Indexing activity suspends until
the commit finishes, then
starts. Having both queries and indexing on the same Solr
will have
this memory problem.

Lance

On Thu, Oct 21, 2010 at 1:16 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:


If I do _not_ have any auto-commit enabled, and add
  

500k documents and


commit at end, no problem.

If I instead set auto-commit maxDocs to 10 (pretty
  

large number), and


try to add 500k docs, with autocommits theoretically
  

happening every 100k...


I run into an OutOfMemory error.

Can anyone think of any reasons that would cause this,
  

and how to resolve


it?
All I can think of is that in the first case, my
  

newSearcher and


firstSearcher warming queries don't run until the
  

'document add' is


completely done. In the second case, there are
  

newSearcher and firstSearcher


warming queries happening at the same time another
  

process is continuing to


stream 'add's to Solr.   Although at a maxDocs of
  

10, I shouldn't (I


think) get _overlapping_ warming queries, the warming
  

queries should be done


before the next commit. I think. But nonetheless, just
  

the fact that warming


queries are happening at the same time 'add's are
  

continuing to stream,


could that be enough to somehow increase memory usage
  

enough to run into


OOM?

  


--
Lance Norskog
goks...@gmail.com




Re: Modelling Access Control

2010-10-25 Thread Jonathan Rochkind

Dennis Gearon wrote:

why use filter queries?

Wouldn't reducing the set headed into the filters by putting it in the main 
query be faster? (A question to learn, since I do NOT know :-)

  
No. At least as I understand it. In the best case, the filter query will 
be a lot faster, because filter queries are cached seperately in the 
filter cache.  So if the existing filter query can be found in the 
cache, it'll be a lot faster. If it's not in the cache, the performance 
should be pretty much the same as if you had included it as an 
additional clause in the main q query.


The reasons to put it in a fq filter are:

1) The caching behavior. You can have that certain part of the query be 
cached on it's own, speeding up any subsequent queries that use that 
same fq.


2) Simplification of client code. You can leave your 'q' however you 
want it, using whatever kind of query parser you want too (dismax, 
whatever), and just add on the 'fq' without touching the 'q'.   This is 
a lot easier to do, and especially when you're using it for access 
control like this, a lot harder for a bug to creep in.


Jonathan




Re: How to use AND as opposed to OR as the default query operator.

2010-10-25 Thread Jonathan Rochkind
However, for user entered queries, I suggest you take a look at dismax, 
a lot more suitable for user-entered queries than the standard 
solr-lucene query parsers.


Markus Jelsma wrote:

http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator

On Monday 25 October 2010 15:41:50 Swapnonil Mukherjee wrote:
  

Hi Everybody,

I simply want to use AND as the default operator in queries. When a user
searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez
query. On the other hand I want solr to treat this query as Jennifer AND
Lopez and not as Jennifer OR Lopez.

In other words I want a default AND behavior in phrase queries instead of
OR.

I have seen in this presentation
http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52
that this OR behavior is configurable.

Could you please tell me where this configuration is located? I could not
locate it in schema.xml.

Swapnonil Mukherjee
+91-40092712
+91-9007131999



  


Re: Modelling Access Control

2010-10-25 Thread Dennis Gearon
I'll also be interested in how that works for you. Bringing out the whole 
dataset not filtered for some kind of access control will mean that you will 
have then do the filtering of the result set in your server side/command line 
program.

So the speed comparison with the filter query vs the outside langauge 
environement will be very  interesting :-)

I will also do this, but in about 3-5 months. I will report it then.


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/25/10, Paul Carey paul.p.ca...@gmail.com wrote:

 From: Paul Carey paul.p.ca...@gmail.com
 Subject: Re: Modelling Access Control
 To: solr-user@lucene.apache.org
 Date: Monday, October 25, 2010, 5:16 AM
 Many thanks for all the responses. I
 now plan on benchmarking and
 validating both the filter query approach, and maintaining
 the ACL
 entirely outside of Solr. I'll decide from there.
 
 Paul



Re: Modelling Access Control

2010-10-25 Thread Dennis Gearon
Thanks for that insight, a lot.

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/25/10, Jonathan Rochkind rochk...@jhu.edu wrote:

 From: Jonathan Rochkind rochk...@jhu.edu
 Subject: Re: Modelling Access Control
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Date: Monday, October 25, 2010, 8:19 AM
 Dennis Gearon wrote:
  why use filter queries?
  
  Wouldn't reducing the set headed into the filters by
 putting it in the main query be faster? (A question to
 learn, since I do NOT know :-)
  
    
 No. At least as I understand it. In the best case, the
 filter query will be a lot faster, because filter queries
 are cached seperately in the filter cache.  So if the
 existing filter query can be found in the cache, it'll be a
 lot faster. If it's not in the cache, the performance should
 be pretty much the same as if you had included it as an
 additional clause in the main q query.
 
 The reasons to put it in a fq filter are:
 
 1) The caching behavior. You can have that certain part of
 the query be cached on it's own, speeding up any subsequent
 queries that use that same fq.
 
 2) Simplification of client code. You can leave your 'q'
 however you want it, using whatever kind of query parser you
 want too (dismax, whatever), and just add on the 'fq'
 without touching the 'q'.   This is a lot
 easier to do, and especially when you're using it for access
 control like this, a lot harder for a bug to creep in.
 
 Jonathan
 
 



Does anyone notice this site?

2010-10-25 Thread scott chu

I happen to bump into this site: http://www.solr.biz/

They said they are also developing a search engine? Is this any connection 
to open source Solr? 



RE: Does anyone notice this site?

2010-10-25 Thread Eric Martin
This is not legal advice. Take this as it is. Just off my head and what I
know. I did not research this, but could, if Solr wants me to.

From a marketing stand-point, probably. 

From a legal standpoint. They can do whatever they want with the name Solr
so long as they maintain a distance between any trademarked name and the
fundamental use of the trademark, unless there is  substantial connection
between the trademark name and recognition. Of course, that is to be
determined by a few factors, length in business, trademarks carried, whether
or not the offending trademark makes a claim (not making a claim limits your
recovery substantially and may even null it.). They are also in South
Africa. So, throw in international law.

Of course, you also have fair use law. Well, this can get tricky. Here is an
example: myspace.com and moremyspace.com. If moremysapce.com is used as a
social networking site than myspace has a claim. If it is used as a social
networking site in parody then mysapce has no legal claim whatsoever.

Another example is booble.com (not work safe link!) That case lasted many
years and google lost. 

Trademarks are a very tricky business and one that I will never practice.
Anyway, seeing as how they are making a search engine, they are using a
lower level FQDN and they have not made a dent in the industry it would be
futile to do anything but send them an email laying cliam to the name Solr.

*If you do not send them a letter/email laying claim to Solr you will lose
your rights to fight that battle with IANA, etc or the ability to seek legal
remedy.*

Eric
Law Student - Second Year



-Original Message-
From: scott chu [mailto:scott@udngroup.com] 
Sent: Monday, October 25, 2010 9:55 AM
To: solr-user@lucene.apache.org
Subject: Does anyone notice this site?

I happen to bump into this site: http://www.solr.biz/

They said they are also developing a search engine? Is this any connection 
to open source Solr? 



Re: Does anyone notice this site?

2010-10-25 Thread Grant Ingersoll

On Oct 25, 2010, at 12:54 PM, scott chu wrote:

 I happen to bump into this site: http://www.solr.biz/
 
 They said they are also developing a search engine? Is this any connection to 
 open source Solr? 


No, it is not a connection and they likely should not be using the name that 
way, as Solr is a TM of the ASF.



Re: a bug of solr distributed search

2010-10-25 Thread Andrzej Bialecki
On 2010-10-25 13:37, Toke Eskildsen wrote:
 On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
 * there is an exact solution to this problem, namely to make two
 distributed calls instead of one (first call to collect per-shard IDFs
 for given query terms, second call to submit a query rewritten with the
 global IDF-s). This solution is implemented in SOLR-1632, with some
 caching to reduce the cost for common queries.
 
 I must admit that I have not tried the patch myself. Looking at
 https://issues.apache.org/jira/browse/SOLR-1632
 i see that the last comment is from LiLi with a failed patch, but as
 there are no further comments it is unclear if the problem is general or
 just with LiLi's setup. I might be a bit harsh here, but the other
 comments for the JIRA issue also indicate that one would have to be
 somewhat adventurous to run this in production. 

Oh, definitely this is not production quality yet - there are known
bugs, for example, that I need to fix, and then it needs to be
forward-ported to trunk. It shouldn't be too much work to bring it back
into usable state.

 * another reason is that in many many cases the difference between using
 exact global IDF and per-shard IDFs is not that significant. If shards
 are more or less homogenous (e.g. you assign documents to shards by
 hash(docId)) then term distributions will be also similar.
 
 While I agree on the validity of the solution, it does put some serious
 constraints on the shard-setup.

True. But this is the simplest setup that just may be enough.

 
 To summarize, I would qualify your statement with: ...if the
 composition of your shards is drastically different. Otherwise the cost
 of using global IDF is not worth it, IMHO.
 
 Do you know of any studies of the differences in ranking with regard to
 indexing-distribution by hashing, logical grouping and distributed IDF?

Unfortunately, this information is surprisingly scarce - research
predating year 2000 is often not applicable, and most current research
concentrates on P2P systems, which are really a different ball of wax.
Here's a few papers that I found that are related to this issue:

* Global Term Weights in Distributed Environments, H. Witschel, 2007
(Elsevier)

* KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel,
P. Triantafillou, G. Weikum, VLDB'05 (ACM)

* Exploring the Stability of IDF Term Weighting, Xin Fu and  Miao Chen,
2008 (Springer Verlag)

* A Comparison of Techniques for Estimating IDF Values to Generate
Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM)

* Comparison of dierent Collection Fusion Models in Distributed
Information Retrieval, Alexander Steidinger - this paper gives a nice
comparison framework for different strategies for joining partial
results; apparently we use the most primitive strategy explained there,
based on raw scores...

These papers likely don't fully answer your question, but at least they
provide a broader picture of the issue...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



DIH wiht several Cores

2010-10-25 Thread stockiii

Hello.

I have 7 Cores. Each Core has his own index and his own import.

i want one DIH with an url like http://host/solr/dih.
is this possible that the DIH is using different index-folder ? or its
nessecary that each core use his own DIH with the solrconfig from each core
? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-wiht-several-Cores-tp1767883p1767883.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Does anyone notice this site?

2010-10-25 Thread Peter Keegan
fwiw, our proxy server has blocked this site for malicious content.

Peter

On Mon, Oct 25, 2010 at 1:25 PM, Grant Ingersoll gsing...@apache.orgwrote:


 On Oct 25, 2010, at 12:54 PM, scott chu wrote:

  I happen to bump into this site: http://www.solr.biz/
 
  They said they are also developing a search engine? Is this any
 connection to open source Solr?


 No, it is not a connection and they likely should not be using the name
 that way, as Solr is a TM of the ASF.




Re: Solr ExtractingRequestHandler with Compressed files

2010-10-25 Thread Jayendra Patil
There was this issue with the previous version of Solr, wherein only the
file names from the zip used to get indexed.
We had faced the same issue and ended up using the Solr trunk which has the
Tika version upgraded and works fine.

The Solr version 1.4.1 should also have the fix included. Try using it.

Regards,
Jayendra

On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.comwrote:

 Hi,

 Has anyone had success using ExtractingRequestHandler and Tika with any of
 the compressed file formats (zip, tar, gz, etc) ?

 I am sending solr the archived.tar file using curl. curl 

 http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true
 
 -H 'Content-type:application/octet-stream' --data-binary
 @/home/archived.tar
 The result I get when I query the document is that the filenames inside the
 archive are indexed as the body_texts, but the content of those files is
 not extracted or included.  This is not the behvior I expected. Ref:

 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
 .
 When I send 1 of the actual documents inside the archive using the same
 curl
 command the extracted content is then stored in the body_texts field.  Am
 I missing a step for the compressed files?

 I have added all the extraction depednenices as indicated by mat in
 http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
 am able to succesfully extract data from MS Word, PDF, HTML documents.

 I'm using the following library versions.
  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4

 Given everything I have read this version of Tika should support extracting
 data from all files within a compressed file.  Any help or suggestions
 would
 be appreciated.



RE: FieldCache

2010-10-25 Thread Mathias Walter
Hi,

 On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter mathias.wal...@gmx.net
 wrote:
  I indexed about 90 million sentences and the PAS (predicate argument
 structures) they consist of (which are about 500 million). Then
  I try to do NER (named entity recognition) by searching about 5 million
 entities. For each entity I need the all search results, not
  just the top X. Since about 10 percent of the entities are high frequent (i.
 e. there are more than 5 million hits for human), it
  takes very long to obtain the data from the index. Very long means about a
 day with 15 distributed Katta nodes. Katta is just a
  distribution and shard balancing solution on top of Lucene.
 
 if you aren't getting top-N results/doing search, are you sure a
 search engine library/server is the right tool for this job?

No, I'm not sure, but I didn't find another solution. Any other solution also 
has to create some kind of index and has to provide some search API. Because I 
need SpanNearQuery and PhraseQuery to find some multi-term entities, I think 
Solr/Lucene is a good starting point. Also, I need the classic top-N results 
for the web application. So a single solution is preferred.

  Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte
 array. The size was increased to 7 characters (= 14 bytes)
  which is still a gain of more than 50 percent compared to the UTF8 encoding.
 BTW: I found no sample how to use the
  IndexableBinaryStringTools class except in the unit tests.
 
 it is deprecated in trunk, because you can index binary terms (your
 own byte[]) directly if you want. To do this, you need to use a custom
 AttributeFactory.

How do I use it with Solr, i. e. how to set up a schema.xml using a custom 
AttributeFactory?

--
Kind regards,
Mathias



Re: FieldCache

2010-10-25 Thread Robert Muir
On Mon, Oct 25, 2010 at 3:41 PM, Mathias Walter mathias.wal...@gmx.net wrote:

 How do I use it with Solr, i. e. how to set up a schema.xml using a custom 
 AttributeFactory?


at the moment there is no way to specify an AttributeFactory
(AttributeFactoryFactory? heh) in the schema.xml, nor do the
TokenizerFactories have any way to use any but the default.

So, in order to do this at the moment, you need to make a custom
TokenizerFactory hardwired to your AttributeFactory... take a look at
KeywordTokenizerFactory, you could make MyKeywordTokenizerFactory that
instead of invoking:

new KeywordTokenizer(input);

in its create() method, would use the
KeywordTokenizer(AttributeFactory, Reader, int) ctor with your custom
AttributeFactory.


command line to check if Solr is up running

2010-10-25 Thread Xin Li
As we know we can use browser to check if Solr is running by going to 
http://$hostName:$portNumber/$masterName/admin, say 
http://localhost:8080/solr1/admin. My questions is: are there any ways to check 
it using command line? I used curl http://localhost:8080; to check my Tomcat, 
it worked fine. However, no response if I try curl 
http://localhost:8080/solr1/admin; (even when my Solr is running). Does anyone 
know any command line alternatives?

Thanks,
Xin
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the addressee(s) named herein.  If you are not an intended 
recipient, please contact the sender immediately and take the 
steps necessary to delete the message completely from your 
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.


Re: command line to check if Solr is up running

2010-10-25 Thread Rob Casson
you could look at the ping stuff:

 http://wiki.apache.org/solr/SolrConfigXml#The_Admin.2BAC8-GUI_Section

cheers,
rob

On Mon, Oct 25, 2010 at 3:56 PM, Xin Li x...@book.com wrote:
 As we know we can use browser to check if Solr is running by going to 
 http://$hostName:$portNumber/$masterName/admin, say 
 http://localhost:8080/solr1/admin. My questions is: are there any ways to 
 check it using command line? I used curl http://localhost:8080; to check my 
 Tomcat, it worked fine. However, no response if I try curl 
 http://localhost:8080/solr1/admin; (even when my Solr is running). Does 
 anyone know any command line alternatives?

 Thanks,
 Xin
 This electronic mail message contains information that (a) is or
 may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
 PROTECTED
 BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
 the addressee(s) named herein.  If you are not an intended
 recipient, please contact the sender immediately and take the
 steps necessary to delete the message completely from your
 computer system.

 Not Intended as a Substitute for a Writing: Notwithstanding the
 Uniform Electronic Transaction Act or any other law of similar
 effect, absent an express statement to the contrary, this e-mail
 message, its contents, and any attachments hereto are not
 intended
 to represent an offer or acceptance to enter into a contract and
 are not otherwise intended to bind this sender,
 barnesandnoble.com
 llc, barnesandnoble.com inc. or any other person or entity.



Re: command line to check if Solr is up running

2010-10-25 Thread Ahmet Arslan
 My questions is: are
 there any ways to check it using command line? I used curl
 http://localhost:8080; to check my Tomcat, it worked
 fine. However, no response if I try curl http://localhost:8080/solr1/admin; 
 (even when my Solr
 is running). Does anyone know any command line
 alternatives?


What about curl solr/admin/ping?echoParams=noneomitHeader=on



  


RE: command line to check if Solr is up running

2010-10-25 Thread Xin Li
Thanks Bob and Ahmet, 

curl http://localhost:8080/solr1/admin/ping; works fine :)

Xin



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Monday, October 25, 2010 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: command line to check if Solr is up running

 My questions is: are
 there any ways to check it using command line? I used curl
 http://localhost:8080; to check my Tomcat, it worked
 fine. However, no response if I try curl
http://localhost:8080/solr1/admin; (even when my Solr
 is running). Does anyone know any command line
 alternatives?


What about curl solr/admin/ping?echoParams=noneomitHeader=on



  

This electronic mail message contains information that (a) is or
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the addressee(s) named herein.  If you are not an intended
recipient, please contact the sender immediately and take the
steps necessary to delete the message completely from your
computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.


error in Solr log when adding documents?

2010-10-25 Thread Jonathan Rochkind
Anyone seen anything like this before, the error message does not give 
me very much information, not sure what's going on.


Oct 25, 2010 4:11:02 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR adding document 
SolrInputDocument [lengthy serialized hash of document being added is here]
at 
org.apache.solr.handler.BinaryUpdateRequestHandler$2.document(BinaryUpdateRequestHandler.java:81)
   at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:136)
   at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:126)
   at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210)
   at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:112)
   at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175)
   at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101)
   at 
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:141)
   at 
org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:68)
   at 
org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:46)
   at 
org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:55)
   at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
   at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
   at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
   at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

   at java.lang.Thread.run(Thread.java:619)



Re: DataImporter using pure solr add XML

2010-10-25 Thread Ken Stanley
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin
dario.rigo...@comperio.itwrote:

 Looking at DataImporter I'm not sure if it's possible to import using a
 standard adddoc... xml document representing a document add operation.
 Generating adddoc is quite expensive in my application and I have
 cached
 all those documents into a text column into MySQL database.
 It will be easier for me to push all updated documents directly from
 Database instead passing via multiple xml files posted in stream mode to
 Solr.

 Thank you.

 Dario.



Dario,

Technically nothing is stopping you from using the DIH to import your XML
document(s). However, note that the docadd/add/doc structure is not
required. In fact, you can make up your own structure for the documents, so
long as you configure the DIH to recognize them. At minimum, you should be
able to use something to the effect of:

dataSource type=FileDataSource encoding=UTF-8 /

document
entity
name=some_unique_name_for_the_entity
rootEntity=false
dataSource=null
processor=FileListEntityProcessor
fileName=some_regex_matching_your_files.*\.xml$
baseDir=/path/to/xml/files

newerThan=${dataimporter.some_unique_name_for_the_entity.last_index_time}

entity
name=another_unique_entity_name
dataSource=some_unique_name_for_the_entity
processor=XPathEntityProcessor
url=${some_unique_name_for_the_entity.fileAbsolutePath}
forEach=/XMLROOT/CHILD_NODE
stream=true

   !-- An optional list of field / definitions if your XML
schema does not match that of SOLR --
/entity
/entity
/document

The break down is as follows:

The dataSource / defines the document encoding that SOLR should use for
your XML files.

The top-level entity / creates the list of files to parse (hence why the
fileName attribute supports regex expressions). The dataSource attribute
needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as
1.3 as well). The rootEntity=false  is important to tell SOLR that it
should not try to define fields from this entity.

The second-level entity / is where the documents found in the file list
are processed and parsed. The dataSource attribute needs to be the name of
the top-level entity /. The url attribute is defined as the absolute path
to the file generated by the top-level entity. The forEach is the key
component here; this is the minimum xPath needed to iterate over your
document structure. So, if by example you had:

XMLROOT
CHILD_NODE
 field1data/field1
 field2more data/field2
 ...
/CHILD_NODE
/XMLROOT

Also note that, in my experience, case sensitivity matters when parsing your
xpath instructions.

I hope this helps!

- Ken Stanley


replication with multicores

2010-10-25 Thread Mike Zupan
On my master for the forum core I have the following in
forum/conf/solrconfig.xml


requestHandler name=/replication class=solr.ReplicationHandler 
  lst name=master
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
str name=replicateAfteroptimize/str
  /lst
/requestHandler

Then on the slave for the forum core I have the following in
forum/conf/solrconfig.xml


requestHandler name=/replication class=solr.ReplicationHandler 
  lst name=slave
str 
name=masterUrlhttp://host.domain.com:8983/solr/forum/replication/str
str name=pollInterval00:00:20/str
str name=compressioninternal/str
   /lst
/requestHandler

Then if I hit the following url for my master I see

http://host.domain.com:8983/solr/forum/admin/replication/index.jsp


Local Index  Index Version: 1278007696445, Generation: 38534
Location: /data/solr/product/index

It is replicating another core's data.. not sure how or why.. any
pointers to what I might be doing wrong? And replication is working
for the product core but i don't have anything setup in that core

Mike


Re: DIH wiht several Cores

2010-10-25 Thread markwaddle

Unfortunately, what you are asking for is not possible. The DIH needs to be
configured separately for each core. I have a similar situation with my Solr
application. I am solving it by creating a custom index feeder that is aware
of all of the cores and which documents to send to which cores.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-wiht-several-Cores-tp1767883p1769794.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Failing to successfully import international characters via DIH

2010-10-25 Thread virtas

As it turns out issue was somewhere in mysql. Not sure exactly where, but
something to do to with BLOB. 

Now, I changed text field from BLOB to varchar and started using
mysql_real_escape_string in my php code and all started working just fine. 

Thanks for the help
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Failing-to-successfully-import-international-characters-via-DIH-tp1753190p1770533.html
Sent from the Solr - User mailing list archive at Nabble.com.


after the slave node pull index from master, when will solr del the tmp index dir

2010-10-25 Thread Chengyang
I noticed that the slave node have some tmp Index.x dir that created during 
the index sync with master, but they are not removed even after serval days. So 
when will solr del the tmp index dir?