Re: how to get all the docIds in the search result?

2009-07-23 Thread Avlesh Singh
query.setRows(Integer.MAX_VALUE);

Cheers
Avlesh

On Thu, Jul 23, 2009 at 8:15 AM, shb suh...@gmail.com wrote:

 When I use
  SolrQuery query = new SolrQuery();
   query.set(q, issn:0002-9505);
   query.setRows(10);
   QueryResponse response = server.query(query);
 I only can get the 10 ids in the response.

 How can i get all the docIds  in the search result?  Thanks.



Re: how to get all the docIds in the search result?

2009-07-23 Thread shb
if I use query.setRows(Integer.MAX_VALUE);
the query will become very slow, because searcher will go
to fetch the filed value in the index for all the returned
document.

So if I set query.setRows(10), is there any other ways to
get all the ids? thanks

2009/7/23 Avlesh Singh avl...@gmail.com

 query.setRows(Integer.MAX_VALUE);

 Cheers
 Avlesh

 On Thu, Jul 23, 2009 at 8:15 AM, shb suh...@gmail.com wrote:

  When I use
   SolrQuery query = new SolrQuery();
query.set(q, issn:0002-9505);
query.setRows(10);
QueryResponse response = server.query(query);
  I only can get the 10 ids in the response.
 
  How can i get all the docIds  in the search result?  Thanks.
 



Re: how to get all the docIds in the search result?

2009-07-23 Thread Toby Cole
Have you tried limiting the fields that you're requesting to just the  
ID?

Something along the line of:

query.setRows(Integer.MAX_VALUE);
query.setFields(id);

Might speed the query up a little.

On 23 Jul 2009, at 09:11, shb wrote:


Here id is indeed the uniqueKey of a document.
I want to get all the ids  for some other  useage.


2009/7/23 Shalin Shekhar Mangar shalinman...@gmail.com


On Thu, Jul 23, 2009 at 1:09 PM, shb suh...@gmail.com wrote:


if I use query.setRows(Integer.MAX_VALUE);
the query will become very slow, because searcher will go
to fetch the filed value in the index for all the returned
document.

So if I set query.setRows(10), is there any other ways to
get all the ids? thanks



You should fetch as many rows as you need and not more. Why do you  
need all
the ids? I'm assuming that by id you mean the uniqueKey of a  
document.


--
Regards,
Shalin Shekhar Mangar.



--

Toby Cole
Software Engineer, Semantico Limited
toby.c...@semantico.com tel:+44 1273 358 238
Registered in England and Wales no. 03841410, VAT no. GB-744614334.
Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.

Check out all our latest news and thinking on the Discovery blog
http://blogs.semantico.com/discovery-blog/



Re: how to get all the docIds in the search result?

2009-07-23 Thread shb
I have tried the following code:
 query.setRows(Integer.MAX_VALUE);
 query.setFields(id);

when it return 1000,000 records, it will take about 22s.
This is very slow. Is there any other way?


2009/7/23 Toby Cole toby.c...@semantico.com

 Have you tried limiting the fields that you're requesting to just the ID?
 Something along the line of:

 query.setRows(Integer.MAX_VALUE);
 query.setFields(id);

 Might speed the query up a little.


 On 23 Jul 2009, at 09:11, shb wrote:

  Here id is indeed the uniqueKey of a document.
 I want to get all the ids  for some other  useage.


 2009/7/23 Shalin Shekhar Mangar shalinman...@gmail.com

  On Thu, Jul 23, 2009 at 1:09 PM, shb suh...@gmail.com wrote:

  if I use query.setRows(Integer.MAX_VALUE);
 the query will become very slow, because searcher will go
 to fetch the filed value in the index for all the returned
 document.

 So if I set query.setRows(10), is there any other ways to
 get all the ids? thanks


 You should fetch as many rows as you need and not more. Why do you need
 all
 the ids? I'm assuming that by id you mean the uniqueKey of a document.

 --
 Regards,
 Shalin Shekhar Mangar.


 --

 Toby Cole
 Software Engineer, Semantico Limited
 toby.c...@semantico.com tel:+44 1273 358 238
 Registered in England and Wales no. 03841410, VAT no. GB-744614334.
 Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.

 Check out all our latest news and thinking on the Discovery blog
 http://blogs.semantico.com/discovery-blog/




Index per user - thousands of indices in one Solr instance

2009-07-23 Thread Łukasz Osipiuk
Hi,

I am new to Solr and I want to get a quick hint if it is suitable for
what we want to use it for.
We are building e-mail platform and we want to provide our users with
full-text search functionality.

We are not willing to use single index file for all users as we want
to be able to migrate user index from one machine
to another if need for scaling arises. As we want to have separate
index file per user, single Solr instance would have to
handle few thousands (or hundreds of thousands) index files (yet each
quite small in size).
We also need to add and remove indices online, as users register
accounts or are moved to different computer in cluster.

Was Solr designed with such setup in mind? I search the net but did
not find such usage pattern.

We can directly use Lucene and implement network layer and index
replication by ourselves but it would be nice to avoid it.


Best regards, Łukasz Osipiuk

-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net


Re: Index per user - thousands of indices in one Solr instance

2009-07-23 Thread Shalin Shekhar Mangar
On Thu, Jul 23, 2009 at 3:06 PM, Łukasz Osipiuk luk...@osipiuk.net wrote:


 I am new to Solr and I want to get a quick hint if it is suitable for
 what we want to use it for.
 We are building e-mail platform and we want to provide our users with
 full-text search functionality.

 We are not willing to use single index file for all users as we want
 to be able to migrate user index from one machine
 to another if need for scaling arises. As we want to have separate
 index file per user, single Solr instance would have to
 handle few thousands (or hundreds of thousands) index files (yet each
 quite small in size).
 We also need to add and remove indices online, as users register
 accounts or are moved to different computer in cluster.

 Was Solr designed with such setup in mind? I search the net but did
 not find such usage pattern.

 We can directly use Lucene and implement network layer and index
 replication by ourselves but it would be nice to avoid it.


Solr was not designed with such a setup in mind. However, we are working on
a similar use-case and building the additional features Solr would need.

See https://issues.apache.org/jira/browse/SOLR-1293

We're planning to put up a patch soon. Perhaps we can collaborate?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Highlight arbitrary text

2009-07-23 Thread Anders Melchiorsen
On Tue, 21 Jul 2009 14:25:52 +0200, Anders Melchiorsen wrote:

 On Fri, 17 Jul 2009 16:04:24 +0200, Anders Melchiorsen wrote:

 However, in the normal highlighter, I am using usePhraseHighlighter and
 highlightMultiTerm and it seems that there is no way to turn these on in
 FieldAnalysisRequestHandler ?

 In case these options are not available with the
 FieldAnalysisRequestHandler,
 would it be simple to implement them with a plugin? The
highlightMultiTerm
 is absolutely needed, as we use a lot of prefix searches.

I tried following the FieldAnalysisRequestHandler code, but I could not
find
a place to plug in wildcard searching. Is it supposed to be simple (like
enabling a single option somewhere), or will it need a bunch of new code?



In related news, the highlighter is not exactly working correctly, because
I use the PatternTokenizer for the indexed fields, and
HTMLStripWhiteSpaceTokenizer
obviously gives slightly different results on the presentation field.

So, I tried creating my own plugin:

public class HTMLStripPatternTokenizerFactory extends
PatternTokenizerFactory {
public TokenStream create(Reader input) {
return super.create(new org.apache.solr.analysis.HTMLStripReader(input));
}
}

It seems to work, but is that the proper way to mix the HTML stripper and
the Pattern tokenizer? Obviously, I would prefer not having to maintain a
plugin,
even if it is a tiny one.


- Anders



Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Glen Newton
Chantal,

You might consider LuSql[1].
It has much better performance than Solr DIH. It runs 4-10 times faster on a
multicore machine, and can run in 1/20th the heap size Solr needs. It
produces a Lucene index.

See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf

[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/22 Chantal Ackermann chantal.ackerm...@btelligent.de:
 Hi all,

 this is my first post, as I am new to SOLR (some Lucene exp).

 I am trying to load data from an existing datamart into SOLR using the
 DataImportHandler but in my opinion it is too slow due to the special
 structure of the datamart I have to use.

 Root Cause:
 This datamart uses a row based approach (pivot) to present its data. It was
 so done to allow adding more attributes to the data set without having to
 change the table structure.

 Impact:
 To use the DataImportHandler, i have to pivot the data to create again one
 row per data set. Unfortunately, this results in more and less performant
 queries. Moreover, there are sometimes multiple rows for a single attribute,
 that require separate queries - or more tricky subselects that probably
 don't speed things up.

 Here is an example of the relation between DB requests, row fetches and
 actual number of documents created:

 lst name=statusMessages
 str name=Total Requests made to DataSource3737/str
 str name=Total Rows Fetched5380/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2009-07-22 18:19:06/str
 −
 str name=
 Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
 /str
 str name=Committed2009-07-22 18:22:29/str
 str name=Optimized2009-07-22 18:22:29/str
 str name=Time taken 0:3:22.484/str
 /lst

 (Full index creation.)
 There are about half a million data sets, in total. That would require about
 30h for indexing? My feeling is that there are far too many row fetches per
 data set.

 I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
 around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
 factor 10, ram buffer size 32).

 Possible solutions?
 A) Write my own DataImportHandler?
 B) Write my own MultiRowTransformer that accepts several rows as input
 argument (not sure this is a valid option)?
 C) Approach the DB developers to add a flat table with one data set per row?
 D) ...?

 If someone would like to share their experiences, that would be great!

 Thanks a lot!
 Chantal



 --
 Chantal Ackermann




-- 

-


Question re SOLR-920 Cache and reuse schema

2009-07-23 Thread Brian Klippel
https://issues.apache.org/jira/browse/SOLR-920


Where would the shared schema.xml be located (same as solr.xml?),  and how 
would dynamic schema play into this? Would each core's dynamic schema still be 
independent?




Re: Question re SOLR-920 Cache and reuse schema

2009-07-23 Thread Noble Paul നോബിള്‍ नोब्ळ्
shareSchema tries to see if the schema.xml from a given file and
timestamp is already loaded . if yes ,the old object is re-used.

All the cores which load the same file will share a single object

On Thu, Jul 23, 2009 at 3:32 PM, Brian Klippelbr...@theport.com wrote:
 https://issues.apache.org/jira/browse/SOLR-920


 Where would the shared schema.xml be located (same as solr.xml?),  and how 
 would dynamic schema play into this? Would each core's dynamic schema still 
 be independent?






-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Question re SOLR-920 Cache and reuse schema

2009-07-23 Thread Shalin Shekhar Mangar
On Thu, Jul 23, 2009 at 3:32 PM, Brian Klippel br...@theport.com wrote:

 https://issues.apache.org/jira/browse/SOLR-920


   and how would dynamic schema play into this? Would each core's dynamic
 schema still be independent?


I guess you mean dynamic fields. If so, then yes, you will still be able to
add values to dynamic fields for each core independently.
-- 
Regards,
Shalin Shekhar Mangar.


Re: Index per user - thousands of indices in one Solr instance

2009-07-23 Thread Shalin Shekhar Mangar
On Thu, Jul 23, 2009 at 4:30 PM, Łukasz Osipiuk luk...@osipiuk.net wrote:


  See https://issues.apache.org/jira/browse/SOLR-1293
 
  We're planning to put up a patch soon. Perhaps we can collaborate?

 What are your estimations to have this patches ready. We have quite
 tight deadlines
 and cannot afford months of developments.
 If you are finishing and have some well separated tasks we certainly
 can help (preferably ones which does not require deep Solr internal
 understanding).
 Otherwise we will probably go for a quick hack using lucene directly.


It is mostly done with some caveats (some features like alias/unalias are
not supported). We've been doing extensive performance testing with this
patch and we've already seen upto 5x improvement in throughput.

We'll post the patch by tomorrow so you can take a look and get started.
I'll also start a wiki page and document the various features, configuration
options and performance benchmark results.

-- 
Regards,
Shalin Shekhar Mangar.


Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Chantal Ackermann

Hi Paul, hi Glen, hi all,

thank you for your answers.

I have followed Paul's solution (as I received it earlier). (I'll keep 
your suggestion in mind, though, Glen.)


It looks good, except that it's not creating any documents... ;-)
It is most probably some misunderstanding on my side, and maybe you can 
help me correct that?


So, I have subclassed the SqlEntityProcessor by overwriting basically 
nextRow() as Paul suggested:


public MapString, Object nextRow() {
if (rowcache != null)
return getFromRowCache();
if (rowIterator == null) {
String q = getQuery();
initQuery(resolver.replaceTokens(q));
}
MapString, Object pivottedRow = new HashMapString, Object();
MapString, Object fieldRow = getNext();
while (fieldRow != null) {
// populate pivottedRow
fieldRow = getNext();
}
pivottedRow = applyTransformer(pivottedRow);
log.info(Returning:  + pivottedRow);
return pivottedRow;
}

This seems to work as intended. From the log output, I can see that I 
get only the rows that I expect for one iteration in the correct 
key-value structure. I can also see, that the returned pivottedRow is 
what I want it to be: a map containing columns where each column 
contains what previously was input as a row.


Example (shortened):
INFO: Next fieldRow: {value=2, name=audio, id=1}
INFO: Next fieldRow: {value=773, name=cat, id=23}
INFO: Next fieldRow: {value=642058, name=sid, id=17}

INFO: Returning: {sid=642058, cat=[773], audio=2}

The entity declaration in the dih config (db_data_config.xml) looks like 
this (shortened):

entity name=my_value processor=PivotSqlEntityProcessor
columnValue=value columnName=name
	query=select id, name, value from datamart where 
parent_id=${id_definition.ID} and id in (1,23,17)

field column=sid name=sid /
field column=audio name=audio /
field column=cat name=cat /
/entity

id_definition is the root entity. Per parent_id there are several rows 
in the datamart table which describe one data set (=lucene document).


The object type of value is either String, String[] or List. I am not 
handling that explicitly, yet. If that'd be the problem it would throw 
an exception, wouldn't it?


But it is not creating any documents at all, although the data seems to 
be returned correctly from the processor, so it's pobably something far 
more fundamental.

str name=Total Requests made to DataSource1069/str
str name=Total Rows Fetched1069/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2009-07-23 12:57:07/str
−
str name=
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
/str

Any help / hint on what the root cause is or how to debug it would be 
greatly appreciated.


Thank you!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:

alternately, you can write your own EntityProcessor and just override
the nextRow() . I guess you can still use the JdbcDataSource

On Wed, Jul 22, 2009 at 10:05 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:

Hi all,

this is my first post, as I am new to SOLR (some Lucene exp).

I am trying to load data from an existing datamart into SOLR using the
DataImportHandler but in my opinion it is too slow due to the special
structure of the datamart I have to use.

Root Cause:
This datamart uses a row based approach (pivot) to present its data. It was
so done to allow adding more attributes to the data set without having to
change the table structure.

Impact:
To use the DataImportHandler, i have to pivot the data to create again one
row per data set. Unfortunately, this results in more and less performant
queries. Moreover, there are sometimes multiple rows for a single attribute,
that require separate queries - or more tricky subselects that probably
don't speed things up.

Here is an example of the relation between DB requests, row fetches and
actual number of documents created:

lst name=statusMessages
str name=Total Requests made to DataSource3737/str
str name=Total Rows Fetched5380/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2009-07-22 18:19:06/str
-
str name=
Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
/str
str name=Committed2009-07-22 18:22:29/str
str name=Optimized2009-07-22 18:22:29/str
str name=Time taken 0:3:22.484/str
/lst

(Full index creation.)
There are about half a million data sets, in total. That would require about
30h for indexing? My feeling is that there are far too many row fetches per
data set.

I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
factor 10, ram buffer size 32).

Possible solutions?
A) Write my own DataImportHandler?
B) Write my own MultiRowTransformer that accepts several rows as input
argument (not sure this is a valid option)?
C) Approach the DB developers to add 

Facet

2009-07-23 Thread Nishant Chandra
Hi,
I am new to Solr and need help with the following use case:
I want to provide faceted browsing. For a given product, there are multiple
descriptions (feeds, the description being 100-1500 words) that my
application gets. I want to check for the presence of a fixed number of
terms or attributes (5-10 attributes for a product, e.g. weight, memory etc)
in the description. The attribute set will be different for each product
category. And then for a given product, I wish to display the numbers of
descriptions found for each attribute (the attribute text is present
somewhere in the description). A description can contain more than 1
attribute. How can this be achieved? Please help.
Thanks,
Nishant


Re: Facet

2009-07-23 Thread Ninad Raut
Try out this with SolrJ
SolrQuery query = new SolrQuery();

  query.setQuery(q);

// query.setQueryType(dismax);

  query.setFacet(true);

  query.addFacetField(id);

  query.addFacetField(text);

  query.setFacetMinCount(2);


On Thu, Jul 23, 2009 at 5:12 PM, Nishant Chandra
nishant.chan...@gmail.comwrote:

 Hi,
 I am new to Solr and need help with the following use case:
 I want to provide faceted browsing. For a given product, there are multiple
 descriptions (feeds, the description being 100-1500 words) that my
 application gets. I want to check for the presence of a fixed number of
 terms or attributes (5-10 attributes for a product, e.g. weight, memory
 etc)
 in the description. The attribute set will be different for each product
 category. And then for a given product, I wish to display the numbers of
 descriptions found for each attribute (the attribute text is present
 somewhere in the description). A description can contain more than 1
 attribute. How can this be achieved? Please help.
 Thanks,
 Nishant



Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Noble Paul നോബിള്‍ नोब्ळ्
Is there a uniqueKey in your schema ? are you returning a value
corresponding to that key name?

probably you can paste the whole data-config.xml



On Thu, Jul 23, 2009 at 4:59 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:
 Hi Paul, hi Glen, hi all,

 thank you for your answers.

 I have followed Paul's solution (as I received it earlier). (I'll keep your
 suggestion in mind, though, Glen.)

 It looks good, except that it's not creating any documents... ;-)
 It is most probably some misunderstanding on my side, and maybe you can help
 me correct that?

 So, I have subclassed the SqlEntityProcessor by overwriting basically
 nextRow() as Paul suggested:

 public MapString, Object nextRow() {
        if (rowcache != null)
                return getFromRowCache();
        if (rowIterator == null) {
                String q = getQuery();
                initQuery(resolver.replaceTokens(q));
        }
        MapString, Object pivottedRow = new HashMapString, Object();
        MapString, Object fieldRow = getNext();
        while (fieldRow != null) {
                // populate pivottedRow
                fieldRow = getNext();
        }
        pivottedRow = applyTransformer(pivottedRow);
        log.info(Returning:  + pivottedRow);
        return pivottedRow;
 }

 This seems to work as intended. From the log output, I can see that I get
 only the rows that I expect for one iteration in the correct key-value
 structure. I can also see, that the returned pivottedRow is what I want it
 to be: a map containing columns where each column contains what previously
 was input as a row.

 Example (shortened):
 INFO: Next fieldRow: {value=2, name=audio, id=1}
 INFO: Next fieldRow: {value=773, name=cat, id=23}
 INFO: Next fieldRow: {value=642058, name=sid, id=17}

 INFO: Returning: {sid=642058, cat=[773], audio=2}

 The entity declaration in the dih config (db_data_config.xml) looks like
 this (shortened):
 entity name=my_value processor=PivotSqlEntityProcessor
        columnValue=value columnName=name
        query=select id, name, value from datamart where
 parent_id=${id_definition.ID} and id in (1,23,17)
        field column=sid name=sid /
        field column=audio name=audio /
        field column=cat name=cat /
 /entity

 id_definition is the root entity. Per parent_id there are several rows in
 the datamart table which describe one data set (=lucene document).

 The object type of value is either String, String[] or List. I am not
 handling that explicitly, yet. If that'd be the problem it would throw an
 exception, wouldn't it?

 But it is not creating any documents at all, although the data seems to be
 returned correctly from the processor, so it's pobably something far more
 fundamental.
 str name=Total Requests made to DataSource1069/str
 str name=Total Rows Fetched1069/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2009-07-23 12:57:07/str
 −
 str name=
 Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
 /str

 Any help / hint on what the root cause is or how to debug it would be
 greatly appreciated.

 Thank you!
 Chantal


 Noble Paul നോബിള്‍ नोब्ळ् schrieb:

 alternately, you can write your own EntityProcessor and just override
 the nextRow() . I guess you can still use the JdbcDataSource

 On Wed, Jul 22, 2009 at 10:05 PM, Chantal
 Ackermannchantal.ackerm...@btelligent.de wrote:

 Hi all,

 this is my first post, as I am new to SOLR (some Lucene exp).

 I am trying to load data from an existing datamart into SOLR using the
 DataImportHandler but in my opinion it is too slow due to the special
 structure of the datamart I have to use.

 Root Cause:
 This datamart uses a row based approach (pivot) to present its data. It
 was
 so done to allow adding more attributes to the data set without having to
 change the table structure.

 Impact:
 To use the DataImportHandler, i have to pivot the data to create again
 one
 row per data set. Unfortunately, this results in more and less performant
 queries. Moreover, there are sometimes multiple rows for a single
 attribute,
 that require separate queries - or more tricky subselects that probably
 don't speed things up.

 Here is an example of the relation between DB requests, row fetches and
 actual number of documents created:

 lst name=statusMessages
 str name=Total Requests made to DataSource3737/str
 str name=Total Rows Fetched5380/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2009-07-22 18:19:06/str
 -
 str name=
 Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
 /str
 str name=Committed2009-07-22 18:22:29/str
 str name=Optimized2009-07-22 18:22:29/str
 str name=Time taken 0:3:22.484/str
 /lst

 (Full index creation.)
 There are about half a million data sets, in total. That would require
 about
 30h for indexing? My feeling is that there are far too many row fetches
 per
 data set.

 I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
 

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Chantal Ackermann

Hi Paul,

no, I didn't return the unique key, though there is one defined. I added 
that to the nextRow() implementation, and I am now returning it as part 
of the map.


But it is still not creating any documents, and now that I can see the 
ID I have realized that it is always processing the same - the first - 
data set. It's like it tries to create the first document but does not, 
then reiterates over that same data, fails again, and so on. I mean, it 
doesn't even create one document. So it cannot be a simple iteration 
that updates the same document over and over again (as there is none).


I haven't changed the log level. I see no error message in the output 
(catalina.log in my case).


The complete entity definition:

dataConfig
dataSource type=JdbcDataSource 
driver=oracle.jdbc.driver.OracleDriver ... /

document name=doc
entity name=epg_definition pk=ID
query=select ID from DEFINITION
!-- originally I would set the field id (unique key) on this level, 
doesn't work neither --
entity name=value pk=DEF_ID 
processor=PivotSqlEntityProcessor
		query=select DEF_ID, id, name, value from datamart where 
parent_id=${id_definition.ID} and id in (1,23,17)

field column=DEF_ID name=id /
field column=sid name=sid /
field column=audio name=audio /
field column=cat name=cat /
/entity
/entity
/document
/dataConfig

schema:
field name=id type=long indexed=true stored=true required=true /
field name=sid type=long indexed=true stored=true 
required=true /
field name=audio type=text_ws indexed=true stored=false 
omitNorms=true multiValued=true/
field name=cat type=text_ws indexed=true stored=true 
omitNorms=true multiValued=true/


I am using more fields, but I removed them to make it easier to read. I 
am thinking about removing them from my test to be sure they don't 
interfere.


Thanks for your help!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:

Is there a uniqueKey in your schema ? are you returning a value
corresponding to that key name?

probably you can paste the whole data-config.xml



On Thu, Jul 23, 2009 at 4:59 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:

Hi Paul, hi Glen, hi all,

thank you for your answers.

I have followed Paul's solution (as I received it earlier). (I'll keep your
suggestion in mind, though, Glen.)

It looks good, except that it's not creating any documents... ;-)
It is most probably some misunderstanding on my side, and maybe you can help
me correct that?

So, I have subclassed the SqlEntityProcessor by overwriting basically
nextRow() as Paul suggested:

public MapString, Object nextRow() {
   if (rowcache != null)
   return getFromRowCache();
   if (rowIterator == null) {
   String q = getQuery();
   initQuery(resolver.replaceTokens(q));
   }
   MapString, Object pivottedRow = new HashMapString, Object();
   MapString, Object fieldRow = getNext();
   while (fieldRow != null) {
   // populate pivottedRow
   fieldRow = getNext();
   }
   pivottedRow = applyTransformer(pivottedRow);
   log.info(Returning:  + pivottedRow);
   return pivottedRow;
}

This seems to work as intended. From the log output, I can see that I get
only the rows that I expect for one iteration in the correct key-value
structure. I can also see, that the returned pivottedRow is what I want it
to be: a map containing columns where each column contains what previously
was input as a row.

Example (shortened):
INFO: Next fieldRow: {value=2, name=audio, id=1}
INFO: Next fieldRow: {value=773, name=cat, id=23}
INFO: Next fieldRow: {value=642058, name=sid, id=17}

INFO: Returning: {sid=642058, cat=[773], audio=2}

The entity declaration in the dih config (db_data_config.xml) looks like
this (shortened):
entity name=my_value processor=PivotSqlEntityProcessor
   columnValue=value columnName=name
   query=select id, name, value from datamart where
parent_id=${id_definition.ID} and id in (1,23,17)
   field column=sid name=sid /
   field column=audio name=audio /
   field column=cat name=cat /
/entity

id_definition is the root entity. Per parent_id there are several rows in
the datamart table which describe one data set (=lucene document).

The object type of value is either String, String[] or List. I am not
handling that explicitly, yet. If that'd be the problem it would throw an
exception, wouldn't it?

But it is not creating any documents at all, although the data seems to be
returned correctly from the processor, so it's pobably something far more
fundamental.
str name=Total Requests made to DataSource1069/str
str name=Total Rows Fetched1069/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2009-07-23 12:57:07/str
−
str name=
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
/str

Any help / hint on what the root 

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Otis Gospodnetic
Note that the statement about LuSql (or really any other tool, LuSql is just an 
example because it was mentioned) is true only if Solr is underutilized because 
DIH uses a single thread to talk to Solr (is this correct?) vs. LuSql using 
multiple (I'm guessing that's the case becase of the multicore comment).

But, if the DB itself if your bottleneck, and I've seen plenty of such cases, 
then speed of DIH vs. LuSql vs. something else matters less.  Glen, please 
correct me if I'm wrong about this - I know you have done plenty of 
benchmarking. :)

 Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Glen Newton glen.new...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 5:52:43 AM
 Subject: Re: DataImportHandler / Import from DB : one data set comes in  
 multiple rows
 
 Chantal,
 
 You might consider LuSql[1].
 It has much better performance than Solr DIH. It runs 4-10 times faster on a
 multicore machine, and can run in 1/20th the heap size Solr needs. It
 produces a Lucene index.
 
 See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf
 
 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
 
 Disclosure: I am the author of LuSql.
 
 Glen Newton
 http://zzzoot.blogspot.com/
 
 2009/7/22 Chantal Ackermann :
  Hi all,
 
  this is my first post, as I am new to SOLR (some Lucene exp).
 
  I am trying to load data from an existing datamart into SOLR using the
  DataImportHandler but in my opinion it is too slow due to the special
  structure of the datamart I have to use.
 
  Root Cause:
  This datamart uses a row based approach (pivot) to present its data. It was
  so done to allow adding more attributes to the data set without having to
  change the table structure.
 
  Impact:
  To use the DataImportHandler, i have to pivot the data to create again one
  row per data set. Unfortunately, this results in more and less performant
  queries. Moreover, there are sometimes multiple rows for a single attribute,
  that require separate queries - or more tricky subselects that probably
  don't speed things up.
 
  Here is an example of the relation between DB requests, row fetches and
  actual number of documents created:
 
  
  3737
  5380
  0
  2009-07-22 18:19:06
  −
  
  Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
  
  2009-07-22 18:22:29
  2009-07-22 18:22:29
  0:3:22.484
  
 
  (Full index creation.)
  There are about half a million data sets, in total. That would require about
  30h for indexing? My feeling is that there are far too many row fetches per
  data set.
 
  I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
  around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
  factor 10, ram buffer size 32).
 
  Possible solutions?
  A) Write my own DataImportHandler?
  B) Write my own MultiRowTransformer that accepts several rows as input
  argument (not sure this is a valid option)?
  C) Approach the DB developers to add a flat table with one data set per row?
  D) ...?
 
  If someone would like to share their experiences, that would be great!
 
  Thanks a lot!
  Chantal
 
 
 
  --
  Chantal Ackermann
 
 
 
 
 -- 
 
 -



Re: how to get all the docIds in the search result?

2009-07-23 Thread Otis Gospodnetic
You could pull the ID directly from the Lucene index, that may be a little 
faster.
You can also use Lucene's TermEnum to get to this.
And you should make sure that id field is the first field in your documents 
(when you index them).

But no matter what you do, this will not be subsecond for non-trivial indices - 
it's the equivalent of a full table scan in RDBMS world.

Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: shb suh...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 5:35:29 AM
 Subject: Re: how to get all the docIds in the search result?
 
 I have tried the following code:
 query.setRows(Integer.MAX_VALUE);
 query.setFields(id);
 
 when it return 1000,000 records, it will take about 22s.
 This is very slow. Is there any other way?
 
 
 2009/7/23 Toby Cole 
 
  Have you tried limiting the fields that you're requesting to just the ID?
  Something along the line of:
 
  query.setRows(Integer.MAX_VALUE);
  query.setFields(id);
 
  Might speed the query up a little.
 
 
  On 23 Jul 2009, at 09:11, shb wrote:
 
   Here id is indeed the uniqueKey of a document.
  I want to get all the ids  for some other  useage.
 
 
  2009/7/23 Shalin Shekhar Mangar 
 
   On Thu, Jul 23, 2009 at 1:09 PM, shb wrote:
 
   if I use query.setRows(Integer.MAX_VALUE);
  the query will become very slow, because searcher will go
  to fetch the filed value in the index for all the returned
  document.
 
  So if I set query.setRows(10), is there any other ways to
  get all the ids? thanks
 
 
  You should fetch as many rows as you need and not more. Why do you need
  all
  the ids? I'm assuming that by id you mean the uniqueKey of a document.
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
  --
 
  Toby Cole
  Software Engineer, Semantico Limited
  
  Registered in England and Wales no. 03841410, VAT no. GB-744614334.
  Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.
 
  Check out all our latest news and thinking on the Discovery blog
  http://blogs.semantico.com/discovery-blog/
 
 



Sort field

2009-07-23 Thread Jörg Agatz
Hallo...

I have a problem...

i want to sort a field

at the Moment the field type is text, but i have test it with string or
date
the content of the field looks like 22.07.09 it is a Date.

when i sort, i get :

failed to open stream: HTTP request failed! HTTP/1.1 500
there_are_more_terms_than_documents_in_field_ERP_ERP_FILE_CONTENT_DATUM_but_its_impossible_to_sort_on_tokenized_f
in */var/www/search.php* on line *23

What happen?


*


Re: Sort field

2009-07-23 Thread Erik Hatcher


On Jul 23, 2009, at 11:03 AM, Jörg Agatz wrote:


Hallo...

I have a problem...

i want to sort a field

at the Moment the field type is text, but i have test it with  
string or

date
the content of the field looks like 22.07.09 it is a Date.

when i sort, i get :

failed to open stream: HTTP request failed! HTTP/1.1 500
there_are_more_terms_than_documents_in_field_ERP_ERP_FILE_CONTENT_DATUM_but_its_impossible_to_sort_on_tokenized_f
in */var/www/search.php* on line *23

What happen?


You have to sort on a field that only has a single indexed term per  
document.  A string with indexed=true is one option.  Use copyField  
to copy your text field to a string version if it is as simple as that  
for your sorting needs.


Erik



Re: how to get all the docIds in the search result?

2009-07-23 Thread Erik Hatcher
Rather than trying to get all document id's in one call to Solr,  
consider paging through the results.  Set rows=1000 or probably  
larger, then check the numFound and continue making requests to Solr  
incrementing start parameter accordingly until done.


Erik

On Jul 23, 2009, at 5:35 AM, shb wrote:


I have tried the following code:
query.setRows(Integer.MAX_VALUE);
query.setFields(id);

when it return 1000,000 records, it will take about 22s.
This is very slow. Is there any other way?


2009/7/23 Toby Cole toby.c...@semantico.com

Have you tried limiting the fields that you're requesting to just  
the ID?

Something along the line of:

query.setRows(Integer.MAX_VALUE);
query.setFields(id);

Might speed the query up a little.


On 23 Jul 2009, at 09:11, shb wrote:

Here id is indeed the uniqueKey of a document.

I want to get all the ids  for some other  useage.


2009/7/23 Shalin Shekhar Mangar shalinman...@gmail.com

On Thu, Jul 23, 2009 at 1:09 PM, shb suh...@gmail.com wrote:


if I use query.setRows(Integer.MAX_VALUE);

the query will become very slow, because searcher will go
to fetch the filed value in the index for all the returned
document.

So if I set query.setRows(10), is there any other ways to
get all the ids? thanks


You should fetch as many rows as you need and not more. Why do  
you need

all
the ids? I'm assuming that by id you mean the uniqueKey of a  
document.


--
Regards,
Shalin Shekhar Mangar.



--

Toby Cole
Software Engineer, Semantico Limited
toby.c...@semantico.com tel:+44 1273 358 238
Registered in England and Wales no. 03841410, VAT no. GB-744614334.
Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.

Check out all our latest news and thinking on the Discovery blog
http://blogs.semantico.com/discovery-blog/






Re: DataImportHandler / Import from DB : one data set comes in multiple rows

2009-07-23 Thread Glen Newton
Hi Otis,

Yes, you are right: LuSql is heavily optimized for multi-thread/multi-core.
It also performs better on a single core with multiple threads, due to
the heavy i/o bounded nature of Lucene indexing.

So if the DB is the bottleneck, well, yes, then LuSql and any other
tool are not going help. Resolve the DB bottleneck, and then decide
what tool best serves your indexing requirements.

Only slightly off topic: I have noticed one problem with DBs (with
LuSql and custom JDBC clients processing records) when the fetch size
is too large and the amount of processsing of each record gets too
large: sometimes the connection times out because the time between
getting the next batch takes too long (due to the accumulated delay
from processing all the records). Solved with reducing the fetch size.
I am not sure if Solr/DIH users have experienced this. LuSql allows
setting the fetch size (like DIH I believe) and (unreleased version)
re-issues the SQL and offsets to the last+1 record when this happens.

-glen

2009/7/23 Otis Gospodnetic otis_gospodne...@yahoo.com:
 Note that the statement about LuSql (or really any other tool, LuSql is just 
 an example because it was mentioned) is true only if Solr is underutilized 
 because DIH uses a single thread to talk to Solr (is this correct?) vs. LuSql 
 using multiple (I'm guessing that's the case becase of the multicore comment).

 But, if the DB itself if your bottleneck, and I've seen plenty of such cases, 
 then speed of DIH vs. LuSql vs. something else matters less.  Glen, please 
 correct me if I'm wrong about this - I know you have done plenty of 
 benchmarking. :)

  Otis
 --
 Sematext is hiring: http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: Glen Newton glen.new...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 5:52:43 AM
 Subject: Re: DataImportHandler / Import from DB : one data set comes in  
 multiple rows

 Chantal,

 You might consider LuSql[1].
 It has much better performance than Solr DIH. It runs 4-10 times faster on a
 multicore machine, and can run in 1/20th the heap size Solr needs. It
 produces a Lucene index.

 See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf

 [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

 Disclosure: I am the author of LuSql.

 Glen Newton
 http://zzzoot.blogspot.com/

 2009/7/22 Chantal Ackermann :
  Hi all,
 
  this is my first post, as I am new to SOLR (some Lucene exp).
 
  I am trying to load data from an existing datamart into SOLR using the
  DataImportHandler but in my opinion it is too slow due to the special
  structure of the datamart I have to use.
 
  Root Cause:
  This datamart uses a row based approach (pivot) to present its data. It was
  so done to allow adding more attributes to the data set without having to
  change the table structure.
 
  Impact:
  To use the DataImportHandler, i have to pivot the data to create again one
  row per data set. Unfortunately, this results in more and less performant
  queries. Moreover, there are sometimes multiple rows for a single 
  attribute,
  that require separate queries - or more tricky subselects that probably
  don't speed things up.
 
  Here is an example of the relation between DB requests, row fetches and
  actual number of documents created:
 
 
  3737
  5380
  0
  2009-07-22 18:19:06
  −
 
  Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
 
  2009-07-22 18:22:29
  2009-07-22 18:22:29
  0:3:22.484
 
 
  (Full index creation.)
  There are about half a million data sets, in total. That would require 
  about
  30h for indexing? My feeling is that there are far too many row fetches per
  data set.
 
  I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
  around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
  factor 10, ram buffer size 32).
 
  Possible solutions?
  A) Write my own DataImportHandler?
  B) Write my own MultiRowTransformer that accepts several rows as input
  argument (not sure this is a valid option)?
  C) Approach the DB developers to add a flat table with one data set per 
  row?
  D) ...?
 
  If someone would like to share their experiences, that would be great!
 
  Thanks a lot!
  Chantal
 
 
 
  --
  Chantal Ackermann
 



 --

 -





-- 

-


Re: excluding certain terms from facet counts when faceting based on indexed terms of a field

2009-07-23 Thread Bill Au
I want to exclude a very small number of terms which will be different for
each query.  So I think my best bet is to use localParam.

Bill

On Wed, Jul 22, 2009 at 4:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I am faceting based on the indexed terms of a field by using facet.field.
 : Is there any way to exclude certain terms from the facet counts?

 if you're talking about a lot of terms, and they're going to be hte same
 for *all* queries, the best appraoch is to strip them out when indexing
 (StopWordFilter is your freind)

 -Hoss




Re: excluding certain terms from facet counts when faceting based on indexed terms of a field

2009-07-23 Thread Erik Hatcher
Give it is a small number of terms, seems like just excluding them  
from use/visibility on the client would be reasonable.


Erik

On Jul 23, 2009, at 11:43 AM, Bill Au wrote:

I want to exclude a very small number of terms which will be  
different for

each query.  So I think my best bet is to use localParam.

Bill

On Wed, Jul 22, 2009 at 4:16 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:



: I am faceting based on the indexed terms of a field by using  
facet.field.

: Is there any way to exclude certain terms from the facet counts?

if you're talking about a lot of terms, and they're going to be hte  
same
for *all* queries, the best appraoch is to strip them out when  
indexing

(StopWordFilter is your freind)

-Hoss






Re: excluding certain terms from facet counts when faceting based on indexed terms of a field

2009-07-23 Thread Bill Au
That's actually what we have been doing.  I was just wondering if there is
any way to move this work from the client back into Solr.

Bill

On Thu, Jul 23, 2009 at 11:47 AM, Erik Hatcher
e...@ehatchersolutions.comwrote:

 Give it is a small number of terms, seems like just excluding them from
 use/visibility on the client would be reasonable.

Erik


 On Jul 23, 2009, at 11:43 AM, Bill Au wrote:

  I want to exclude a very small number of terms which will be different for
 each query.  So I think my best bet is to use localParam.

 Bill

 On Wed, Jul 22, 2009 at 4:16 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:


 : I am faceting based on the indexed terms of a field by using
 facet.field.
 : Is there any way to exclude certain terms from the facet counts?

 if you're talking about a lot of terms, and they're going to be hte same
 for *all* queries, the best appraoch is to strip them out when indexing
 (StopWordFilter is your freind)

 -Hoss






Re: how to get all the docIds in the search result?

2009-07-23 Thread Otis Gospodnetic
And if I may add another thing - if you are using Solr in this fashion, have a 
look at your caches, esp. document cache. If your queries of this type are 
repeated, you may benefit from large cache.  Or, if they are not, you may 
completely disable some caches.

 Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Erik Hatcher e...@ehatchersolutions.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 11:15:45 AM
 Subject: Re: how to get all the docIds in the search result?
 
 Rather than trying to get all document id's in one call to Solr, consider 
 paging 
 through the results.  Set rows=1000 or probably larger, then check the 
 numFound 
 and continue making requests to Solr incrementing start parameter accordingly 
 until done.
 
 Erik
 
 On Jul 23, 2009, at 5:35 AM, shb wrote:
 
  I have tried the following code:
  query.setRows(Integer.MAX_VALUE);
  query.setFields(id);
  
  when it return 1000,000 records, it will take about 22s.
  This is very slow. Is there any other way?
  
  
  2009/7/23 Toby Cole 
  
  Have you tried limiting the fields that you're requesting to just the ID?
  Something along the line of:
  
  query.setRows(Integer.MAX_VALUE);
  query.setFields(id);
  
  Might speed the query up a little.
  
  
  On 23 Jul 2009, at 09:11, shb wrote:
  
  Here id is indeed the uniqueKey of a document.
  I want to get all the ids  for some other  useage.
  
  
  2009/7/23 Shalin Shekhar Mangar 
  
  On Thu, Jul 23, 2009 at 1:09 PM, shb wrote:
  
  if I use query.setRows(Integer.MAX_VALUE);
  the query will become very slow, because searcher will go
  to fetch the filed value in the index for all the returned
  document.
  
  So if I set query.setRows(10), is there any other ways to
  get all the ids? thanks
  
  
  You should fetch as many rows as you need and not more. Why do you need
  all
  the ids? I'm assuming that by id you mean the uniqueKey of a document.
  
  --
  Regards,
  Shalin Shekhar Mangar.
  
  
  --
  
  Toby Cole
  Software Engineer, Semantico Limited
  
  Registered in England and Wales no. 03841410, VAT no. GB-744614334.
  Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.
  
  Check out all our latest news and thinking on the Discovery blog
  http://blogs.semantico.com/discovery-blog/
  
  



Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Jibo John

Thanks for the response, Eric.

We have seen that size of the index has a direct impact on the search  
speed, especially when the index size is in GBs, so trying all  
possible ways to keep the index size as low as we can.


We thought solr.ExternalFileField type would help to keep the index  
size low by storing a text field out side of the index.


Here's what we were planning: initially, all the fields except the  
solr.ExternalFileField type field will be queried and will be  
displayed to the end user. . There will be subsequent calls from the  
UI  to pull the solr.ExternalFileField field that will be loaded in a  
lazy manner.


However, realized that solr.ExternalFileField only supports float  
type, however, the data that we're planning to keep as an external  
field is a string type.


Thanks,
-Jibo



On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:


Hoping the experts chime in if I'm wrong, but
As far as I know, while storing a field increases the size of an  
index,

it doesn't have much impact on the search speed. Which you could
pretty easily test by creating the index both ways and firing off some
timing queries and comparing. Although it would be time  
consuming...


I believe there's some info on the Lucene Wiki about this, but my  
memory

isn't what it used to be.

Erick


On Tue, Jul 21, 2009 at 2:42 PM, Jibo John jiboj...@mac.com wrote:


We're in the process of building a log searcher application.

In order to reduce the index size to improve the query performance,  
we're

exploring the possibility of having:

1. One field for each log line with 'indexed=true  stored=false'  
that

will be used for searching
2. Another field for each log line of type solr.ExternalFileField  
that

will be used only for display purpose.

We realized that currently solr.ExternalFileField supports only  
float type.


Is there a way we can override this to support string type? Any  
issues with

this approach?

Any ideas are welcome.


Thanks,
-Jibo







Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Otis Gospodnetic
I'm not sure if there is a lot of benefit from storing the literal values in 
that external file vs. directly in the index.  There are a number of things one 
should look at first, as far as performance is concerned - JVM settings, cache 
sizes, analysis, etc.

For example, I have one index here that is 9 times the size of the original 
data because of how its fields are analyzed.  I can change one analysis-level 
setting and make that ratio go down to 2.  So I'd look at other, more straight 
forward things first.  There is a Wiki page either on Solr or Lucene Wiki 
dedicated to various search performance tricks.

 Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Jibo John jiboj...@mac.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 12:08:26 PM
 Subject: Re: Storing string field in solr.ExternalFieldFile type
 
 Thanks for the response, Eric.
 
 We have seen that size of the index has a direct impact on the search speed, 
 especially when the index size is in GBs, so trying all possible ways to keep 
 the index size as low as we can.
 
 We thought solr.ExternalFileField type would help to keep the index size low 
 by 
 storing a text field out side of the index.
 
 Here's what we were planning: initially, all the fields except the 
 solr.ExternalFileField type field will be queried and will be displayed to 
 the 
 end user. . There will be subsequent calls from the UI  to pull the 
 solr.ExternalFileField field that will be loaded in a lazy manner.
 
 However, realized that solr.ExternalFileField only supports float type, 
 however, 
 the data that we're planning to keep as an external field is a string type.
 
 Thanks,
 -Jibo
 
 
 
 On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:
 
  Hoping the experts chime in if I'm wrong, but
  As far as I know, while storing a field increases the size of an index,
  it doesn't have much impact on the search speed. Which you could
  pretty easily test by creating the index both ways and firing off some
  timing queries and comparing. Although it would be time consuming...
  
  I believe there's some info on the Lucene Wiki about this, but my memory
  isn't what it used to be.
  
  Erick
  
  
  On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote:
  
  We're in the process of building a log searcher application.
  
  In order to reduce the index size to improve the query performance, we're
  exploring the possibility of having:
  
  1. One field for each log line with 'indexed=true  stored=false' that
  will be used for searching
  2. Another field for each log line of type solr.ExternalFileField that
  will be used only for display purpose.
  
  We realized that currently solr.ExternalFileField supports only float type.
  
  Is there a way we can override this to support string type? Any issues with
  this approach?
  
  Any ideas are welcome.
  
  
  Thanks,
  -Jibo
  
  
  



index backup works only if there are committed index

2009-07-23 Thread solr jay
Hi,

I noticed that the backup request

http://master_host:port/solr/replication?command=backuphttp://master_host/solr/replication?command=backup

works only if there are committed index data, i.e.
core.getDeletionPolicy().getLatestCommit() is not null. Otherwise, no backup
is created. It sounds logical because if nothing has been committed since
your last backup, it doesn't help much to do a new backup. However, consider
this scenario:

1. a backup process is scheduled at 1:00AM every Monday
2. just before 1:00AM, the system is shutdown (for whatever reason), and
then restarts
3. No index is committed before 1:00AM
4. at 1:00AM, backup process starts and no committed index is found, and
therefore no backup (until next week)

The probability of this scenario is probably small, but it still could
happen, and it seems to me that if I want to backup index, a backup should
be created whether there are new committed index or not.

Your thoughts?

Thanks,

-- 
J


Re: Solr and UIMA

2009-07-23 Thread Grant Ingersoll


On Jul 21, 2009, at 11:57 AM, JCodina wrote:



Hello, Grant,
there are two ways, to implement this, one is payloads, and the  
other one is

multiple tokens at the same positions.
Each of them can be useful, let me explain the way I thick they can  
be used.

Payloads : every token has extra information that can be used in the
processing , for example if I can add Part-of-speech then I can  
develop
tokenizers that take into account the POS (or for example I can  
generate
bigrams of Noum Adjective, or Noum prep Noum i can have a better  
stopwords

algorithm)

Multiple tokes in one position: If I can have  different tokens at  
the same
place, I can have different informations like: was #verb _be so I  
can do a
search for you _be #adjective to find all the sentences that talk  
about

you for example you were clever you are tall ..


This was one of the use cases for payloads as well, but it likely  
needs more Query support at the moment, as the BoostingTermQuery would  
only allow you to boost values where it's a verb, not include/exclude.





I have not understood the way that the 
DelimitedPayloadTokenFilterFactory

may work in solr, which is the input format?


the DPTFF (nice acronym, eh?) allows you to send in your normal Solr  
XML, but with payloads encoded in the text.  For instance:


field name=foothe quick|JJ red|JJ fox|NN jumped|VB over the lazy| 
JJ brown|JJ dogs|NN/field


The DPTFF will take the value before the delimiter as the Token and  
the value after the delimiter as the payload.  This then allows Solr  
to add Payloads without modifying a single thing in Solr, at least on  
the indexing side.




so I was thinking in generating an xml where for each token a single  
string

is generated like was#verb#be
and then there is a tokenfilter that splits by # each white space  
separated
string,  in this case  in three words and adds the trailing  
character that

allows to search for the right semantic info. But gives them the same
increment. Of course the full processing chain must be aware of this.
But I must think on multiwords tokens



We could likely make a generic TokenFilter that can capture both  
multiple tokens and payloads all at the same time, simply by allowing  
it to have to attributes:

1. token delimiter (#)
2. payload delimiter (|)

Then, you could do something like:
was#be|verb
or
was#be|0.3

where was and be are both tokens at the same position and verb  
or 0.3 are payloads on those tokens.  This is a nearly trivial  
variation of the DelimitedPayloadTokenFilter









Grant Ingersoll-6 wrote:



On Jul 20, 2009, at 6:43 AM, JCodina wrote:


D: Break things down. The CAS would only produce XML that solr can
process.
Then different Tokenizers can be used to deal with the data in the
CAS. the
main point is that the XML has  the doc and field labels of solr.


I just committed the DelimitedPayloadTokenFilterFactory, I suspect
this is along the lines of what you are thinking, but I haven't done
all that much with UIMA.

I also suspect the Tee/Sink capabilities of Lucene could be helpful,
but they aren't available in Solr yet.





E: The set of capabilities to process the xml is defined in XML,
similar to
lucas to define the ouput and in the solr schema to define how  
this is

processed.


I want to use it in order to index something that is common but I
can't get
any tool to do that with sol: indexing a word and coding at the same
position the syntactic and semantic information. I know that in
Lucene this
is evolving and it will be possible to include metadata but for the
moment


What does Lucas do with Lucene?  Is it putting multiple tokens at the
same position or using Payloads?

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search





--
View this message in context: 
http://www.nabble.com/Solr-and-UIMA-tp24567504p24590509.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



facet.prefix question

2009-07-23 Thread Licinio Fernández Maurelo
i'm trying to do some filtering in the count list retrieved by solr when
doing a faceting query ,

i'm wondering how can i use facet.prefix to gem something like this:

Query

facet.field=foofacet.prefix=A OR B

Response

lst name=facet_fields
-
lst name=foo
int name=A12560/int
int name=A*5440/int
int name=B**2357/int
.
.
.
/lst



How can i achieve such this behaviour?

Best Regards

-- 
Lici


Re: how to get all the docIds in the search result?

2009-07-23 Thread Chris Hostetter

: Here id is indeed the uniqueKey of a document.
: I want to get all the ids  for some other  useage.

http://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss



Re: index backup works only if there are committed index

2009-07-23 Thread Otis Gospodnetic
Another options is making backups more directly, not using the Solr backup 
mechanism.

Check the green link on http://www.manning.com/hatcher3/


Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: solr jay solr...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 12:56:23 PM
 Subject: index backup works only if there are committed index
 
 Hi,
 
 I noticed that the backup request
 
 http://master_host:port/solr/replication?command=backup
 
 works only if there are committed index data, i.e.
 core.getDeletionPolicy().getLatestCommit() is not null. Otherwise, no backup
 is created. It sounds logical because if nothing has been committed since
 your last backup, it doesn't help much to do a new backup. However, consider
 this scenario:
 
 1. a backup process is scheduled at 1:00AM every Monday
 2. just before 1:00AM, the system is shutdown (for whatever reason), and
 then restarts
 3. No index is committed before 1:00AM
 4. at 1:00AM, backup process starts and no committed index is found, and
 therefore no backup (until next week)
 
 The probability of this scenario is probably small, but it still could
 happen, and it seems to me that if I want to backup index, a backup should
 be created whether there are new committed index or not.
 
 Your thoughts?
 
 Thanks,
 
 -- 
 J



Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Jibo John

Thanks for the quick response, Otis.

We have been able to achieve the ratio of 2 with different settings,  
however, considering the huge volume of the data that we need to deal  
with - 600 GB of data per day, and, we need to keep it in the index  
for 3 days - we're looking at all possible ways to reduce the index  
size further.
Will definitely keep exploring the straightforward things and see if  
we can find a better setting.



Thanks,
-Jibo

On Jul 23, 2009, at 9:49 AM, Otis Gospodnetic wrote:

I'm not sure if there is a lot of benefit from storing the literal  
values in that external file vs. directly in the index.  There are a  
number of things one should look at first, as far as performance is  
concerned - JVM settings, cache sizes, analysis, etc.


For example, I have one index here that is 9 times the size of the  
original data because of how its fields are analyzed.  I can change  
one analysis-level setting and make that ratio go down to 2.  So I'd  
look at other, more straight forward things first.  There is a Wiki  
page either on Solr or Lucene Wiki dedicated to various search  
performance tricks.


Otis
--
Sematext is hiring: http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: Jibo John jiboj...@mac.com
To: solr-user@lucene.apache.org
Sent: Thursday, July 23, 2009 12:08:26 PM
Subject: Re: Storing string field in solr.ExternalFieldFile type

Thanks for the response, Eric.

We have seen that size of the index has a direct impact on the  
search speed,
especially when the index size is in GBs, so trying all possible  
ways to keep

the index size as low as we can.

We thought solr.ExternalFileField type would help to keep the index  
size low by

storing a text field out side of the index.

Here's what we were planning: initially, all the fields except the
solr.ExternalFileField type field will be queried and will be  
displayed to the

end user. . There will be subsequent calls from the UI  to pull the
solr.ExternalFileField field that will be loaded in a lazy manner.

However, realized that solr.ExternalFileField only supports float  
type, however,
the data that we're planning to keep as an external field is a  
string type.


Thanks,
-Jibo



On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:


Hoping the experts chime in if I'm wrong, but
As far as I know, while storing a field increases the size of an  
index,

it doesn't have much impact on the search speed. Which you could
pretty easily test by creating the index both ways and firing off  
some
timing queries and comparing. Although it would be time  
consuming...


I believe there's some info on the Lucene Wiki about this, but my  
memory

isn't what it used to be.

Erick


On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote:


We're in the process of building a log searcher application.

In order to reduce the index size to improve the query  
performance, we're

exploring the possibility of having:

1. One field for each log line with 'indexed=true  stored=false'  
that

will be used for searching
2. Another field for each log line of type solr.ExternalFileField  
that

will be used only for display purpose.

We realized that currently solr.ExternalFileField supports only  
float type.


Is there a way we can override this to support string type? Any  
issues with

this approach?

Any ideas are welcome.


Thanks,
-Jibo









Re: Storing string field in solr.ExternalFieldFile type

2009-07-23 Thread Otis Gospodnetic
Jibo,

Well, there is always field compression, which lets you trade the index 
size/disk space for extra CPU time and thus some increase in indexing and 
search latency.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Jibo John jiboj...@mac.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 1:43:45 PM
 Subject: Re: Storing string field in solr.ExternalFieldFile type
 
 Thanks for the quick response, Otis.
 
 We have been able to achieve the ratio of 2 with different settings, however, 
 considering the huge volume of the data that we need to deal with - 600 GB of 
 data per day, and, we need to keep it in the index for 3 days - we're looking 
 at 
 all possible ways to reduce the index size further.
 Will definitely keep exploring the straightforward things and see if we can 
 find 
 a better setting.
 
 
 Thanks,
 -Jibo
 
 On Jul 23, 2009, at 9:49 AM, Otis Gospodnetic wrote:
 
  I'm not sure if there is a lot of benefit from storing the literal values 
  in 
 that external file vs. directly in the index.  There are a number of things 
 one 
 should look at first, as far as performance is concerned - JVM settings, 
 cache 
 sizes, analysis, etc.
  
  For example, I have one index here that is 9 times the size of the original 
 data because of how its fields are analyzed.  I can change one analysis-level 
 setting and make that ratio go down to 2.  So I'd look at other, more 
 straight 
 forward things first.  There is a Wiki page either on Solr or Lucene Wiki 
 dedicated to various search performance tricks.
  
  Otis
  --
  Sematext is hiring: http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
  
  
  
  - Original Message 
  From: Jibo John 
  To: solr-user@lucene.apache.org
  Sent: Thursday, July 23, 2009 12:08:26 PM
  Subject: Re: Storing string field in solr.ExternalFieldFile type
  
  Thanks for the response, Eric.
  
  We have seen that size of the index has a direct impact on the search 
  speed,
  especially when the index size is in GBs, so trying all possible ways to 
  keep
  the index size as low as we can.
  
  We thought solr.ExternalFileField type would help to keep the index size 
  low 
 by
  storing a text field out side of the index.
  
  Here's what we were planning: initially, all the fields except the
  solr.ExternalFileField type field will be queried and will be displayed to 
 the
  end user. . There will be subsequent calls from the UI  to pull the
  solr.ExternalFileField field that will be loaded in a lazy manner.
  
  However, realized that solr.ExternalFileField only supports float type, 
 however,
  the data that we're planning to keep as an external field is a string type.
  
  Thanks,
  -Jibo
  
  
  
  On Jul 22, 2009, at 1:46 PM, Erick Erickson wrote:
  
  Hoping the experts chime in if I'm wrong, but
  As far as I know, while storing a field increases the size of an index,
  it doesn't have much impact on the search speed. Which you could
  pretty easily test by creating the index both ways and firing off some
  timing queries and comparing. Although it would be time consuming...
  
  I believe there's some info on the Lucene Wiki about this, but my memory
  isn't what it used to be.
  
  Erick
  
  
  On Tue, Jul 21, 2009 at 2:42 PM, Jibo John wrote:
  
  We're in the process of building a log searcher application.
  
  In order to reduce the index size to improve the query performance, we're
  exploring the possibility of having:
  
  1. One field for each log line with 'indexed=true  stored=false' that
  will be used for searching
  2. Another field for each log line of type solr.ExternalFileField that
  will be used only for display purpose.
  
  We realized that currently solr.ExternalFileField supports only float 
  type.
  
  Is there a way we can override this to support string type? Any issues 
  with
  this approach?
  
  Any ideas are welcome.
  
  
  Thanks,
  -Jibo
  
  
  
  



Re: Solr Cell

2009-07-23 Thread Matt Weber
Found my own answer, use the literal parameter.  Should have dug  
around before asking.  Sorry.


Thanks,

Matt Weber
eSr Technologies
http://www.esr-technologies.com




On Jul 23, 2009, at 2:26 PM, Matt Weber wrote:

Is it possible to supply addition metadata along with the binary  
file when using Solr Cell?


For example, I have a pdf called somefile.pdf and I have some  
external metadata related to that file.  Such metadata might be  
things like author, publisher, source, date published, etc.   I want  
to post the binary data for somefile.pdf to Solr Cell AND map my  
metadata into other fields in the same document that has the  
extracted text from the pdf.


I know I could do this using Tika and SolrJ directly, but it would  
be much easier if Solr Cell can do it.


Thanks,

Matt Weber
eSr Technologies
http://www.esr-technologies.com








Re: LocalSolr - order of fields on xml response

2009-07-23 Thread Daniel Cassiano
Hi Ryan,

Thanks for the information.
Is this expected to be implemented?


Regards,
-- 
Daniel Cassiano
_

http://www.apontador.com.br/
http://www.maplink.com.br/

On Wed, Jul 22, 2009 at 10:08 PM, Ryan McKinley ryan...@gmail.com wrote:

 ya...  'expected', but perhaps not ideal.  As is, LocalSolr munges the
 document on its way out the door to add the distance.

 When LocalSolr makes it into the source, it will likely use a method like:
  https://issues.apache.org/jira/browse/SOLR-705
 to augment each document with the calculated distance.

 This will at least have consistent behavior.



 On Jul 22, 2009, at 10:47 AM, Daniel Cassiano wrote:

  Hi folks,

 When I do some query with LocalSolr to get the geo_distance, the order of
 xml fields is different of a standard query.
 It's a simple query, like this:

 http://myhost.com:8088/solr/core/select?qt=geox=-46.01y=-23.01radius=15sort=geo_distanceascq=*:*

 Is this an expected behavior of LocalSolr?


 Thanks!

 --
 Daniel Cassiano
 _
 http://www.apontador.com.br/
 http://www.maplink.com.br/





JDBC Import not exposing nested entities

2009-07-23 Thread Tagge, Tim
Hi,
I'm attempting to setup a simple joined index of some tables with the following 
structure...

EMPLOYEEORGANIZATION
-
employee_id organization_id
first_name  organization_name
last_name
edr_party_id
organization_id

When running the import, I'm getting this WARNING...
Jul 23, 2009 2:17:41 PM org.apache.solr.handler.dataimport.SolrWriter upload
WARNING: Error creating document : SolrInputDocumnt[{id=id(1.0)={42078}, 
first_name=first_name(1.0)={Mike}, last_name=last_name(1.0)={Madlock}, 
edr_party_id=edr
_party_id(1.0)={29131}, organization_id=organization_id(1.0)={138}}]
org.apache.solr.common.SolrException: Document [42078] missing required field: 
org
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:289)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:58)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:69)
at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:288)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)

As a result of this issue, no documents are searchable.  If I flip the required 
flag to false in schema.xml, the WARNING goes away and the documents are 
searchable.  However, the documents do not contain organization_name and they 
are not searchable by organization_name.  Have I overlooked a flag somewhere 
that specifies that nested entities are indexed?  Or an issue in my config?  
I've attached my full data-config and the fields section of schema.xml.  Thanks 
in advance.
Tim


schema.xml
fields
  field name=id type=integer indexed=true stored=true required=true 
/
  field name=first_name type=string indexed=true stored=true 
required=false /
  field name=last_name type=string indexed=true stored=true 
required=false /
  field name=edr_party_id type=integer indexed=true stored=true 
required=false /
  field name=org type=string indexed=true stored=true required=true 
/
  field name=organization_id type=integer indexed=true stored=true 
required=true /
  !--field name=city type=string indexed=true stored=true 
required=false /--
/fields

data-config.xml
dataConfig
dataSource
driver=oracle.jdbc.driver.OracleDriver
url=jdbc:oracle:thin:@hsrdb3:1521:hsint13
user=user
password=password /

document name=agentDoc
entity name=agent query=SELECT e.employee_id, e.first_name, 
e.last_name, e.edr_party_id, e.organization_id 
FROM employee e 
WHERE e.disabled = 'N' 
AND rownum  lt; 1000
field column=EMPLOYEE_ID name=id /
field column=FIRST_NAME name=first_name /
field column=LAST_NAME name=last_name /
field column=EDR_PARTY_ID name=edr_party_id /
field column=ORGANIZATION_ID name=organization_id 
/

entity name=organization query=select 
o.organization_name from organizations o where o.organization_id = 
'${agent.ORGANIZATION_ID}'
field name=org column=organization_name /
/entity

/entity
/document
/dataConfig




RE: Exception searching PhoneticFilterFactory field with number

2009-07-23 Thread Robert Petersen
Sure Otis, and in fact I can narrow it down to just exactly that query,
but with user queries I don't think it is right to throw an exception
out of phonetic filter factory if the user enters a number.  What I am
saying is am I going to have to filter the user queries for numerics
before using it to search in my double metaphone version of my titles?
That doesn't seem good.

Jul 23, 2009 2:58:17 PM org.apache.solr.core.SolrCore execute
INFO: [10017] webapp=/solr path=/select/
params={debugQuery=truerows=10start=0q=allDoublemetaphone:2343)
^0.5)))} hits=6873 status=500 QTime=3 
Jul 23, 2009 2:58:17 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: java.lang.IllegalArgumentException:
name and value cannot both be empty
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:470)
at
org.apache.solr.util.SolrPluginUtils.doStandardDebug(SolrPluginUtils.jav
a:399)
at
org.apache.solr.handler.component.DebugComponent.process(DebugComponent.
java:54)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Search
Handler.java:177)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1205)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
86)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
5)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.IllegalArgumentException: name and value cannot
both be empty
at org.apache.lucene.document.Field.init(Field.java:277)
at org.apache.lucene.document.Field.init(Field.java:251)
at
org.apache.solr.search.QueryParsing.writeFieldVal(QueryParsing.java:307)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:320)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:467)
... 19 more


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Monday, July 20, 2009 6:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Exception searching PhoneticFilterFactory field with number


Robert,

Can you narrow things down by simplifying the query?  For example, I see
allDoublemetaphone:2226, which looks suspicious in the give me
phonetic version of the input context, but if you could narrow it down,
we could probably be able to help more.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Monday, July 20, 2009 12:11:38 PM
 Subject: Exception searching PhoneticFilterFactory field with number
 
 Reposting in hopes of an answer...
 
 
 
 Hello all, 
 
 
 
 I am getting the following exception whenever a user includes a
numeric
 term in their search, and the search includes a field defined with a
 PhoneticFilterFactory and further it occurs whether I use the
 DoubleMetaphone encoder or any other.  Has this ever come up before?
I
 can replicate this with no data in the index at all, but if I search
the
 field by hand from the solr web interface there is no exception.  I am
 running the lucid imagination 1.3 certified release in a multicore
 master/slaves configuration.  I will include the field def and the
 search/exception below and let me know if I can include any more
 clues... seems like it's trying to make a field with no name/value:  
 
 
 
 
 positionIncrementGap=100
 
 
 
 
 class=solr.WhitespaceTokenizerFactory/
 
 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
 
 
 ignoreCase=true words=stopwords.txt/
 
 
 
 
 
 
 protected=protwords.txt/
 
 
 

RE: Exception searching PhoneticFilterFactory field with number

2009-07-23 Thread Robert Petersen
Hey I just noticed that this only happens when I enable debug.  If
debugQuery=true is on the URL then it goes through the debugging
component and that is throwing this exception.  It must be getting an
empty field object from the phonetic filter factory for numbers or
something similar

-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Thursday, July 23, 2009 4:12 PM
To: solr-user@lucene.apache.org
Subject: RE: Exception searching PhoneticFilterFactory field with number

Actually my first question should be, Is this a known bug or am I doing
something wrong?

The only one thing I can find on this topic is the following statement
on the solr-dev group when discussing adding the maxCodeLength, see
point two below:

Ryan McKinley updated SOLR-813:

---

Attachment: SOLR-813.patch

Here is an update that adresses two concerns: 
1. position increments -- this keeps the tokens in sync with the input 
2. previous version would stop processing after a number. That is: aaa
1234
bbb would not process bbb 3. Token types... this changes it to
DoubleMetaphone rather then ALPHANUM



-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Thursday, July 23, 2009 3:24 PM
To: solr-user@lucene.apache.org
Subject: RE: Exception searching PhoneticFilterFactory field with number

Sure Otis, and in fact I can narrow it down to just exactly that query,
but with user queries I don't think it is right to throw an exception
out of phonetic filter factory if the user enters a number.  What I am
saying is am I going to have to filter the user queries for numerics
before using it to search in my double metaphone version of my titles?
That doesn't seem good.

Jul 23, 2009 2:58:17 PM org.apache.solr.core.SolrCore execute
INFO: [10017] webapp=/solr path=/select/
params={debugQuery=truerows=10start=0q=allDoublemetaphone:2343)
^0.5)))} hits=6873 status=500 QTime=3 
Jul 23, 2009 2:58:17 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: java.lang.IllegalArgumentException:
name and value cannot both be empty
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:470)
at
org.apache.solr.util.SolrPluginUtils.doStandardDebug(SolrPluginUtils.jav
a:399)
at
org.apache.solr.handler.component.DebugComponent.process(DebugComponent.
java:54)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Search
Handler.java:177)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1205)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
86)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
5)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.IllegalArgumentException: name and value cannot
both be empty
at org.apache.lucene.document.Field.init(Field.java:277)
at org.apache.lucene.document.Field.init(Field.java:251)
at
org.apache.solr.search.QueryParsing.writeFieldVal(QueryParsing.java:307)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:320)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:467)
... 19 more


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Monday, July 20, 2009 6:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Exception searching PhoneticFilterFactory field with number


Robert,

Can you narrow things down by simplifying the query?  For example, I see
allDoublemetaphone:2226, which looks suspicious in the give me
phonetic version of the input context, but if you could narrow it down,
we could probably be able to help more.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

RE: Exception searching PhoneticFilterFactory field with number

2009-07-23 Thread Robert Petersen
Actually my first question should be, Is this a known bug or am I doing
something wrong?

The only one thing I can find on this topic is the following statement
on the solr-dev group when discussing adding the maxCodeLength, see
point two below:

Ryan McKinley updated SOLR-813:

---

Attachment: SOLR-813.patch

Here is an update that adresses two concerns: 
1. position increments -- this keeps the tokens in sync with the input 
2. previous version would stop processing after a number. That is: aaa
1234
bbb would not process bbb 3. Token types... this changes it to
DoubleMetaphone rather then ALPHANUM



-Original Message-
From: Robert Petersen [mailto:rober...@buy.com] 
Sent: Thursday, July 23, 2009 3:24 PM
To: solr-user@lucene.apache.org
Subject: RE: Exception searching PhoneticFilterFactory field with number

Sure Otis, and in fact I can narrow it down to just exactly that query,
but with user queries I don't think it is right to throw an exception
out of phonetic filter factory if the user enters a number.  What I am
saying is am I going to have to filter the user queries for numerics
before using it to search in my double metaphone version of my titles?
That doesn't seem good.

Jul 23, 2009 2:58:17 PM org.apache.solr.core.SolrCore execute
INFO: [10017] webapp=/solr path=/select/
params={debugQuery=truerows=10start=0q=allDoublemetaphone:2343)
^0.5)))} hits=6873 status=500 QTime=3 
Jul 23, 2009 2:58:17 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: java.lang.IllegalArgumentException:
name and value cannot both be empty
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:470)
at
org.apache.solr.util.SolrPluginUtils.doStandardDebug(SolrPluginUtils.jav
a:399)
at
org.apache.solr.handler.component.DebugComponent.process(DebugComponent.
java:54)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Search
Handler.java:177)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1205)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
86)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
5)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.IllegalArgumentException: name and value cannot
both be empty
at org.apache.lucene.document.Field.init(Field.java:277)
at org.apache.lucene.document.Field.init(Field.java:251)
at
org.apache.solr.search.QueryParsing.writeFieldVal(QueryParsing.java:307)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:320)
at
org.apache.solr.search.QueryParsing.toString(QueryParsing.java:467)
... 19 more


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Monday, July 20, 2009 6:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Exception searching PhoneticFilterFactory field with number


Robert,

Can you narrow things down by simplifying the query?  For example, I see
allDoublemetaphone:2226, which looks suspicious in the give me
phonetic version of the input context, but if you could narrow it down,
we could probably be able to help more.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Robert Petersen rober...@buy.com
 To: solr-user@lucene.apache.org
 Sent: Monday, July 20, 2009 12:11:38 PM
 Subject: Exception searching PhoneticFilterFactory field with number
 
 Reposting in hopes of an answer...
 
 
 
 Hello all, 
 
 
 
 I am getting the following exception whenever a user includes a
numeric
 term in their search, and the search includes a field defined with a
 PhoneticFilterFactory and further it occurs whether I use the
 DoubleMetaphone encoder or any other.  Has this 

server won't start using configs from Drupal

2009-07-23 Thread david
I've downloaded solr-2009-07-21.tgz and followed the instructions at http://drupal.org/node/343467 
including retrieving the solrconfig.xml and schema.xml files from the Drupal apachesolr module.


The server seems to start properly with the original solrconfig.xml and 
schema.xml files

When I try to start up the server with the Drupal supplied files, I get errors on the command line, 
and a 500 error from the server.


solrconfig.xml  http://pastebin.com/m23d14a2
schema.xml  http://pastebin.com/m2e79f304
output of http://localhost:8983/solr/admin/:http://pastebin.com/m410fa74d


Following looks to me like the important bits, but I'm not a java coder, so I 
could easily be wrong.

command line extract:

22/07/2009 5:58:54 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: analyzer without class or tokenizer 
 filter list
(plus lots of WARN messages)

extract from browser at http://localhost:8983/solr/admin/

org.apache.solr.common.SolrException: Unknown fieldtype 'text' specified on 
field title
(snip lots of stuff)
org.apache.solr.common.SolrException: analyzer without class or tokenizer  
filter list
(snip lots of stuff)
org.apache.solr.common.SolrException: Error loading class 
'solr.CharStreamAwareWhitespaceTokenizerFactory'

(snip lots of stuff)
Caused by: java.lang.ClassNotFoundException: 
solr.CharStreamAwareWhitespaceTokenizerFactory

Nothing in apache logs...

solr logs contain this:
127.0.0.1 - - [22/07/2009:08:01:10 +] GET /solr/admin/ HTTP/1.1 500 10292

Any help greatly appreciated.

David.


Re: server won't start using configs from Drupal

2009-07-23 Thread Otis Gospodnetic
I think the problem is CharStreamAwareWhitespaceTokenizerFactory, which used to 
live in Solr (when Drupal schema.xml for Solr was made), but has since moved to 
Lucene.  I'm half guessing. :)

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: david da...@kenpro.com.au
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 9:59:53 PM
 Subject: server won't start using configs from Drupal 
 
 I've downloaded solr-2009-07-21.tgz and followed the instructions at 
 http://drupal.org/node/343467 including retrieving the solrconfig.xml and 
 schema.xml files from the Drupal apachesolr module.
 
 The server seems to start properly with the original solrconfig.xml and 
 schema.xml files
 
 When I try to start up the server with the Drupal supplied files, I get 
 errors 
 on the command line, and a 500 error from the server.
 
 solrconfig.xml http://pastebin.com/m23d14a2
 schema.xml http://pastebin.com/m2e79f304
 output of http://localhost:8983/solr/admin/:  
 http://pastebin.com/m410fa74d
 
 
 Following looks to me like the important bits, but I'm not a java coder, so I 
 could easily be wrong.
 
 command line extract:
 
 22/07/2009 5:58:54 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: analyzer without class or 
 tokenizer  filter list
 (plus lots of WARN messages)
 
 extract from browser at http://localhost:8983/solr/admin/
 
 org.apache.solr.common.SolrException: Unknown fieldtype 'text' specified on 
 field title
 (snip lots of stuff)
 org.apache.solr.common.SolrException: analyzer without class or tokenizer  
 filter list
 (snip lots of stuff)
 org.apache.solr.common.SolrException: Error loading class 
 'solr.CharStreamAwareWhitespaceTokenizerFactory'
 (snip lots of stuff)
 Caused by: java.lang.ClassNotFoundException: 
 solr.CharStreamAwareWhitespaceTokenizerFactory
 
 Nothing in apache logs...
 
 solr logs contain this:
 127.0.0.1 - - [22/07/2009:08:01:10 +] GET /solr/admin/ HTTP/1.1 500 
 10292
 
 Any help greatly appreciated.
 
 David.