Re: Chinese chars are not indexed ?

2010-06-28 Thread Ahmet Arslan
 I am using the sample, not deploying Solr in Tomcat. Is
 there a place I can modify this setting ?


Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without 
spaces between words.

tokenizer class=solr.CJKTokenizerFactory/


Or you can search with both leading and trailing star. q=*ChineseText* should 
return something.



  


Re: Chinese chars are not indexed ?

2010-06-28 Thread go canal
oh yes, *...* works. thanks.

I saw tokenizer is defined in schema.xml. There are a few places that define 
the tokenizer. Wondering if it is enough to define one for:

fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
   !--    this is the only one I need to modify ? - --
tokenizer class=solr.WhitespaceTokenizerFactory/
   !-- - --
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English 
protected=protwords.txt/
  /analyzer/fieldType

 thanks,
canal





From: Ahmet Arslan iori...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Mon, June 28, 2010 2:54:16 PM
Subject: Re: Chinese chars are not indexed ?

 I am using the sample, not deploying Solr in Tomcat. Is
 there a place I can modify this setting ?


Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without 
spaces between words.

tokenizer class=solr.CJKTokenizerFactory/


Or you can search with both leading and trailing star. q=*ChineseText* should 
return something.


  

Re: Chinese chars are not indexed ?

2010-06-28 Thread Ahmet Arslan
 oh yes, *...* works. thanks.
 
 I saw tokenizer is defined in schema.xml. There are a few
 places that define the tokenizer. Wondering if it is enough
 to define one for:

It is better to define a brand new field type specific to Chinese. 

http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
 like:

at index time:
tokenizer class=solr.CJKTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/

at query time:
tokenizer class=solr.CJKTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PositionFilterFactory /



  


one to many denormalization approach

2010-06-28 Thread Michael Delaney
Hi,

I have an architectural question about using apache solr/lucene.

I'm building a solr index for searching a CV database. Basically every CV on
there will have some fields like:

rate of pay, address, title

these fields are straight forward. The area I need advise on is, skills and
job history. For skills, someone might add an entry like: Ruby - 5 Years,
Java - 9 Years

CV:

John Smith
27
Skills:
Java, 5 Years
Sql, 4 Years
Lucene, 1 Year
Jobs:
1998-2004 Acme Search Ltd, Senior Java Developer, New York City, US
2004-2009 Software Labs Ltd, Technical Architect, San Francisco, CA,
US

So there's essentially N number of skills, each with a string name and a int
no of years. I was thinking I could use a dynamic field, *_skill, and
possibly add them like so:

1_skill: Ruby, 2_skill: Java

But how can I index the years experience? would I then add a dynamic field
like:

1_skill_years: 5, 2_skill_years: 9


How would i fit these into the index?
Any help greatly appreciated?

Regards


Question about the mailinglist (junk on my behalf)

2010-06-28 Thread MitchK

Hello community,

since a few days I recieve daily some mails with suspicious content. It is
said that some of my mails were rejected, because of the file-types of the
mail's attachements and other things.
This wonders me a lot, because I didn't send any mails with attachements and
even the eMail-adresses which want to make me aware of my rejected mails are
unknown to me.

This is the first mailinglist I have joined and I know that there are a lot
of bots out there, crawling for eMail-adresses to send junk. However, I
can't recognize any suspicious behaviour except those mails.

The number of mails that make me aware of the mentioned thing is 10 in a few
days, maybe 15 but not more. And I do not get more junk than I normally get. 

Does anyone recieves suspicious eMails on my behalf?

Thank you.
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-the-mailinglist-junk-on-my-behalf-tp927461p927461.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: is there a delete all command in updateHandler?

2010-06-28 Thread Daniel Alheiros
Hi Li,

Yes, you can issue a delete all by:
curl http://your_solr_server:your_solr_port/solr/update -H
Content-Type: text/xml --data-binary
'deletequery*:*/query/delete';

Hope it helps.

Cheers,
Daniel 

-Original Message-
From: Li Li [mailto:fancye...@gmail.com] 
Sent: 28 June 2010 03:41
To: solr-user@lucene.apache.org
Subject: is there a delete all command in updateHandler?

I want to delete all index and rebuild index frequently. I can't delete
the index files directly because I want to use replication

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



preside != president

2010-06-28 Thread Darren Govoni
Hi,
  It seems to me that because the stemming does not produce
grammatically correct stems in many of the cases,
search anomalies can occur like the one I am seeing where I have a
document with president in it and it is returned
when I search for preside, a different word entirely.

Is this correct or acceptable behavior? Previous discussions here on
stemming, I was told its ok as long as all the words reduce
to the same stem, but when different words reduce to the same stem it
seems to affect search results in a bad way.

Darren


Search limit to the first 50 000 chars for one field

2010-06-28 Thread judauphant

Hi,

I use solr 1.4 for search contents in documents (pdf, doc, odt ...). I use
the module /update/extract.
When I am researching, I am limited to the first 5 characters
(approximately).
Any word or sentence after is not found (but the field has more than 5
characters when I recovered it by a search).
I searched if anyone had the same problem or if there was a setting to
resolved this but I found nothing.

How I can increase this limit ?

Line of my schema.xml for the field in which I search :
field name=text type=text indexed=true stored=true
multiValued=true termPositions=true termOffsets=true compressed=true
/
I store the content to use the module Highlighting.

And here are my search options (but without options, I have the same
problem):
/select?q=mySearchstart=0rows=1250fl=idhl=onhl.fl=textomitHeader=truehl.mergeContiguous=truehl.snippets=5hl.simple.pre=[PRE_FIND_START]hl.simple.post=[PRE_FIND_END]wt=phps


Thank you in advance for your reply.

Best regards,
Julien
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-limit-to-the-first-50-000-chars-for-one-field-tp927635p927635.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search limit to the first 50 000 chars for one field

2010-06-28 Thread Ahmet Arslan
 I use solr 1.4 for search contents in documents (pdf, doc,
 odt ...). I use
 the module /update/extract.
 When I am researching, I am limited to the first 5
 characters
 (approximately).
 Any word or sentence after is not found (but the field has
 more than 5
 characters when I recovered it by a search).
 I searched if anyone had the same problem or if there was a
 setting to
 resolved this but I found nothing.
 
 How I can increase this limit ?


maxFieldLength configuration can be done in solrconfig.xml
maxFieldLength2147483647/maxFieldLength


  


Re: Data Import Handler Rich Format Documents

2010-06-28 Thread Alexey Serba
 Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using
 Solr Version: 1.4.0 and getting the following error:

 java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
 org.apache.solr.handler.dataimport.BinURLDataSource
It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
release. You should use trunk / nightly builds.
https://issues.apache.org/jira/browse/SOLR-1583

 My data-config.xml looks like this:

 dataConfig
  dataSource type=JdbcDataSource
    driver=oracle.jdbc.driver.OracleDriver
    url=jdbc:oracle:thin:@whatever:12345:whatever
    user=me
    name=ds-db
    password=secret/

  dataSource type=BinURLDataSource
    name=ds-url/

  document
    entity name=my_database
     dataSource=ds-db
     query=select * from my_database where rownum lt;=2
      field column=CONTENT_ID                name=content_id/
      field column=CMS_TITLE                 name=cms_title/
      field column=FORM_TITLE                name=form_title/
      field column=FILE_SIZE                 name=file_size/
      field column=KEYWORDS                  name=keywords/
      field column=DESCRIPTION               name=description/
      field column=CONTENT_URL               name=content_url/
    /entity

    entity name=my_database_url
     dataSource=ds-url
     query=select CONTENT_URL from my_database where
 content_id='${my_database.CONTENT_ID}'
     entity processor=TikaEntityProcessor
      dataSource=ds-url
      format=text
      url=http://www.mysite.com/${my_database.content_url};
      field column=text/
     /entity
    /entity

  /document
 /dataConfig

 I added the entity name=my_database_url section to an existing (working)
 database entity to be able to have Tika index the content pointed to by the
 content_url.

 Is there anything obviously wrong with what I've tried so far?

I think you should move Tika entity into my_database entity and
simplify the whole configuration

entity name=my_database dataSource=ds-db query=select * from
my_database where rownum lt;=2
...
field column=CONTENT_URL   name=content_url/

entity processor=TikaEntityProcessor dataSource=ds-url
format=text url=http://www.mysite.com/${my_database.content_url};
field column=text/
/entity
/entity


Strange query behavior

2010-06-28 Thread Marc Ghorayeb

Hello,
I have a title that says 3DVIA Studio amp; Virtools Maya and 3dsMax 
Exporters. The analysis tool for this field gives me these 
tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport


However, when i search for 3dsmax, i get no results :( Furthermore, if i 
search for dsmax i get the spellchecker that suggests me 3dsmax even though 
it doesn't find any results. If i search for any other token (3dvia, or max 
for example), the document is found. 3dsmax is the only token that doesn't 
seem to work!! :(
Here is my schema for this field:fieldType name=text class=solr.TextField 
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
splitOnCaseChange=1
preserveOriginal=1
/

filter class=solr.TrimFilterFactory updateOffsets=true/
filter class=solr.LengthFilterFactory min=2 max=15/ 
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /   filter 
class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true 
expand=true/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.SnowballPorterFilterFactory 
language=${Language} protected=protwords.txt/
/analyzer

analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory /

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=0
splitOnCaseChange=1
preserveOriginal=1
/

filter class=solr.TrimFilterFactory updateOffsets=true/
filter class=solr.LengthFilterFactory min=2 max=15/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
filter class=solr.SnowballPorterFilterFactory 
language=${Language} protected=protwords.txt /
/analyzer
/fieldType
Can anyone help me out please? :(
PS: the ${Language} is set to en (for english) in this case...
  
_
La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail dans 
Hotmail !
http://www.windowslive.fr/hotmail/nowgeneration/

Re: Search limit to the first 50 000 chars for one field

2010-06-28 Thread judauphant

Ok thanks, it works.

Best regards,
Julien
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-limit-to-the-first-50-000-chars-for-one-field-tp927635p927725.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: preside != president

2010-06-28 Thread Brendan Grainger
Hi Darren,

You might want to look at the KStemmer 
(http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem) instead of 
the standard PorterStemmer. It essentially has a 'dictionary' of exception 
words where stemming stops if found, so in your case president won't be stemmed 
any further than president (but presidents will be stemmed to president). You 
will have to integrate it into solr yourself, but that's straightforward. 

HTH
Brendan

 
On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:

 Hi,
  It seems to me that because the stemming does not produce
 grammatically correct stems in many of the cases,
 search anomalies can occur like the one I am seeing where I have a
 document with president in it and it is returned
 when I search for preside, a different word entirely.
 
 Is this correct or acceptable behavior? Previous discussions here on
 stemming, I was told its ok as long as all the words reduce
 to the same stem, but when different words reduce to the same stem it
 seems to affect search results in a bad way.
 
 Darren



Re: preside != president

2010-06-28 Thread darren
Thanks for the tip. Yeah, I think the stemming confounds search results as
it stands (porter stemmer).

I was also thinking of using my dictionary of 500,000 words with their
complete morphologies and conjugations and create a synonyms.txt to
provide english accurate morphology.

Is this a good idea?

Darren

 Hi Darren,

 You might want to look at the KStemmer
 (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
 instead of the standard PorterStemmer. It essentially has a 'dictionary'
 of exception words where stemming stops if found, so in your case
 president won't be stemmed any further than president (but presidents will
 be stemmed to president). You will have to integrate it into solr
 yourself, but that's straightforward.

 HTH
 Brendan


 On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:

 Hi,
  It seems to me that because the stemming does not produce
 grammatically correct stems in many of the cases,
 search anomalies can occur like the one I am seeing where I have a
 document with president in it and it is returned
 when I search for preside, a different word entirely.

 Is this correct or acceptable behavior? Previous discussions here on
 stemming, I was told its ok as long as all the words reduce
 to the same stem, but when different words reduce to the same stem it
 seems to affect search results in a bad way.

 Darren





DataImportHandler $deleteDocById question

2010-06-28 Thread André Maldonado
Hi all.

I'm trying to get $deleteDocById working, but any document is being deleted
from my index.

I'm using Full-Import (withOUT cleaning) and a script with:

row.put('$deleteDocById', row.get('codAnuncio'));

The script is passing in this line for every document it processes (for
testing purposes). The schema has:

uniqueKeycodanuncio/uniqueKey

What can be wrong?

Thank's

Então aproximaram-se os que estavam no barco, e adoraram-no, dizendo: És
verdadeiramente o Filho de Deus. (Mateus 14:33)


custom core admin handler

2010-06-28 Thread Dave Hall
Hi all,

I have been using Solr for quite a while, but I never really got into
looking at the code.  Last week that all changed, I decided to write a
custom core admin handler.  I've posted something on my blog about it,
along with a Drupal centric howto.  I'd be interested to know what
people think of it.  The post is at
http://davehall.com.au/blog/dave/2010/06/26/multi-core-apache-solr-ubuntu-1004-drupal-auto-provisioning

It's been a while since I hacked on Java, so I am sure there are bits
that can be improved.  Feel free to email me on or off list, or post a
comment on my blog.

If there is interest in including this in Solr, I would be willing to
relicense it.

Cheers

Dave



Re: Chinese chars are not indexed ?

2010-06-28 Thread Andy
What if Chinese is mixed with English?

I have text that is entered by users and it could be a mix of Chinese, English, 
etc.

What's the best way to handle that?

Thanks.

--- On Mon, 6/28/10, Ahmet Arslan iori...@yahoo.com wrote:

 From: Ahmet Arslan iori...@yahoo.com
 Subject: Re: Chinese chars are not indexed ?
 To: solr-user@lucene.apache.org
 Date: Monday, June 28, 2010, 3:44 AM
  oh yes, *...* works. thanks.
  
  I saw tokenizer is defined in schema.xml. There are a
 few
  places that define the tokenizer. Wondering if it is
 enough
  to define one for:
 
 It is better to define a brand new field type specific to
 Chinese. 
 
 http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
 like:
 
 at index time:
 tokenizer class=solr.CJKTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 
 at query time:
 tokenizer class=solr.CJKTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PositionFilterFactory /
 
 
 
       
 





Re: preside != president

2010-06-28 Thread Joe Calderon
the general consensus among people who run into the problem you have
is to use a plurals only stemmer, a synonyms file or a combination of
both (for irregular nouns etc)

if you search the archives you can find info on a plurals stemmer

On Mon, Jun 28, 2010 at 6:49 AM,  dar...@ontrenet.com wrote:
 Thanks for the tip. Yeah, I think the stemming confounds search results as
 it stands (porter stemmer).

 I was also thinking of using my dictionary of 500,000 words with their
 complete morphologies and conjugations and create a synonyms.txt to
 provide english accurate morphology.

 Is this a good idea?

 Darren

 Hi Darren,

 You might want to look at the KStemmer
 (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
 instead of the standard PorterStemmer. It essentially has a 'dictionary'
 of exception words where stemming stops if found, so in your case
 president won't be stemmed any further than president (but presidents will
 be stemmed to president). You will have to integrate it into solr
 yourself, but that's straightforward.

 HTH
 Brendan


 On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:

 Hi,
  It seems to me that because the stemming does not produce
 grammatically correct stems in many of the cases,
 search anomalies can occur like the one I am seeing where I have a
 document with president in it and it is returned
 when I search for preside, a different word entirely.

 Is this correct or acceptable behavior? Previous discussions here on
 stemming, I was told its ok as long as all the words reduce
 to the same stem, but when different words reduce to the same stem it
 seems to affect search results in a bad way.

 Darren






Re: Strange query behavior

2010-06-28 Thread Joe Calderon
splitOnCaseChange is creating multiple tokens from 3dsMax disable it
or enable catenateAll, use the analysys page in the admin tool to see
exactly how your text will be indexed by analyzers without having to
reindex your documents, once you have it right you can do a full
reindex.

On Mon, Jun 28, 2010 at 5:48 AM, Marc Ghorayeb dekay...@hotmail.com wrote:

 Hello,
 I have a title that says 3DVIA Studio  Virtools Maya and 3dsMax Exporters. 
 The analysis tool for this field gives me these 
 tokens:3dviadviastudio;virtoolmaya3dsmaxdssystèmmaxexport


 However, when i search for 3dsmax, i get no results :( Furthermore, if i 
 search for dsmax i get the spellchecker that suggests me 3dsmax even 
 though it doesn't find any results. If i search for any other token (3dvia, 
 or max for example), the document is found. 3dsmax is the only token that 
 doesn't seem to work!! :(
 Here is my schema for this field:fieldType name=text 
 class=solr.TextField positionIncrementGap=100
        analyzer type=index
                tokenizer class=solr.WhitespaceTokenizerFactory/

                filter class=solr.WordDelimiterFilterFactory
                        generateWordParts=1
                        generateNumberParts=1
                        catenateWords=0
                        catenateNumbers=0
                        catenateAll=0
                        splitOnCaseChange=1
                        preserveOriginal=1
                /

                filter class=solr.TrimFilterFactory updateOffsets=true/
                filter class=solr.LengthFilterFactory min=2 max=15/    
          filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /               
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/

                filter class=solr.LowerCaseFilterFactory/
                filter class=solr.RemoveDuplicatesTokenFilterFactory/
                filter class=solr.SnowballPorterFilterFactory 
 language=${Language} protected=protwords.txt/
        /analyzer

        analyzer type=query
                tokenizer class=solr.WhitespaceTokenizerFactory /

                filter class=solr.WordDelimiterFilterFactory
                        generateWordParts=1
                        generateNumberParts=1
                        catenateWords=1
                        catenateNumbers=1
                        catenateAll=0
                        splitOnCaseChange=1
                        preserveOriginal=1
                /

                filter class=solr.TrimFilterFactory updateOffsets=true/
                filter class=solr.LengthFilterFactory min=2 max=15/
                filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt enablePositionIncrements=true /
                filter class=solr.LowerCaseFilterFactory /
                filter class=solr.RemoveDuplicatesTokenFilterFactory /
                filter class=solr.SnowballPorterFilterFactory 
 language=${Language} protected=protwords.txt /
        /analyzer
 /fieldType
 Can anyone help me out please? :(
 PS: the ${Language} is set to en (for english) in this case...

 _
 La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail 
 dans Hotmail !
 http://www.windowslive.fr/hotmail/nowgeneration/


Re: questions about Solr shards

2010-06-28 Thread Joe Calderon
there is a first pass query to retrieve all matching document ids from
every shard along with relevant sorting information, the document ids
are then sorted and limited to the amount needed, then a second query
is sent for the rest of the documents metadata.

On Sun, Jun 27, 2010 at 7:32 PM, Babak Farhang farh...@gmail.com wrote:
 Otis,

 Belated thanks for your reply.

 2. The index could change between stages, e.g. a
 document that matched a
 query and was subsequently changed may no
 longer match but will still be
 retrieved.

 2. This describes the situation where, for instance, a
 document with ID=10 is updated between the 2 calls
 to the Solr instance/shard where that doc ID=10 lives.

 Can you explain why this happens? (I.e. does each query to the sharded
 index somehow involve 2 calls to each shard instance from the base
 instance?)

 -Babak

 On Thu, Jun 24, 2010 at 10:14 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hi Babak,

 1. Yes, you are reading that correctly.

 2. This describes the situation where, for instance, a document with ID=10 
 is updated between the 2 calls to the Solr instance/shard where that doc 
 ID=10 lives.

 3. Yup, orthogonal.  You can have a master with multiple cores for sharded 
 and non-sharded indices and you can have a slave with cores that hold 
 complete indices or just their shards.
  Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Babak Farhang farh...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, June 24, 2010 6:32:54 PM
 Subject: questions about Solr shards

 Hi everyone,

 There are a couple of notes on the limitations of this
 approach at

 target=_blank http://wiki.apache.org/solr/DistributedSearch which I'm
 having trouble
 understanding.

 1. When duplicate doc IDs are received,
 Solr chooses the first doc
   and discards subsequent
 ones

 Received here is from the perspective of the base Solr instance
 at
 query time, right?  I.e. if you inadvertently indexed 2 versions
 of
 the document with the same unique ID but different contents to
 2
 shards, then at query time, the first document (putting aside for
 the
 moment what exactly first means) would win.  Am I reading
 this
 right?


 2. The index could change between stages, e.g. a
 document that matched a
   query and was subsequently changed may no
 longer match but will still be
   retrieved.

 I have no idea what
 this second statement means.


 And one other question about
 shards:

 3. The examples I've seen documented do not illustrate
 sharded,
 multicore setups; only sharded monolithic cores.  I assume
 sharding
 works with multicore as well (i.e. the two issues are
 orthogonal).  Is
 this right?


 Any help on interpreting the
 above would be much appreciated.

 Thank you,
 -Babak




Too Many Open Files

2010-06-28 Thread Anderson vasconcelos
Hi all
When i send a delete query to SOLR, using the SOLRJ i received this
exception:

org.apache.solr.client.solrj.SolrServerException: java.net.SocketException:
Too many open files
11:53:06,964 INFO  [HttpMethodDirector] I/O exception
(java.net.SocketException) caught when processing request: Too many open
files

Anyone could Help me? How i can solve this?

Thanks


Re: Too Many Open Files

2010-06-28 Thread Erick Erickson
This probably means you're opening new readers without closing
old ones. But that's just a guess. I'm guessing that this really
has nothing to do with the delete itself, but the delete is what's
finally pushing you over the limit.

I know this has been discussed before, try searching the mail
archive for TooManyOpenFiles and/or File Handles

You could get much better information by providing more details, see:

http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)

Best
Erick

On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos 
anderson.v...@gmail.com wrote:

 Hi all
 When i send a delete query to SOLR, using the SOLRJ i received this
 exception:

 org.apache.solr.client.solrj.SolrServerException: java.net.SocketException:
 Too many open files
 11:53:06,964 INFO  [HttpMethodDirector] I/O exception
 (java.net.SocketException) caught when processing request: Too many open
 files

 Anyone could Help me? How i can solve this?

 Thanks



solr data config questions

2010-06-28 Thread Peng, Wei
Hi All,

 

I am a new user of Solr.

We are now trying to enable searching on Digg dataset.

It has story_id as the primary key and comment_id are the comment id
which commented story_id, so story_id and comment_id is one-to-many
relationship.

These comment_ids can be replied by some repliers, so comment_id and
repliers are one-to-many relationship.

 

The problem is that within a single returned document the search results
shows an array of comment_ids and an array of repliers without knowing
which repliers replied which comment.

For example: now we got comment_id:[c1,c,2...,cn],
repliers:[r1,r2,r3rm]. Can we get something like
comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
{r1,r2} is corresponding to c1?

 

Our current data-config is attached:

dataConfig

dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
autoreconnect=true netTimeoutForStreamingResults=1200
url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root
password= /

document

entity name=story pk=story_id query=select * from
story

  deltaImportQuery=select * from story where
ID=='${dataimporter.delta.story_id}'

  deltaQuery=select story_id from story where
last_modified  '${dataimporter.last_index_time}'



field column=link name=link /

field column=title name=title /

field column=description name=story_content /

field column=digg name=positiveness /

field column=comment name=spreading_number /

field column=user_id name=author /

field column=profile_view name=user_popularity /

field column=topic name=topic /

field column=timestamp name=timestamp /



entity name=dugg_list  pk=story_id

query=select * from dugg_list where
story_id='${story.story_id}'

deltaQuery=select SID from dugg_list where
last_modified  '${dataimporter.last_index_time}'

parentDeltaQuery=select story_id from story where
story_id=${dugg_list.story_id}

  field name=viewer column=dugger /

/entity

 

entity name=commenttable  pk=comment_id

query=select * from commenttable where
story_id='${story.story_id}'

deltaQuery=select SID from commenttable where
last_modified  '${dataimporter.last_index_time}'

parentDeltaQuery=select story_id from story where
story_id=${commenttable.story_id}

  field name=comment_id column=comment_id /

  field name=spreading_user column=replier /

  field name=comment_positiveness column=up /

  field name=comment_negativeness column=down /

  field name=user_comment column=content /

  field name=user_comment_timestamp
column=timestamp /

 

 

entity name=replytable  

query=select * from replytable where
comment_id='${commenttable.comment_id}'

deltaQuery=select SID from replytable where
last_modified  '${dataimporter.last_index_time}'

parentDeltaQuery=select comment_id from
commenttable where comment_id=${replytable.comment_id}

  field name=replier_id column=replier_id /

  field name=reply_content column=content /

  field name=reply_positiveness column=up /

  field name=reply_negativeness column=down /

  field name=reply_timestamp column=timestamp /

/entity

 

/entity

/entity

/document

/dataConfig

 

Please help me on this.

Many thanks

 

Vivian

 

 

 



Re: preside != president

2010-06-28 Thread Jan Høydahl / Cominvent
Hi,

You might also want to check out the new Lucene-Hunspell stemmer at 
http://code.google.com/p/lucene-hunspell/
It uses OpenOffice dictionaries with known stems in combination with a large 
set of language specific rules.
It handles your example, but it is an early release, so test it thoroughly 
before deploying in production :)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 28. juni 2010, at 17.43, Joe Calderon wrote:

 the general consensus among people who run into the problem you have
 is to use a plurals only stemmer, a synonyms file or a combination of
 both (for irregular nouns etc)
 
 if you search the archives you can find info on a plurals stemmer
 
 On Mon, Jun 28, 2010 at 6:49 AM,  dar...@ontrenet.com wrote:
 Thanks for the tip. Yeah, I think the stemming confounds search results as
 it stands (porter stemmer).
 
 I was also thinking of using my dictionary of 500,000 words with their
 complete morphologies and conjugations and create a synonyms.txt to
 provide english accurate morphology.
 
 Is this a good idea?
 
 Darren
 
 Hi Darren,
 
 You might want to look at the KStemmer
 (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
 instead of the standard PorterStemmer. It essentially has a 'dictionary'
 of exception words where stemming stops if found, so in your case
 president won't be stemmed any further than president (but presidents will
 be stemmed to president). You will have to integrate it into solr
 yourself, but that's straightforward.
 
 HTH
 Brendan
 
 
 On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:
 
 Hi,
  It seems to me that because the stemming does not produce
 grammatically correct stems in many of the cases,
 search anomalies can occur like the one I am seeing where I have a
 document with president in it and it is returned
 when I search for preside, a different word entirely.
 
 Is this correct or acceptable behavior? Previous discussions here on
 stemming, I was told its ok as long as all the words reduce
 to the same stem, but when different words reduce to the same stem it
 seems to affect search results in a bad way.
 
 Darren
 
 
 
 



Re: SweetSpotSimilarity

2010-06-28 Thread Blargy


iorixxx wrote:
 
 it is in schema.xml:
 
 similarity class=org.apache.lucene.search.SweetSpotSimilarity/
 

How would you configure the tfBaselineTfFactors and LengthNormFactors when
configuring via schema.xml? Do I have to create a subclass that hardcodes
these values?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SweetSpotSimilarity-tp922546p928730.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SweetSpotSimilarity

2010-06-28 Thread Ahmet Arslan


 How would you configure the tfBaselineTfFactors and
 LengthNormFactors when
 configuring via schema.xml? 

CustomSimilarityFactory that extends org.apache.solr.schema.SimilarityFactory 
should do it. There is an example CustomSimilarityFactory.java under 
src/test/org...


  


Re: Spatial types and DIH

2010-06-28 Thread Grant Ingersoll

On Jun 24, 2010, at 12:32 AM, Eric Angel wrote:

 I'm using solr 4.0-2010-06-23_08-05-33 and can't figure out how to add the 
 spatial types (LatLon, Point, GeoHash or SpatialTile) using 
 dataimporthandler.  My lat/lngs from the database are in separate fields.  
 Does anyone know how to do his?

Can you concat the two fields together as part of your SQL statement?

Re: Spatial types and DIH

2010-06-28 Thread Eric Angel
Yes.  For now, I've gone back to Lucene 1.4 and installed Local Lucene.  I just 
couldn't get the sfilt to work.  I'm sure I was probably missing something, but 
I think I'll just wait until 1.5 is ready to be shipped.


On Jun 28, 2010, at 12:02 PM, Grant Ingersoll wrote:

 
 On Jun 24, 2010, at 12:32 AM, Eric Angel wrote:
 
 I'm using solr 4.0-2010-06-23_08-05-33 and can't figure out how to add the 
 spatial types (LatLon, Point, GeoHash or SpatialTile) using 
 dataimporthandler.  My lat/lngs from the database are in separate fields.  
 Does anyone know how to do his?
 
 Can you concat the two fields together as part of your SQL statement?



Re: SweetSpotSimilarity

2010-06-28 Thread Blargy


iorixxx wrote:
 
 CustomSimilarityFactory that extends
 org.apache.solr.schema.SimilarityFactory should do it. There is an example
 CustomSimilarityFactory.java under src/test/org...
 

This is exactly what I was looking for... this is very similar ( no put
intended ;) ) to the updateProcessorFactory configuration in
solr-config.xml. The wiki should probably include this information.

Side question. How would I know if a configuration option can also take a
factory class.. like in this instance?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SweetSpotSimilarity-tp922546p928862.html
Sent from the Solr - User mailing list archive at Nabble.com.


spellcheckcomponent and frequency thresholds

2010-06-28 Thread Matthew Goldfield
Hi,
I'm adding the spellCheckComponent to my current configuration of solr, and I 
was wondering if there was a way to set a minimum frequency threshold for the 
IndexBasedSpellChecker through solr like there is in the depreciated Spell 
Check Request Handler.  I know that you can fix most problems by changing the 
'accuracy' field, but there are small anomalies that I'd like do remove from 
the dictionary entirely, and a simple way to do this would be using a frequency 
threshold.

I've looked around for this and I havent found anything recent.

Thanks,
Matt

csn | stores shop easy
Software Development
Phone: 617-502-7694



Re: Too Many Open Files

2010-06-28 Thread Michel Bottan
Hi Anderson,

If you are using SolrJ, it's recommended to reuse the same instance per solr
server.

http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer

But there are other scenarios which may cause this situation:

1. Other application running in the same Solr JVM which doesn't close
properly sockets or control file handlers.
2. Open files limits configuration is low . Check your limits, read it from
JVM process info:
cat /proc/1234/limits (where 1234 is your process ID)

Cheers,
Michel Bottan


On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.comwrote:

 This probably means you're opening new readers without closing
 old ones. But that's just a guess. I'm guessing that this really
 has nothing to do with the delete itself, but the delete is what's
 finally pushing you over the limit.

 I know this has been discussed before, try searching the mail
 archive for TooManyOpenFiles and/or File Handles

 You could get much better information by providing more details, see:


 http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29

 Best
 Erick

 On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos 
 anderson.v...@gmail.com wrote:

  Hi all
  When i send a delete query to SOLR, using the SOLRJ i received this
  exception:
 
  org.apache.solr.client.solrj.SolrServerException:
 java.net.SocketException:
  Too many open files
  11:53:06,964 INFO  [HttpMethodDirector] I/O exception
  (java.net.SocketException) caught when processing request: Too many open
  files
 
  Anyone could Help me? How i can solve this?
 
  Thanks
 



Re: solr data config questions

2010-06-28 Thread Alexey Serba
Hi,

You can add additional commentreplyjoin entity to story entity, i.e.

entity name=story ...
...
entity name=commenttable ...
...
entity name=replytable ...
...
/entity
/entity

entity name=commentreplyjoin query=select concat(comment_id,
',', replier_id) as commentreply from commenttable left join
replytable on replytable.comment_id=commenttable.comment_id where
commenttable.story_id=${story.story_id}'
field name=commentreply column=commentreply /
/entity
/entity

Thus, you will have multivalued field commentreply that contains list
of related comment_id, reply_id (comment_id, if you don't have any
related replies for this entry) pairs. You can retrieve all values of
that field and process on a client and build complex data structure.

HTH,
Alex

On Mon, Jun 28, 2010 at 8:19 PM, Peng, Wei wei.p...@xerox.com wrote:
 Hi All,



 I am a new user of Solr.

 We are now trying to enable searching on Digg dataset.

 It has story_id as the primary key and comment_id are the comment id
 which commented story_id, so story_id and comment_id is one-to-many
 relationship.

 These comment_ids can be replied by some repliers, so comment_id and
 repliers are one-to-many relationship.



 The problem is that within a single returned document the search results
 shows an array of comment_ids and an array of repliers without knowing
 which repliers replied which comment.

 For example: now we got comment_id:[c1,c,2...,cn],
 repliers:[r1,r2,r3rm]. Can we get something like
 comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
 {r1,r2} is corresponding to c1?



 Our current data-config is attached:

 dataConfig

    dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
 autoreconnect=true netTimeoutForStreamingResults=1200
 url=jdbc:mysql://localhost/diggdataset batchSize=-1 user=root
 password= /

    document

            entity name=story pk=story_id query=select * from
 story

                  deltaImportQuery=select * from story where
 ID=='${dataimporter.delta.story_id}'

                  deltaQuery=select story_id from story where
 last_modified  '${dataimporter.last_index_time}'



            field column=link name=link /

            field column=title name=title /

            field column=description name=story_content /

            field column=digg name=positiveness /

            field column=comment name=spreading_number /

            field column=user_id name=author /

            field column=profile_view name=user_popularity /

            field column=topic name=topic /

            field column=timestamp name=timestamp /



            entity name=dugg_list  pk=story_id

                    query=select * from dugg_list where
 story_id='${story.story_id}'

                    deltaQuery=select SID from dugg_list where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select story_id from story where
 story_id=${dugg_list.story_id}

                  field name=viewer column=dugger /

            /entity



            entity name=commenttable  pk=comment_id

                    query=select * from commenttable where
 story_id='${story.story_id}'

                    deltaQuery=select SID from commenttable where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select story_id from story where
 story_id=${commenttable.story_id}

                  field name=comment_id column=comment_id /

                  field name=spreading_user column=replier /

                  field name=comment_positiveness column=up /

                  field name=comment_negativeness column=down /

                  field name=user_comment column=content /

                  field name=user_comment_timestamp
 column=timestamp /





            entity name=replytable

                    query=select * from replytable where
 comment_id='${commenttable.comment_id}'

                    deltaQuery=select SID from replytable where
 last_modified  '${dataimporter.last_index_time}'

                    parentDeltaQuery=select comment_id from
 commenttable where comment_id=${replytable.comment_id}

                  field name=replier_id column=replier_id /

                  field name=reply_content column=content /

                  field name=reply_positiveness column=up /

                  field name=reply_negativeness column=down /

                  field name=reply_timestamp column=timestamp /

            /entity



            /entity

            /entity

    /document

 /dataConfig



 Please help me on this.

 Many thanks



 Vivian










Very basic questions: Indexing text

2010-06-28 Thread Peter Spam
Hi everyone,

I'm looking for a way to index a bunch of (potentially large) text files.  I 
would love to see results like Google, so I went through a few tutorials, but 
I've still got questions:

1) I can get my docs in the index, but when I search, it returns the entire 
document.  I'd love to have it only return the line (or two) around the search 
term.

2) There are one or two fields at the beginning of the file that I would like 
to search on, so these should be indexed differently, right?

3) Is there a nice front-end example anywhere?  Something that would return 
results kind of like Google?

Thanks for your time - Solr / Lucene seem to be very powerful.


-Pete


Re: Very basic questions: Indexing text

2010-06-28 Thread Ahmet Arslan
 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.

Solr can generate Google-like snippets as you describe. 
http://wiki.apache.org/solr/HighlightingParameters

 2) There are one or two fields at the beginning of the file
 that I would like to search on, so these should be indexed
 differently, right?

Probably yes. 
 
 3) Is there a nice front-end example anywhere? 
 Something that would return results kind of like Google?

http://wiki.apache.org/solr/PublicServers
http://search-lucene.com/





Re: Very basic questions: Indexing text

2010-06-28 Thread Peter Spam
Great, thanks for the pointers.


Thanks,
Peter

On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters
 
 2) There are one or two fields at the beginning of the file
 that I would like to search on, so these should be indexed
 differently, right?
 
 Probably yes. 
 
 3) Is there a nice front-end example anywhere? 
 Something that would return results kind of like Google?
 
 http://wiki.apache.org/solr/PublicServers
 http://search-lucene.com/
 
 
 



DIH and denormalizing

2010-06-28 Thread Shawn Heisey
I am trying to do some denormalizing with DIH from a MySQL source.  
Here's part of my data-config.xml:


entity name=dataTable pk=did
  query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE 
did gt; ${dataimporter.request.minDid} AND did lt;= 
${dataimporter.request.maxDid} AND (did % 
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})

entity name=ncdat_wt
query=SELECT webtable as wt FROM ncdat_wt WHERE 
featurecode='${ncdat.feature}'

/entity
/entity

The relationship between features in ncdat and webtable in ncdat_wt (via 
featurecode) will be many-many.  The wt field in schema.xml is set up 
as multivalued.


It seems that ${ncdat.feature} is not being set.  I saw a query 
happening on the server and it was SELECT webtable as wt FROM ncdat_wt 
WHERE featurecode='' - that last part is an empty string with single 
quotes around it.  From what I can tell, there are no entries in ncdat 
where feature is blank.  I've tried this with both a 1.5-dev checked out 
months ago (which we are using in production) and a 3.1-dev checked out 
today.


Am I doing something wrong?

Thanks,
Shawn



Re: Too Many Open Files

2010-06-28 Thread Anderson vasconcelos
Thanks for responses.
I instantiate one instance of  per request (per delete query, in my case).
I have a lot of concurrency process. Reusing the same instance (to send,
delete and remove data) in solr, i will have a trouble?
My concern is if i do this, solr will commit documents with data from other
transaction.

Thanks




2010/6/28 Michel Bottan freakco...@gmail.com

 Hi Anderson,

 If you are using SolrJ, it's recommended to reuse the same instance per
 solr
 server.

 http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer

 But there are other scenarios which may cause this situation:

 1. Other application running in the same Solr JVM which doesn't close
 properly sockets or control file handlers.
 2. Open files limits configuration is low . Check your limits, read it from
 JVM process info:
 cat /proc/1234/limits (where 1234 is your process ID)

 Cheers,
 Michel Bottan


 On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  This probably means you're opening new readers without closing
  old ones. But that's just a guess. I'm guessing that this really
  has nothing to do with the delete itself, but the delete is what's
  finally pushing you over the limit.
 
  I know this has been discussed before, try searching the mail
  archive for TooManyOpenFiles and/or File Handles
 
  You could get much better information by providing more details, see:
 
 
 
 http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29
 
 http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29
 
 
  Best
  Erick
 
  On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos 
  anderson.v...@gmail.com wrote:
 
   Hi all
   When i send a delete query to SOLR, using the SOLRJ i received this
   exception:
  
   org.apache.solr.client.solrj.SolrServerException:
  java.net.SocketException:
   Too many open files
   11:53:06,964 INFO  [HttpMethodDirector] I/O exception
   (java.net.SocketException) caught when processing request: Too many
 open
   files
  
   Anyone could Help me? How i can solve this?
  
   Thanks
  
 



Optimizing cache

2010-06-28 Thread Blargy

Here is a screen shot for our cache from New Relic.

http://s4.postimage.org/mmuji-31d55d69362066630eea17ad7782419c.png

Query cache: 55-65%
Filter cache: 100%
Document cache: 63%

Cache size is 512 for above 3 caches.

How do I interpret this data? What are some optimal configuration changes
given the above stats?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Optimizing-cache-tp929156p929156.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DIH and denormalizing

2010-06-28 Thread caman

In your query 'query=SELECT webtable as wt FROM ncdat_wt WHERE 
featurecode='${ncdat.feature}'  .. instead of ${ncdat.feature} use
${dataTable.feature}  where dataTable is your parent entity name.

 

 

 

From: Shawn Heisey-4 [via Lucene]
[mailto:ml-node+929151-1527242139-124...@n3.nabble.com] 
Sent: Monday, June 28, 2010 2:24 PM
To: caman
Subject: DIH and denormalizing

 

I am trying to do some denormalizing with DIH from a MySQL source.   
Here's part of my data-config.xml: 

entity name=dataTable pk=did 
   query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE 
did  ${dataimporter.request.minDid} AND did = 
${dataimporter.request.maxDid} AND (did % 
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal}) 
entity name=ncdat_wt 
 query=SELECT webtable as wt FROM ncdat_wt WHERE 
featurecode='${ncdat.feature}' 
/entity 
/entity 

The relationship between features in ncdat and webtable in ncdat_wt (via 
featurecode) will be many-many.  The wt field in schema.xml is set up 
as multivalued. 

It seems that ${ncdat.feature} is not being set.  I saw a query 
happening on the server and it was SELECT webtable as wt FROM ncdat_wt 
WHERE featurecode='' - that last part is an empty string with single 
quotes around it.  From what I can tell, there are no entries in ncdat 
where feature is blank.  I've tried this with both a 1.5-dev checked out 
months ago (which we are using in production) and a 3.1-dev checked out 
today. 

Am I doing something wrong? 

Thanks, 
Shawn 




  _  

View message @
http://lucene.472066.n3.nabble.com/DIH-and-denormalizing-tp929151p929151.htm
l 
To start a new topic under Solr - User, email
ml-node+472068-464289649-124...@n3.nabble.com 
To unsubscribe from Solr - User, click
 (link removed) 
GZvcnRoZW90aGVyc3R1ZmZAZ21haWwuY29tfDQ3MjA2OHwtOTM0OTI1NzEx  here. 

 


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-and-denormalizing-tp929151p929168.html
Sent from the Solr - User mailing list archive at Nabble.com.


unknown handler dataimport

2010-06-28 Thread Lance Hill
Hi,

 

I am trying to get db indexing up and running, but I am having trouble
getting it working. 

 

In the solrconfig.xml file, I added 

 

  requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler

  lst name=defaults

 str name=configdata-config.xml/str

  /lst

  /requestHandler

 

I defined a couple of fields in schema.xml   

 

field name=media_id type=long stored=true /

   field name=artist_name type=text indexed=true stored=true
multiValued=true /

   field name=song_title type=text indexed=true stored=true
multiValued=true /

 

 

media_id is defined as the unique key

 

I added the dataconfig to the data-config.xml file

 

dataConfig

dataSource type=JdbcDataSource

driver=com.mysql.jdbc.Driver

url=jdbc:mysql://localhost/media

user=xxxd

password=*/

document name=media

entity name=video query=select mediaId, name, title FROM Media


field column=mediaId name=media_id type=integer
stored=true/

field column=name name=artist_name type=string
indexed=true stored=true/

field column=title name=song_title type=string
indexed=true stored=true/

/entity

/document

/dataConfig

 

 

When I start the server, I can see it is loading the dataimport handler

 

Jun 28, 2010 8:52:32 PM org.apache.solr.handler.dataimport.DataImportHandler
processConfiguration

INFO: Processing configuration from solrconfig.xml: {config=data-config.xml}

Jun 28, 2010 8:52:32 PM org.apache.solr.handler.dataimport.DataImporter
loadDataConfig

INFO: Data Configuration loaded successfully

 

 

When I go to
http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport, on the
right side, I see the message 
 
unknown handler: /dataimport

 

 

I do see a BindException: Address already in use when I restart the solr
process, but I don't see any other errors . Since the dataimport config was
successfully loaded, I don't think that is the reason /dataimport is
unknown.  

 

Did I forget to add something to the configurations? Is there another log
file I should be checking for errors?

 

Regards,

 

L. Hill



Re: DIH and denormalizing

2010-06-28 Thread Shawn Heisey

On 6/28/2010 3:28 PM, caman wrote:

In your query 'query=SELECT webtable as wt FROM ncdat_wt WHERE
featurecode='${ncdat.feature}'  .. instead of ${ncdat.feature} use
${dataTable.feature}  where dataTable is your parent entity name.
   


I knew it would be something stupid like that.  I thought I changed 
everything, looks like I forgot one.  Thank you!  From what I can tell 
now, it's working.  Sure is a lot slower now that it's got to do another 
query for every item.


Shawn



Re: DIH and denormalizing

2010-06-28 Thread Alexey Serba
 It seems that ${ncdat.feature} is not being set.
Try ${dataTable.feature} instead.


On Tue, Jun 29, 2010 at 1:22 AM, Shawn Heisey s...@elyograg.org wrote:
 I am trying to do some denormalizing with DIH from a MySQL source.  Here's
 part of my data-config.xml:

 entity name=dataTable pk=did
      query=SELECT *,FROM_UNIXTIME(post_date) as pd FROM ncdat WHERE did
 gt; ${dataimporter.request.minDid} AND did lt;=
 ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards})
 IN (${dataimporter.request.modVal})
 entity name=ncdat_wt
        query=SELECT webtable as wt FROM ncdat_wt WHERE
 featurecode='${ncdat.feature}'
 /entity
 /entity

 The relationship between features in ncdat and webtable in ncdat_wt (via
 featurecode) will be many-many.  The wt field in schema.xml is set up as
 multivalued.

 It seems that ${ncdat.feature} is not being set.  I saw a query happening on
 the server and it was SELECT webtable as wt FROM ncdat_wt WHERE
 featurecode='' - that last part is an empty string with single quotes
 around it.  From what I can tell, there are no entries in ncdat where
 feature is blank.  I've tried this with both a 1.5-dev checked out months
 ago (which we are using in production) and a 3.1-dev checked out today.

 Am I doing something wrong?

 Thanks,
 Shawn




Re: Too Many Open Files

2010-06-28 Thread Anderson vasconcelos
Other question,
Why SOLRJ d'ont close the StringWriter e OutputStreamWriter ?

thanks

2010/6/28 Anderson vasconcelos anderson.v...@gmail.com

 Thanks for responses.
 I instantiate one instance of  per request (per delete query, in my case).
 I have a lot of concurrency process. Reusing the same instance (to send,
 delete and remove data) in solr, i will have a trouble?
 My concern is if i do this, solr will commit documents with data from other
 transaction.

 Thanks




 2010/6/28 Michel Bottan freakco...@gmail.com

 Hi Anderson,

 If you are using SolrJ, it's recommended to reuse the same instance per
 solr
 server.

 http://wiki.apache.org/solr/Solrj#CommonsHttpSolrServer

 But there are other scenarios which may cause this situation:

 1. Other application running in the same Solr JVM which doesn't close
 properly sockets or control file handlers.
 2. Open files limits configuration is low . Check your limits, read it
 from
 JVM process info:
 cat /proc/1234/limits (where 1234 is your process ID)

 Cheers,
 Michel Bottan


 On Mon, Jun 28, 2010 at 1:18 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  This probably means you're opening new readers without closing
  old ones. But that's just a guess. I'm guessing that this really
  has nothing to do with the delete itself, but the delete is what's
  finally pushing you over the limit.
 
  I know this has been discussed before, try searching the mail
  archive for TooManyOpenFiles and/or File Handles
 
  You could get much better information by providing more details, see:
 
 
 
 http://wiki.apache.org/solr/UsingMailingLists?highlight=(most)|(users)|(list)http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29
 
 http://wiki.apache.org/solr/UsingMailingLists?highlight=%28most%29%7C%28users%29%7C%28list%29
 
 
  Best
  Erick
 
  On Mon, Jun 28, 2010 at 11:56 AM, Anderson vasconcelos 
  anderson.v...@gmail.com wrote:
 
   Hi all
   When i send a delete query to SOLR, using the SOLRJ i received this
   exception:
  
   org.apache.solr.client.solrj.SolrServerException:
  java.net.SocketException:
   Too many open files
   11:53:06,964 INFO  [HttpMethodDirector] I/O exception
   (java.net.SocketException) caught when processing request: Too many
 open
   files
  
   Anyone could Help me? How i can solve this?
  
   Thanks
  
 





Re: Very basic questions: Indexing text

2010-06-28 Thread Peter Spam
On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters

Here's how I commit my documents:

J=0;
for i in `find . -name \*.txt`; do
(( J++ ))
curl http://localhost:8983/solr/update/extract?literal.id=doc$J; -F 
myfi...@$i;
done;

echo - Committing
curl http://localhost:8983/solr/update/extract?commit=true;


Then, I try to query using 
http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing
but I only get back the document ID rather than the snippet:

doc
float name=score0.05030759/float
arr name=content_type
strtext/plain/str
/arr
str name=iddoc16/str
/doc

 I'm using the schema.xml from the lucid imagination: Indexing text and html 
files tutorial.



-Pete


Re: Very basic questions: Indexing text

2010-06-28 Thread Erick Erickson
try adding hl.fl=text
to specify your highlight field. I don't understand why you're only
getting the ID field back though. Do note that the highlighting
is after the docs, related by the ID.

Try a (non highlighting) query of just * to verify that you're
pointing at the index you think you are. It's possible that
you've modified a different index with SolrJ than your web
server is pointing at.

Also, SOLR has no way of knowing you're modified your index
with SolrJ, so it may not be automatically reopening an
IndexReader so your recent changes may not be visible
until you force the SOLR reader to reopen.

HTH
Erick

On Mon, Jun 28, 2010 at 6:49 PM, Peter Spam ps...@mac.com wrote:

 On Jun 28, 2010, at 2:00 PM, Ahmet Arslan wrote:

  1) I can get my docs in the index, but when I search, it
  returns the entire document.  I'd love to have it only
  return the line (or two) around the search term.
 
  Solr can generate Google-like snippets as you describe.
  http://wiki.apache.org/solr/HighlightingParameters

 Here's how I commit my documents:

 J=0;
 for i in `find . -name \*.txt`; do
(( J++ ))
curl http://localhost:8983/solr/update/extract?literal.id=doc$J;
 -F myfi...@$i;
 done;

 echo - Committing
 curl http://localhost:8983/solr/update/extract?commit=true;


 Then, I try to query using
 http://localhost:8983/solr/select?rows=10start=0fl=*,scorehl=trueq=testing
 but I only get back the document ID rather than the snippet:

 doc
 float name=score0.05030759/float
 arr name=content_type
 strtext/plain/str
 /arr
 str name=iddoc16/str
 /doc

  I'm using the schema.xml from the lucid imagination: Indexing text and
 html files tutorial.



 -Pete



What is the proper procedure to reopen closed bugs?

2010-06-28 Thread Teruhiko Kurosaka
I'd like to reopen a bug SOLR-1960
https://issues.apache.org/jira/browse/SOLR-1960
http://wiki.apache.org/solr/ : non-English users get generic MoinMoin page 
instead of the desired information
as I submitted a patch.  But jira won't let me do it.
Do I have to clone it?


Teruhiko Kuro Kurosaka, 415-227-9600 x122
RLP + Lucene  Solr = powerful search for global contents



AutoSuggest Question

2010-06-28 Thread Neil Lott
Hi,

I've read some on the autosuggest and I would like to know if the following is 
possible with my current configuration.

I'm using solr 1.4.


field name=title type=text indexed=true stored=true required=true/
field name=titleac3 type=autocomplete3 indexed=true stored=true 
omitNorms=true omitTermFreqAndPositions=true/
copyField source=title dest=titleac3/

fieldType name=autocomplete3 class=solr.TextField 
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.LetterTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25/
/analyzer
analyzer type=query
tokenizer class=solr.LetterTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

Results:

http://localhost:8984/solr/core/select/?q=titleac3:%22secret%22version=2.2start=0rows=10indent=onfl=title

I currently get the following results:

doc
str name=titlePasajes Secretos/str
/doc
−
doc
str name=titleSecret Agent Zero/str
/doc
−
doc
str name=titleSecretos de la Ciudad/str
/doc
−
doc
str name=titleBack to the Secret Garden/str
/doc
−
doc
str name=titleThe Making Of: The Secret Life of Bees/str
/doc
−
doc
str name=titleSexy Celebrity Secrets/str
/doc
−
doc
str name=titleThe Secrets of the Battle of the Bulge/str
/doc
−
doc
str name=titleThe Secret Life of Bees/str
/doc
−
doc
str name=titleAncient Secrets of the Bible/str
/doc
−
doc
str name=titleSecrets of the Submarine War/str
/doc


I'd like a way for the results to be sorted so it looks like this:


−
doc
str name=titleSecret Agent Zero/str  (found in 1st word)
/doc
−
doc
str name=titleThe Secrets of the Battle of the Bulge/str (found in 1st 
word)
/doc
−
doc
str name=titleThe Secret Life of Bees/str (found in 1st word)
/doc
−
doc
str name=titleSecrets of the Submarine War/str (found in 1st word)
/doc
-
doc
str name=titleSecretos de la Ciudad/str (found in 1st word)
/doc
-
doc
str name=titleAncient Secrets of the Bible/str (found in 2nd word)
/doc
-
doc
str name=titleBack to the Secret Garden/str (found in 2nd word)
/doc
-
doc
str name=titleThe Making Of: The Secret Life of Bees/str (found in 2nd 
word)
/doc
−
doc
str name=titlePasajes Secretos/str (found in 2nd word)
/doc
-
doc
str name=titleSexy Celebrity Secrets/str (found in 3rd word)
/doc


So I'd like to have the first matches are where secret happens in the first 
word or second leading sub-word grouped alphabetically.
The next, where secret happens in the second word or second leading sub-word 
grouped alphabetically.
Etc.

My specific rule is that the first word is stop word it is ignored in the 
sorting.

Is there a way I can get solr to order my results as such?

Also, are there any drawbacks on using the solr.LetterTokenizerFactory?

I assume the maxGramSize refers to the max length of a gram and making 
something larger than 25 really is not helpful?

Is there a better way to do the autosuggest technique above aside from using 
the autocomplete3 field I've defined given what I'm trying to accomplish?

Thanks,

Neil















Re: Very basic questions: Indexing text

2010-06-28 Thread Michael Lackhoff
On 28.06.2010 23:00 Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters

I didn't know this is possible and am also interested in this feature
but even after reading the given Wiki page I cannot make out which is
the parameter to use. The only paramter that could be similar is
'hl.maxAlternateFieldLength' where it is possible to give a length to
return but according to the description that is for the case no match.
And there is hl.fragmentsBuilder but with no explanation (the refered
page SolrFragmentsBuilder does not yet exist).

Could you give an example?
E.g. lets say I have a field 'title' and a field 'fulltext' and my
search term is 'solr'. What would be the right set of parameters to get
back the whole title-field but only a sniplet of 50 words (or three
sentences or whatever the unit) from the fulltext field.


Thanks
-Michael