PDF indexing

2011-09-29 Thread Jón Helgi Jónsson
Good day,

I'm checking if Solr would work for indexing PDFs. My requirements are:

1) I must know which page has what contents.
2) Left to right search support. Such as Hebrew. This has been the most
trickiest to achieve.

I also prefer to know the position of the searched contents on the page but
could live without.

Any info or ideas would be greatly appreciated.

Thank you,
Jon


How do I format this query with 2 search terms?

2010-11-17 Thread Jón Helgi Jónsson
I'm using index time boosting and need to specify every field I want
to search (not use copy fields) or else the boosting wont work.

This query with 1 saerchterm works fine, boosts look good:

http://localhost:8983/solr/select/?
q=companyName:foo
+descriptionTxt:verslun
fl=*%20scorerows=10start=0

However if I have 2 words in the query and do it like this boosting
seems not to be working

http://localhost:8983/solr/select/?
q=companyName:foo+bar
+descriptionTxt:foo+bar
fl=*%20scorerows=10start=0

Its probably using the default search field for the second word which
has no boosting configured. How do I go about this?

Thanks,
Jon


Re: How do I format this query with 2 search terms?

2010-11-17 Thread Jón Helgi Jónsson
Thanks a lot for that!

I wanted to use dismax but hit a wall because I require trailing
wildcards in some instances. Methods 1 and 3 do not work in my case.
However upon further thinking I realized in the cases I required
wildcard I'm only searching one field. So I'll just turn dismax on and
off as required.

Thanks again :)

On Wed, Nov 17, 2010 at 12:40 PM, Ken Stanley doh...@gmail.com wrote:
 2010/11/17 Jón Helgi Jónsson jonjons...@gmail.com:
 I'm using index time boosting and need to specify every field I want
 to search (not use copy fields) or else the boosting wont work.

 This query with 1 saerchterm works fine, boosts look good:

 http://localhost:8983/solr/select/?
 q=companyName:foo
 +descriptionTxt:verslun
 fl=*%20scorerows=10start=0

 However if I have 2 words in the query and do it like this boosting
 seems not to be working

 http://localhost:8983/solr/select/?
 q=companyName:foo+bar
 +descriptionTxt:foo+bar
 fl=*%20scorerows=10start=0

 Its probably using the default search field for the second word which
 has no boosting configured. How do I go about this?

 Thanks,
 Jon


 Jon,

 You have a few options here, depending on what you want to achieve
 with your query:

 1. If you're trying to do a phrase query, you simply need to ensure
 that your phrases are quoted. The default behavior in SOLR is to split
 the phrase into multiple chunks. If a word is not preceded with a
 field definition, then SOLR will automatically apply the word(s) as if
 you had specified the default field. So for your example, SOLR would
 parse your query into companyName:foo defaultField:bar
 descriptionTxt:foo defaultField:bar.
 2. You can use the dismax query plugin instead of the standard query
 plugin. You simply configure the dismax section of your solrconfig.xml
 to your liking - you define which fields to search, apply any special
 boosts for your needs, etc
 (http://wiki.apache.org/solr/DisMaxQParserPlugin) - and then you
 simply feed the query terms without naming your fields (i.e.,
 q=foo+bar), along with telling SOLR to use dismax (i.e.,
 qt=whatever_you_named_your_dismax_handler).
 3. If phrase queries are not important to you, you can manually prefix
 each term in your query with the field you wish to search; for
 example, you would do companyName:foo companyName:bar
 descriptionTxt:foo descriptionTxt:bar.

 Whichever way you decide to go, the best thing that you can do to
 understand SOLR and how it's working in your environment is to append
 debugQuery=on to the end of your URL; this tells SOLR to output
 information about how it parsed your query, how long each component
 took to run, and some other useful debugging information. It's very
 useful, and has come in handy several times here where I'm at when I
 wanted to know why SOLR returned the results (or didn't return) that I
 expected.

 I hope this helps.

 - Ken



Index time boosting troubles

2009-11-16 Thread Jón Helgi Jónsson
Hi,

I had working index time boosting on documents like so: doc boost=10.0

Everything was great until I made some changes that I thought where no
related to the doc boost but after that my doc boosting appears to be
missing.

I'm having a tough time debugging this and didn't have the sense to version
control this so I would have something to revert to (lesson learned).

In schema.xml I have fieldType name=float class=solr.FloatField
omitNorms=false/

Is there something else I should be watching out for? Some query parameter
perhaps?

Or something else? I think wildcards in query affect it but I don't have
any, some setting in solrconfig.xml or cheme.xml?

Thanks!
Jon


Re: How to use key with facet.prefix?

2009-08-08 Thread Jón Helgi Jónsson
Thanks for that. So perhaps use copyfield in schema and make a subcat
field identical to my category would be the best solution?

On Sat, Aug 8, 2009 at 10:17 AM, Koji Sekiguchik...@r.email.ne.jp wrote:
 Jón Helgi Jónsson wrote:

 I'm trying to facet multiple times on same field using key.

 This works fine except when I use prefixes for these facets.

 What I got so far (and not functional):
 ..
 facet=true
 facet.field=categoryf.category.facet.prefix=01
 facet.field={!key=subcat}categoryf.subcat.facet.prefix=00

 This will give me 2 facets in results, one named 'category' and
 another 'subcat' like expected. But prefix for key 'subcat' is ignored
 and the other prefix is used for both facets.

 How do I use key with prefixes or am I barking up the wrong tree here?

 Thanks!



 I think '!key' can be used for just a label when displaying
 the facet result. As it doesn't change its field name,
 the parameter f.subcat.facet.prefix=00 is ignored.

 Koji





How to use key with facet.prefix?

2009-08-07 Thread Jón Helgi Jónsson
I'm trying to facet multiple times on same field using key.

This works fine except when I use prefixes for these facets.

What I got so far (and not functional):
..
facet=true
facet.field=categoryf.category.facet.prefix=01
facet.field={!key=subcat}categoryf.subcat.facet.prefix=00

This will give me 2 facets in results, one named 'category' and
another 'subcat' like expected. But prefix for key 'subcat' is ignored
and the other prefix is used for both facets.

How do I use key with prefixes or am I barking up the wrong tree here?

Thanks!


Summing sub categories in faceting

2009-08-06 Thread Jón Helgi Jónsson
Hi, would really appreciate some help on this.

I'm doing a category browser for companies. Kind of like a yellow pages.

For each company I store each category the company is in like this:
Example for Boeing would be
03.03.02
which is an fictional id for 'Jets'

The beginning point I display all companies

My query: ?q=*:*facet=truefacet.field=categoryIDfacet.mincount=1

Desired facet result:
Shops and services (4313) ID = 01
Home and interiour (2932)  ID = 02
Transportation (1144) ID = 03


I click Transportation, ID = 03

My query: 
?q=*:*'fq=categoryID:03*facet=truefacet.field=categoryIDfacet.mincount=1

Desired facet result:
Land vehicles (708)   ID = 03.01
Boats (391)  ID = 03.02
Planes (342)ID = 03.03

Under these categories are even more subcategories and so forth.

Using facet queries like above would give me count for every single
sub category which will be in the hundreds when I only really want the
sum of where I am in the hierarchical category tree at that.

Does this make sense?

My solution is to store multiple ID's for each company. Example for
Boeing would be to have a categoryFacet field and store 03 and 03.03
and 03.03.02, and skip the wildcard in the facet.field.

Seems kind of bloated, are there better solutions?

Thanks a bunch!


Re: Summing sub categories in faceting

2009-08-06 Thread Jón Helgi Jónsson
Did a bit more creative searching for a solution and came up with this:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg15027.html

I'm using couple of days old nightly build, so unless there is
something new I should know about I'm going with that method :)

2009/8/6 Jón Helgi Jónsson jonjons...@gmail.com:
 Hi, would really appreciate some help on this.

 I'm doing a category browser for companies. Kind of like a yellow pages.

 For each company I store each category the company is in like this:
 Example for Boeing would be
 03.03.02
 which is an fictional id for 'Jets'

 The beginning point I display all companies

 My query: ?q=*:*facet=truefacet.field=categoryIDfacet.mincount=1

 Desired facet result:
 Shops and services (4313)         ID = 01
 Home and interiour (2932)          ID = 02
 Transportation (1144)                 ID = 03


 I click Transportation, ID = 03

 My query: 
 ?q=*:*'fq=categoryID:03*facet=truefacet.field=categoryIDfacet.mincount=1

 Desired facet result:
 Land vehicles (708)       ID = 03.01
 Boats (391)                  ID = 03.02
 Planes (342)                ID = 03.03

 Under these categories are even more subcategories and so forth.

 Using facet queries like above would give me count for every single
 sub category which will be in the hundreds when I only really want the
 sum of where I am in the hierarchical category tree at that.

 Does this make sense?

 My solution is to store multiple ID's for each company. Example for
 Boeing would be to have a categoryFacet field and store 03 and 03.03
 and 03.03.02, and skip the wildcard in the facet.field.

 Seems kind of bloated, are there better solutions?

 Thanks a bunch!



Wildcard and boosting

2009-07-29 Thread Jón Helgi Jónsson
Hey now!

I do index time boosting for my fields and just discovered that when
searching with a trailing wild card the boosting is ignored.

Will my boosting work with a wild card if I do it at query time? And
if so is there a lot of performance difference?

Some other method I can use to preserve my boosting? I do not need
hightlighting.

Thanks,
Jon Helgi


Re: Wildcard and boosting

2009-07-29 Thread Jón Helgi Jónsson
I just updated to nightly build (I was using 1.2) and this does not
seem to be an issue anymore.

2009/7/29 Jón Helgi Jónsson jonjons...@gmail.com:
 Hey now!

 I do index time boosting for my fields and just discovered that when
 searching with a trailing wild card the boosting is ignored.

 Will my boosting work with a wild card if I do it at query time? And
 if so is there a lot of performance difference?

 Some other method I can use to preserve my boosting? I do not need
 hightlighting.

 Thanks,
 Jon Helgi



Re: How to install a patch?

2008-06-10 Thread Jón Helgi Jónsson
Thanks for that. The patch in question is this one:
http://issues.apache.org/jira/browse/SOLR-469
I found this patching utility for Windows, going to give it a go:
http://gnuwin32.sourceforge.net/packages/patch.htm

On Tue, Jun 10, 2008 at 12:11 PM, Jacob Singh [EMAIL PROTECTED] wrote:
 Hi Rusli,

 Is there a URL you'd like to reference for where you got the patch?
 That would probably help.

 For windows I suppose you'll have to google around to find a version of
 patch which runs there.  Beyond Compare is a windows app which has
 patching capabilities.  patch is a program for *nix machines where in
 the user supplies a patch file and it patches an existing file.

 a patch file is in a certain format where it explains the differences
 between the original and a modified copy.  So you already have the file
 locally, and by applying the patch file to it, it will make the changes
 needed to make your copy like the one the author of the patch has.

 The source of the file you are looking for is probably in the handlers
 directory of the solr source.


 Hope that helps,
 Jacob


 Rusli Ruslakall wrote:
 This is a terribly simple question I bet.

 I'm running Solr on Windows and would like to use the Data Import
 RequestHandler patch. I have been trying to figure out how to install
 this patch but been unsuccessful so far. How would I go about doing
 this?

 Thanks,
 Jon




Re: Want to drill down facet search result

2008-05-29 Thread Jón Helgi Jónsson
Thanks for that, I looked into fq and it will definatly help when I
drill into zip codes.

However I'm still having some issues, facet.prefix only got me so far
because sometimes the facet is the second word in the field.

Also I have another question with this example:

doc
  field name=nameCompany A/field

  field name=category_id1/field
  field name=category_nameCar/field
  field name=category_aliasautomobile, vehicle/field

   field name=category_id2/field
  field name=category_nameAnimals/field
  field name=category_aliascat, dog, rat/field

/doc

Is there any way I can group category information together? So that I
know the category_id for the specific category_name?

For example, I want to facet search for 'vehicle' and want to count
how many companies are in the mother category 1 and the name of the
category = Car.

I can put everything in one line and break apart with php after the
fact but wondering if there is a better way.

On Thu, May 29, 2008 at 5:32 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 On Thu, May 29, 2008 at 12:22 PM, Rusli Ruslakall
 [EMAIL PROTECTED] wrote:
 searched forever before posting and of course I found it shortly after :)

 Can use facet.prefix, beautiful!

 You can also constrain both results and facets to any arbitrary query
 via fq=myquery

 -Yonik


 On Thu, May 29, 2008 at 3:43 PM, Rusli Ruslakall
 [EMAIL PROTECTED] wrote:
 Hi,

 I index something like this:

 doc
field name=nameCompany A/field
field name=cat123/field
field name=cat456/field
field name=cat789/field
 /doc

 doc
field name=nameCompany B/field
field name=cat129/field
field name=cat123/field
field name=cat987/field
 /doc

 So I ONLY want to display all category names starting with '12' and
 how many companies are in each one.

 In this example it should output:

 name count
 123  (2)
 129  (1)


 What I have now is:
 http://localhost:8983/solr/select/?q=cat:12facet=truefacet.limit=-1facet.field=catfacet.mincount=1

 But with this I get all the categories which I would rather not prefer:

 name count
 123  (2)
 456  (1) -- Rather not get this information
 789  (1) -- Rather not get this information
 129  (1)
 987  (1) -- Rather not get this information


 Is there some way of achieving this in Solr?

 Thanks alot!
 Jon