AndrzejBialecki to this group. Thank you!
--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
___.,___,___,___,_._. __
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com
be only used to co-ordinate
parts of the query that matched the same document number.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf
http://research.google.com/pubs/archive/37365.pdf
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix
there.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
LUCENE-3622.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
LUCENE-1812 for another practical application of this concept.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
/browse/SOLR-1535
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
dramatically (and the performance will drop then). Modern OS-es try to
keep as much data in memory as possible, so the memory usage itself is
not that informative - but check what are the pagein/pageout rates when
you start hitting the 32 vs 64 cores.
--
Best regards,
Andrzej Bialecki
that excess of memory, but it
won't be available for OS-level disk IO. Therefore reducing the heap
size may actually increase your performance.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
://www.lucidimagination.com/forum/ .
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot
around) then 3.1 is a safer bet. If you need a dozen or so new
exciting features (e.g. results grouping) or top performance, or if you
need LucidWorks with Click and other goodies, then use 4.x and be
prepared for an occasional full reindex.
--
Best regards,
Andrzej Bialecki
to transparently convert indexes from
one 4.x to another 4.x format, but they are not there yet.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
a
contrib/patch that was applied?
At the moment it's proprietary. I will have a talk at the Lucene
Revolution conference that describes the Click tools in detail.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
. For now it's better to pass openNew=false
and be prepared to get a null.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
distributed RPC for every query...
To summarize, I would qualify your statement with: ...if the
composition of your shards is drastically different. Otherwise the cost
of using global IDF is not worth it, IMHO.
--
Best regards,
Andrzej Bialecki
On 2010-10-25 13:37, Toke Eskildsen wrote:
On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote:
* there is an exact solution to this problem, namely to make two
distributed calls instead of one (first call to collect per-shard IDFs
for given query terms, second call to submit a query
of a tokenizing chain
that could use a language detector to create different fields (or
tokenize differently) based on this decision.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
hitting the trunk (right, Mark? ;) ), so
medium-term I think this is your best bet.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
convenience components
that will make it easier.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
a spellchecker in another core
(instead of using the current sub-index hack).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
MOD won't do ;) so I
think it would be good to hide this strategy behind an
interface/abstract class. It costs nothing, and gives you flexibility in
how you implement this mapping.
--
Best regards,
Andrzej Bialecki
On 2010-09-06 22:03, Dennis Gearon wrote:
What is a 'simple MOD'?
md5(docId) % numShards
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
an awful lot of RAM... see SOLR-1316 for some
measurements.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
On 2010-06-03 13:38, Michael Kuhlmann wrote:
Am 03.06.2010 13:02, schrieb Andrzej Bialecki:
..., and deploy this
index in a separate JVM (to benefit from other CPUs than the one that
runs your Solr core)
Every known webserver ist multithreaded by default, so putting different
Solr
statements that can then be fetched separately. You may want to write your
own multithreaded client to index.
SOLR-1301 is also an option if you are familiar with Hadoop ...
--
Best regards,
Andrzej Bialecki
On 2010-06-02 13:12, Grant Ingersoll wrote:
On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
On 2010-06-02 12:42, Grant Ingersoll wrote:
On Jun 1, 2010, at 9:54 PM, Blargy wrote:
We have around 5 million items in our index and each item has a description
located on a separate
On 2010-05-15 02:46, Blargy wrote:
Thanks for your help and especially your analyzer.. probably saved me a
full-import or two :)
Also, take a look at this issue:
https://issues.apache.org/jira/browse/SOLR-1316
--
Best regards,
Andrzej Bialecki
to assign different
priorities to suggestions (other than a simple IDF based priority), or
have many city names consisting of multiple tokens, then use SOLR-1316.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
equivalent
to the TermsComponent; or from a list of frequent queries - but you need
to build that list yourself).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
to
the frequency of terms/phrases in the query logs ...
TermsComponent and EdgeNGrams, while simple to use, suffer from both issues.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
contrib/ is a quick and perhaps
acceptable solution ...
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
an
AddUpdateCommand to the update processor. You can obtain both the update
processor and SolrCell instance from req.getCore().
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
.
Could you perhaps elaborate a bit on this functionality? Your
description sounds intriguing - it reminds me of ParallelReader, but I'm
probably completely wrong ...
--
Best regards,
Andrzej Bialecki
that shows fragments of XML config files.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
consists of 4 fields, F1, F2, F3, F4
Now I want to update the value of field F2, so if I send the update xml to
SOLR, can it keep the old field values for F1,F3,F4 and update the new value
specified for F2?
Best Regards,
Kranti K K Parisa
--
Best regards,
Andrzej Bialecki
).
Kullback-Leibler divergence?
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
be too costly.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
out from the response...
You can also implement a SearchComponent that post-processes results and
based on the schema if a field is missing then it adds an empty node to
the result.
--
Best regards,
Andrzej Bialecki
On 2010-01-10 01:55, Lance Norskog wrote:
Make two copies of the index. In each copy, delete the records you do
not want. Optimize.
... which is essentially what the MultiPassIndexSplitter does, only it
avoids the initial copy (by deleting in the source index).
--
Best regards,
Andrzej
to solving this is to index compound words, i.e. when
producing a spellchecker dictionary add a record tommyhitfiger with a
field that points to tommy hitfiger. Details vary depending on what
spellchecking impl. you use.
--
Best regards,
Andrzej Bialecki
, etc. The cost for this flexibility is
that it needs to read index files multiple times (hence multi-pass).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded
-1316, there are patches there that implement
such component using prefix trees.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
term appropriately.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=104
INFO: [idx1] webapp=/solr path=/update/ params={} status=0 QTime=52
...
Is this a known issue ?
It may be an issue with System.currentTimeMillis() resolution on some
platforms (e.g. Windows)?
--
Best regards,
Andrzej Bialecki
Yonik Seeley wrote:
On Mon, Oct 12, 2009 at 12:03 PM, Andrzej Bialecki a...@getopt.org wrote:
Solr never discarded non-positive hits, and now Lucene 2.9 no longer
does either.
Hmm ... The code that I pasted in my previous email uses
Searcher.search(Query, int), which in turn uses search(Query
(a:b in 0), product of:
1.0 = tf(termFreq(a:b)=1)
0.30685282 = idf(docFreq=1, numDocs=1)
0.5 = fieldNorm(field=a, doc=0)
bsh %
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
Yonik Seeley wrote:
On Mon, Oct 12, 2009 at 5:58 AM, Andrzej Bialecki a...@getopt.org wrote:
BTW, standard Collectors collect only results
with positive scores, so if you want to collect results with negative scores
as well then you need to use a custom Collector.
Solr never discarded non
Shalin Shekhar Mangar wrote:
On Fri, Oct 9, 2009 at 10:53 PM, Andrzej Bialecki a...@getopt.org wrote:
Hi,
What's the canonical way to pass an update request to another handler? I'm
implementing a handler that has to dispatch its result to different update
handlers based on its internal
dependent on deployment paths defined in solrconfig.xml.
Using SolrCore.getRequestHandlers(handler.class) often returns the
LazyRequestHandlerWrapper, from which it's not possible to retrieve the
wrapped instance of the handler ..
--
Best regards,
Andrzej Bialecki
regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
- just get
IndexReader.terms() enumeration and traverse it.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com
fields and term counts per field.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
way involves writing a SearchComponent that does the
latter part of that process on the Solr side.)
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix
a heap dump on OOM (it's a JVM flag) and then use a tool like
HAT to find largest objects and references to them.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
/apache/nutch/analysis/lang/LanguageIdentifier.html
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact
.
Phrase queries that you can construct using QueryParser can't match two
tokens separated by a hole, unless you set a slop value 0.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
Otis Gospodnetic wrote:
You should be fine on either Linux or FreeBSD (or any other UNIX
flavour). Running on Solaris would probably give you access to
goodness like dtrace, but you can live without it.
There's dtrace on FreeBSD, too.
--
Best regards,
Andrzej Bialecki
to reindex your segments using the solrindex command, and
change the searcher configuration. See nutch-default.xml for details.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
a couple days - please monitor this
issue, and when it's done just download the patched code.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System
/apache_solr_c_blue.jpg
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
in the proceedings of SIGIR-08, which presents an interesting and
relatively simple algorithm that yields excellent results. Who has some
spare CPU cycles to implement this? ;)
http://ilpubs.stanford.edu:8090/860/
--
Best regards,
Andrzej Bialecki
) - this should work in
your case.
Ultimately, what you are probably looking for is a shingle-based
algorithm, but it's relatively costly and requires multiple passes.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information
the length of posting lists, which leads
to increased memory/CPU consumption during decoding and traversing of
the lists. Also, the overall increased number of positions will have an
impact on the index size.
--
Best regards,
Andrzej Bialecki
straightforward, and relieves your from the need to
de-duplicate your collection.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http
(or Morfologik) for Polish stemming and
lemmatization. It provides a superset of Stempel features, namely in
addition to the algorithmic stemming it provides a dictionary-based
stemming, and these two methods nicely complement each other.
--
Best regards,
Andrzej Bialecki
manager
that knows the latest versions of each shard among the whole active set
- or that clients discover this dynamically by querying the shard
servers every now and then.
--
Best regards,
Andrzej Bialecki
).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
the correct stopword list.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
for which Nutch architecture is optimized.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info
71 matches
Mail list logo