Re: Timestamps and Cardinality in Queries

2017-03-01 Thread Aaron D. Mihalik
 The mongo tests download and run a specially package version of Mongo.
It seems like it's  having difficulty downloading.Can you hit the URL
for mongo?

Could not open inputStream for
http://fastdl.mongodb.org/osx/mongodb-osx-x86_64-3.2.1.tgz


On Wed, Mar 1, 2017 at 6:04 PM Liu, Eric  wrote:

> Hm, maven runs now, but it’s getting this error in the Mongo tests:
> http://pastebin.com/Mt928ane
>
> On 3/1/17, 12:30 PM, "Aaron D. Mihalik"  wrote:
>
> That's really strange.  Can you hit the maven central repo [1] from
> your
> machine?
>
> I guess delete the locationtech  definition from your pom?
>
>
> [1] http://repo1.maven.org/maven2/org/apache/apache/17/
>
> On Wed, Mar 1, 2017 at 2:31 PM Liu, Eric 
> wrote:
>
> > Hmmm, deleting the files in .m2 doesn’t stop it from searching in
> > locationtech, and using the other mvn command gives me no log output.
> >
> > On 3/1/17, 10:55 AM, "Aaron D. Mihalik" 
> wrote:
> >
> > transversing: gotcha.  I completely understand now.  And now I
> > understand
> > how the prospector table would help with sniping out those nodes.
> >
> > maven: yep, that's the right git repo.  Locationtech is required
> when
> > you
> > build with the 'geoindexing' profile.  Regardless, it's strange
> that
> > maven
> > tried to get the apache pom from locationtech.  Deleting the
> > org/apache/apache directory should force maven to download the
> apache
> > pom
> > from maven central.
> >
> > --Aaron
> >
> > On Wed, Mar 1, 2017 at 1:47 PM Liu, Eric <
> eric@capitalone.com>
> > wrote:
> >
> > > Oh, that’s not an issue, that’s what we would like to do when
> > traversing
> > > through the data. If a node has a high cardinality we don’t
> want to
> > further
> > > traverse through its children.
> > >
> > > As for installation, did I clone the right repo for Rya? The
> one I’m
> > using
> > > has locationtech repos for SNAPSHOT and RELEASE:
> > > https://github.com/apache/incubator-rya/blob/master/pom.xml
> > >
> > > On 3/1/17, 6:09 AM, "Aaron D. Mihalik" <
> aaron.miha...@gmail.com>
> > wrote:
> > >
> > > Repos: The locationtech repo is up [1].  The issue is that
> your
> > local
> > > .m2
> > > repo is in a bad state.  Maven is trying to get the apache
> pom
> > from
> > > locationtech.  Locationtech does not host that pom,
> instead it's
> > on
> > > maven
> > > central [2].
> > >
> > > Two ways to fix this issue (you should do (1) and that'll
> fix
> > it...
> > > (2) is
> > > just another option for reference).
> > >
> > > 1. Delete your apache pom directory from your local maven
> repo
> > (e.g.
> > > rm -rf
> > > ~/.m2/repository/org/apache/apache/)
> > >
> > > 2. Tell maven to ignore remote repository metadata with
> the -llr
> > flag
> > > (e.g.
> > > mvn clean install -llr -Pgeoindexing)
> > >
> > > Let me know if you have any other issues.
> > >
> > > deep/wide: okay, I don't understand this statement: "if the
> > > cardinality of
> > > a node is too high (for example, a user that owns a large
> number
> > of
> > > datasets), the neighbors of that node will not be found."
> Is
> > this a
> > > property of your current datstore, or is this an issue
> with Rya?
> > >
> > > --Aaron
> > >
> > > [1]
> > >
> > >
> >
> https://repo.locationtech.org/content/repositories/releases/org/locationtech/geomesa/
> > > [2] http://repo1.maven.org/maven2/org/apache/apache/17/
> > >
> > > On Wed, Mar 1, 2017 at 7:43 AM Puja Valiyil <
> puja...@gmail.com>
> > wrote:
> > >
> > > > Hey Eric,
> > > > Regarding the repos-- sometimes the location tech repos
> go
> > down,
> > > your best
> > > > bet is to wait a little bit and try again.  You can also
> > download the
> > > > latest artifacts off of the apache build server.
> > > > Since location tech is only used for the geo profile we
> may
> > want to
> > > move
> > > > where that repo is declared (or put it in the geo
> profile).
> > > > For your use case, you could look to use the cardinality
> in the
> > > prospector
> > > > services for individual nodes.  Though the prospector
> services
> > could
> > > be run
> > > > once and then used to be 

Re: Timestamps and Cardinality in Queries

2017-03-01 Thread Aaron D. Mihalik
transversing: gotcha.  I completely understand now.  And now I understand
how the prospector table would help with sniping out those nodes.

maven: yep, that's the right git repo.  Locationtech is required when you
build with the 'geoindexing' profile.  Regardless, it's strange that maven
tried to get the apache pom from locationtech.  Deleting the
org/apache/apache directory should force maven to download the apache pom
from maven central.

--Aaron

On Wed, Mar 1, 2017 at 1:47 PM Liu, Eric  wrote:

> Oh, that’s not an issue, that’s what we would like to do when traversing
> through the data. If a node has a high cardinality we don’t want to further
> traverse through its children.
>
> As for installation, did I clone the right repo for Rya? The one I’m using
> has locationtech repos for SNAPSHOT and RELEASE:
> https://github.com/apache/incubator-rya/blob/master/pom.xml
>
> On 3/1/17, 6:09 AM, "Aaron D. Mihalik"  wrote:
>
> Repos: The locationtech repo is up [1].  The issue is that your local
> .m2
> repo is in a bad state.  Maven is trying to get the apache pom from
> locationtech.  Locationtech does not host that pom, instead it's on
> maven
> central [2].
>
> Two ways to fix this issue (you should do (1) and that'll fix it...
> (2) is
> just another option for reference).
>
> 1. Delete your apache pom directory from your local maven repo (e.g.
> rm -rf
> ~/.m2/repository/org/apache/apache/)
>
> 2. Tell maven to ignore remote repository metadata with the -llr flag
> (e.g.
> mvn clean install -llr -Pgeoindexing)
>
> Let me know if you have any other issues.
>
> deep/wide: okay, I don't understand this statement: "if the
> cardinality of
> a node is too high (for example, a user that owns a large number of
> datasets), the neighbors of that node will not be found."  Is this a
> property of your current datstore, or is this an issue with Rya?
>
> --Aaron
>
> [1]
>
> https://repo.locationtech.org/content/repositories/releases/org/locationtech/geomesa/
> [2] http://repo1.maven.org/maven2/org/apache/apache/17/
>
> On Wed, Mar 1, 2017 at 7:43 AM Puja Valiyil  wrote:
>
> > Hey Eric,
> > Regarding the repos-- sometimes the location tech repos go down,
> your best
> > bet is to wait a little bit and try again.  You can also download the
> > latest artifacts off of the apache build server.
> > Since location tech is only used for the geo profile we may want to
> move
> > where that repo is declared (or put it in the geo profile).
> > For your use case, you could look to use the cardinality in the
> prospector
> > services for individual nodes.  Though the prospector services could
> be run
> > once and then used to be representative (that wouldn't work for your
> use
> > case), you could run them regularly to keep track of counts for your
> use
> > case.  Are you using the count keyword or just manually counting
> edges?
> > The count keyword is pretty inefficient currently.  We could add
> that to
> > our list of priorities maybe.
> >
> > Sent from my iPhone
> >
> > > On Mar 1, 2017, at 3:00 AM, Liu, Eric 
> wrote:
> > >
> > > Hey Aaron,
> > >
> > > I’m currently setting up Rya to test these queries with some of our
> > data. I run into an error when I run ‘mvn clean install’, I attached
> the
> > logs but it seems like I can’t connect to the snapshots repo you’re
> using.
> > >
> > > As for “deep/wide”, it would be something like starting at a
> dataset,
> > then fanning out looking for relations where it is either the
> subject or
> > object, such as the user who created it, the job it came from, where
> it’s
> > stored, etc. It would recurse on these neighboring nodes until a
> total
> > number of results is reached. However, if the cardinality of a node
> is too
> > high (for example, a user that owns a large number of datasets), the
> > neighbors of that node will not be found. Really, the goal is to
> find the
> > most distance relevant relationships possible, and this is our
> current
> > naïve way of doing so.
> > >
> > > Do you want to have a short call about this? I think it’d be
> easier to
> > explain/answer questions over the phone. I’m free pretty much any
> time
> > 1pm-5pm PST tomorrow (3/1).
> > >
> > > Thanks,
> > > Eric
> > >
> > > On 2/24/17, 6:18 AM, "Aaron D. Mihalik" 
> wrote:
> > >
> > >deep vs wide: I played around with the property paths sparql
> operator
> > and
> > >put up an example here [1].  This is a slightly different query
> than
> > the
> > >one I sent out before.  It would be worth it for us to look at
> how
> > this is
> > >actually executed by OpenRDF.
> > >
> > >  

Re: Timestamps and Cardinality in Queries

2017-03-01 Thread Aaron D. Mihalik
Repos: The locationtech repo is up [1].  The issue is that your local .m2
repo is in a bad state.  Maven is trying to get the apache pom from
locationtech.  Locationtech does not host that pom, instead it's on maven
central [2].

Two ways to fix this issue (you should do (1) and that'll fix it... (2) is
just another option for reference).

1. Delete your apache pom directory from your local maven repo (e.g. rm -rf
~/.m2/repository/org/apache/apache/)

2. Tell maven to ignore remote repository metadata with the -llr flag (e.g.
mvn clean install -llr -Pgeoindexing)

Let me know if you have any other issues.

deep/wide: okay, I don't understand this statement: "if the cardinality of
a node is too high (for example, a user that owns a large number of
datasets), the neighbors of that node will not be found."  Is this a
property of your current datstore, or is this an issue with Rya?

--Aaron

[1]
https://repo.locationtech.org/content/repositories/releases/org/locationtech/geomesa/
[2] http://repo1.maven.org/maven2/org/apache/apache/17/

On Wed, Mar 1, 2017 at 7:43 AM Puja Valiyil  wrote:

> Hey Eric,
> Regarding the repos-- sometimes the location tech repos go down, your best
> bet is to wait a little bit and try again.  You can also download the
> latest artifacts off of the apache build server.
> Since location tech is only used for the geo profile we may want to move
> where that repo is declared (or put it in the geo profile).
> For your use case, you could look to use the cardinality in the prospector
> services for individual nodes.  Though the prospector services could be run
> once and then used to be representative (that wouldn't work for your use
> case), you could run them regularly to keep track of counts for your use
> case.  Are you using the count keyword or just manually counting edges?
> The count keyword is pretty inefficient currently.  We could add that to
> our list of priorities maybe.
>
> Sent from my iPhone
>
> > On Mar 1, 2017, at 3:00 AM, Liu, Eric  wrote:
> >
> > Hey Aaron,
> >
> > I’m currently setting up Rya to test these queries with some of our
> data. I run into an error when I run ‘mvn clean install’, I attached the
> logs but it seems like I can’t connect to the snapshots repo you’re using.
> >
> > As for “deep/wide”, it would be something like starting at a dataset,
> then fanning out looking for relations where it is either the subject or
> object, such as the user who created it, the job it came from, where it’s
> stored, etc. It would recurse on these neighboring nodes until a total
> number of results is reached. However, if the cardinality of a node is too
> high (for example, a user that owns a large number of datasets), the
> neighbors of that node will not be found. Really, the goal is to find the
> most distance relevant relationships possible, and this is our current
> naïve way of doing so.
> >
> > Do you want to have a short call about this? I think it’d be easier to
> explain/answer questions over the phone. I’m free pretty much any time
> 1pm-5pm PST tomorrow (3/1).
> >
> > Thanks,
> > Eric
> >
> > On 2/24/17, 6:18 AM, "Aaron D. Mihalik"  wrote:
> >
> >deep vs wide: I played around with the property paths sparql operator
> and
> >put up an example here [1].  This is a slightly different query than
> the
> >one I sent out before.  It would be worth it for us to look at how
> this is
> >actually executed by OpenRDF.
> >
> >Eric: Could you clarify by "deep vs wide"?  I think I understand your
> >queries, but I don't have a good intuition about those terms and how
> >cardinality might figure into a query.  It would probably be a bit
> more
> >helpful if you provided a model or general description that is
> (somewhat)
> >representative of your data.
> >
> >--Aaron
> >
> >[1]
> >
> https://github.com/amihalik/sesame-debugging/blob/master/src/main/java/com/github/amihalik/sesame/debugging/PropertyPathsExample.java
> >
> >>On Thu, Feb 23, 2017 at 9:42 PM Adina Crainiceanu 
> wrote:
> >>
> >> Hi Eric,
> >>
> >> If you want to query by the Accumulo timestamp, something like
> >> timeRange(?ts, 13141201490, 13249201490) should work in Rya. I did not
> try
> >> it lately, but timeRange() was in Rya originally. Not sure if it was
> >> removed in later iterations or whether it would be useful for your use
> >> case. First Rya paper
> >> https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
> discusses
> >> time ranges (Section 5.3 at the link above)
> >>
> >> Adina
> >>
> >>> On Thu, Feb 23, 2017 at 8:31 PM, Puja Valiyil 
> wrote:
> >>>
> >>> Hey John,
> >>> I'm pretty sure your pull request was merged-- it was pulled in through
> >>> another pull request.  If not, sorry-- I thought it had been merged and
> >>> then just not closed.  I was going to spend some time doing merges
> >> tomorrow
> >>> so I can get it 

Re: Timestamps and Cardinality in Queries

2017-03-01 Thread Puja Valiyil
Hey Eric,
Regarding the repos-- sometimes the location tech repos go down, your best bet 
is to wait a little bit and try again.  You can also download the latest 
artifacts off of the apache build server.
Since location tech is only used for the geo profile we may want to move where 
that repo is declared (or put it in the geo profile).
For your use case, you could look to use the cardinality in the prospector 
services for individual nodes.  Though the prospector services could be run 
once and then used to be representative (that wouldn't work for your use case), 
you could run them regularly to keep track of counts for your use case.  Are 
you using the count keyword or just manually counting edges?  The count keyword 
is pretty inefficient currently.  We could add that to our list of priorities 
maybe. 

Sent from my iPhone

> On Mar 1, 2017, at 3:00 AM, Liu, Eric  wrote:
> 
> Hey Aaron,
> 
> I’m currently setting up Rya to test these queries with some of our data. I 
> run into an error when I run ‘mvn clean install’, I attached the logs but it 
> seems like I can’t connect to the snapshots repo you’re using.
> 
> As for “deep/wide”, it would be something like starting at a dataset, then 
> fanning out looking for relations where it is either the subject or object, 
> such as the user who created it, the job it came from, where it’s stored, 
> etc. It would recurse on these neighboring nodes until a total number of 
> results is reached. However, if the cardinality of a node is too high (for 
> example, a user that owns a large number of datasets), the neighbors of that 
> node will not be found. Really, the goal is to find the most distance 
> relevant relationships possible, and this is our current naïve way of doing 
> so.
> 
> Do you want to have a short call about this? I think it’d be easier to 
> explain/answer questions over the phone. I’m free pretty much any time 
> 1pm-5pm PST tomorrow (3/1).
> 
> Thanks,
> Eric
> 
> On 2/24/17, 6:18 AM, "Aaron D. Mihalik"  wrote:
> 
>deep vs wide: I played around with the property paths sparql operator and
>put up an example here [1].  This is a slightly different query than the
>one I sent out before.  It would be worth it for us to look at how this is
>actually executed by OpenRDF.
> 
>Eric: Could you clarify by "deep vs wide"?  I think I understand your
>queries, but I don't have a good intuition about those terms and how
>cardinality might figure into a query.  It would probably be a bit more
>helpful if you provided a model or general description that is (somewhat)
>representative of your data.
> 
>--Aaron
> 
>[1]
>
> https://github.com/amihalik/sesame-debugging/blob/master/src/main/java/com/github/amihalik/sesame/debugging/PropertyPathsExample.java
> 
>>On Thu, Feb 23, 2017 at 9:42 PM Adina Crainiceanu  wrote:
>> 
>> Hi Eric,
>> 
>> If you want to query by the Accumulo timestamp, something like
>> timeRange(?ts, 13141201490, 13249201490) should work in Rya. I did not try
>> it lately, but timeRange() was in Rya originally. Not sure if it was
>> removed in later iterations or whether it would be useful for your use
>> case. First Rya paper
>> https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf discusses
>> time ranges (Section 5.3 at the link above)
>> 
>> Adina
>> 
>>> On Thu, Feb 23, 2017 at 8:31 PM, Puja Valiyil  wrote:
>>> 
>>> Hey John,
>>> I'm pretty sure your pull request was merged-- it was pulled in through
>>> another pull request.  If not, sorry-- I thought it had been merged and
>>> then just not closed.  I was going to spend some time doing merges
>> tomorrow
>>> so I can get it tomorrow.
>>> 
>>> Sent from my iPhone
>>> 
 On Feb 23, 2017, at 8:13 PM, John Smith  wrote:
 
 I have a pull request that fixes that problem.. it has been stuck in
>>> limbo
 for months.. https://github.com/apache/incubator-rya-site/pull/1  Can
 someone merge it into master?
 
> On Thu, Feb 23, 2017 at 2:00 PM, Liu, Eric 
>>> wrote:
> 
> Cool, thanks for the help.
> By the way, the link to the Rya Manual is outdated on the
>>> rya.apache.org
> site. Should be pointing at https://github.com/apache/
> incubator-rya/blob/master/extras/rya.manual/src/site/markdown/_
>> index.md
> 
> On 2/23/17, 12:34 PM, "Aaron D. Mihalik" 
>>> wrote:
> 
>   deep vs wide:
> 
>   A property path query is probably your best bet.  Something like:
> 
>   for the following data:
> 
>   s:EventA p:causes s:EventB
>   s:EventB p:causes s:EventC
>   s:EventC p:causes s:EventD
> 
> 
>   This query would start at EventB and work it's way up and down the
> chain:
> 
>   SELECT * WHERE {
>   (|^)* ?s . ?s ?p ?o
>   }
> 
> 
>   On Thu, 

Re: Timestamps and Cardinality in Queries

2017-03-01 Thread Liu, Eric
Hey Aaron,

I’m currently setting up Rya to test these queries with some of our data. I run 
into an error when I run ‘mvn clean install’, I attached the logs but it seems 
like I can’t connect to the snapshots repo you’re using.

As for “deep/wide”, it would be something like starting at a dataset, then 
fanning out looking for relations where it is either the subject or object, 
such as the user who created it, the job it came from, where it’s stored, etc. 
It would recurse on these neighboring nodes until a total number of results is 
reached. However, if the cardinality of a node is too high (for example, a user 
that owns a large number of datasets), the neighbors of that node will not be 
found. Really, the goal is to find the most distance relevant relationships 
possible, and this is our current naïve way of doing so.

Do you want to have a short call about this? I think it’d be easier to 
explain/answer questions over the phone. I’m free pretty much any time 1pm-5pm 
PST tomorrow (3/1).

Thanks,
Eric

On 2/24/17, 6:18 AM, "Aaron D. Mihalik"  wrote:

deep vs wide: I played around with the property paths sparql operator and
put up an example here [1].  This is a slightly different query than the
one I sent out before.  It would be worth it for us to look at how this is
actually executed by OpenRDF.

Eric: Could you clarify by "deep vs wide"?  I think I understand your
queries, but I don't have a good intuition about those terms and how
cardinality might figure into a query.  It would probably be a bit more
helpful if you provided a model or general description that is (somewhat)
representative of your data.

--Aaron

[1]

https://github.com/amihalik/sesame-debugging/blob/master/src/main/java/com/github/amihalik/sesame/debugging/PropertyPathsExample.java

On Thu, Feb 23, 2017 at 9:42 PM Adina Crainiceanu  wrote:

> Hi Eric,
>
> If you want to query by the Accumulo timestamp, something like
> timeRange(?ts, 13141201490, 13249201490) should work in Rya. I did not try
> it lately, but timeRange() was in Rya originally. Not sure if it was
> removed in later iterations or whether it would be useful for your use
> case. First Rya paper
> https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf discusses
> time ranges (Section 5.3 at the link above)
>
> Adina
>
> On Thu, Feb 23, 2017 at 8:31 PM, Puja Valiyil  wrote:
>
> > Hey John,
> > I'm pretty sure your pull request was merged-- it was pulled in through
> > another pull request.  If not, sorry-- I thought it had been merged and
> > then just not closed.  I was going to spend some time doing merges
> tomorrow
> > so I can get it tomorrow.
> >
> > Sent from my iPhone
> >
> > > On Feb 23, 2017, at 8:13 PM, John Smith  wrote:
> > >
> > > I have a pull request that fixes that problem.. it has been stuck in
> > limbo
> > > for months.. https://github.com/apache/incubator-rya-site/pull/1  Can
> > > someone merge it into master?
> > >
> > >> On Thu, Feb 23, 2017 at 2:00 PM, Liu, Eric 
> > wrote:
> > >>
> > >> Cool, thanks for the help.
> > >> By the way, the link to the Rya Manual is outdated on the
> > rya.apache.org
> > >> site. Should be pointing at https://github.com/apache/
> > >> incubator-rya/blob/master/extras/rya.manual/src/site/markdown/_
> index.md
> > >>
> > >> On 2/23/17, 12:34 PM, "Aaron D. Mihalik" 
> > wrote:
> > >>
> > >>deep vs wide:
> > >>
> > >>A property path query is probably your best bet.  Something like:
> > >>
> > >>for the following data:
> > >>
> > >>s:EventA p:causes s:EventB
> > >>s:EventB p:causes s:EventC
> > >>s:EventC p:causes s:EventD
> > >>
> > >>
> > >>This query would start at EventB and work it's way up and down the
> > >> chain:
> > >>
> > >>SELECT * WHERE {
> > >>(|^)* ?s . ?s ?p ?o
> > >>}
> > >>
> > >>
> > >>On Thu, Feb 23, 2017 at 2:58 PM Meier, Caleb <
> > caleb.me...@parsons.com>
> > >>wrote:
> > >>
> > >>> Yes, that's a good place to start.  If you have external timestamps
> > >> that
> > >>> are built into your graph using the time ontology in owl (e.g you
> > >> have
> > >>> triples of the form (event123, time:inDateTime, 2017-02-23T14:29)),
> > >> the
> > >>> temporal index is exactly what you want.  If you are hoping to query
> > >> based
> > >>> on the internal timestamps that Accumulo assigns to your triples,
> > >> then
> > >>> there are some slight tweaks that can be done to facilitate this,
> > >> but it
> > >>> won't be nearly as